Staff Software Engineer, Tpu, Performance

Google Google · Big Tech · Sunnyvale, CA +1

Staff Software Engineer focused on optimizing the performance of ML models (including Gemini and OSS models) on TPU systems for both JAX and PyTorch platforms. The role involves identifying and maintaining ML benchmarks, analyzing performance metrics, and collaborating with compiler and runtime teams to improve performance. It also includes engaging with product teams and researchers to solve performance problems for large-scale ML training and serving.

What you'd actually do

  1. Identify and maintain ML training and serving benchmarks that are representative to Google production and broader ML industry.
  2. Achieve performance for customer launches, and in case of third-party/Open-Source Software (3P/OSS) models, for engaged benchmark submissions ML commons, InferenceMAX, etc.).
  3. Use the benchmarks to identify performance opportunities and drive out-of-the-box performance toward improving the compiler, runtime, etc., in collaboration with those teams.
  4. Engage with Google Product teams and researchers to solve their performance problems (e.g., onboard new ML models and products on Google new TPU hardware, enabling larger models (giant models) to train efficiently on a very large-scale (i.e., thousands of TPUs).
  5. Analyze performance and efficiency metrics to identify bottlenecks, design, and implement solutions at Google fleet-wide scale.

Skills

Required

  • software development
  • speech/audio
  • reinforcement learning
  • ML infrastructure
  • ML design
  • model deployment
  • model evaluation
  • data processing
  • debugging
  • fine tuning
  • software testing
  • software launching
  • software design
  • software architecture

Nice to have

  • data structures
  • algorithms
  • compiler optimization
  • code generation
  • runtime systems for GPU architectures
  • OpenXLA
  • MLIR
  • Triton
  • tailoring algorithms and ML models to exploit ML accelerator architecture strengths
  • low-level GPU programming
  • CUDA
  • OpenCL
  • performance tuning techniques
  • GPU architectures
  • TPU architectures
  • ML accelerator architectures
  • memory hierarchies
  • performance bottlenecks

What the JD emphasized

  • performance optimization
  • ML models
  • TPU systems
  • JAX and PyTorch platforms
  • compiler
  • runtime
  • large-scale ML training
  • performance problems

Other signals

  • TPU performance optimization
  • ML model performance
  • Gemini and OSS models
  • compiler and runtime optimization
  • large-scale ML training