Engineering Manager, ML Performance

Google Google · Big Tech · Sunnyvale, CA +2

Engineering Manager for Google's TPU Performance team, focusing on optimizing the speed and efficiency of AI/ML model training and inference on custom TPU hardware. The role involves leading a team to develop and maintain ML benchmarks, identify performance opportunities, drive optimizations (near-term and out-of-the-box), and participate in algorithmic innovations and co-designing TPU-friendly models. This includes work on inference serving, quantization, and compiler optimizations, serving both internal Google teams and external AI companies.

What you'd actually do

  1. Lead a team of software engineers focused on identifying and maintaining ML training and serving benchmarks that are representative to Google production and the broader ML industry.
  2. Achieve performance for customer launches, and in case of third-party/open-source software (OSS) models, for engaged benchmark submissions (ML Commons, InferenceX, etc.).
  3. Use benchmarks to identify performance opportunities and drive both near-term SOTA (e.g., custom kernels) and out-of the box performance (compiler/runtime optimizations, agentic tooling, auto-sharding) directly and in collaboration with partner teams.
  4. Participate in algorithmic innovations exploiting new TPU hardware features and model-preserving optimizations (speculative decoding, sparsity, quantization, LoRA, etc.).
  5. Participate in co-designing models that are TPU-friendly to showcase model quality at performance advanced to OSS models typically designed on GPUs.

Skills

Required

  • software development
  • leading ML design
  • optimizing ML infrastructure
  • technical leadership
  • people management
  • ML performance analysis
  • benchmarking
  • computer architecture

Nice to have

  • ML accelerators (GPUs, TPUs)
  • low-level kernel programming/tuning
  • CUDA
  • Triton
  • Pallas
  • compiler optimization
  • MLIR
  • OpenXLA
  • integrating frameworks/serving libraries
  • PyTorch
  • JAX
  • vLLM
  • hardware design

What the JD emphasized

  • ML training and serving benchmarks
  • performance
  • ML performance analysis, benchmarking, and computer architecture
  • ML accelerators (GPUs, TPUs)
  • low-level kernel programming/tuning
  • compiler optimization
  • serving libraries

Other signals

  • TPU Performance team
  • maximizing the speed and efficiency of Google’s custom AI chips (TPUs) for training and running massive AI/ML models
  • optimization partners for both Google's internal teams and major external AI companies and foundation model builders
  • customer launches
  • ML Commons, InferenceX
  • compiler/runtime optimizations
  • agentic tooling
  • auto-sharding
  • speculative decoding, sparsity, quantization, LoRA
  • ML accelerators (GPUs, TPUs)
  • low-level kernel programming/tuning
  • compiler optimization (MLIR, OpenXLA)
  • integrating frameworks/serving libraries (PyTorch, JAX, vLLM)