Senior Staff Software Engineer, Tpu Performance

Google Google · Big Tech · Sunnyvale, CA +1

Senior Staff Software Engineer focused on optimizing ML training and serving performance on Google's TPUs. This role involves identifying and maintaining benchmarks, driving performance improvements through compiler/runtime optimizations and algorithmic innovations, and co-designing TPU-friendly models. Experience with ML infrastructure, speech/audio, or reinforcement learning is required.

What you'd actually do

  1. Identify and maintain ML training and serving benchmarks that are representative to Google production and the broader ML industry.
  2. Achieve performance for customer launches, and in case of third-party/OSS models, for engaged benchmark submissions (ML Commons, InferenceX, etc).
  3. Use the benchmarks to identify performance opportunities and drive both near-term state of the art (e.g. custom kernels) and out-of the box performance (compiler/runtime optimizations, agentic tooling, auto-sharding) directly and in collaboration with partner teams.
  4. Participate in algorithmic innovations exploiting new TPU hardware features and model-preserving optimizations (speculative decoding, sparsity, quantization, LoRA, etc).
  5. Participate in co-designing models that are TPU-friendly to showcase model quality at performance excellent to OSS models typically designed on GPUs.

Skills

Required

  • Software development
  • Technical project strategy
  • ML design
  • ML infrastructure
  • Model deployment
  • Model evaluation
  • Data processing
  • Debugging
  • Fine tuning
  • Speech/audio
  • Reinforcement learning
  • Design and architecture
  • Software product testing/launching
  • Python
  • C++

Nice to have

  • Master’s degree or PhD in Engineering, Computer Science, or a related technical field
  • Data structures and algorithms
  • Technical leadership
  • Matrixed organization experience
  • Performance analysis and debugging
  • PyTorch
  • JAX

What the JD emphasized

  • 7 years of experience leading technical project strategy, ML design, and working with ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning).
  • 5 years of experience with one or more of the following: Speech/audio (e.g., technology duplicating and responding to the human voice), reinforcement learning (e.g., sequential decision making), ML infrastructure, or specialization in another ML field.

Other signals

  • TPU performance
  • ML training and serving benchmarks
  • compiler/runtime optimizations
  • PyTorch and JAX