Staff Software Engineer, Ai/ml Performance

Google Google · Big Tech · Sunnyvale, CA +1

Staff Software Engineer focused on optimizing AI/ML training and serving workloads on TPUs. The role involves identifying performance opportunities, driving optimizations through custom kernels, compiler/runtime improvements, and algorithmic innovation. It also includes co-designing TPU-friendly models and working with frontier lab hyperscalers and foundation model builders.

What you'd actually do

  1. Identify and maintain ML training and serving benchmarks.
  2. Achieve state-of-the-art performance for customer launches, and in case of 3P/OSS models, for competitive benchmark submissions (ML Commons, InferenceX, etc.).
  3. Use the benchmarks to identify performance opportunities and directly drive both near-term SOTA (e.g. custom kernels) and out-of-the-box performance (e.g. compiler/runtime optimizations, agentic tooling, auto-sharding) in collaboration with partner teams.
  4. Participate in algorithmic innovation, exploiting new TPU hardware features and model-preserving optimizations (e.g. speculative decoding, sparsity, quantization, LoRA, etc.).
  5. Participate in co-designing models that are TPU-friendly to showcase model quality at performance of OSS models typically designed on GPUs.

Skills

Required

  • programming in C++ or Python
  • testing, and launching software products
  • performance, large-scale systems data analysis, visualization tools, or debugging
  • software design and architecture

Nice to have

  • compiler optimization
  • code generation
  • runtime systems for popular accelerators
  • modern GPU, TPU, or other ML accelerator architectures
  • memory hierarchies
  • performance bottlenecks
  • tailoring algorithms and ML models to exploit ML accelerator architecture strengths and minimize weaknesses

What the JD emphasized

  • bleeding edge performance
  • maximum efficiency
  • state-of-the-art performance
  • custom kernels
  • compiler/runtime optimizations
  • agentic tooling
  • algorithmic innovation
  • model-preserving optimizations
  • speculative decoding
  • sparsity
  • quantization
  • LoRA

Other signals

  • Optimizing AI/ML training and serving workloads
  • Extracting maximum efficiency for AI/ML workloads
  • Driving optimizations for Cloud TPU and on-prem TPU customers
  • Leveraging custom kernels, compiler optimizations, quantization, sparsity, and agentic tooling