Staff Software Engineer, Ai/ml Performance

Google Google · Big Tech · Sunnyvale, CA +1

Staff Software Engineer focused on optimizing AI/ML training and serving workloads on TPUs. This role involves identifying performance bottlenecks, driving optimizations through custom kernels, compiler/runtime improvements, and collaborating with partner teams to achieve state-of-the-art performance for foundation model builders and hyperscalers. The position also involves algorithmic innovation and co-designing TPU-friendly models.

What you'd actually do

  1. Identify and maintain ML training and serving benchmarks.
  2. Achieve state-of-the-art performance for customer launches, and in case of 3P/OSS models, for competitive benchmark submissions (ML Commons, InferenceX, etc.).
  3. Use the benchmarks to identify performance opportunities and directly drive both near-term SOTA (e.g. custom kernels) and out-of-the-box performance (e.g. compiler/runtime optimizations, agentic tooling, auto-sharding) in collaboration with partner teams.
  4. Participate in algorithmic innovation, exploiting new TPU hardware features and model-preserving optimizations (e.g. speculative decoding, sparsity, quantization, LoRA, etc.).
  5. Participate in co-designing models that are TPU-friendly to showcase model quality at performance of OSS models typically designed on GPUs.

Skills

Required

  • C++ or Python programming
  • testing and launching software products
  • performance, large-scale systems data analysis, visualization tools, or debugging
  • software design and architecture

Nice to have

  • Master’s degree or PhD in Engineering, Computer Science, or a related technical field
  • data structures and algorithms
  • technical leadership role leading project teams and setting technical direction
  • compiler optimization, code generation, and runtime systems for popular accelerators
  • Understanding of modern GPU, TPU, or other ML accelerator architectures, memory hierarchies, and performance bottlenecks
  • tailoring algorithms and ML models to exploit ML accelerator architecture strengths and minimize weaknesses

What the JD emphasized

  • bleeding edge performance
  • maximum efficiency
  • major frontier lab hyperscalers
  • foundation model builders
  • state-of-the-art performance
  • custom kernels
  • compiler/runtime optimizations
  • agentic tooling
  • auto-sharding
  • algorithmic innovation
  • speculative decoding
  • sparsity
  • quantization
  • LoRA

Other signals

  • performance optimization
  • ML training and serving
  • TPU
  • compiler optimizations
  • quantization
  • custom kernels