Senior Engineering Manager AI Inference Platform, Distributed Cloud

Google Google · Big Tech · Sunnyvale, CA +1

Senior Engineering Manager for AI Inference Platform, Distributed Cloud. Role focuses on architecting and optimizing the serving stack for models like Gemini in an on-prem cloud environment, improving speed, efficiency, and cost-effectiveness. Responsibilities include leading a team, defining technical vision for the LLM serving stack, overseeing performance analysis and benchmarking, and driving the design/implementation of advanced serving architectures.

What you'd actually do

  1. Lead, mentor, and grow a high-performing team of systems and ML engineers. Drive a culture of excellence, psychological safety, and continuous learning while guiding career paths and OKRs.
  2. Define the technical vision and strategy for enhancing the LLM serving stack, focusing on performance, scalability, and resource efficiency.
  3. Oversee the infrastructure and tooling for in-depth performance analysis, profiling, and benchmarking of LLM models on GPU accelerators.
  4. Partner closely with Research, SRE, Product, and core library teams to optimize and deploy LLMs globally.
  5. Drive the design, implementation, and optimization of advanced serving architectures—including disaggregated serving—while collaborating with core library and kernel partners to eliminate low-level performance bottlenecks, maximize resource utilization, and minimize latency.

Skills

Required

  • C++ or Python programming
  • optimizing, profiling, and scaling production-grade systems on GPU accelerators or specialized AI hardware
  • people management or team leadership
  • managing engineering organizations across multi-team infrastructure dependencies

Nice to have

  • Master’s degree or PhD in Engineering, Computer Science, or a related technical field
  • working in a complex, matrixed organization
  • implementing advanced LLM serving architectures and optimization techniques (e.g., disaggregated serving, continuous batching, specialized compiler technologies)
  • utilizing deep-dive ML profiling tools (e.g., Nsight, xprof) to troubleshoot and resolve low-level bottlenecks within major frameworks (e.g., JAX, PyTorch, TensorFlow)

What the JD emphasized

  • optimizing the serving stack
  • improve speed and efficiency
  • run faster and more cost-effectively
  • performance analysis, profiling, and benchmarking
  • optimize and deploy LLMs globally
  • optimization of advanced serving architectures
  • eliminate low-level performance bottlenecks
  • maximize resource utilization
  • minimize latency
  • optimizing, profiling, and scaling production-grade systems on GPU accelerators
  • implementing advanced LLM serving architectures and optimization techniques
  • deep-dive ML profiling tools

Other signals

  • LLM serving stack optimization
  • performance profiling
  • GPU accelerators
  • disaggregated serving