Senior Engineering Manager AI Inference Platform, Distributed Cloud

Google · Big Tech · Sunnyvale, CA +1

Senior Engineering Manager for AI Inference Platform, Distributed Cloud. Role focuses on architecting and optimizing the serving stack for models like Gemini in an on-prem cloud environment, improving speed, efficiency, and cost-effectiveness. Responsibilities include leading a team, defining technical vision for the LLM serving stack, overseeing performance analysis and benchmarking, and driving the design/implementation of advanced serving architectures.

What you'd actually do

Lead, mentor, and grow a high-performing team of systems and ML engineers. Drive a culture of excellence, psychological safety, and continuous learning while guiding career paths and OKRs.
Define the technical vision and strategy for enhancing the LLM serving stack, focusing on performance, scalability, and resource efficiency.
Oversee the infrastructure and tooling for in-depth performance analysis, profiling, and benchmarking of LLM models on GPU accelerators.
Partner closely with Research, SRE, Product, and core library teams to optimize and deploy LLMs globally.
Drive the design, implementation, and optimization of advanced serving architectures—including disaggregated serving—while collaborating with core library and kernel partners to eliminate low-level performance bottlenecks, maximize resource utilization, and minimize latency.

Skills

Required

C++ or Python programming
optimizing, profiling, and scaling production-grade systems on GPU accelerators or specialized AI hardware
people management or team leadership
managing engineering organizations across multi-team infrastructure dependencies

Nice to have

Master’s degree or PhD in Engineering, Computer Science, or a related technical field
working in a complex, matrixed organization
implementing advanced LLM serving architectures and optimization techniques (e.g., disaggregated serving, continuous batching, specialized compiler technologies)
utilizing deep-dive ML profiling tools (e.g., Nsight, xprof) to troubleshoot and resolve low-level bottlenecks within major frameworks (e.g., JAX, PyTorch, TensorFlow)

What the JD emphasized

optimizing the serving stack
improve speed and efficiency
run faster and more cost-effectively
performance analysis, profiling, and benchmarking
optimize and deploy LLMs globally
optimization of advanced serving architectures
eliminate low-level performance bottlenecks
maximize resource utilization
minimize latency
optimizing, profiling, and scaling production-grade systems on GPU accelerators
implementing advanced LLM serving architectures and optimization techniques
deep-dive ML profiling tools

Other signals

LLM serving stack optimization
performance profiling
GPU accelerators
disaggregated serving

Read full job description

In this role, you will be pivotal in architecting and optimizing the serving stack for models like Gemini in an on-prem cloud environment, addressing exciting challenges to improve speed and efficiency. This is a unique opportunity to go deep, leading system-level design and performance profiling, ensuring Google's LLMs run faster and more cost-effectively than ever before.

Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.

Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $262000 - $365000 (USD) + 25% bonus target + equity + benefits

Learn more about benefits at Google.

Responsibilities

Lead, mentor, and grow a high-performing team of systems and ML engineers. Drive a culture of excellence, psychological safety, and continuous learning while guiding career paths and OKRs.
Define the technical vision and strategy for enhancing the LLM serving stack, focusing on performance, scalability, and resource efficiency.
Oversee the infrastructure and tooling for in-depth performance analysis, profiling, and benchmarking of LLM models on GPU accelerators.
Partner closely with Research, SRE, Product, and core library teams to optimize and deploy LLMs globally.
Drive the design, implementation, and optimization of advanced serving architectures—including disaggregated serving—while collaborating with core library and kernel partners to eliminate low-level performance bottlenecks, maximize resource utilization, and minimize latency.

Qualifications

Minimum qualifications:

Bachelor's degree or equivalent practical experience.
8 years of experience programming in C++ or Python.
7 years of experience optimizing, profiling, and scaling production-grade systems on GPU accelerators or specialized AI hardware.
5 years of experience directly managing and leading engineering teams focused on machine learning infrastructure, AI platforms, or high-performance distributed computing systems.
5 years of experience in a people management or team leadership role.
4 years of experience managing engineering organizations across multi-team infrastructure dependencies.

Preferred qualifications:

Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
5 years of experience working in a complex, matrixed organization.
5 years of experience implementing advanced LLM serving architectures and optimization techniques, such as disaggregated serving, continuous batching, or specialized compiler technologies (e.g., XLA).
4 years of experience utilizing deep-dive ML profiling tools (e.g., Nsight, xprof) to troubleshoot and resolve low-level bottlenecks within major frameworks like JAX, PyTorch, or TensorFlow.