Technical Program Manager Iii, GPU Infrastructure Reliability, Google Cloud

Google Google · Big Tech · Sunnyvale, CA +2

This role is for a Technical Program Manager III focused on GPU Infrastructure Reliability within Google Cloud. The primary responsibility is to lead the end-to-end development, project planning, and delivery of next-generation Cloud GPU products, including software qualification and release strategies for AI hypercompute clusters. The role involves managing escalations, mitigating risks, and coordinating cross-functional initiatives related to AI infrastructure customer onboarding and production support. It also includes participating in the development of core management software, monitoring, and diagnostic tooling for scalable Cloud ML solutions. While the role supports AI products, it is not directly building or researching AI models but rather managing the infrastructure that enables them.

What you'd actually do

  1. Lead the end-to-end development, project planning, and delivery of next-gen AI Infra GPU products from concept to production.
  2. Lead software qualifications, release strategy, and test infrastructure management for AI hypercompute clusters.
  3. Manage escalations and critical incidents while proactively identifying and mitigating risks that could impact project success.
  4. Coordinate with TPMs in AI2 (e.g., ACI, Platforms, and CSCO) and ACI leadership on cross-functional initiatives related to AI Infra customer onboarding and production support.
  5. Participate in the development of core management software, monitoring, and diagnostic tooling for scalable Cloud ML solutions.

Skills

Required

  • technical field degree or equivalent practical experience
  • 5 years of experience in program management
  • Experience with infrastructure reliability
  • Experience with GPUs or GPU Systems

Nice to have

  • 5 years of experience managing cross-functional or cross-team projects
  • 5 years of experience in technical program management, with a focus on software engineering and ML infrastructure projects
  • Knowledge of software development, distributed systems, and ML infrastructure or GPU systems
  • Ability to think critically and solve problems
  • Excellent project management skills, and experience with project planning, execution, and risk management
  • Excellent communication and collaboration skills, with the ability to build relationships and influence across all levels of the organization

What the JD emphasized

  • end-to-end development and delivery of next-generation Cloud GPU products
  • software qualification and release strategies for AI hypercompute clusters
  • ML workload monitoring and diagnostic tooling