Technical Program Manager Iii, ML Infrastructure Resource Management, Google Cloud

Google Google · Big Tech · Sunnyvale, CA +2

Technical Program Manager III for ML Infrastructure Resource Management within Google Cloud. This role focuses on managing Google's global ML accelerator fleet, specifically the LLM Serving pool, and acting as a PARM partner for client groups. The core mission involves planning, deploying, and optimizing TPU and GPU capacity aligned with strategic priorities. The role requires collaborating with SWE and SRE teams, owning operational execution of capacity allocations, driving tool and process optimizations, and leveraging data analysis for efficiency. A technical or engineering background is critical, with a preference for experience in deploying large-scale ML models.

What you'd actually do

  1. Act as a trusted advisor to Product Area partners, understanding their TPU/GPU requirements and delivering a guided, seamless resource management experience.
  2. Collaborate closely with Software Engineering (SWE) and Site Reliability Engineering (SRE) teams to uncover, analyze, and execute on efficiency opportunities across our managed resource footprints.
  3. Own the operational execution of capacity allocations and allied workflows using core Google tooling, a technical or engineering background is critical to successfully navigating this significant operational component.
  4. Partner cross-functionally to drive tool and process optimizations. Leverage strong data analysis skills to convert fleet metrics into actionable business value and automated scalability.
  5. Utilize an understanding of ML fundamentals to inform resourcing decisions, with a preference for practical experience in deploying large-scale ML models.

Skills

Required

  • program management
  • infrastructure resource management
  • infrastructure capacity planning
  • data analytics tools
  • SQL
  • Python
  • Databases
  • programming languages

Nice to have

  • cross-functional project management
  • large scale distributed infrastructure
  • deploying large language models
  • distributed machine learning
  • supply chain management
  • data center capacity planning
  • compute/storage infrastructure

What the JD emphasized

  • technical or engineering background is critical
  • Experience in infrastructure resource management or Infrastructure capacity planning
  • Experience with deploying large language models or distributed machine learning