Principal Technical Program Manager- AI Infrastructure

Microsoft Microsoft · Big Tech · Redmond, WA +2 · Technical Program Management

This role focuses on managing the delivery of AI infrastructure platforms, emphasizing performance, scalability, and cost efficiency for frontier AI workloads. The Principal Technical Program Manager will own end-to-end delivery from development through production readiness, driving execution across complex, cross-functional programs involving hardware and software stacks. Key responsibilities include integrated planning, dependency management, risk mitigation, and ensuring platform readiness for AI training and inference.

What you'd actually do

  1. Own end-to-end delivery from development through production readiness, including integrated planning across the software stack
  2. Drive execution by managing dependencies, risks, and cross-team tradeoffs to keep delivery on track
  3. Ensure platform and performance readiness (bring-up, key workloads, benchmarking, optimization)
  4. Establish strong operating rhythm (reporting, alignment, and clear escalation paths) while improving tools and processes to increase predictability
  5. Identify systemic gaps and act as the bridge across infrastructure, research, and product, driving alignment and translating complexity into clear, actionable updates

Skills

Required

  • Bachelor's Degree AND 8+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience.
  • 6+ years of experience managing cross-functional and/or cross-team projects.
  • Ability to meet Microsoft, customer, and/or government security screening requirements

Nice to have

  • Experience driving platform bring-up, system integration, or hardware/software co-development programs, managing dependencies across the complete platform stack and engineering stakeholders during early development phases.
  • Strong communication and collaboration skills, with experience partnering across hardware, software, and external vendor teams, and the ability to influence technical discussions involving architecture, trade-offs, system-level challenges, AI training/inference workloads, and performance optimization.

What the JD emphasized

  • platform readiness
  • performance readiness
  • AI training/inference workloads

Other signals

  • AI infrastructure platforms
  • frontier AI workloads
  • performance, scalability, and cost efficiency
  • cross-stack software delivery
  • system bring-up, validation, and production deployment
  • novel scaling challenges
  • AI training/inference workloads