Technical Program Manager, Infrastructure

Anthropic Anthropic · AI Frontier · San Francisco, CA · Technical Program Management

Technical Program Manager for Anthropic's Infrastructure organization, focusing on coordinating complex programs across developer productivity, tooling, reliability, and operations for AI systems. The role involves driving strategic initiatives, improving developer workflows, ensuring system reliability, and bridging communication between research, engineering, and product teams.

What you'd actually do

  1. Drive cross-functional programs to improve developer environments, CI/CD infrastructure, and release processes that enable rapid innovation while maintaining high security standards
  2. Drive programs to establish and achieve reliability targets across training infrastructure and production services
  3. Serve as the critical bridge between infrastructure teams, research, and product, translating technical complexities into clear updates for a variety of audiences

Skills

Required

  • 5+ years of technical program management experience
  • deep technical understanding of infrastructure systems
  • stakeholder management skills
  • experience with developer productivity initiatives, CI/CD systems, or infrastructure scaling
  • Experience with Kubernetes, cloud platforms (AWS, GCP, Azure), and ML infrastructure (GPU/TPU/Trainium clusters)
  • Background working with research teams
  • Experience driving adoption of AI tools to improve engineering productivity
  • Familiarity with observability tooling and practices

Nice to have

  • thrives in ambiguity
  • making everyone around them more effective
  • passion for supporting internal partners like research
  • passionate about AI infrastructure
  • understanding of unique challenges of building and operating systems at frontier scale

What the JD emphasized

  • track record of successfully delivering complex infrastructure programs in ML/AI systems or large-scale distributed systems
  • deep technical understanding of infrastructure systems
  • Excel at creating structure and processes in ambiguous environments
  • comfortable navigating competing priorities and using data to drive technical decisions
  • Experience with Kubernetes, cloud platforms (AWS, GCP, Azure), and ML infrastructure (GPU/TPU/Trainium clusters)
  • Background working with research teams and translating their needs into concrete technical requirements
  • obsessed with reliability, scalability, security, and continuous improvement

Other signals

  • scaling challenges
  • frontier models
  • production infrastructure
  • developer platforms
  • ML/AI systems
  • large-scale distributed systems
  • infrastructure scaling
  • ML infrastructure