Technical Program Manager, Compute

Anthropic Anthropic · AI Frontier · San Francisco, CA · Technical Program Management

This role is for a Technical Program Manager on the Compute team, responsible for planning, coordination, and execution of programs related to Anthropic's compute infrastructure at scale. The role involves managing the compute lifecycle, from procurement to utilization, and partnering with various engineering and research teams to ensure efficient operation of the compute fleet, which supports model training, evaluation, and inference.

What you'd actually do

  1. Own and drive critical programs across the compute lifecycle, coordinating execution across multiple engineering, research, and operations teams
  2. Build and maintain operational visibility into the compute fleet, ensuring the organization has a clear picture of supply, demand, utilization, and health
  3. Lead cross-functional coordination for compute transitions: bringing new capacity online, migrating workloads, and managing decommissions across cloud providers and hardware platforms
  4. Partner with engineering and research leadership to navigate competing priorities and drive alignment on how compute resources are planned, allocated, and used
  5. Identify and close operational gaps across the compute pipeline, whether through new tooling, improved processes, or better cross-team communication

Skills

Required

  • Technical program management
  • Infrastructure
  • Platform engineering
  • Compute-intensive environments
  • Cross-functional program leadership
  • Cloud infrastructure
  • Cluster management
  • Job scheduling
  • Resource orchestration
  • Communication skills
  • Building trust with engineering teams

Nice to have

  • Managing compute capacity across multiple cloud providers
  • Job scheduling systems
  • Resource orchestration systems
  • Workload management systems
  • GPU or accelerator infrastructure
  • ML training and inference workloads
  • Observability for infrastructure systems
  • Capacity planning
  • Demand forecasting
  • Cost modeling
  • Hardware lifecycle management
  • Scaling through hypergrowth in AI/ML, HPC, or large-scale cloud environments

What the JD emphasized

  • 7+ years of technical program management experience in infrastructure, platform engineering, or compute-intensive environments
  • experience working with research or ML teams
  • experience with GPU or accelerator infrastructure, including the unique challenges of large-scale ML training and inference workloads