Principal Capacity Engineer, Compute

Anthropic Anthropic · AI Frontier · Compute

This role focuses on capacity engineering for AI workloads, involving planning, forecasting, and optimization of global infrastructure. The engineer will design and deliver capacity management systems, build usage attribution, oversee planning tools and guardrails, model costs for research and training, identify efficiency opportunities, and partner with Finance and leadership for strategic decision-making. Experience with AI workload capacity, cross-functional projects, LLMs, and observability is preferred.

What you'd actually do

  1. Design, develop, and deliver capacity management systems for AI workloads on heterogenous infrastructure
  2. Build and maintain robust attribution of usage and enable in-depth data-driven insights
  3. Oversee design and implementation of planning tools and systems-level guardrails for capacity planning and quota management
  4. Build a deep understanding of research and training workloads to accurately model cost-to-serve and cost-to-train
  5. Proactively identify efficiency opportunities and collaborate with teams across the org to increase total effective compute for Anthropic

Skills

Required

  • capacity planning
  • forecasting
  • infrastructure optimization
  • data analysis
  • cross-functional project management
  • stakeholder management
  • observability
  • cost modeling

Nice to have

  • experience with AI workloads
  • experience at a major cloud provider or hyperscaler
  • experience with LLMs
  • interest in model training and serving efficiency
  • experience building observability for complex systems
  • interpersonal skills
  • influence without authority
  • past experience as a lead capacity engineer
  • past experience partnering with senior leadership
  • past experience working on model training or model inference

What the JD emphasized

  • capacity management for AI workloads is preferred
  • experience working with LLMs and/or a deep interest in learning about model training and serving efficiency
  • building observability for complex systems
  • model training or model inference

Other signals

  • capacity planning for AI workloads
  • optimizing global infrastructure fleet
  • scalable systems for capacity management
  • high-quality data and insights for planning
  • engineering roadmaps that deliver efficiency wins
  • increase total effective compute