Engineering Manager, Capacity

Anthropic Anthropic · AI Frontier · Compute

Engineering Manager for Anthropic's Capacity team, responsible for managing cloud spend across a multi-cloud environment. This role involves designing and delivering capacity management systems for AI workloads, forecasting infrastructure needs, identifying efficiency opportunities, and partnering with Finance and leadership for strategic decision-making. The role requires experience managing significant infrastructure spend and building scalable capacity management systems, with familiarity with LLMs and AI research/training workloads being a plus.

What you'd actually do

  1. Design, develop, and deliver capacity management systems for AI workloads on heterogenous infrastructure
  2. Build and maintain robust attribution of usage and enable in-depth data-driven insights that are actionable
  3. Build a deep understanding of research and training workloads to accurately forecast infrastructure needs
  4. Oversee design and implementation of forecasting tools and software systems for managing billions of dollars in spend
  5. Proactively identify efficiency opportunities and collaborate with teams across the org to increase effective capacity for Anthropic

Skills

Required

  • Experience managing $XXXM to $XB in infrastructure spend
  • Experience working with public clouds (AWS, GCP, Azure, etc.) and/or hybrid on-prem, cloud environments
  • Experience setting up capacity management systems that scale with growing organizations
  • Comfortable leveraging data and have experience building observability for complex systems
  • Strong interpersonal skills that enable you to influence and build cross-organizational support for capacity initiatives

Nice to have

  • Familiarity with LLMs and a deep interest in learning more about research and model training workloads
  • Past experience managing capacity for AI research and production workloads
  • Past experience partnering with senior leadership, both technical and non-technical, to drive company-level reporting and decision making

What the JD emphasized

  • manage $XXXM to $XB in infrastructure spend
  • capacity management systems that scale
  • building observability for complex systems
  • Past experience managing capacity for AI research and production workloads