Engineering Manager - AI Reliability

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Engineering Manager for AI Reliability at Anthropic, leading a team focused on defining and achieving reliability metrics for large language model serving systems. This role involves overseeing monitoring, high-availability infrastructure, incident response, and cost optimization for AI infrastructure, while also pioneering the use of AI for reliability engineering.

What you'd actually do

  1. Lead and grow a team of reliability engineers responsible for large language model serving.
  2. Drive the development of Service Level Objectives that balance availability/latency with development velocity across the organization
  3. Oversee the design and implementation of comprehensive monitoring systems for availability, latency and other critical metrics
  4. Guide your team in architecting high-availability language model serving infrastructure capable of supporting millions of external customers and high-traffic internal workloads
  5. Lead the strategy for automated failover and recovery systems across multiple regions and cloud providers

Skills

Required

  • managing and scaling reliability or infrastructure engineering teams
  • deep technical knowledge of distributed systems observability and monitoring at scale
  • understanding the unique challenges of operating AI infrastructure
  • successfully implemented SLO/SLA frameworks
  • leading technical discussions
  • translating between ML engineers and infrastructure teams
  • excellent leadership and communication skills
  • ability to influence at all levels
  • strong hiring and talent development capabilities

Nice to have

  • managed teams operating large-scale model training or serving infrastructure (>1000 GPUs)
  • hands-on experience with ML hardware accelerators (GPUs, TPUs, Trainium, etc.)
  • understanding ML-specific networking optimizations and their operational implications
  • led teams through major reliability transformations or infrastructure migrations
  • experience building reliability engineering practices from the ground up
  • contributed to or led open-source infrastructure or ML tooling initiatives
  • demonstrate thought leadership in the reliability engineering community

What the JD emphasized

  • reliability metrics for Anthropic's critical serving systems
  • large language model serving
  • high-availability language model serving infrastructure
  • critical AI services
  • large-scale AI infrastructure
  • operating AI infrastructure
  • AI-specific performance indicators
  • ML hardware accelerators (GPUs, TPUs, Trainium, etc.)
  • large-scale model training or serving infrastructure (>1000 GPUs)

Other signals

  • leading reliability engineering teams
  • defining and achieving reliability metrics for AI serving systems
  • pioneering the use of modern AI capabilities to reengineer reliability engineering