Engineering Manager, AI Reliability Engineering

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Engineering Manager for AI Reliability Engineering at Anthropic, focused on managing a team that defines and achieves reliability metrics for internal and external AI products and services, including LLM serving and training systems. The role involves driving SLOs, overseeing monitoring, architecting high-availability infrastructure, leading incident response, and optimizing AI infrastructure costs.

What you'd actually do

  1. Lead and grow a team of reliability engineers responsible for large language model serving and training systems
  2. Drive the development of service level objectives (SLOs) that balance availability/latency with development velocity across the organization
  3. Oversee the design and implementation of comprehensive monitoring systems for availability, latency and other critical metrics
  4. Guide your team in architecting high-availability language model serving infrastructure capable of supporting millions of external customers and high-traffic internal workloads
  5. Lead the strategy for automated failover and recovery systems across multiple regions and cloud providers

Skills

Required

  • managing and scaling reliability or infrastructure engineering teams
  • deep technical knowledge of distributed systems observability and monitoring at scale
  • operating AI infrastructure
  • implemented SLO/SLA frameworks
  • traditional infrastructure metrics and AI-specific performance indicators
  • leading technical discussions
  • hiring and talent development capabilities

Nice to have

  • managed teams operating large-scale model training or serving infrastructure (>1000 GPUs)
  • hands-on experience with ML hardware accelerators (GPUs, TPUs, Trainium, etc.)
  • ML-specific networking optimizations
  • led teams through major reliability transformations or infrastructure migrations
  • building reliability engineering practices from the ground up
  • contributed to or led open-source infrastructure or ML tooling initiatives
  • thought leadership in the reliability engineering community

What the JD emphasized

  • large language model serving and training systems
  • high-availability language model serving infrastructure
  • large-scale model training or serving infrastructure
  • ML hardware accelerators

Other signals

  • managing and scaling reliability or infrastructure engineering teams
  • operating AI infrastructure
  • large-scale model training or serving infrastructure
  • ML hardware accelerators