Staff Software Engineer, AI Reliability Engineering

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Staff Software Engineer focused on AI Reliability Engineering, responsible for defining and achieving reliability metrics for Anthropic's AI systems, including LLM serving and training infrastructure. The role involves designing monitoring, high-availability serving systems, automated failover, incident response, and cost optimization for large-scale AI infrastructure.

What you'd actually do

  1. Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity
  2. Design and implement monitoring systems including availability, latency and other salient metrics
  3. Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads
  4. Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
  5. Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
  6. Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency

Skills

Required

  • distributed systems observability and monitoring at scale
  • operating AI infrastructure, including model serving, batch inference, and training pipelines
  • implementing and maintaining SLO/SLA frameworks for business-critical services
  • traditional metrics (latency, availability)
  • AI-specific metrics (model performance, training convergence)
  • chaos engineering and systematic resilience testing
  • bridging the gap between ML engineers and infrastructure teams
  • communication skills

Nice to have

  • operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)
  • ML hardware accelerators (GPUs, TPUs, Trainium, e.g.)
  • ML-specific networking optimizations like RDMA and InfiniBand
  • AI-specific observability tools and frameworks
  • ML model deployment strategies and their reliability implications
  • contributed to open-source infrastructure or ML tooling

What the JD emphasized

  • large language model serving and training systems
  • high-availability language model serving infrastructure
  • accelerator (GPU/TPU/Trainium) utilization and efficiency

Other signals

  • reliability metrics for AI systems
  • large language model serving and training systems
  • high-availability language model serving infrastructure
  • incident response for critical AI services
  • cost optimization for AI infrastructure