Staff Software Engineer, AI Reliability Engineering

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Staff Software Engineer focused on AI Reliability Engineering, responsible for defining and achieving reliability metrics for Anthropic's AI products and services, including LLM serving and training systems. This involves designing monitoring, implementing high-availability infrastructure, leading incident response, and optimizing costs for large-scale AI infrastructure.

What you'd actually do

  1. Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity
  2. Design and implement monitoring systems including availability, latency and other salient metrics
  3. Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads
  4. Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
  5. Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident

Skills

Required

  • distributed systems observability and monitoring at scale
  • operating AI infrastructure
  • SLO/SLA frameworks for business-critical services
  • traditional metrics (latency, availability)
  • AI-specific metrics (model performance, training convergence)
  • chaos engineering and systematic resilience testing
  • communication skills

Nice to have

  • operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)
  • ML hardware accelerators (GPUs, TPUs, Trainium, e.g.)
  • ML-specific networking optimizations like RDMA and InfiniBand
  • AI-specific observability tools and frameworks
  • ML model deployment strategies and their reliability implications
  • contributed to open-source infrastructure or ML tooling

What the JD emphasized

  • large language model serving
  • training systems
  • high-availability language model serving infrastructure
  • model serving deployments
  • critical AI services
  • large-scale AI infrastructure
  • operating AI infrastructure
  • model serving
  • batch inference
  • training pipelines
  • ML hardware accelerators
  • ML model deployment strategies

Other signals

  • reliability engineering for AI systems
  • operating AI infrastructure
  • SLO/SLA frameworks for AI services
  • incident response for AI services