Staff Software Engineer, AI Reliability Engineering

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Staff Software Engineer focused on AI Reliability Engineering at Anthropic, responsible for defining and achieving reliability metrics for LLM serving and training systems. This includes designing monitoring, implementing high-availability infrastructure, leading incident response, and optimizing costs for large-scale AI infrastructure.

What you'd actually do

  1. Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
  2. Design and implement monitoring systems including availability, latency and other salient metrics.
  3. Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
  4. Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
  5. Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident

Skills

Required

  • distributed systems observability and monitoring at scale
  • operating AI infrastructure (model serving, batch inference, training pipelines)
  • implementing and maintaining SLO/SLA frameworks
  • traditional metrics (latency, availability)
  • AI-specific metrics (model performance, training convergence)
  • chaos engineering and systematic resilience testing
  • bridging ML engineers and infrastructure teams
  • communication skills

Nice to have

  • operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)
  • ML hardware accelerators (GPUs, TPUs, Trainium)
  • ML-specific networking optimizations (RDMA, InfiniBand)
  • AI-specific observability tools and frameworks
  • ML model deployment strategies and their reliability implications
  • contributed to open-source infrastructure or ML tooling

What the JD emphasized

  • operating AI infrastructure
  • model serving
  • training pipelines
  • SLO/SLA frameworks
  • AI-specific metrics
  • large-scale model training infrastructure
  • serving infrastructure
  • ML hardware accelerators
  • AI-specific observability tools

Other signals

  • reliability engineering for AI systems
  • large language model serving and training infrastructure
  • high-availability
  • incident response
  • cost optimization