Staff + Sr. Software Engineer, AI Reliability

Anthropic Anthropic · AI Frontier · New York, NY +2 · Software Engineering - Infrastructure

This role focuses on improving the reliability of AI serving systems, including infrastructure, API layers, and accelerators. Responsibilities include developing SLOs, designing monitoring and observability systems, assisting with high-availability infrastructure, leading incident response for critical AI services, and supporting safeguard model serving. The role requires strong distributed systems and reliability backgrounds, with experience in large-scale model serving infrastructure being a plus.

What you'd actually do

  1. Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity
  2. Design and implement monitoring and observability systems across the token path
  3. Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud provider
  4. Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements
  5. Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

Skills

Required

  • distributed systems
  • infrastructure
  • reliability backgrounds
  • reliability-minded software engineers
  • SREs
  • monitoring
  • observability
  • high-availability serving infrastructure
  • incident response
  • safeguard model serving

Nice to have

  • operating large-scale model serving or training infrastructure (>1000 GPUs)
  • ML hardware accelerators (GPUs, TPUs, Trainium)
  • ML-specific networking optimizations (RDMA, InfiniBand)
  • AI-specific observability tools and frameworks
  • chaos engineering
  • systematic resilience testing
  • contributed to open-source infrastructure or ML tooling

What the JD emphasized

  • critical serving paths
  • critical AI services
  • large-scale model serving or training infrastructure (>1000 GPUs)

Other signals

  • improving reliability across our most critical serving paths
  • make the systems that deliver Claude more robust and resilient
  • lead incident response for critical AI services
  • operating large-scale model serving or training infrastructure