Staff Software Engineer, AI Reliability Engineering

Anthropic Anthropic · AI Frontier · London, United Kingdom · Software Engineering - Infrastructure

Staff Software Engineer, AI Reliability Engineering at Anthropic. This role focuses on improving the reliability, robustness, and resilience of AI serving systems, specifically for large language models like Claude. Responsibilities include developing SLOs, designing monitoring and observability, assisting with high-availability infrastructure, leading incident response for critical AI services, and supporting the reliability of safeguard model serving.

What you'd actually do

  1. Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.
  2. Design and implement monitoring and observability systems across the token path.
  3. Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
  4. Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
  5. Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

Skills

Required

  • distributed systems
  • infrastructure
  • reliability backgrounds
  • reliability-minded software engineers
  • SREs
  • operating large-scale model serving infrastructure
  • ML hardware accelerators
  • ML-specific networking optimizations
  • AI-specific observability tools and frameworks
  • chaos engineering
  • systematic resilience testing

Nice to have

  • experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium)
  • Understand ML-specific networking optimizations like RDMA and InfiniBand
  • expertise in AI-specific observability tools and frameworks
  • experience with chaos engineering and systematic resilience testing
  • contributed to open-source infrastructure or ML tooling

What the JD emphasized

  • critical serving paths
  • critical AI services
  • safeguard model serving

Other signals

  • improving reliability across our most critical serving paths
  • make the systems that deliver Claude more robust and resilient
  • Develop appropriate Service Level Objectives for large language model serving systems
  • Design and implement monitoring and observability systems across the token path
  • Support the reliability of safeguard model serving