Staff Software Engineer, AI Reliability Engineering

Anthropic Anthropic · AI Frontier · Dublin, Ireland · Software Engineering - Infrastructure

Staff Software Engineer focused on AI Reliability Engineering for large language model serving systems. Responsibilities include developing SLOs, designing monitoring and observability systems, implementing high-availability infrastructure, and leading incident response for critical AI services. This role partners with teams across Anthropic to improve reliability across serving paths.

What you'd actually do

  1. Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.
  2. Design and implement monitoring and observability systems across the token path.
  3. Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
  4. Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
  5. Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

Skills

Required

  • distributed systems
  • infrastructure
  • reliability backgrounds
  • reliability-minded software engineers
  • SREs
  • monitoring
  • observability systems
  • high-availability serving infrastructure
  • incident response
  • communication
  • collaboration

Nice to have

  • operating large-scale model serving or training infrastructure (>1000 GPUs)
  • ML hardware accelerators (GPUs, TPUs, Trainium)
  • ML-specific networking optimizations like RDMA and InfiniBand
  • AI-specific observability tools and frameworks
  • chaos engineering
  • systematic resilience testing
  • open-source infrastructure or ML tooling

What the JD emphasized

  • large scale systems
  • large-scale model serving
  • AI-specific observability tools

Other signals

  • AI Reliability Engineering
  • large language model serving systems
  • high-availability serving infrastructure
  • incident response for critical AI services