Staff Software Engineer, Systems

Anthropic Anthropic · AI Frontier · London, United Kingdom · Software Engineering - Infrastructure

Staff Software Engineer, Systems role at Anthropic focused on building and maintaining the infrastructure that supports AI clusters at massive scale. This includes compute uptime, resilience, networking, and reliability challenges, enabling frontier AI research and safe deployment to millions of users. The role requires deep knowledge of distributed systems, reliability, and cloud platforms, with experience in systems languages and leading infrastructure projects.

What you'd actually do

  1. Lead infrastructure projects from design through delivery, owning scope, execution, and outcomes
  2. Build and maintain systems that support AI clusters at massive scale (thousands to hundreds of thousands of machines)
  3. Partner with cloud providers and internal teams to solve compute, networking, and reliability challenges
  4. Tackle difficult technical problems in your domain and proactively fill gaps in tooling, documentation, and processes
  5. Contribute to operational practices including incident response, postmortems, and on-call rotations

Skills

Required

  • 6+ years of software engineering experience
  • Led technical projects end-to-end over multiple months, including scoping, breaking down work, and driving delivery
  • Deep knowledge of distributed systems, reliability, and cloud platforms (Kubernetes, IaC, AWS/GCP)
  • Strong in at least one systems language (Python, Rust, Go, Java)
  • Solve hard problems independently and know when to pull others in
  • Help teammates grow through knowledge sharing and thoughtful technical guidance
  • Communicate clearly in design docs, presentations, and cross-functional discussions

Nice to have

  • Security and privacy best practice expertise
  • Experience with machine learning infrastructure like GPUs, TPUs, or Trainium, as well as supporting networking infrastructure like NCCL
  • Low level systems experience, for example linux kernel tuning and eBPF
  • Technical expertise: Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems

What the JD emphasized

  • massive scale
  • thousands to hundreds of thousands of machines
  • compute uptime and resilience
  • frontier AI research
  • safely deployable to customers
  • deep knowledge of distributed systems, reliability, and cloud platforms
  • led technical projects end-to-end

Other signals

  • AI infrastructure at massive scale
  • compute uptime and resilience
  • frontier AI research enablement
  • safely deployable to customers