Staff Software Engineer, Node Infra

Anthropic Anthropic · AI Frontier · London, United Kingdom · Software Engineering - Infrastructure

Staff Software Engineer, Node Infra at Anthropic. This role focuses on owning the technical strategy and roadmap for node lifecycle management (ingestion, bring-up, health checking, automated repair) for AI clusters. It involves driving cross-team initiatives to scale AI clusters across multiple clouds and accelerator families, designing systems for hardware health and remediation, and defining infrastructure architecture. The role also requires close collaboration with cloud providers and internal teams on compute and infrastructure strategy, and establishing operational excellence practices. Requires deep expertise in distributed systems, reliability, cloud platforms, systems languages, and hands-on experience with ML accelerators.

What you'd actually do

  1. Own the technical strategy and roadmap for node lifecycle management - ingestion, bring-up, health checking, and automated repair
  2. Drive cross-team initiatives to build and scale AI clusters across multiple clouds and accelerator families
  3. Design and operate the systems that detect, isolate, and remediate unhealthy hardware automatically, driving up fleet MTBI and minimizing stranded capacity
  4. Define infrastructure architecture, ensuring the hardest problems get solved - whether by you directly or by working through others
  5. Work closely with cloud providers and internal research/inference/product teams to shape long-term compute, data, and infrastructure strategy

Skills

Required

  • Distributed systems
  • Reliability
  • Cloud platforms (Kubernetes, IaC, AWS/GCP/Azure)
  • Rust
  • Go
  • Python
  • Terraform
  • Machine learning accelerators (GPUs, TPUs, or Trainium)
  • Leading complex technical initiatives
  • Stakeholder management
  • Effective communication

Nice to have

  • Technical lead experience
  • Large scale compute infrastructure management
  • Capacity management
  • Kubernetes internals
  • Cluster orchestration systems
  • Node provisioning pipelines
  • Low-level systems experience (kernel, virtualization, device drivers, firmware, hardware health/diagnostics daemons)
  • High-performance networking (EFA, RDMA, InfiniBand)
  • Production reliability for high-throughput, latency-sensitive systems
  • Open-source contributions
  • Systems design tradeoffs

What the JD emphasized

  • Deep expertise in distributed systems, reliability, and cloud platforms
  • Hands-on experience with machine learning accelerators (GPUs, TPUs, or Trainium)
  • Track record of leading complex, multi-quarter technical initiatives that span multiple teams or systems
  • Ability to build alignment across senior stakeholders and communicate effectively at all levels
  • Experience managing large scale compute infrastructure at hyperscale (10K+ nodes), including capacity management and efficiency
  • Familiarity with high-performance networking (EFA, RDMA, InfiniBand) for distributed ML workloads.
  • Demonstrated ownership of production reliability for high-throughput, latency-sensitive systems