Staff Infrastructure Engineer, Cluster Infrastructure

Anthropic Anthropic · AI Frontier · London, United Kingdom · Software Engineering - Infrastructure

Staff Infrastructure Engineer focused on building and scaling compute clusters for AI model training and inference. This role involves owning the technical strategy for agent-driven cluster lifecycle management, ensuring high-bandwidth connectivity, security, scalability, and fault tolerance. The engineer will collaborate with cloud providers and internal teams to shape long-term compute strategy and establish operational excellence practices.

What you'd actually do

  1. Own the technical strategy and roadmap for agent-driven cluster lifecycle management - provisioning, updates and decommissioning
  2. Partner across teams to ensure new compute capacity is ingested on time
  3. Align with partner teams on physical build-out and leverage cloud solutions to deliver high-bandwidth inter-cluster connectivity
  4. Collaborate with security owners to ensure clusters are provisioned secure-by-default
  5. Define and drive strategy on cluster scalability, homogeneity and fault tolerance

Skills

Required

  • Deep expertise in distributed systems, reliability, and cloud platforms (e.g., Kubernetes, IaC, AWS/GCP/Azure)
  • Strong proficiency in at least one systems language (e.g., Rust, Go, or Python), IaC proficiency with Terraform.
  • Track record of leading complex, multi-quarter technical initiatives spanning multiple teams or systems
  • Ability to build alignment across senior stakeholders and communicate effectively at all levels

Nice to have

  • 8+ years of software engineering experience, including time as a technical lead setting direction for a team
  • Experience operating large-scale compute infrastructure at hyperscale (100+ clusters, 10K+ nodes)
  • Depth in one or more of: Kubernetes internals, cluster provisioning and management systems, cluster orchestration systems (Mesos, Borg-like)
  • Experience with cloud networking: VPC design and peering, Shared VPC/Transit Gateway, Cloud Interconnect/Direct Connect, Cloud NAT, cross-cloud private connectivity, BGP and route control, edge load balancing and DDoS mitigation (Cloud Armor / AWS Shield)
  • Experience with cluster and host networking: CNI (Cilium), eBPF, NetworkPolicy, multi-NIC, sFlow, service mesh (Istio/Envoy/Linkerd, mTLS)
  • Experience with cluster security: pod security standards and admission control, RBAC and least-privilege IAM, node and container hardening, supply-chain/image provenance
  • Deep experience with infrastructure-as-code (Terraform, Atlantis), workflow orchestration (Temporal, Argo Workflows)
  • Skill in quickly understanding systems design tradeoffs and keeping track of rapidly evolving software systems

What the JD emphasized

  • Deep expertise in distributed systems, reliability, and cloud platforms
  • Track record of leading complex, multi-quarter technical initiatives spanning multiple teams or systems
  • Ability to build alignment across senior stakeholders and communicate effectively at all levels