Staff Infrastructure Engineer, Cluster Infrastructure

Anthropic Anthropic · AI Frontier · New York, NY +1 · Software Engineering - Infrastructure

Staff Infrastructure Engineer focused on the full lifecycle of compute clusters, including agent-driven automation for provisioning, updates, and decommissioning. The role involves ensuring high-bandwidth, secure, and fault-tolerant clusters to support AI model training and inference at scale.

What you'd actually do

  1. Own the technical strategy and roadmap for agent-driven cluster lifecycle management - provisioning, updates and decommissioning
  2. Partner across teams to ensure new compute capacity is ingested on time
  3. Align with partner teams on physical build-out and leverage cloud solutions to deliver high-bandwidth inter-cluster connectivity
  4. Collaborate with security owners to ensure clusters are provisioned secure-by-default
  5. Define and drive strategy on cluster scalability, homogeneity and fault tolerance

Skills

Required

  • distributed systems
  • reliability
  • cloud platforms
  • Kubernetes
  • IaC
  • AWS
  • GCP
  • Azure
  • Rust
  • Go
  • Python
  • Terraform
  • leading complex technical initiatives
  • stakeholder management
  • communication

Nice to have

  • 8+ years of software engineering experience
  • technical lead experience
  • operating large-scale compute infrastructure
  • Kubernetes internals
  • cluster provisioning and management systems
  • cluster orchestration systems
  • cloud networking
  • VPC design
  • Shared VPC/Transit Gateway
  • Cloud Interconnect/Direct Connect
  • Cloud NAT
  • cross-cloud private connectivity
  • BGP
  • route control
  • edge load balancing
  • DDoS mitigation
  • cluster and host networking
  • CNI
  • eBPF
  • NetworkPolicy
  • multi-NIC
  • sFlow
  • service mesh
  • Istio
  • Envoy
  • Linkerd
  • mTLS
  • cluster security
  • pod security standards
  • admission control
  • RBAC
  • least-privilege IAM
  • node and container hardening
  • supply-chain/image provenance
  • workflow orchestration
  • Temporal
  • Argo Workflows
  • systems design tradeoffs

What the JD emphasized

  • Deep expertise in distributed systems, reliability, and cloud platforms (e.g., Kubernetes, IaC, AWS/GCP/Azure)
  • Strong proficiency in at least one systems language (e.g., Rust, Go, or Python), IaC proficiency with Terraform.
  • Track record of leading complex, multi-quarter technical initiatives spanning multiple teams or systems
  • Ability to build alignment across senior stakeholders and communicate effectively at all levels
  • Experience operating large-scale compute infrastructure at hyperscale (100+ clusters, 10K+ nodes)