Staff Software Engineer, Kubernetes Platform

Anthropic Anthropic · AI Frontier · London, United Kingdom · Software Engineering - Infrastructure

Staff Software Engineer on the Kubernetes Platform team at Anthropic, responsible for owning, operating, and extending large-scale Kubernetes clusters (hundreds of thousands of nodes) used for training, research, and serving frontier AI models. This includes custom scheduling plugins, scaling the control plane, and building core cluster services. The role requires deep Kubernetes experience and a track record in production distributed systems.

What you'd actually do

  1. Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption
  2. Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us
  3. Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on
  4. Build and maintain custom controllers, operators, and CRDs
  5. Partner with research, training, and inference to understand workload shapes and turn their requirements into platform capabilities

Skills

Required

  • Significant software engineering experience building and operating production distributed systems
  • Proficiency in at least one systems-appropriate language (e.g., Go, Python, Rust, or C++)
  • Deep, hands-on Kubernetes experience (well beyond "user of”) into scheduler, controllers, apiserver, or operating large multi-tenant clusters
  • Demonstrated ability to debug complex issues across the stack, from API behavior down to node and network-level root causes
  • A track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on
  • Strong written and verbal communication; comfort building consensus with internal stakeholders

Nice to have

  • Experience with Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar
  • Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)
  • Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)
  • Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL
  • Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code
  • Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF
  • 8+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects

What the JD emphasized

  • core craft
  • production distributed systems
  • Deep, hands-on Kubernetes experience
  • complex issues across the stack
  • designing for reliability, correctness
  • ML infrastructure

Other signals

  • operating at scale
  • hundreds of thousands of nodes
  • train, research, and serve frontier AI models
  • scale the control plane
  • support clusters far beyond typical limits
  • ML workloads
  • accelerators