Staff Software Engineer, Kubernetes Platform

Anthropic Anthropic · AI Frontier · New York, NY +1 · Software Engineering - Infrastructure

This role focuses on building and operating large-scale Kubernetes infrastructure to support Anthropic's AI model training, research, and serving. The engineer will own and extend the Kubernetes scheduler, scale the control plane, and build core cluster services to handle massive compute footprints.

What you'd actually do

  1. Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption
  2. Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us
  3. Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on
  4. Build and maintain custom controllers, operators, and CRDs
  5. Partner with research, training, and inference to understand workload shapes and turn their requirements into platform capabilities

Skills

Required

  • software engineering experience building and operating production distributed systems
  • Go, Python, Rust, or C++
  • Deep, hands-on Kubernetes experience (well beyond "user of”) into scheduler, controllers, apiserver, or operating large multi-tenant clusters
  • debug complex issues across the stack
  • designing for reliability, correctness, and clear failure semantics
  • written and verbal communication
  • building consensus with internal stakeholders

Nice to have

  • Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar
  • building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)
  • scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)
  • ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL
  • GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code
  • Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF
  • 8+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects

What the JD emphasized

  • Kubernetes
  • scheduler
  • control plane
  • large-scale
  • production distributed systems
  • reliability
  • correctness