Staff Software Engineer, Node Infra

Anthropic Anthropic · AI Frontier · New York, NY +1 · Software Engineering - Infrastructure

The Staff Infrastructure Engineer, Node Infra role at Anthropic focuses on owning the full lifecycle of accelerator capacity, including ingestion, provisioning, scaling, and maintaining AI clusters across multiple clouds and datacenters. This involves designing and operating systems for hardware health, diagnostics, and automated repair to ensure GPU/TPU/Trainium nodes are usable for AI research and inference. The role requires deep expertise in distributed systems, reliability, cloud platforms, and systems programming, with a focus on scaling and operating large compute infrastructure.

What you'd actually do

  1. Own the technical strategy and roadmap for node lifecycle management - ingestion, bring-up, health checking, and automated repair
  2. Drive cross-team initiatives to build and scale AI clusters across multiple clouds and accelerator families
  3. Design and operate the systems that detect, isolate, and remediate unhealthy hardware automatically, driving up fleet MTBI and minimizing stranded capacity
  4. Define infrastructure architecture, ensuring the hardest problems get solved - whether by you directly or by working through others
  5. Work closely with cloud providers and internal research/inference/product teams to shape long-term compute, data, and infrastructure strategy

Skills

Required

  • distributed systems
  • reliability
  • cloud platforms
  • Kubernetes
  • IaC
  • AWS
  • GCP
  • Azure
  • Rust
  • Go
  • Python
  • Terraform
  • machine learning accelerators
  • GPUs
  • TPUs
  • Trainium
  • leading complex technical initiatives

Nice to have

  • technical lead
  • capacity management
  • Kubernetes internals
  • cluster orchestration systems
  • node provisioning pipelines
  • low-level systems experience
  • kernel
  • virtualization
  • device drivers
  • firmware
  • hardware health/diagnostics daemons
  • high-performance networking
  • EFA
  • RDMA
  • InfiniBand
  • production reliability
  • high-throughput systems
  • latency-sensitive systems
  • open-source projects
  • systems design tradeoffs

What the JD emphasized

  • Deep expertise in distributed systems, reliability, and cloud platforms
  • Hands-on experience with machine learning accelerators (GPUs, TPUs, or Trainium)
  • Track record of leading complex, multi-quarter technical initiatives that span multiple teams or systems
  • Ability to build alignment across senior stakeholders and communicate effectively at all levels
  • Experience managing large scale compute infrastructure at hyperscale (10K+ nodes), including capacity management and efficiency