Staff Engineer, Datacenter Server Lifecycle

Anthropic Anthropic · AI Frontier · San Francisco, CA · Software Engineering - Infrastructure

Staff Engineer to own the end-to-end operational journey of every machine in Anthropic's datacenter, from provisioning to decommissioning. This role involves defining processes, tooling, and operational standards, with a deep intersection with security to ensure trusted compute standards across the lifecycle. It requires hands-on experience with server hardware, modern cloud infrastructure, and ideally GPU/AI accelerator hardware.

What you'd actually do

  1. Lead the build-out of automation to support datacenters containing tens of thousands of servers.
  2. Define and own the end-to-end server lifecycle strategy — from provisioning and deployment through operation, maintenance, refresh, and decommissioning — and maintain automation and operational procedures for common lifecycle events (e.g., hardware failures, firmware upgrades, fleet rotations).
  3. Partner closely with Infrastructure Security to design and enforce trusted compute standards across the server lifecycle.
  4. Work closely with our Networking team to ensure end-to-end connectivity across all sites.
  5. Build and maintain tooling to track machine health, configuration, and operational status across the full datacenter fleet.

Skills

Required

  • server hardware
  • hardware lifecycle management
  • Python
  • Rust
  • Go
  • Java
  • Kubernetes
  • Infrastructure as Code
  • AWS
  • GCP
  • communication
  • cross-functional problem solving

Nice to have

  • 8+ years of experience
  • GPU or AI accelerator hardware
  • NVIDIA A100/H100
  • AMD MI300
  • Google TPUs
  • AWS Trainium
  • coreboot
  • LinuxBoot
  • u-root
  • datacenter automation
  • fleet management platforms
  • server operating system distributions
  • capacity planning
  • hardware refresh strategy
  • secure boot
  • TPM
  • hardware attestation
  • firmware verification

What the JD emphasized

  • trusted compute standards
  • trusted compute
  • hardware security concepts