Staff Engineer, Datacenter Server Lifecycle

Anthropic Anthropic · AI Frontier · London, United Kingdom · Software Engineering - Infrastructure

Staff Engineer responsible for the end-to-end operational journey of servers in Anthropic's datacenters, focusing on provisioning, deployment, maintenance, refresh, and decommissioning. This role involves defining processes, tooling, and operational standards, with a strong emphasis on security and trusted compute standards in partnership with the Infrastructure Security team. The role supports datacenters containing tens of thousands of servers and requires collaboration with Networking and Infrastructure Security teams.

What you'd actually do

  1. Lead the build-out of automation to support datacenters containing tens of thousands of servers.
  2. Define and own the end-to-end server lifecycle strategy — from provisioning and deployment through operation, maintenance, refresh, and decommissioning — and maintain automation and operational procedures for common lifecycle events (e.g., hardware failures, firmware upgrades, fleet rotations).
  3. Partner closely with Infrastructure Security to design and enforce trusted compute standards across the server lifecycle.
  4. Work closely with our Networking team to ensure end-to-end connectivity across all sites.
  5. Build and maintain tooling to track machine health, configuration, and operational status across the full datacenter fleet.

Skills

Required

  • server hardware
  • hardware lifecycle management
  • Python
  • Rust
  • Go
  • Java
  • modern cloud infrastructure
  • Kubernetes
  • Infrastructure as Code
  • AWS
  • GCP
  • communication
  • cross-functional problem solving

Nice to have

  • 8+ years of experience in datacenter operations
  • GPU or AI accelerator hardware
  • NVIDIA A100/H100
  • AMD MI300
  • Google TPUs
  • AWS Trainium
  • coreboot
  • LinuxBoot
  • u-root
  • datacenter automation
  • fleet management platforms
  • server operating system distributions
  • capacity planning
  • hardware refresh strategy
  • secure boot
  • TPM
  • hardware attestation
  • firmware verification

What the JD emphasized

  • trusted compute standards
  • trusted compute
  • hardware security concepts