Staff + Senior Software Engineer, Inference Deployment

Anthropic Anthropic · AI Frontier · San Francisco, CA · Software Engineering - Infrastructure

Software Engineer focused on designing and building the deployment infrastructure for AI inference services across various hardware accelerators (GPUs, TPUs, Trainium). The role involves optimizing deployment orchestration, capacity-aware scheduling, and observability to ensure safe, quick, and uninterrupted production releases, managing resource constraints and minimizing cycle time from code merge to production.

What you'd actually do

  1. Own deployment orchestration that continuously moves validated inference builds into production across GPU, TPU, and Trainium fleets, unattended under normal conditions
  2. Improve capacity-aware deployment scheduling to maximize deployment throughput against constrained accelerator budgets and variable fleet sizes
  3. Extend deployment observability — dashboards and tooling that answer "what code is running in production," "where is my commit," and "what validation passed for this deploy"
  4. Drive down cycle time from code merge to production with pipeline architectures that minimize serial dependencies and maximize parallelism
  5. Optimize fleet rollout strategies for large-scale deployments across thousands of accelerator chips, minimizing disruption to serving capacity

Skills

Required

  • Strong software engineering skills
  • designing systems that manage complex state machines and multi-stage pipelines
  • Kubernetes-based deployments
  • rolling update mechanics
  • container orchestration
  • building deployment, release, or delivery infrastructure where resource constraints shape the design
  • automation that measurably improves deployment velocity and reliability
  • working across the stack — from backend services and databases to CLI tools and web UIs
  • Strong communication skills
  • work closely with oncall engineers, model teams, and infrastructure partners

Nice to have

  • 5+ years of experience building deployment, release, or delivery infrastructure at scale
  • Python and/or Rust in production systems
  • ML inference or training infrastructure deployment, particularly across multiple accelerator types (GPU, TPU, Trainium)
  • capacity planning or resource-constrained scheduling (e.g., bin-packing, fleet management, job scheduling with hardware affinity)
  • progressive delivery in systems with long validation cycles: canary/soak testing, blue-green deployments, traffic shifting, automated rollback
  • companies with large-scale release engineering challenges (mobile release trains, monorepo deployments, multi-datacenter rollouts)

What the JD emphasized

  • resource-constrained optimization problem
  • resource constraints
  • accelerator chips
  • fleet sizes
  • validation and deployment consume the same accelerator chips that serve customer traffic

Other signals

  • inference deployment
  • deployment infrastructure
  • resource-constrained optimization
  • accelerator chips
  • cycle time reduction