Staff + Sr. Software Engineer, Inference Deployment

Anthropic Anthropic · AI Frontier · New York, NY +2 · Software Engineering - Infrastructure

This role focuses on building and maintaining the infrastructure for deploying AI inference code to production across various accelerator fleets (GPU, TPU, Trainium). The core responsibility is to create a continuous, unattended deployment system that optimizes for resource constraints, minimizes cycle time, and ensures reliability at scale. It involves capacity-aware scheduling, deployment observability, and self-service onboarding for new models.

What you'd actually do

  1. Own deployment orchestration that continuously moves validated inference builds into production across GPU, TPU, and Trainium fleets, unattended under normal conditions
  2. Improve capacity-aware deployment scheduling to maximize deployment throughput against constrained accelerator budgets and variable fleet sizes
  3. Extend deployment observability — dashboards and tooling that answer "what code is running in production," "where is my commit," and "what validation passed for this deploy"
  4. Drive down cycle time from code merge to production with pipeline architectures that minimize serial dependencies and maximize parallelism
  5. Optimize fleet rollout strategies for large-scale deployments across thousands of GPU, TPU, and Trainium chips, minimizing disruption to serving capacity

Skills

Required

  • 5+ years of experience building deployment, release, or delivery infrastructure at scale
  • Strong software engineering skills with experience designing systems that manage complex state machines and multi-stage pipelines
  • Experience with deployment systems where resource constraints shape the design
  • A track record of building automation that measurably improves deployment velocity and reliability
  • Proficiency with Kubernetes-based deployments, rolling update mechanics, and container orchestration
  • Comfort working across the stack — from backend services and databases to CLI tools and web UIs
  • Strong communication skills and the ability to work closely with oncall engineers, model teams, and infrastructure partners

Nice to have

  • Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator types (GPU, TPU, Trainium)
  • Background in capacity planning or resource-constrained scheduling (e.g., bin-packing, fleet management, job scheduling with hardware affinity)
  • Experience with progressive delivery in systems with long validation cycles: canary/soak testing, blue-green deployments, traffic shifting, automated rollback
  • Experience at companies with large-scale release engineering challenges (mobile release trains, monorepo deployments, multi-datacenter rollouts)
  • Experience with Python and/or Rust in production systems

What the JD emphasized

  • inference deployment
  • resource constraints
  • deployment velocity
  • large-scale release engineering

Other signals

  • inference deployment
  • GPU, TPU, and Trainium fleets
  • resource-constrained optimization
  • continuous deployment
  • large-scale deployments