Staff / Senior Software Engineer, Compute Capacity

Anthropic Anthropic · AI Frontier · San Francisco, CA · Compute

This role focuses on building and operating the production systems that manage Anthropic's large accelerator fleet. Responsibilities include developing data pipelines for telemetry ingestion, creating observability tooling for fleet health, and measuring compute efficiency across training, inference, and eval workloads. The role requires strong software engineering, Kubernetes, and data pipeline experience, with a focus on internal tooling and operational depth.

What you'd actually do

  1. Build and operate data pipelines that ingest accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery. Own data completeness, latency SLOs, gap detection, and backfill automation.
  2. Develop and maintain observability infrastructure — Prometheus recording rules, Grafana dashboards, and alerting systems — that surface actionable signals about fleet health, occupancy, and efficiency.
  3. Instrument and analyze compute efficiency metrics across training, inference, and eval workloads. Build benchmarking infrastructure, establish per-config baselines, and work with system-owning teams to improve utilization.
  4. Build internal tooling and platforms that enable capacity planning, workload attribution, and cluster debugging. The consumers are other engineering teams, finance, and leadership — not external users.
  5. Operate Kubernetes-native systems at scale — deploying data collection agents, managing workload labeling infrastructure, and understanding how taints, reservations, and scheduling affect capacity.

Skills

Required

  • 5+ years of software engineering experience
  • Kubernetes fluency at operational depth
  • Data pipeline engineering experience
  • production-quality code
  • Kubernetes-native infrastructure
  • data engineering
  • systems engineering
  • observability
  • Prometheus
  • Grafana
  • BigQuery
  • AWS
  • GCP
  • Azure

Nice to have

  • high-autonomy
  • high-ambiguity environment
  • move between data engineering, systems engineering, and observability with comfort
  • product thinking

What the JD emphasized

  • production systems
  • Kubernetes fluency at operational depth
  • Data pipeline engineering experience