Staff / Senior Software Engineer, Compu… at Anthropic

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the Role

Anthropic manages one of the largest and fastest-growing accelerator fleets in the industry — spanning multiple accelerator families and clouds. The Accelerator Capacity Engineering (ACE) team is responsible for making sure every chip in that fleet is accounted for, well-utilized, and efficiently allocated. We own the data, tooling, and operational systems that let Anthropic plan, measure, and maximize utilization across first-party and third-party compute.

As an engineer on ACE, you will build the production systems that power this work: data pipelines that ingest and normalize telemetry from heterogeneous cloud environments, observability tooling that gives the org real-time visibility into fleet health, and performance instrumentation that measures how efficiently every major workload uses the hardware it’s running on. You will be expected to write production-quality code every day, operate alongside Kubernetes-native infrastructure at meaningful scale, and directly influence decisions around one of Anthropic’s largest areas of spend.

You’ll collaborate closely with research engineering, infrastructure, inference, and finance teams. The work requires someone who can move between data engineering, systems engineering, and observability with comfort — and who thrives in a high-autonomy, high-ambiguity environment.

What This Team Owns

The team’s work spans three functional areas. Depending on your background and interests, you’ll focus primarily in one, but the boundaries are fluid and the problems overlap:

**Data infrastructure — **collecting, normalizing, and serving the fleet-wide data that powers everything else. This means building pipelines that ingest occupancy and utilization telemetry from Kubernetes clusters, normalizing billing and usage data across cloud providers, and maintaining the BigQuery layer that the rest of the org queries against. Correctness, completeness, and latency matter here.
**Fleet observability — **making the state of the accelerator fleet legible and actionable in real time. This means building cluster health tooling, capacity planning platforms, alerting on occupancy drops and allocation problems, and driving systemic improvements to scheduling and fragmentation. The work sits at the intersection of Kubernetes operations and cross-team coordination.
**Compute efficiency — **measuring and improving how effectively every major workload uses the hardware it’s running on. This means instrumenting utilization metrics across training, inference, and eval systems, building benchmarking infrastructure, establishing per-config baselines, and collaborating directly with system-owning teams to close efficiency gaps.
**Internal compute tooling — **building the platforms and interfaces that make capacity data usable across the org. This includes capacity planning tools, workload attribution systems, cost dashboards, and self-service APIs. The consumers are research engineers, infrastructure teams, finance, and leadership — each with different needs and different levels of technical depth. The work involves product thinking as much as engineering: figuring out what people actually need, defining schema contracts, and making the data discoverable.

You will be placed on a pod based on your background and interests. We are especially focused on hiring for Data Platform, but strong candidates for any of the three active pods will move forward.

What You’ll Do

**Build and operate data pipelines **that ingest accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery. Own data completeness, latency SLOs, gap detection, and backfill automation.
**Develop and maintain observability infrastructure **— Prometheus recording rules, Grafana dashboards, and alerting systems — that surface actionable signals about fleet health, occupancy, and efficiency.
**Instrument and analyze compute efficiency metrics **across training, inference, and eval workloads. Build benchmarking infrastructure, establish per-config baselines, and work with system-owning teams to improve utilization.
**Build internal tooling and platforms **that enable capacity planning, workload attribution, and cluster debugging. The consumers are other engineering teams, finance, and leadership — not external users.
**Operate Kubernetes-native systems at scale **— deploying data collection agents, managing workload labeling infrastructure, and understanding how taints, reservations, and scheduling affect capacity.
**Normalize and reconcile data across heterogeneous sources **— including AWS, GCP, and Azure billing exports, vendor-specific telemetry formats, and internal systems with different schemas and billing arrangements.
**Collaborate across organizational boundaries **with research engineering, infrastructure, inference, and finance teams. Gather requirements from technical stakeholders, translate them into useful systems, and communicate trade-offs to non-technical audiences.

You May Be a Good Fit If You Have

**5+ years of software engineering experience **with a strong track record building and operating production systems. You write code every day — this is a hands-on engineering role, not a planning or coordination role.
**Kubernetes fluency at operational depth **— you’ve operated production K8s at meaningful scale, not just written manifests. Comfort with scheduling, taints, labels, node management, and debugging cluster-level issues.
**Data pipeline engineering experience **— designing, building, and owning the full lifecycle of production data pipelines. Experience with data warehouses (BigQuery preferred), schema management, streaming ingestion, SLOs for latency and completeness, and a strong instinct for correctness.
**Observability tooling experience **— Prometheus, PromQL, and Grafana are in the critical path for this team. Experience writing recording rules, understanding metric semantics, and building monitoring systems that engineering teams actually rely on.
**Python and SQL at production quality. **Most pipeline code is Python; the presentation layer is BigQuery SQL including table-valued functions and views. Both need to be idiomatic, well-tested, and maintainable.
**Familiarity with at least one major cloud provider (AWS, GCP, or Azure) **at the infrastructure level — compute, billing, usage APIs, cost management tooling. Multi-cloud experience is a strong plus.
**High autonomy and strong cross-team communication. **You can gather your own requirements, navigate ambiguity, and work across organizational boundaries. Scrappiness and ownership matter more than polish.

Strong Candidates May Also Have

**Multi-cloud data ingestion experience **— especially working with AWS and GCP APIs, billing exports, or vendor-specific telemetry formats. Experience normalizing data from external providers with different billing arrangements is directly applicable.
**Accelerator infrastructure familiarity **— GPU metrics (DCGM), TPU utilization, Trainium power and utilization metrics, or experience working with ML training/inference systems at the hardware level.
**Performance engineering and benchmarking experience **— building benchmark harnesses, establishing baselines, reasoning about compute efficiency (FLOPs utilization, memory bandwidth, interconnect throughput), and working with system teams to diagnose and improve performance.
**Data-as-product thinking **— experience building internal data products with self-service access, schema contracts, API serving, documentation, and discoverability. Not just building pipelines, but thinking about how platform data gets consumed.
**Experience with capacity planning, resource management, or cost attribution systems **at a hyperscaler or large-scale ML environment. FinOps, chargeback systems, or infrastructure cost modeling.

**Familiarity with ClickHouse, Terraform, or Rust. **ClickHouse is the team’s current streaming store; Terraform for infrastructure-as-code; Rust for high-performance data collection agents.

The annual compensation range for this role is listed below.

For sales roles, the range provided is the role’s On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.

Annual Salary:

$320,000—$405,000 USD

Logistics

**Minimum education: **Bachelor’s degree or an equivalent combination of education, training, and/or experience

**Required field of study: **A field relevant to the role as demonstrated through coursework, training, or professional experience

**Minimum years of experience: **Years of experience required will correlate with the internal job level requirements for the position

Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.

Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.

The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences.

Come work with us!

Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process

About Anthropic

About the Role

What This Team Owns

The team’s work spans three functional areas. Depending on your background and interests, you’ll focus primarily in one, but the boundaries are fluid and the problems overlap:

**Data infrastructure — **collecting, normalizing, and serving the fleet-wide data that powers everything else. This means building pipelines that ingest occupancy and utilization telemetry from Kubernetes clusters, normalizing billing and usage data across cloud providers, and maintaining the BigQuery layer that the rest of the org queries against. Correctness, completeness, and latency matter here.
**Fleet observability — **making the state of the accelerator fleet legible and actionable in real time. This means building cluster health tooling, capacity planning platforms, alerting on occupancy drops and allocation problems, and driving systemic improvements to scheduling and fragmentation. The work sits at the intersection of Kubernetes operations and cross-team coordination.
**Compute efficiency — **measuring and improving how effectively every major workload uses the hardware it’s running on. This means instrumenting utilization metrics across training, inference, and eval systems, building benchmarking infrastructure, establishing per-config baselines, and collaborating directly with system-owning teams to close efficiency gaps.
**Internal compute tooling — **building the platforms and interfaces that make capacity data usable across the org. This includes capacity planning tools, workload attribution systems, cost dashboards, and self-service APIs. The consumers are research engineers, infrastructure teams, finance, and leadership — each with different needs and different levels of technical depth. The work involves product thinking as much as engineering: figuring out what people actually need, defining schema contracts, and making the data discoverable.

You will be placed on a pod based on your background and interests. We are especially focused on hiring for Data Platform, but strong candidates for any of the three active pods will move forward.

What You’ll Do

**Build and operate data pipelines **that ingest accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery. Own data completeness, latency SLOs, gap detection, and backfill automation.
**Develop and maintain observability infrastructure **— Prometheus recording rules, Grafana dashboards, and alerting systems — that surface actionable signals about fleet health, occupancy, and efficiency.
**Instrument and analyze compute efficiency metrics **across training, inference, and eval workloads. Build benchmarking infrastructure, establish per-config baselines, and work with system-owning teams to improve utilization.
**Build internal tooling and platforms **that enable capacity planning, workload attribution, and cluster debugging. The consumers are other engineering teams, finance, and leadership — not external users.
**Operate Kubernetes-native systems at scale **— deploying data collection agents, managing workload labeling infrastructure, and understanding how taints, reservations, and scheduling affect capacity.
**Normalize and reconcile data across heterogeneous sources **— including AWS, GCP, and Azure billing exports, vendor-specific telemetry formats, and internal systems with different schemas and billing arrangements.
**Collaborate across organizational boundaries **with research engineering, infrastructure, inference, and finance teams. Gather requirements from technical stakeholders, translate them into useful systems, and communicate trade-offs to non-technical audiences.

You May Be a Good Fit If You Have

**5+ years of software engineering experience **with a strong track record building and operating production systems. You write code every day — this is a hands-on engineering role, not a planning or coordination role.
**Kubernetes fluency at operational depth **— you’ve operated production K8s at meaningful scale, not just written manifests. Comfort with scheduling, taints, labels, node management, and debugging cluster-level issues.
**Data pipeline engineering experience **— designing, building, and owning the full lifecycle of production data pipelines. Experience with data warehouses (BigQuery preferred), schema management, streaming ingestion, SLOs for latency and completeness, and a strong instinct for correctness.
**Observability tooling experience **— Prometheus, PromQL, and Grafana are in the critical path for this team. Experience writing recording rules, understanding metric semantics, and building monitoring systems that engineering teams actually rely on.
**Python and SQL at production quality. **Most pipeline code is Python; the presentation layer is BigQuery SQL including table-valued functions and views. Both need to be idiomatic, well-tested, and maintainable.
**Familiarity with at least one major cloud provider (AWS, GCP, or Azure) **at the infrastructure level — compute, billing, usage APIs, cost management tooling. Multi-cloud experience is a strong plus.
**High autonomy and strong cross-team communication. **You can gather your own requirements, navigate ambiguity, and work across organizational boundaries. Scrappiness and ownership matter more than polish.

Strong Candidates May Also Have

**Multi-cloud data ingestion experience **— especially working with AWS and GCP APIs, billing exports, or vendor-specific telemetry formats. Experience normalizing data from external providers with different billing arrangements is directly applicable.
**Accelerator infrastructure familiarity **— GPU metrics (DCGM), TPU utilization, Trainium power and utilization metrics, or experience working with ML training/inference systems at the hardware level.
**Performance engineering and benchmarking experience **— building benchmark harnesses, establishing baselines, reasoning about compute efficiency (FLOPs utilization, memory bandwidth, interconnect throughput), and working with system teams to diagnose and improve performance.
**Data-as-product thinking **— experience building internal data products with self-service access, schema contracts, API serving, documentation, and discoverability. Not just building pipelines, but thinking about how platform data gets consumed.
**Experience with capacity planning, resource management, or cost attribution systems **at a hyperscaler or large-scale ML environment. FinOps, chargeback systems, or infrastructure cost modeling.

**Familiarity with ClickHouse, Terraform, or Rust. **ClickHouse is the team’s current streaming store; Terraform for infrastructure-as-code; Rust for high-performance data collection agents.

The annual compensation range for this role is listed below.

Annual Salary:

$320,000—$405,000 USD

Logistics

**Minimum education: **Bachelor’s degree or an equivalent combination of education, training, and/or experience

**Required field of study: **A field relevant to the role as demonstrated through coursework, training, or professional experience

**Minimum years of experience: **Years of experience required will correlate with the internal job level requirements for the position

Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Staff / Senior Software Engineer, Compute Capacity

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

About Anthropic

About the Role

What This Team Owns

What You’ll Do

You May Be a Good Fit If You Have

Strong Candidates May Also Have

Logistics

How we're different

Come work with us!

About Anthropic

About the Role

What This Team Owns

What You’ll Do

You May Be a Good Fit If You Have

Strong Candidates May Also Have

Logistics

How we're different

Come work with us!