Data Scientist, Platform (reliability/l… at Anthropic

About Anthropic

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the role

As an early member of our Data Science team, you will play a crucial role in ensuring our AI systems deliver exceptional user experiences through reliable, low-latency performance. You'll be at the intersection of data science and infrastructure, using rigorous analysis to understand how platform performance impacts user behavior and identifying high-impact opportunities to improve our systems' reliability and responsiveness.

Your work will directly influence how millions of users experience Claude and our other AI systems. You'll quantify user sensitivity to latency, reliability, errors, and refusal rates, then translate these insights into actionable recommendations that drive meaningful improvements to our platform infrastructure. This role offers the unique opportunity to shape the technical foundation that enables safe, frontier AI to scale globally.

Responsibilities:

Design and execute comprehensive analyses to understand how latency, reliability, errors, and refusal rates affect user engagement, satisfaction, and retention across our platform
Identify and prioritize high-impact infrastructure improvements by analyzing user behavior patterns, system performance metrics, and the relationship between technical performance and business outcomes
Develop robust methodologies to measure platform reliability and performance, including defining key metrics, establishing baselines, and creating monitoring systems that enable proactive optimization
Collaborate with engineering teams to design A/B tests and controlled experiments that measure the impact of platform improvements on user experience and system performance
Investigate performance anomalies, conduct root cause analysis of reliability issues, and provide data-driven insights to guide engineering priorities and architectural decisions
Work closely with Platform Engineering, Product, and Research teams to translate technical performance data into user experience insights and strategic recommendations
Build models to forecast platform capacity needs, predict potential reliability issues, and optimize resource allocation to maintain optimal performance at scale
Present complex technical analyses and recommendations to both technical and non-technical stakeholders, including engineering leadership and executive teams

You may be a good fit if you have:

Advanced degree in Statistics, Computer Science, Engineering, Mathematics, or related quantitative field, with 5+ years of hands-on data science experience
Deep understanding of distributed systems, cloud infrastructure, and performance engineering, with experience analyzing large-scale system metrics
Expertise in experimental design, causal inference, statistical modeling, and A/B testing frameworks, particularly in high-scale technical environments
Strong skills in Python, SQL, and data analysis tools, with experience working with large datasets and real-time streaming data
Experience translating technical performance metrics into user experience insights, including understanding how system performance affects user engagement and satisfaction
Proven ability to work effectively with engineering teams and translate complex technical analyses into actionable recommendations for diverse audiences
Track record of using data science to drive significant improvements in system performance, user experience, or business outcomes

Strong candidates may also have:

Hands-on experience with observability tools, APM systems, and infrastructure monitoring platforms (e.g., Prometheus, Grafana, DataDog)
Experience with machine learning infrastructure, model serving, and understanding the unique performance characteristics of AI/ML systems
Familiarity with SRE practices, error budgets, SLOs/SLIs, and reliability engineering principles
Experience analyzing performance of real-time or near-real-time systems, including understanding of latency distributions and tail behavior
Background in user behavior analysis, growth metrics, or product analytics, particularly in understanding how technical performance drives user outcomes
Direct experience working with platform or infrastructure teams in high-scale technology environments

The expected salary range for this position is:

Annual Salary:

$275,000—$355,000 USD

Logistics

**Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.

Location-based hybrid policy:** Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.

The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences.

Come work with us!

Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process

About Anthropic

About the role

Responsibilities:

Design and execute comprehensive analyses to understand how latency, reliability, errors, and refusal rates affect user engagement, satisfaction, and retention across our platform
Identify and prioritize high-impact infrastructure improvements by analyzing user behavior patterns, system performance metrics, and the relationship between technical performance and business outcomes
Develop robust methodologies to measure platform reliability and performance, including defining key metrics, establishing baselines, and creating monitoring systems that enable proactive optimization
Collaborate with engineering teams to design A/B tests and controlled experiments that measure the impact of platform improvements on user experience and system performance
Investigate performance anomalies, conduct root cause analysis of reliability issues, and provide data-driven insights to guide engineering priorities and architectural decisions
Work closely with Platform Engineering, Product, and Research teams to translate technical performance data into user experience insights and strategic recommendations
Build models to forecast platform capacity needs, predict potential reliability issues, and optimize resource allocation to maintain optimal performance at scale
Present complex technical analyses and recommendations to both technical and non-technical stakeholders, including engineering leadership and executive teams

You may be a good fit if you have:

Advanced degree in Statistics, Computer Science, Engineering, Mathematics, or related quantitative field, with 5+ years of hands-on data science experience
Deep understanding of distributed systems, cloud infrastructure, and performance engineering, with experience analyzing large-scale system metrics
Expertise in experimental design, causal inference, statistical modeling, and A/B testing frameworks, particularly in high-scale technical environments
Strong skills in Python, SQL, and data analysis tools, with experience working with large datasets and real-time streaming data
Experience translating technical performance metrics into user experience insights, including understanding how system performance affects user engagement and satisfaction
Proven ability to work effectively with engineering teams and translate complex technical analyses into actionable recommendations for diverse audiences
Track record of using data science to drive significant improvements in system performance, user experience, or business outcomes

Strong candidates may also have:

Hands-on experience with observability tools, APM systems, and infrastructure monitoring platforms (e.g., Prometheus, Grafana, DataDog)
Experience with machine learning infrastructure, model serving, and understanding the unique performance characteristics of AI/ML systems
Familiarity with SRE practices, error budgets, SLOs/SLIs, and reliability engineering principles
Experience analyzing performance of real-time or near-real-time systems, including understanding of latency distributions and tail behavior
Background in user behavior analysis, growth metrics, or product analytics, particularly in understanding how technical performance drives user outcomes
Direct experience working with platform or infrastructure teams in high-scale technology environments

The expected salary range for this position is:

Annual Salary:

$275,000—$355,000 USD

Logistics

**Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.

Location-based hybrid policy:** Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Data Scientist, Platform (reliability/latency/inference)

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

About Anthropic

About the role

Responsibilities:

You may be a good fit if you have:

Strong candidates may also have:

Logistics

How we're different

Come work with us!

About Anthropic

About the role

Responsibilities:

You may be a good fit if you have:

Strong candidates may also have:

Logistics

How we're different

Come work with us!