Data Scientist, Platform (reliability/latency/inference)

Anthropic Anthropic · AI Frontier · Data Science & Analytics

Data Scientist focused on platform reliability and latency for AI systems, analyzing user behavior and system performance to drive infrastructure improvements and optimize resource allocation. This role is crucial for ensuring a positive user experience with AI products at scale.

What you'd actually do

  1. Design and execute comprehensive analyses to understand how latency, reliability, errors, and refusal rates affect user engagement, satisfaction, and retention across our platform
  2. Identify and prioritize high-impact infrastructure improvements by analyzing user behavior patterns, system performance metrics, and the relationship between technical performance and business outcomes
  3. Develop robust methodologies to measure platform reliability and performance, including defining key metrics, establishing baselines, and creating monitoring systems that enable proactive optimization
  4. Collaborate with engineering teams to design A/B tests and controlled experiments that measure the impact of platform improvements on user experience and system performance
  5. Investigate performance anomalies, conduct root cause analysis of reliability issues, and provide data-driven insights to guide engineering priorities and architectural decisions

Skills

Required

  • Advanced degree in Statistics, Computer Science, Engineering, Mathematics, or related quantitative field
  • Hands-on data science experience
  • Understanding of distributed systems, cloud infrastructure, and performance engineering
  • Analyzing large-scale system metrics
  • Experimental design
  • Causal inference
  • Statistical modeling
  • A/B testing frameworks
  • Python
  • SQL
  • Data analysis tools
  • Working with large datasets
  • Real-time streaming data
  • Translating technical performance metrics into user experience insights
  • Working effectively with engineering teams
  • Translating complex technical analyses into actionable recommendations

Nice to have

  • Observability tools
  • APM systems
  • Infrastructure monitoring platforms (e.g., Prometheus, Grafana, DataDog)
  • Machine learning infrastructure
  • Model serving
  • Performance characteristics of AI/ML systems
  • SRE practices
  • Error budgets
  • SLOs/SLIs
  • Reliability engineering principles
  • Analyzing performance of real-time or near-real-time systems
  • Latency distributions
  • Tail behavior
  • User behavior analysis
  • Growth metrics
  • Product analytics

What the JD emphasized

  • 5+ years of hands-on data science experience
  • Deep understanding of distributed systems, cloud infrastructure, and performance engineering
  • Expertise in experimental design, causal inference, statistical modeling, and A/B testing frameworks
  • Strong skills in Python, SQL, and data analysis tools
  • Experience translating technical performance metrics into user experience insights
  • Proven ability to work effectively with engineering teams and translate complex technical analyses into actionable recommendations

Other signals

  • Analyze platform performance impacting user behavior
  • Improve system reliability and responsiveness
  • Quantify user sensitivity to latency, reliability, errors, and refusal rates
  • Translate insights into actionable recommendations for platform infrastructure
  • Shape technical foundation for scaling AI globally