Data Scientist 5 - AI Evals

Netflix Netflix · Big Tech · United States · Remote · Data & Insights

Netflix is seeking an experienced Senior Data Scientist specialized in AI Evals to architect systems and frameworks for measuring, validating, and optimizing GenAI systems in production for both player-facing games and internal agentic tools. The role involves building evaluation pipelines, curating datasets, designing experiments to link technical attributes with user experience, and guiding evaluations for agentic systems.

What you'd actually do

  1. Partner with GenAI research team to ensure GenAI product graduation from R&D into production at scale and live operations.
  2. Build and operate robust evaluation pipelines on production-stage GenAI experiences, using a mix of automated metrics, LLM-as-a-Judge frameworks, human-in-the-loop grading, and simulation-based testing.
  3. Curate high-quality golden datasets, test suites, adversarial challenge sets, and synthetic testbeds to establish ground-truth performance across various generative tasks.
  4. Design experiments to understand the trade-offs between technical attributes and end-user experience quality in a real-time game environment.
  5. Measure the coherence, fluency, relevance, and joy value of AI-powered game features.

Skills

Required

  • Ph.D. in Data Science, Computer Science, Statistics, Cognitive Science, or a related quantitative field.
  • 4+ years of industry experience in Data Science, ML, or AI
  • strong foundation in experimental design, causal inference, A/B testing, and uncertainty quantification.
  • Experience with modern AI Evals and observability frameworks (e.g., OpenAI/Anthropic evaluation suites).
  • Proven track record of evaluating LLM and agentic systems.
  • Deep understanding of prompt engineering, RAG Evals, and agentic Evals.
  • Understanding of agent architectures and how to evaluate long-horizon reasoning and complex tool-use.

Nice to have

  • Experience with defining core user experience metrics in gaming or streaming.
  • Experience working with game development teams, particularly game design and engineering.
  • Experience with building production-grade ML systems, including MLOps best practices.

What the JD emphasized

  • architect the systems and framework to measure, validate, and optimize GenAI systems in production
  • rigorous, unbiased measurement
  • evaluate GenAI systems
  • evaluate LLM and agentic systems
  • evaluate long-horizon reasoning and complex tool-use

Other signals

  • evaluating GenAI systems in production
  • architecting evaluation systems and frameworks
  • measuring, validating, and optimizing GenAI systems
  • evaluating player-facing games experiences
  • evaluating internal agentic tools