What you'd actually do

Partner with GenAI research team to ensure GenAI product graduation from R&D into production at scale and live operations.

Build and operate robust evaluation pipelines on production-stage GenAI experiences, using a mix of automated metrics, LLM-as-a-Judge frameworks, human-in-the-loop grading, and simulation-based testing.

Curate high-quality golden datasets, test suites, adversarial challenge sets, and synthetic testbeds to establish ground-truth performance across various generative tasks.

Design experiments to understand the trade-offs between technical attributes and end-user experience quality in a real-time game environment.

Measure the coherence, fluency, relevance, and joy value of AI-powered game features.

Skills

Required

Ph.D. in Data Science, Computer Science, Statistics, Cognitive Science, or a related quantitative field.
4+ years of industry experience in Data Science, ML, or AI
strong foundation in experimental design, causal inference, A/B testing, and uncertainty quantification.
Experience with modern AI Evals and observability frameworks (e.g., OpenAI/Anthropic evaluation suites).
Proven track record of evaluating LLM and agentic systems.
Deep understanding of prompt engineering, RAG Evals, and agentic Evals.
Understanding of agent architectures and how to evaluate long-horizon reasoning and complex tool-use.

Nice to have

Experience with defining core user experience metrics in gaming or streaming.
Experience working with game development teams, particularly game design and engineering.
Experience with building production-grade ML systems, including MLOps best practices.

At Netflix, our mission is to entertain the world. Together, we are writing the next episode - pushing the boundaries of storytelling, global fandom and making the unimaginable a reality. We are a dream team obsessed with the uncomfortable excitement of discovering what happens when you merge creativity, intuition and cutting-edge technology. Come be a part of what’s next.

Games are our next big frontier and an incredible opportunity for us to deliver new experiences to delight and entertain our quickly growing membership. You will be jumping in at the very beginning of this adventure and be in a position to help us redefine what a Netflix subscription means for our members around the world.

Data Science and Engineering (DSE) at Netflix is aimed at using data, analytics, causal inference, machine learning, and sciences to improve various aspects of our business. The AI initiative at Netflix Games is dedicated to pioneering the next generation of interactive entertainment. We have the ambition to transform how players interact with stories, characters, and worlds by empowering gameplay experience with AI. We work at the intersection of creative game design and cutting-edge machine learning, ensuring that dynamic storytelling is not only novel but also coherent, immersive, and safe for our players.

We are seeking an experienced Senior Data Scientist specialized in AI Evals to architect the systems and framework to measure, validate, and optimize GenAI systems in production. You will bridge the gap between model capabilities and end-user experience with rigorous, unbiased measurement. Your work will span two critical domains: a) Player-facing games experiences, evaluating AI-powered storytelling, interactions, and gameplays to ensure our games are engaging, grounded, and safe; b) Internal agentic tools, designing evaluation harnesses for agentic system used by our internal technical, business, and creative teams. You will work alongside world-class Scientists, Engineers, and Designers to define what good looks like for frontier AI systems in games, ensuring we ship with confidence and iterate fast without breaking the player experience or technical excellence.

In this role, you will:

Partner with GenAI research team to ensure GenAI product graduation from R&D into production at scale and live operations.
Build and operate robust evaluation pipelines on production-stage GenAI experiences, using a mix of automated metrics, LLM-as-a-Judge frameworks, human-in-the-loop grading, and simulation-based testing.
Curate high-quality golden datasets, test suites, adversarial challenge sets, and synthetic testbeds to establish ground-truth performance across various generative tasks.
Design experiments to understand the trade-offs between technical attributes and end-user experience quality in a real-time game environment.
Measure the coherence, fluency, relevance, and joy value of AI-powered game features.
Design red-teaming protocols and safety evaluators to detect and mitigate toxicity, hallucinations, jailbreaks, and out-of-character behavior in gaming environments.
Guide Evals for internal agentic tools (e.g., data analytics, experimentation).

Who Will Succeed in This Role:

Ph.D. in Data Science, Computer Science, Statistics, Cognitive Science, or a related quantitative field.
4+ years of industry experience in Data Science, ML, or AI with strong foundation in experimental design, causal inference, A/B testing, and uncertainty quantification.
Experience with modern AI Evals and observability frameworks (e.g., OpenAI/Anthropic evaluation suites).
Proven track record of evaluating LLM and agentic systems. Deep understanding of prompt engineering, RAG Evals, and agentic Evals.
Understanding of agent architectures and how to evaluate long-horizon reasoning and complex tool-use.
Can bridge the gap between art and science, effectively collaborating with game teams to translate creative objectives into data specifications.
Have a passion to focus on developing AI that serves the purpose of joy, entertainment, and storytelling in a high-visibility consumer product.

Nice to Have

Experience with defining core user experience metrics in gaming or streaming.
Experience working with game development teams, particularly game design and engineering.
Experience with building production-grade ML systems, including MLOps best practices.

Generally, our compensation structure consists solely of an annual salary; we do not have bonuses. You choose each year how much of your compensation you want in salary versus stock options. To determine your personal top of market compensation, we rely on market indicators and consider your specific job family, background, skills, and experience to determine your compensation in the market range. The range for this role is $372,000.00 - $600,000.00. This compensation range will vary based on location.

Netflix provides comprehensive benefits including Health Plans, Mental Health support, a 401(k) Retirement Plan with employer match, Stock Option Program, Disability Programs, Health Savings and Flexible Spending Accounts, Family-forming benefits, and Life and Serious Injury Benefits. We also offer paid leave of absence programs. Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off. Full-time salaried employees are immediately entitled to flexible time off. See more details about our Benefits here.

Netflix is a unique culture and environment. Learn more here.

Inclusion is a Netflix value and we strive to host a meaningful interview experience for all candidates. If you want an accommodation/adjustment for a disability or any other reason during the hiring process, please send a request to your recruiting partner.

We are an equal-opportunity employer and celebrate diversity, recognizing that diversity builds stronger teams. We approach diversity and inclusion seriously and thoughtfully. We do not discriminate on the basis of race, religion, color, ancestry, national origin, caste, sex, sexual orientation, gender, gender identity or expression, age, disability, medical condition, pregnancy, genetic makeup, marital status, or military service.

Job is open for no less than 7 days and will be removed when the position is filled.

In this role, you will:

Partner with GenAI research team to ensure GenAI product graduation from R&D into production at scale and live operations.

Curate high-quality golden datasets, test suites, adversarial challenge sets, and synthetic testbeds to establish ground-truth performance across various generative tasks.

Design experiments to understand the trade-offs between technical attributes and end-user experience quality in a real-time game environment.

Measure the coherence, fluency, relevance, and joy value of AI-powered game features.

Design red-teaming protocols and safety evaluators to detect and mitigate toxicity, hallucinations, jailbreaks, and out-of-character behavior in gaming environments.

Guide Evals for internal agentic tools (e.g., data analytics, experimentation).

Who Will Succeed in This Role:

Ph.D. in Data Science, Computer Science, Statistics, Cognitive Science, or a related quantitative field.

4+ years of industry experience in Data Science, ML, or AI with strong foundation in experimental design, causal inference, A/B testing, and uncertainty quantification.

Experience with modern AI Evals and observability frameworks (e.g., OpenAI/Anthropic evaluation suites).

Proven track record of evaluating LLM and agentic systems. Deep understanding of prompt engineering, RAG Evals, and agentic Evals.

Understanding of agent architectures and how to evaluate long-horizon reasoning and complex tool-use.

Can bridge the gap between art and science, effectively collaborating with game teams to translate creative objectives into data specifications.

Have a passion to focus on developing AI that serves the purpose of joy, entertainment, and storytelling in a high-visibility consumer product.

Nice to Have

Experience with defining core user experience metrics in gaming or streaming.

Experience working with game development teams, particularly game design and engineering.

Experience with building production-grade ML systems, including MLOps best practices.

Netflix is a unique culture and environment. Learn more here.

Job is open for no less than 7 days and will be removed when the position is filled.

Data Scientist 5 - AI Evals

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

In this role, you will:

Who Will Succeed in This Role:

Nice to Have

In this role, you will:

Who Will Succeed in This Role:

Nice to Have