What you'd actually do

Design evaluations of advanced model capabilities and use them to drive rapid, high-signal iteration loops

Work with vendors to produce high quality evaluation and training data

Build data pipelines to produce high quality evaluation and training data

Build data flywheels to hill-climb on model weaknesses, using data from various surfaces where our models are deployed

Ensure optimal quality, quantity and coverage of data across our post-training stages

Skills

Required

Hands-on experience with large language models, including training or applying them in production
Designing and running post-training experiments (evals, ablations, preference tuning / RLHF-style methods)
Building and owning scalable data pipelines for training and evaluation data
Strong Python skills for ML experimentation, data processing, and analysis
Solid statistical, experimental, and general engineering fundamentals

Nice to have

Demonstrated SOTA results in any area of large-scale training, inference, or evaluation

What the JD emphasized

Hands-on experience with large language models, including training them or applying them in production (not just prompting)

Designing and running post-training experiments (evals, ablations, preference tuning / RLHF-style methods)

Building and owning scalable data pipelines for training and evaluation data

Overview

We’re looking for data scientists to help build the next generation of post-training methods for frontier models at Microsoft AI. You’ll join a small, high-impact team working across all stages of post-training, with a focus on evaluation design, high-quality training data, and scalable data pipelines for state-of-the-art foundation models.

In this role, you’ll help turn raw model capability into reliable, aligned, and measurable performance improvements, directly shaping how frontier models behave in real-world deployments.

About the Role:

Microsoft AI is building the next generation of frontier models that power Copilot and other large-scale AI experiences. The Post-Training team is responsible for transforming powerful pretrained models into robust, aligned, and high-performing systems used by millions of people worldwide.

Our work focuses on improving general quality, instruction following, coding and math ability, tool use, agentic behaviors, personality, and other critical model capabilities. We operate across the full post-training lifecycle — from data generation and curation, to evaluation and diagnostics, to reward modeling and reinforcement learning.

We are a small, highly autonomous team that works closely with pre-training, product, and engineering partners to rapidly iterate on ideas, run large-scale experiments, and safely advance model capabilities. Each team member owns meaningful parts of the post-training pipeline and has direct access to the compute, data, and decision-making needed to move quickly from insight to production.

Microsoft Superintelligence Team

This role is part of Microsoft AI's Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being.

We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!

Responsibilities

Design evaluations of advanced model capabilities and use them to drive rapid, high-signal iteration loops
Work with vendors to produce high quality evaluation and training data
Build data pipelines to produce high quality evaluation and training data
Build data flywheels to hill-climb on model weaknesses, using data from various surfaces where our models are deployed
Ensure optimal quality, quantity and coverage of data across our post-training stages
Run post-training experiments and ablations to produce models that climb our evals
Embody our culture and values.

We’re Looking For People Who:

Have deep experience with LLMs, either training them or applying them in production
Have developed production-scale data pipelines for synthesizing, curating, or processing large quantities of data
Can design, run, and interpret large-scale ML experiments with careful statistical and empirical reasoning.
Possess strong generalist engineering and mathematical skills.
Have clear written and verbal communication, and the ability to collaborate effectively with researchers, engineers and other disciplines.
Bonus skills: Demonstrated SOTA results in any area of large-scale training, inference, or evaluation.

Qualifications

Required skills

Hands‑on experience with large language models, including training or applying them in production (not just prompting)

Designing and running post‑training experiments (evals, ablations, preference tuning / RLHF‑style methods)

Building and owning scalable data pipelines for training and evaluation data

Strong Python skills for ML experimentation, data processing, and analysis

Solid statistical, experimental, and general engineering fundamentals

Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**

Overview

In this role, you’ll help turn raw model capability into reliable, aligned, and measurable performance improvements, directly shaping how frontier models behave in real-world deployments.

About the Role:

Microsoft Superintelligence Team

Responsibilities

Design evaluations of advanced model capabilities and use them to drive rapid, high-signal iteration loops
Work with vendors to produce high quality evaluation and training data
Build data pipelines to produce high quality evaluation and training data
Build data flywheels to hill-climb on model weaknesses, using data from various surfaces where our models are deployed
Ensure optimal quality, quantity and coverage of data across our post-training stages
Run post-training experiments and ablations to produce models that climb our evals
Embody our culture and values.

We’re Looking For People Who:

Have deep experience with LLMs, either training them or applying them in production
Have developed production-scale data pipelines for synthesizing, curating, or processing large quantities of data
Can design, run, and interpret large-scale ML experiments with careful statistical and empirical reasoning.
Possess strong generalist engineering and mathematical skills.
Have clear written and verbal communication, and the ability to collaborate effectively with researchers, engineers and other disciplines.
Bonus skills: Demonstrated SOTA results in any area of large-scale training, inference, or evaluation.

Qualifications

Required skills

Hands‑on experience with large language models, including training or applying them in production (not just prompting)

Designing and running post‑training experiments (evals, ablations, preference tuning / RLHF‑style methods)

Building and owning scalable data pipelines for training and evaluation data

Strong Python skills for ML experimentation, data processing, and analysis

Solid statistical, experimental, and general engineering fundamentals

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Member of Technical Staff - Data Scientist

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals