What you'd actually do

Iterate on the full stack of datasets, training stages, and hyperparameters that determine model behavior. Measure how choices compound across evals and production performance, not just isolated benchmarks.

Build evals that actually capture what matters. The loop never ends: define, optimize, realize the gaps, and rebuild. You'll be responsible for making numbers go up and making sure the numbers mean something.

When training produces results that don't make sense, you dig until you understand why. The goal isn't just to fix it; it's to carry that understanding forward to the next problem.

Apply and advance techniques like RLHF, RLAIF, and constitutional approaches to shape how agents reason, act, and collaborate with humans in long-horizon tasks.

Measure how performance scales with data and compute, and develop new methodologies when existing ones hit ceilings. We expect both rigor and invention.

Skills

Required

post-training
alignment methods
RLHF
RLAIF
preference modeling
reward learning
probability
statistics
ML theory
experimental data analysis
original contributions
large-scale distributed training
systems-level thinking
fast-moving research environments

Nice to have

PhD
competitive programmers
former founders
frontier of AI research

Who We Are

We are an applied AI lab building end-to-end software agents. We're the team behind Devin, the first AI software engineer, and Windsurf, an AI-native IDE. These products represent our vision for AI that doesn't just assist engineers, but works alongside them as a genuine teammate.

Our team is small and talent-dense: world-class competitive programmers, former founders, and researchers from the frontier of AI, including Scale AI, Palantir, Cursor, Google DeepMind, and others.

Role Mission

Post-training is the critical bridge between raw model capability and a system that is actually useful, safe, and effective in the real world. You will shape how our agents learn by iterating on training recipes, evaluations, and alignment methods that directly determine what Devin and our future systems can do. This role blends deep research and hands-on engineering. We don't distinguish between the two.

What You'll Accomplish

**Post-Training Recipe Development: **Iterate on the full stack of datasets, training stages, and hyperparameters that determine model behavior. Measure how choices compound across evals and production performance, not just isolated benchmarks.
**Evaluation Design and Integrity: **Build evals that actually capture what matters. The loop never ends: define, optimize, realize the gaps, and rebuild. You'll be responsible for making numbers go up and making sure the numbers mean something.
**Deep Understanding: **When training produces results that don't make sense, you dig until you understand why. The goal isn't just to fix it; it's to carry that understanding forward to the next problem.
**Alignment and Agent Behavior: **Apply and advance techniques like RLHF, RLAIF, and constitutional approaches to shape how agents reason, act, and collaborate with humans in long-horizon tasks.
**Scaling and Exploration: **Measure how performance scales with data and compute, and develop new methodologies when existing ones hit ceilings. We expect both rigor and invention.

Exceptional Candidates Have Demonstrated

A track record of advancing ML systems through post-training, alignment, or related methods: RLHF, RLAIF, preference modeling, reward learning, or equivalent
Strong fundamentals in probability, statistics, and ML theory. The ability to look at experimental data and distinguish real effects from noise and bugs
Evidence of original contributions: publications at top venues, open-source impact, or equivalent industry results
Experience with large-scale distributed training and the debugging that comes with it
Systems-level thinking: not just model optimization, but understanding how training pipelines, data, and evaluation interact
Comfort with ambiguity and fast-moving research environments where priorities shift quickly
We care more about demonstrated capability than credentials. A PhD is one signal among many.

Resources & Environment

Small, highly selective team where research and product move together; prototypes reach real deployment quickly
Compute is not a constraint: large allocations with training jobs routinely running across thousands of GPUs from day one
The environment rewards speed, autonomy, and technical depth with minimal process overhead; this is one of the most competitive and fast-moving problems in AI
Everything needed to operate at frontier scale from day one.

Equal Opportunity

Cognition is an equal opportunity employer. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic under applicable law. We are committed to providing reasonable accommodations for candidates with disabilities throughout the hiring process - please let us know if you need any.

Who We Are

Our team is small and talent-dense: world-class competitive programmers, former founders, and researchers from the frontier of AI, including Scale AI, Palantir, Cursor, Google DeepMind, and others.

Role Mission

What You'll Accomplish

**Post-Training Recipe Development: **Iterate on the full stack of datasets, training stages, and hyperparameters that determine model behavior. Measure how choices compound across evals and production performance, not just isolated benchmarks.

**Evaluation Design and Integrity: **Build evals that actually capture what matters. The loop never ends: define, optimize, realize the gaps, and rebuild. You'll be responsible for making numbers go up and making sure the numbers mean something.

**Deep Understanding: **When training produces results that don't make sense, you dig until you understand why. The goal isn't just to fix it; it's to carry that understanding forward to the next problem.

**Alignment and Agent Behavior: **Apply and advance techniques like RLHF, RLAIF, and constitutional approaches to shape how agents reason, act, and collaborate with humans in long-horizon tasks.

**Scaling and Exploration: **Measure how performance scales with data and compute, and develop new methodologies when existing ones hit ceilings. We expect both rigor and invention.

Exceptional Candidates Have Demonstrated

A track record of advancing ML systems through post-training, alignment, or related methods: RLHF, RLAIF, preference modeling, reward learning, or equivalent

Strong fundamentals in probability, statistics, and ML theory. The ability to look at experimental data and distinguish real effects from noise and bugs

Evidence of original contributions: publications at top venues, open-source impact, or equivalent industry results

Experience with large-scale distributed training and the debugging that comes with it

Systems-level thinking: not just model optimization, but understanding how training pipelines, data, and evaluation interact

Comfort with ambiguity and fast-moving research environments where priorities shift quickly

We care more about demonstrated capability than credentials. A PhD is one signal among many.

Resources & Environment

Small, highly selective team where research and product move together; prototypes reach real deployment quickly

Compute is not a constraint: large allocations with training jobs routinely running across thousands of GPUs from day one

The environment rewards speed, autonomy, and technical depth with minimal process overhead; this is one of the most competitive and fast-moving problems in AI

Everything needed to operate at frontier scale from day one.

Equal Opportunity

Research, Post-training

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Who We Are

Role Mission

What You'll Accomplish

Exceptional Candidates Have Demonstrated

Resources & Environment

Equal Opportunity

Who We Are

Role Mission

What You'll Accomplish

Exceptional Candidates Have Demonstrated

Resources & Environment

Equal Opportunity