Agent Post-training, Context Research

OpenAI OpenAI · AI Frontier · San Francisco, CA · Research

Research role focused on training frontier agents, specifically improving post-training capabilities and scaling compute on context. Responsibilities include designing experiments, owning improvements to the post-training stack (RL, data pipelines, graders, reward signals, evals), building evals and environments, partnering with product teams, working on early-training and alignment interventions, and debugging model failures. The role emphasizes hands-on experience with LLMs, RL, RLHF/RLAIF, post-training, evals, and production ML systems, with a focus on product impact and model behavior.

What you'd actually do

  1. Design and run experiments that improve scaling of compute on context.
  2. Own end-to-end improvements to the post-training stack, including RL, data pipelines, graders, reward signals, evals, diagnostics, and model-behavior analysis.
  3. Build evals and environments that expose the next set of model failures, then turn those failures into training data, product fixes, or new research directions.
  4. Partner with Codex and ChatGPT product teams to understand what users need and translate product signal into model improvements.
  5. Work on early-training and alignment interventions, including data mixtures, objectives, synthetic data, and eval loops that shape downstream agent behavior.

Skills

Required

  • strong technical fundamentals in machine learning, software engineering, systems, statistics, or a related field
  • learn quickly across the parts you have not worked in before
  • hands-on experience with LLMs, RL, RLHF/RLAIF, post-training, evals, graders, synthetic data, model training, coding agents, tool-using agents, or production ML systems
  • excited by open-ended problems where the path is unclear, the signal is noisy, and the right answer requires both research taste and engineering execution
  • care about product impact and model behavior, not just benchmark movement
  • opinions about what makes an agent useful, reliable, honest, tasteful, and easy to work with
  • can move from a vague behavioral problem to a concrete experiment: define the hypothesis, build the pipeline, run the model, analyze the result, and decide what to do next
  • comfortable working across research, product, infrastructure, data, evals, and safety boundaries
  • can communicate clearly with each group
  • like building load-bearing systems and processes when that is what the team needs
  • want to train and ship the models that make agents genuinely useful for developers, enterprises, researchers, and everyday users

Nice to have

  • multi-agent coordination
  • long-horizon execution
  • factuality
  • instruction following
  • calibrated reasoning
  • taste
  • Codex Chronicle
  • Codex and ChatGPT product teams
  • model training
  • product infrastructure
  • production agent harness
  • multi-agent systems
  • training directly against production-like environments

What the JD emphasized

  • frontier agents
  • post-training
  • scale compute spent on context
  • frontier training stack
  • iterative deployment
  • major model runs
  • product interface
  • product signal into model improvements
  • early-training and alignment interventions
  • agent behavior
  • production agent harness
  • multi-agent systems
  • training directly against production-like environments
  • debug hard failures in shipped or near-shipped models
  • LLMs
  • RL
  • RLHF/RLAIF
  • post-training
  • evals
  • graders
  • synthetic data
  • model training
  • coding agents
  • tool-using agents
  • production ML systems
  • product impact
  • model behavior
  • useful, reliable, honest, tasteful, and easy to work with
  • vague behavioral problem to a concrete experiment
  • research taste and engineering execution
  • move from a vague behavioral problem to a concrete experiment
  • build the pipeline, run the model, analyze the result, and decide what to do next
  • communicate clearly with each group
  • build load-bearing systems and processes
  • train and ship the models that make agents genuinely useful

Other signals

  • frontier agents
  • training the models behind our agents
  • build the data, environments, graders, training methods, and feedback loops
  • scale compute spent on context
  • frontier training stack
  • iterative deployment
  • major model runs
  • product interface
  • product signal into model improvements
  • early-training and alignment interventions
  • agent behavior
  • production agent harness
  • multi-agent systems
  • training directly against production-like environments
  • debug hard failures in shipped or near-shipped models
  • LLMs
  • RL
  • RLHF/RLAIF
  • post-training
  • evals
  • graders
  • synthetic data
  • model training
  • coding agents
  • tool-using agents
  • production ML systems
  • product impact
  • model behavior
  • useful, reliable, honest, tasteful, and easy to work with
  • vague behavioral problem to a concrete experiment
  • research taste and engineering execution
  • move from a vague behavioral problem to a concrete experiment
  • build the pipeline, run the model, analyze the result, and decide what to do next
  • communicate clearly with each group
  • build load-bearing systems and processes
  • train and ship the models that make agents genuinely useful