Agent Post-training, API & Power Users

OpenAI OpenAI · AI Frontier · San Francisco, CA · Research

OpenAI's Agent Post-Training team is responsible for training frontier agents for products like Codex, ChatGPT, and the API. This role focuses on improving the capabilities, reliability, and product fit of agentic models for power users and API developers. Responsibilities include designing experiments, building training environments, creating evals, and driving behavior improvements from discovery to launch, working across research, engineering, data, and product teams. The role requires hands-on experience with LLMs, post-training techniques, and a strong understanding of developer and expert-user needs.

What you'd actually do

  1. Design and run experiments that improve model behavior in API and power-user workflows: function calling, tool use, coding, planning, long-horizon execution, factuality, instruction following, error recovery, and calibrated reasoning.
  2. Build evals, graders, and environments from real developer and power-user workflows, then turn observed failures into training data, model-behavior hypotheses, and shipped improvements.
  3. Partner with API and power-users to identify high-leverage behavior gaps and convert product signals into post-training interventions.
  4. Improve how models behave when composed into systems: using tools reliably, respecting developer intent, handling partial failures, asking for clarification when appropriate, and maintaining coherence across multi-step tasks.
  5. Own end-to-end model behavior projects, from qualitative failure analysis through data generation, training experiments, eval design, integration into major runs, and launch readiness.

Skills

Required

  • LLMs
  • post-training
  • RL/RLHF/RLAIF
  • evals
  • graders
  • synthetic data
  • coding agents
  • tool-using agents
  • API products
  • production ML systems
  • ML
  • software engineering
  • systems
  • statistics
  • applied research

Nice to have

  • model behavior
  • developer and expert-user experience
  • ambiguous capability problems
  • agentic systems

What the JD emphasized

  • improve the capabilities, reliability, and product fit
  • design and run experiments
  • build evals, graders, and environments
  • turn qualitative model failures into training data, evals, or post-training interventions
  • drive a behavior improvement from discovery through post-training, integration, and launch
  • comfortable turning ambiguous model behavior problems into concrete progress
  • improving tool use, planning, instruction following, recovery from mistakes, or how models behave in API-based workflows
  • work across research, engineering, data, evals, and product
  • decide which behaviors matter, how to measure them, how to train them, and when they are ready for major model runs
  • high-agency role
  • function calling, tool use, coding, planning, long-horizon execution, factuality, instruction following, error recovery, and calibrated reasoning
  • build evals, graders, and environments from real developer and power-user workflows
  • turn observed failures into training data, model-behavior hypotheses, and shipped improvements
  • partner with API and power-users to identify high-leverage behavior gaps
  • convert product signals into post-training interventions
  • improve how models behave when composed into systems
  • using tools reliably, respecting developer intent, handling partial failures, asking for clarification when appropriate, and maintaining coherence across multi-step tasks
  • own end-to-end model behavior projects
  • qualitative failure analysis
  • data generation
  • training experiments
  • eval design
  • integration into major runs
  • launch readiness
  • develop feedback loops
  • power-user traces
  • API usage patterns
  • production-like environments
  • discover the next frontier of agentic model failures and gaps
  • help decide which agentic capabilities, behavioral fixes, and partner-team integrations are ready for inclusion in major model runs
  • debug hard failures in shipped or near-shipped models
  • moving between traces, evals, training data, model outputs, and product context
  • work on early-training and alignment interventions
  • data mixtures, objectives, synthetic data, and eval loops that shape downstream agent behavior
  • improve the machinery for large-scale training and launch
  • experiment velocity, reliability, observability, reproducibility, cost, latency, and production readiness
  • take on cross-functional projects
  • model training, product infrastructure, and the production agent harness
  • multi-agent systems
  • training directly against production-like environments
  • strong technical fundamentals in ML, software engineering, systems, statistics, or applied research
  • can quickly learn across unfamiliar parts of the stack
  • hands-on experience with LLMs, post-training, RL/RLHF/RLAIF, evals, graders, synthetic data, coding agents, tool-using agents, API products, or production ML systems
  • strong taste for model behavior
  • look at a transcript, trace, eval failure, or API interaction and form concrete hypotheses about what the model needs to learn
  • excited by ambiguous capability problems where the signal is noisy, the failures are qualitative, and the solution may involve data, training, evals, product changes, or all of the above
  • deeply care about developer and expert-user experience
  • how models behave when embedded in real user workflows, API products, and agent harnesses
  • comfortable working across research, product, infrastructure, data, evals, and safety boundaries
  • can communicate clearly with each group
  • like building load-bearing systems and processes when that is what the team needs
  • want to train and ship the models that make agents genuinely useful for developers, enterprises, researchers, and everyday users

Other signals

  • training frontier agents
  • improving capabilities, reliability, and product fit
  • designing evals from real developer workflows
  • building training environments around production-like tool use
  • turning qualitative model failures into training data, evals, or post-training interventions
  • driving behavior improvement from discovery through post-training, integration, and launch
  • working across research, engineering, data, evals, and product
  • improving model behavior in API and power-user workflows
  • function calling, tool use, coding, planning, long-horizon execution, factuality, instruction following, error recovery, and calibrated reasoning
  • building evals, graders, and environments from real developer and power-user workflows
  • partnering with API and power-users to identify high-leverage behavior gaps
  • improving how models behave when composed into systems
  • owning end-to-end model behavior projects
  • developing feedback loops that use power-user traces, API usage patterns, and production-like environments
  • deciding which agentic capabilities, behavioral fixes, and partner-team integrations are ready for inclusion in major model runs
  • debugging hard failures in shipped or near-shipped models
  • working on early-training and alignment interventions
  • improving the machinery for large-scale training and launch
  • taking on cross-functional projects that touch model training, product infrastructure, and the production agent harness
  • hands-on experience with LLMs, post-training, RL/RLHF/RLAIF, evals, graders, synthetic data, coding agents, tool-using agents, API products, or production ML systems
  • deeply care about developer and expert-user experience
  • train and ship the models that make agents genuinely useful for developers, enterprises, researchers, and everyday users