Researcher, Loss of Control

OpenAI OpenAI · AI Frontier · San Francisco, CA · Safety Systems

Researcher focused on mitigating loss of control risk in frontier AI models, designing and implementing an end-to-end mitigation stack for preventing, monitoring, detecting, containing, and enforcing against intentionally subversive or insufficiently controllable model behavior. This involves integrating safeguards across products and research, evaluating technical trade-offs, collaborating with risk modeling and evaluations teams, and executing rigorous testing and red-teaming workflows against advanced AI behaviors like sandbagging, monitor evasion, exploit-seeking, unsafe tool use, or strategic deception.

What you'd actually do

  1. Design and implement mitigation components for loss of control risk—spanning prevention, monitoring, detection, containment, and enforcement—under the guidance of senior technical and risk leadership.
  2. Integrate safeguards across product and research surfaces in partnership with product, engineering, and research teams, helping ensure protections are consistent, low-latency, and resilient as usage and model autonomy increase.
  3. Evaluate technical trade-offs within the loss of control domain (coverage, robustness, latency, model utility, and operational complexity) and propose pragmatic, testable solutions.
  4. Collaborate closely with risk modeling, evaluations, and policy partners to align mitigation design with anticipated failure modes and high-severity threat scenarios, including deceptive alignment, hidden subgoals, reward hacking, and attempts to evade oversight.
  5. Execute rigorous testing and red-teaming workflows, helping stress-test the mitigation stack against increasingly capable and potentially subversive model behaviors—such as sandbagging, monitor evasion, exploit-seeking, unsafe tool use, or strategic deception—and iterate based on findings.

Skills

Required

  • deep learning
  • transformer models
  • PyTorch or TensorFlow
  • data structures
  • algorithms
  • software engineering principles
  • training and fine-tuning large language models
  • distillation
  • supervised fine-tuning
  • policy optimization
  • designing and evaluating technical safeguards
  • control mechanisms
  • monitoring systems for advanced AI behavior

Nice to have

  • AI safety
  • alignment
  • control
  • interpretability
  • robustness
  • adversarial ML

What the JD emphasized

  • loss of control risk
  • subversive model behavior
  • safeguards
  • mitigation stack
  • frontier AI models
  • deceptive alignment
  • hidden subgoals
  • reward hacking
  • sandbagging
  • monitor evasion
  • exploit-seeking
  • unsafe tool use
  • strategic deception

Other signals

  • loss of control risk
  • subversive model behavior
  • safeguards
  • mitigation stack
  • frontier AI models