Research Scientist, Learning & Cognitive Outcomes

OpenAI OpenAI · AI Frontier · London, United Kingdom · Go To Market

Research Scientist focused on building scientific and evaluation infrastructure to understand how AI systems affect learning, cognition, and capability development over time. The role involves designing rigorous studies, developing scalable evaluation methods, and measuring cognitive outcomes beyond engagement. It sits at the intersection of learning science, cognitive science, experimental design, LLM evaluation, and applied product research, with an initial focus on young users and education settings. This is an applied, empirical role focused on building evidence systems that are scientifically credible, operationally useful, and influential in model and product development.

What you'd actually do

  1. design rigorous studies
  2. develop scalable evaluation methods
  3. help answer a central question: do AI systems help people become more capable over time?
  4. build classifiers and graders
  5. translate findings into model and product improvements

Skills

Required

  • strong grounding in learning science, cognitive science, educational psychology, behavioral science, HCI, or a related empirical field
  • experience designing and executing rigorous empirical research, including RCTs, field experiments, large-scale behavioral studies, or other causal evaluation methods
  • ability to design studies that measure meaningful cognitive and learning outcomes
  • technical fluency to work with data directly, prototype analyses, inspect model outputs, reason about classifier and grader performance
  • operate independently in ambiguous environments
  • communicate clearly with technical, scientific, partner, and executive audiences

Nice to have

  • experience working in frontier AI, big tech research, edtech, learning platforms, tutoring systems, assessment, or other technically sophisticated product environments
  • experience building or evaluating LLM-based graders, classifiers, model-as-judge systems, benchmark datasets, automated assessment tools, or behavioral measurement pipelines
  • familiarity with outcomes such as reasoning quality, transfer, metacognition, self-regulated learning, motivation, autonomy, cognitive offloading, overreliance, help-seeking, feedback use, or durable skill acquisition
  • experience running multi-site studies or managing external research programmes
  • familiarity with psychometrics, measurement validation, causal inference, longitudinal study design, mixed-methods research, or large-scale behavioral data analysis
  • experience with research involving young users, educational institutions

What the JD emphasized

  • measure whether users develop better reasoning, stronger metacognition, greater autonomy, deeper understanding, improved transfer, and more durable skills
  • build and validate evaluation systems for learning and cognitive outcomes, including rubrics, classifiers, graders, benchmarks, behavioral metrics, and model-based evaluators
  • develop methods for detecting both positive and negative effects of AI use, including improved reasoning, better metacognition, durable learning, transfer, overreliance, shallow fluency, answer-copying, reduced agency, or unproductive cognitive offloading
  • understand the practical strengths and limitations of LLM-based evaluation methods, including model-as-judge systems, rubric design, validation, calibration, inter-rater reliability, and precision/recall tradeoffs

Other signals

  • evaluating AI systems
  • measuring cognitive outcomes
  • designing empirical studies
  • translating findings into model and product improvements