Senior Research Scientist, Reward Models

Anthropic Anthropic · AI Frontier · United States · Remote · AI Research & Engineering

Senior Research Scientist focused on reward models for LLMs, involving novel architectures, RLHF, LLM-based evaluation, and mitigating reward hacking. Aims to improve model alignment with human values and translate research into production systems.

What you'd actually do

  1. Lead research on novel reward model architectures and training approaches for RLHF
  2. Develop and evaluate LLM-based grading and evaluation methods, including rubric-driven approaches that improve consistency and interpretability
  3. Research techniques to detect, characterize, and mitigate reward hacking and specification gaming
  4. Design experiments to understand reward model generalization, robustness, and failure modes
  5. Collaborate with the Finetuning team to translate research insights into improvements for production training pipelines

Skills

Required

  • reward modeling
  • RLHF
  • machine learning
  • training reward models
  • evaluating reward models
  • large-scale experiments
  • computational resources
  • scientific rigor
  • Python

Nice to have

  • LLM-as-judge approaches
  • calibration
  • reliability challenges
  • constitutional AI
  • debate
  • scalable oversight
  • production ML systems
  • interpretability techniques

What the JD emphasized

  • track record of research contributions in reward modeling, RLHF, or closely related areas of machine learning
  • experience training and evaluating reward models for large language models
  • published research on reward modeling, preference learning, or RLHF
  • worked on reward hacking, specification gaming, or related robustness problems
  • contributed to production ML systems at scale

Other signals

  • leading research efforts
  • pushing the frontier of reward modeling
  • shipping practical improvements to production systems