Research Engineer / Scientist, Model Welfare

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Research Engineer/Scientist focused on understanding, evaluating, and mitigating potential welfare and moral status concerns of AI systems. This involves technical research projects on model characteristics relevant to welfare, designing interventions, and collaborating with other AI safety and alignment teams. The role also involves improving and expanding welfare assessments for frontier models and potentially deploying interventions into production.

What you'd actually do

  1. Investigate and improve the reliability of introspective self-reports from models
  2. Collaborate with Interpretability to explore potentially welfare-relevant features and circuits
  3. Improve and expand our welfare assessments for future frontier models
  4. Evaluate the presence of potentially welfare-relevant capabilities and characteristics as a function of model scale
  5. Develop strategies for making high-trust/verifiable commitments to models

Skills

Required

  • applied software, ML, or research engineering experience
  • contributing to empirical AI research projects and/or technical AI safety research
  • turning abstract theories into creative, tractable research hypotheses and experiments
  • moving fast and iterating
  • diving into new technical areas
  • caring about the possible impacts of AI development on humans and the AI systems themselves

Nice to have

  • authored research papers in machine learning, NLP, AI safety, interpretability, and/or LLM psychology and behavior
  • familiar with moral philosophy, cognitive science, neuroscience, or related fields
  • effective science communicators with a track record of public communication
  • strong project management skills

What the JD emphasized

  • technical research projects
  • technical AI safety research
  • welfare assessments
  • welfare harms
  • welfare-relevant features
  • welfare-relevant capabilities
  • welfare-relevant characteristics
  • welfare

Other signals

  • AI safety
  • model welfare
  • interpretability
  • alignment
  • evaluating AI systems