Research Engineer, Model Evaluations

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Research Engineer focused on designing and implementing Anthropic's model evaluation platform, influencing training decisions and model development. This role involves leading the architecture of scalable evaluation pipelines, analyzing results, partnering with research teams, and contributing to publications. It sits at the intersection of research and engineering, with a strong emphasis on AI safety and model capabilities.

What you'd actually do

  1. Design novel evaluation methodologies to assess model capabilities across diverse domains including reasoning, safety, helpfulness, and harmlessness
  2. Lead the design and architecture of Anthropic's evaluation platform, ensuring it scales with our rapidly evolving model capabilities and research needs
  3. Implement and maintain high-throughput evaluation pipelines that run during production training, providing real-time insights to guide training decisions
  4. Analyze evaluation results to identify patterns, failure modes, and opportunities for model improvement, translating complex findings into actionable insights
  5. Partner with research teams to develop domain-specific evaluations that probe for emerging capabilities and potential risks

Skills

Required

  • Python
  • distributed computing frameworks
  • systems engineering
  • experimental design
  • statistical analysis
  • large-scale experimental data analysis
  • technical leadership
  • designing and implementing evaluation systems for machine learning models
  • translating research needs to engineering constraints

Nice to have

  • evaluation during model training in production environments
  • safety evaluation frameworks
  • red teaming methodologies
  • psychometrics
  • experimental psychology
  • reinforcement learning evaluation
  • multi-agent systems
  • prompt engineering
  • managing evaluation infrastructure at scale
  • machine learning evaluation research
  • benchmarking
  • multi-step reasoning evaluation
  • tool use evaluation
  • regression detection in model performance/safety
  • human evaluation at scale

What the JD emphasized

  • critical system that shapes how we understand, measure, and improve our models' capabilities and safety
  • directly influences our training decisions and model development roadmap
  • highest standards before deployment
  • technical leadership role
  • designing and implementing evaluation systems for machine learning models, particularly large language models
  • demonstrated technical leadership experience
  • skilled at both systems engineering and experimental design
  • strong programming skills in Python
  • translate between research needs and engineering constraints
  • results-oriented and thrive in fast-paced environments where priorities can shift based on research findings
  • AI safety and the societal impacts of the systems we build
  • statistical analysis and can draw meaningful conclusions from large-scale experimental data
  • evaluation during model training, particularly in production environments
  • safety evaluation frameworks and red teaming methodologies
  • reinforcement learning evaluation or multi-agent systems
  • managing evaluation infrastructure at scale (thousands of experiments)
  • machine learning evaluation, benchmarking, or related areas
  • multi-step reasoning or tool use
  • regression in model performance or safety properties
  • human evaluation at scale

Other signals

  • design and implementation of Anthropic's evaluation platform
  • develop and implement model evaluations
  • influences our training decisions and model development roadmap
  • lead the design and architecture of Anthropic's evaluation platform
  • high-throughput evaluation pipelines that run during production training
  • Analyze evaluation results to identify patterns, failure modes, and opportunities for model improvement
  • Partner with research teams to develop domain-specific evaluations
  • Build infrastructure to enable rapid iteration on evaluation design
  • Establish best practices and standards for evaluation development
  • Coordinate evaluation efforts during critical training runs
  • Contribute to research publications