Evals Engineer, Applied AI

Scale AI Scale AI · Data AI · San Francisco, CA · Enterprise Engineering

Scale AI is looking for an AI Research Engineer to join their Enterprise Evaluations team, focusing on building and improving GenAI Evaluation Suites for enterprise LLM-powered workflows and agents. The role involves creating human-rated datasets, designing LLM-as-a-Judge autorater frameworks, and researching new methodologies for evaluating AI systems.

What you'd actually do

  1. Partner with Scale’s Operations team and enterprise customers to translate ambiguity into structured evaluation data, guiding the creation and maintenance of gold-standard human-rated datasets and expert rubrics that anchor AI evaluation systems.
  2. Analyze feedback and collected data to identify patterns, refine evaluation frameworks, and establish iterative improvement loops that enhance the quality and relevance of human-curated assessments.
  3. Design, research, and develop LLM-as-a-Judge autorater frameworks and AI-assisted evaluation systems. This includes creating models that critique, grade, and explain agent outputs (e.g., RLAIF, model-judging-model setups), along with scalable evaluation pipelines and diagnostic tools.
  4. Pursue research initiatives that explore new methodologies for automatically analyzing, evaluating, and improving the behavior of enterprise agents, pushing the boundaries of how AI systems are assessed and optimized in real-world contexts.

Skills

Required

  • Python
  • PyTorch
  • TensorFlow
  • Large Language Models (LLMs)
  • Generative AI
  • frontier model evaluation methodologies
  • statistical analysis
  • assessing model quality

Nice to have

  • Master’s or Ph.D.
  • Published research in leading ML or AI conferences
  • LLM-as-a-Judge frameworks
  • automated evaluation systems
  • human annotator guidelines
  • ML research engineering
  • stochastic systems
  • observability
  • LLM-powered applications for model evaluation and analysis
  • scalable pipelines
  • distributed computing frameworks
  • modern cloud infrastructure

What the JD emphasized

  • stays current with the latest literature in AI evaluation
  • passionate about integrating novel research ideas into our workflows
  • current research landscape

Other signals

  • GenAI Evaluation Suite
  • LLM-powered workflows and agents
  • LLM-as-a-Judge autorater frameworks
  • AI-assisted evaluation systems
  • enterprise agents