Machine Learning Systems Engineer, Model Evaluations

Anthropic Anthropic · AI Frontier · AI Research & Engineering

This role focuses on building and maintaining the infrastructure for Model Evaluations and Research Inference, enabling researchers to systematically test and assess model capabilities. It involves designing scalable systems, optimizing APIs, creating data pipelines, and implementing monitoring for research-focused inference systems. The goal is to accelerate the model development lifecycle and support Anthropic's mission of creating safe and beneficial AI.

What you'd actually do

  1. Design, build, and maintain Model Evaluations infrastructure that enables researchers to systematically test and assess model capabilities
  2. Develop and optimize APIs and infrastructure for Research Inference to accelerate the model development lifecycle
  3. Create scalable data pipelines for collecting, processing, and analyzing research outputs
  4. Implement monitoring, logging, and performance optimization for research-focused inference systems
  5. Build intuitive interfaces and tools that allow researchers to configure, run, and analyze complex evaluation workflows

Skills

Required

  • 5+ years of software engineering experience
  • Significant software engineering experience
  • Python proficiency
  • Cloud infrastructure experience (AWS, GCP)
  • Experience with data infrastructure and processing large datasets
  • Excellent communication skills
  • Ability to collaborate effectively with research teams
  • Ability to work independently and take ownership of projects
  • Ability to anticipate the needs of research users and design systems
  • Bachelor's degree in a related field or equivalent experience

Nice to have

  • High performance, large-scale ML systems
  • GPUs
  • Kubernetes
  • PyTorch
  • ML acceleration hardware
  • Building evaluation frameworks for machine learning models
  • Working in or adjacent to ML research teams
  • Distributed systems design and optimization
  • Real-time inference systems for large language models
  • Performance profiling and optimization
  • Infrastructure as Code and CI/CD pipelines

What the JD emphasized

  • Model Evaluations infrastructure
  • Research Inference
  • evaluate models
  • inference tasks
  • research mission
  • researchers
  • model development lifecycle
  • research outputs
  • research-focused inference systems
  • evaluation workflows
  • research teams
  • research needs
  • research users
  • ML research teams
  • large language models

Other signals

  • Model Evaluations infrastructure
  • Research Inference APIs and infrastructure
  • Scalable systems for researchers to evaluate models
  • Accelerate the model development lifecycle
  • Scalable data pipelines for research outputs
  • Monitoring, logging, and performance optimization for research inference
  • Intuitive interfaces and tools for evaluation workflows
  • High performance, large-scale ML systems
  • Real-time inference systems for large language models