Research Engineer, Pretraining Scaling

Anthropic Anthropic · AI Frontier · San Francisco, CA · AI Research & Engineering

Research Engineer focused on training production pretrained models at scale, involving performance optimization, debugging, experimental design, and incident response during model launches. The role bridges research and engineering, working across the full training stack.

What you'd actually do

  1. Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability
  2. Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure
  3. Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
  4. Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams
  5. Build and maintain production logging, monitoring dashboards, and evaluation infrastructure

Skills

Required

  • hands-on experience training large language models
  • deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems
  • debugging complex, ambiguous problems across multiple layers of the stack
  • communicate clearly and collaborate effectively
  • on-call for production systems

Nice to have

  • Contributed to open-source LLM frameworks (e.g., open_lm, llm-foundry, mesh-transformer-jax)
  • Published research on model training, scaling laws, or ML systems
  • Experience with production ML systems, observability tools, or evaluation infrastructure
  • Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence

What the JD emphasized

  • production pretraining pipeline
  • on-call incidents during model launches
  • debugging complex, ambiguous problems across multiple layers of the stack
  • working long days during launches
  • respond to issues on evenings and weekends

Other signals

  • trains our production pretrained models
  • frontier models train reliably, efficiently, and at scale
  • performance optimization, hardware debugging, experimental design, and launch coordination
  • debug and resolve complex issues across the full stack
  • design and run experiments to improve training efficiency
  • respond to on-call incidents during model launches