Research Engineer, Pretraining Scaling - London

Anthropic Anthropic · AI Frontier · London, United Kingdom · AI Research & Engineering

Research Engineer focused on pretraining and scaling large language models, involving performance optimization, debugging, experimental design, and ensuring reliability of production training pipelines. The role is highly operational, requiring on-call incident response during model launches, and involves building and maintaining training infrastructure and codebase capabilities.

What you'd actually do

  1. Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability
  2. Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure
  3. Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
  4. Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams
  5. Build and maintain production logging, monitoring dashboards, and evaluation infrastructure

Skills

Required

  • hands-on experience training large language models
  • JAX, TPU, PyTorch, or large-scale distributed systems
  • debugging complex, ambiguous problems across multiple layers of the stack
  • on-call for production systems
  • working long days during launches
  • solving hard problems under pressure
  • communicating clearly and collaborating effectively

Nice to have

  • Contributed to open-source LLM frameworks (e.g., open_lm, llm-foundry, mesh-transformer-jax)
  • Published research on model training, scaling laws, or ML systems
  • Experience with production ML systems, observability tools, or evaluation infrastructure
  • Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence

What the JD emphasized

  • production pretraining pipeline
  • frontier models
  • performance optimization
  • hardware debugging
  • experimental design
  • launch coordination
  • production issues
  • on-call incidents
  • diagnosing problems quickly
  • coordinating solutions
  • production logging
  • monitoring dashboards
  • evaluation infrastructure
  • training codebase
  • long context support
  • novel architectures
  • on-call for production systems
  • working long days during launches
  • solving hard problems under pressure
  • debugging complex, ambiguous problems
  • multiple layers of the stack
  • training LLM’s
  • JAX/TPU, PyTorch
  • ML frameworks at scale
  • published research on model training, scaling laws, or ML systems
  • production ML systems
  • observability tools
  • evaluation infrastructure
  • highly operational
  • responsive to incidents
  • flexible about priorities
  • comfortable with uncertainty
  • extraordinary learning opportunities
  • largest, most sophisticated training runs in the industry

Other signals

  • trains our production pretrained models
  • frontier models train reliably, efficiently, and at scale
  • performance optimization, hardware debugging, experimental design, and launch coordination
  • debug and resolve complex issues across the full stack
  • design and run experiments to improve training efficiency
  • build and maintain production logging, monitoring dashboards, and evaluation infrastructure
  • add new capabilities to the training codebase
  • hands-on experience training large language models
  • large-scale distributed systems
  • on-call for production systems
  • solving hard problems under pressure
  • debugging complex, ambiguous problems across multiple layers of the stack
  • training LLM’s or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale
  • published research on model training, scaling laws, or ML systems
  • experience with production ML systems, observability tools, or evaluation infrastructure
  • highly operational
  • responsive to incidents
  • flexible about priorities
  • comfortable with uncertainty
  • extraordinary learning opportunities
  • largest, most sophisticated training runs in the industry