Research Engineer, Tokens ML Infra

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Research Engineer focused on ML training infrastructure for large language models, involving JAX/PyTorch, distributed systems, performance optimization, and MLOps tooling to support novel training architectures and experimentation.

What you'd actually do

  1. Design and implement high-performance ML training infrastructure for large language model research
  2. Develop and maintain core ML framework primitives in JAX, PyTorch, etc.
  3. Create robust automated evaluation and benchmarking systems for model performance
  4. Implement comprehensive monitoring and debugging tools for ML workflows
  5. Design and optimize data loading pipelines that maximize training throughput

Skills

Required

  • Strong software engineering skills
  • experience in building distributed systems
  • Expertise in Python
  • experience with distributed computing frameworks
  • Deep understanding of cloud computing platforms
  • distributed systems architecture
  • Experience with high-throughput, fault-tolerant system design
  • Strong background in performance optimization
  • system scaling
  • Excellent problem-solving skills
  • attention to detail
  • Strong communication skills
  • ability to work in a collaborative environment

Nice to have

  • Advanced degree (MS or PhD) in Computer Science or related field
  • Experience with language model training infrastructure
  • Strong background in distributed systems and parallel computing
  • Expertise in tokenization algorithms and techniques
  • Experience building high-throughput, fault-tolerant systems
  • Deep knowledge of monitoring and observability practices
  • Experience with infrastructure-as-code and configuration management
  • Background in MLOps or ML infrastructure

What the JD emphasized

  • high-performance ML training infrastructure
  • large language model research
  • ML framework primitives
  • automated evaluation and benchmarking systems
  • ML workflows
  • data loading pipelines
  • MLOps tooling
  • novel training architectures
  • hyperparameter sweeps
  • architecture search
  • distributed systems
  • high-throughput, fault-tolerant system design
  • performance optimization and system scaling
  • language model training infrastructure
  • distributed systems and parallel computing
  • tokenization algorithms and techniques
  • high-throughput, fault-tolerant systems
  • monitoring and observability practices

Other signals

  • ML training infrastructure
  • large language model research
  • ML framework primitives
  • automated evaluation and benchmarking systems
  • ML workflows
  • data loading pipelines
  • MLOps tooling
  • novel training architectures
  • hyperparameter sweeps
  • architecture search