Staff Infrastructure Engineer, Pre-training

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Staff Infrastructure Engineer focused on the data processing infrastructure for large language model pre-training. This role involves designing, implementing, and optimizing scalable systems for data quality, validation, and distributed computing at web-scale, collaborating closely with research teams.

What you'd actually do

  1. Design and implement high-performance data processing infrastructure for large language model training
  2. Develop and maintain core processing primitives (e.g., tokenization, deduplication, chunking) with a focus on scalability
  3. Build robust systems for data quality assurance and validation at scale
  4. Implement comprehensive monitoring systems for data processing infrastructure
  5. Create and optimize distributed computing systems for processing web-scale datasets

Skills

Required

  • Strong software engineering skills with experience in building distributed systems
  • Expertise in Python and Rust
  • Hands-on experience with distributed computing frameworks, particularly Apache Spark
  • Deep understanding of cloud computing platforms and distributed systems architecture
  • Experience with high-throughput, fault-tolerant system design
  • Strong background in performance optimization and system scaling
  • Excellent problem-solving skills and attention to detail
  • Strong communication skills and ability to work in a collaborative environment
  • Advanced degree in Computer Science or related field
  • Experience with language model training infrastructure
  • Strong background in distributed systems and parallel computing
  • Expertise in tokenization algorithms and techniques
  • Experience building high-throughput, fault-tolerant systems
  • Deep knowledge of monitoring and observability practices
  • Experience with infrastructure-as-code and configuration management
  • Background in MLOps or ML infrastructure

Nice to have

  • Significant experience building and maintaining large-scale distributed systems
  • Passionate about system reliability and performance
  • Enjoy solving complex technical challenges at scale
  • Comfortable working with ambiguous requirements and evolving specifications
  • Take ownership of problems and drive solutions independently
  • Excited about contributing to the development of safe and ethical AI systems
  • Can balance technical excellence with practical delivery
  • Eager to learn about machine learning research and its infrastructure requirements

What the JD emphasized

  • 7+ YOE outside of internships
  • Expertise in Python and Rust
  • Hands-on experience with distributed computing frameworks, particularly Apache Spark
  • Deep understanding of cloud computing platforms and distributed systems architecture
  • Experience with high-throughput, fault-tolerant system design
  • Experience with language model training infrastructure
  • Strong background in distributed systems and parallel computing
  • Expertise in tokenization algorithms and techniques
  • Experience building high-throughput, fault-tolerant systems
  • Deep knowledge of monitoring and observability practices

Other signals

  • Developing next generation of large language models
  • Designing and implementing high-performance data processing infrastructure for large language model training
  • Build robust systems for data quality assurance and validation at scale
  • Create and optimize distributed computing systems for processing web-scale datasets
  • Collaborate with research teams to implement novel data processing architectures