Machine Learning Systems Engineer - Infrastructure & Runtime, Horizons

Anthropic Anthropic · AI Frontier · AI Research & Engineering

Machine Learning Systems Engineer focused on building and maintaining foundational infrastructure for AI research, specifically for reinforcement learning, agentic AI, and model evaluation. The role involves designing data pipelines, creating secure execution environments, optimizing distributed computing infrastructure, and translating research requirements into scalable systems.

What you'd actually do

  1. Design and implement high-performance data pipelines for processing large-scale code datasets with an emphasis on reliability and reproducibility
  2. Build and maintain secure sandboxed execution environments using virtualization technologies like GVisor and Firecracker
  3. Develop infrastructure for reinforcement learning training environments, balancing security requirements with performance needs
  4. Optimize resource utilization across our distributed computing infrastructure through profiling, benchmarking, and systems-level improvements
  5. Collaborate with researchers to translate their requirements into scalable, production-grade systems for AI experimentation

Skills

Required

  • Python
  • async/concurrent programming
  • Trio
  • container technologies
  • virtualization systems
  • systems programming
  • performance optimization
  • data pipeline development
  • ETL processes
  • code quality
  • testing
  • performance
  • effective communication

Nice to have

  • cloud infrastructure
  • Kubernetes orchestration
  • infrastructure-as-code tools (Terraform, Pulumi, etc.)
  • Rust
  • C++
  • security controls for code execution
  • ML research concepts

What the JD emphasized

  • reinforcement learning research and development
  • scalable RL infrastructure
  • secure model evaluation systems
  • large-scale code datasets
  • secure sandboxed execution environments
  • reinforcement learning training environments
  • distributed computing infrastructure
  • production-grade systems for AI experimentation

Other signals

  • Reinforcement learning research and development
  • Scalable RL infrastructure and training methodologies
  • Foundational systems for LLM training
  • Agentic AI capabilities
  • Secure model evaluation systems