Research Engineer, Discovery

Anthropic Anthropic · AI Frontier · San Francisco, CA · AI Research & Engineering

Research Engineer focused on building and optimizing infrastructure for AI scientist training, evaluation, and inference. The role involves identifying and resolving infra blockers, developing evaluation frameworks, managing data pipelines, and optimizing training/inference for reinforcement learning in distributed environments.

What you'd actually do

  1. Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments
  2. Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities
  3. Develop robust and reliable evaluation frameworks for measuring progress towards scientific AGI.
  4. Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows
  5. Collaborate to translate experimental requirements into production-ready infrastructure

Skills

Required

  • infrastructure engineering
  • large-scale distributed systems
  • performance optimization
  • system architectures
  • high-throughput ML workloads
  • containerization technologies (Docker, Kubernetes)
  • orchestration at scale
  • large-scale data pipelines
  • distributed storage systems
  • diagnosing and resolving complex infrastructure challenges
  • full ML stack
  • scaling experimental ideas
  • cloud platforms (AWS, GCP)

Nice to have

  • language model training infrastructure
  • distributed ML frameworks (PyTorch, JAX)
  • language model inference optimization
  • VM and container orchestration
  • workflow orchestration tools
  • experiment management systems
  • large scale reinforcement learning
  • Beam
  • Spark
  • Dask

What the JD emphasized

  • 6+ years of highly-relevant experience in infrastructure engineering
  • large-scale distributed systems
  • performance optimization techniques
  • high-throughput ML workloads
  • containerization technologies
  • orchestration at scale
  • large-scale data pipelines
  • distributed storage systems
  • diagnosing and resolving complex infrastructure challenges in production environments
  • full ML stack
  • scale experimental ideas
  • language model training infrastructure
  • distributed ML frameworks
  • language model inference optimization
  • cloud platforms
  • VM and container orchestration
  • workflow orchestration tools
  • experiment management systems
  • large scale reinforcement learning
  • large scale data pipelines

Other signals

  • AI scientist training
  • infrastructure blockers
  • scientific AGI
  • evaluation frameworks
  • large scale data pipelines
  • reinforcement learning