Full-stack Software Engineer, Reinforcement Learning

Anthropic Anthropic · AI Frontier · San Francisco, CA · AI Research & Engineering

Full-Stack Software Engineer to build platforms, tools, and interfaces for environment creation, data collection, and training observability for RL. The role involves owning product surfaces end-to-end, iterating on data collection strategies, and partnering with researchers to ship reliable products.

What you'd actually do

  1. Build and extend web platforms for RL environment creation, management, and quality review — including environment configuration, versioning, and validation workflows
  2. Develop vendor-facing interfaces and tooling that let external partners create, submit, and iterate on training environments with minimal friction
  3. Design and implement platforms for human data collection at scale, including labeling workflows, quality assurance systems, and feedback mechanisms that surface reward signal integrity issues early
  4. Build evaluation dashboards and observability UIs that give researchers real-time insight into environment quality, training run health, and reward hacking
  5. Create backend services and APIs that connect environment authoring tools, data collection systems, and RL training infrastructure

Skills

Required

  • Python
  • modern web stack (React, TypeScript, or similar)
  • shipping systems that solved a hard problem
  • high agency
  • UX focus
  • clear communication

Nice to have

  • Built data collection, labeling, or annotation platforms
  • multi-tenant platforms with role-based access, audit trails, and vendor management workflows
  • cloud infrastructure (GCP or AWS), Docker, and CI/CD pipelines
  • LLM training, fine-tuning, or evaluation workflows
  • async Python (Trio, asyncio) or high-throughput API design
  • dashboards, monitoring, or observability tooling
  • working directly with external vendors or partners on technical integrations

What the JD emphasized

  • ship a polished, reliable product against it, fast
  • The quality of Claude's next generation depends on the quality of the data we train it on
  • iterate on data collection strategies to distill the knowledge of thousands of human experts around the world into our models
  • The RL org is organized around four goals: solving the science of long-horizon tasks and continual learning, scaling RL data and environments to be comprehensive and diverse, automating software engineering end-to-end, and training the frontier production model.

Other signals

  • building platforms for data collection and training observability
  • shipping polished, reliable products
  • iterating on data collection strategies
  • building environments, evaluation systems, data pipelines, and tooling