Engineering Manager, Agent Prompts & Evals

Anthropic Anthropic · AI Frontier · San Francisco, CA · Engineering & Design - Product

Engineering Manager to lead the Agent Prompts & Evals team, responsible for the infrastructure that enables shipping model and prompt changes with confidence. This includes eval frameworks, system prompt pipelines, and regression-detection systems. The team acts as a platform for model behavior, sitting between product engineering and research, and partners with other evals groups and product teams. The role requires leading and growing a team, owning the product-side eval platform and system prompt infrastructure, managing model launches, fostering collaboration, recruiting engineers, and shaping team investment in areas like frontier eval development and launch automation.

What you'd actually do

  1. Lead and grow a team of prompt engineers and platform software engineers
  2. Own the product-side eval platform: the frameworks, dashboards, bulk runners, and CI integrations that product teams use to measure Claude’s behavior and catch regressions before they ship
  3. Own system prompt infrastructure: versioning, deployment, rollback, and review tooling for the prompts that run in production across [claude.ai](http://claude.ai), the API, and agentic surfaces
  4. Be a steady hand through model launches — these are the team’s highest-stakes operational moments and the EM is the backstop when things get chaotic
  5. Build durable collaboration with other evals groups across the company; this means real work on ownership boundaries, shared roadmaps, and avoiding tragedy-of-the-commons on shared eval infrastructure

Skills

Required

  • Software engineering management
  • Platform engineering
  • Infrastructure management
  • Developer tooling
  • System design
  • Pipeline architecture
  • Collaboration
  • Recruiting
  • Team leadership

Nice to have

  • LLM evals
  • ML experimentation platforms
  • Model quality work
  • A/B testing infrastructure
  • Feature flagging
  • Gradual rollout systems
  • Devtools
  • CI/CD platforms
  • Testing infrastructure at scale
  • AI safety and alignment

What the JD emphasized

  • 8+ years in software engineering with 3+ years managing engineering teams
  • platform, infra, or developer-tooling team
  • building “pits of success”
  • mixed charter: platform ownership, service-to-other-teams, and a launch-driven operational rhythm
  • technical depth to engage on system design, review pipeline architecture
  • build and maintain peer relationships with partner orgs
  • recruiting and closing senior ICs

Other signals

  • eval frameworks
  • system prompt pipelines
  • regression-detection systems
  • model behavior
  • platform team