Product Manager, Claude Code Model Performance

Anthropic Anthropic · AI Frontier · San Francisco, CA · Product Management, Support, & Operations

Product Manager for Anthropic's Claude Code Model Performance team, responsible for driving model launches, building agentic evals, and translating research improvements into developer-facing outcomes. Requires experience building agentic evals, a systems thinking approach, and comfort with both research and engineering.

What you'd actually do

  1. Own model launch planning and execution for Claude Code: define readiness criteria, coordinate across research and product engineering, and ensure launches land cleanly with developers
  2. Design and implement agentic evals that measure real-world coding performance
  3. Drive the engineering team's eval roadmap
  4. Partner with researchers working on coding capabilities to define target behaviors and influence model development with evidence from real usage
  5. Talk with users and analyze transcripts to understand capability gaps and turn research progress into shipped improvements

Skills

Required

  • Product Management
  • AI Concepts
  • Model Behavior
  • Prompt Engineering
  • Evaluation Methodology
  • Systems Thinking
  • Agentic Evals
  • Coding Agent Usage

Nice to have

  • Engineering Background
  • Hacker Spirit

What the JD emphasized

  • personally built agentic evals
  • agentic evals
  • eval roadmap
  • model behavior
  • evaluation methodology
  • systems thinker
  • launched products or capabilities in ambiguous, research-adjacent environments

Other signals

  • drive model launches end-to-end
  • build evals that measure what matters
  • partner directly with researchers and product engineers
  • translate model improvements into developer-facing outcomes
  • own model launch planning and execution
  • design and implement agentic evals
  • drive the engineering team's eval roadmap
  • partner with researchers working on coding capabilities
  • define target behaviors and influence model development
  • synthesize signal from internal users, external developers, and competitive benchmarks into clear priorities