Product Manager, Agent Harness

at Cursor · Coding AI · San Francisco, CA · Product Management

Product Manager for Cursor's Agent Harness, responsible for the framework that enables AI agents to decompose tasks, interact with the file system/terminal, handle failures, and be observed/steered by developers. The role involves turning research advances into product, analyzing agent traces, designing evaluation frameworks, and defining agent extensibility primitives.

What you'd actually do

  1. Owning the agent planning and execution framework: how agents decompose tasks, decide what tools to use, and recover when a step fails. Balancing autonomy with predictability.
  2. Designing how developers observe and steer agents: real-time progress, guardrails, the ability to redirect mid-task. The experience should build trust without requiring micromanagement.
  3. Building evaluation and benchmarking systems: defining what "good" means for agent quality—task completion rate, error recovery, hallucination frequency—and building the harnesses to measure it. These measurements drive engineering and research priorities.
  4. Analyzing agent traces at scale: identifying where agents get stuck, loop, hallucinate, or take unproductive paths, and turning those patterns into concrete improvements.
  5. Defining the primitives for agent extensibility: how agents use tools, access codebase context, call external services via MCPs and plugins on the Cursor Marketplace, and how developers customize agent behavior through rules and constraints.

Skills

Required

  • Product Management
  • AI Agents
  • LLM Applications
  • Developer Tools
  • Technical Depth
  • Code Analysis
  • System Behavior Reasoning
  • Evaluation and Measurement
  • Metric Definition
  • Research-Adjacent Environments
  • Reinforcement Learning
  • Agent Frameworks
  • AI Evaluation

Nice to have

  • Agent Harness Design
  • Task Decomposition
  • Failure Handling
  • Observability
  • Steering Agents
  • Benchmarking Systems
  • Agent Trace Analysis
  • Agent Extensibility Primitives
  • Tool Use
  • Codebase Context Access
  • MCPs
  • Plugins
  • Cursor Marketplace
  • Customization Rules
  • Constraints
  • Multi-agent Coordination

What the JD emphasized

  • built or evaluated AI agents
  • AI agents
  • LLM applications
  • ML-powered developer tools
  • deeply technical
  • comfortable reading code
  • analyzing traces
  • reasoning about system behavior
  • strong intuition for evaluation and measurement
  • define metrics that capture quality
  • comfortable in a research-adjacent environment
  • experience with reinforcement learning
  • agent frameworks
  • AI evaluation
  • practitioner
  • working closely with researchers

Other signals

  • AI agents
  • LLM applications
  • developer tools
  • agent harness
  • evaluation frameworks
  • multi-agent coordination
Read full job description

Our mission is to automate coding. The first step in our journey is to build the best tool for professional programmers, using a combination of inventive research, design, and engineering. Our organization is very flat, and our team is small and talent dense. We particularly like people who are truth-seeking, passionate, and creative. We enjoy spirited debate, crazy ideas, and shipping code.

About the Role

The Agent Harness is what makes Cursor's agents actually work. It determines how agents decompose tasks into subtasks, how they interact with the file system and terminal, how they handle failures and retries, and how developers observe and steer what's happening. When an agent gets stuck, loops, or hallucinates, the harness is why—and the harness is how you fix it.

As a Product Manager for the Agent Harness, you will own this framework. Agent quality is improving rapidly—we shipped Composer 2, our own frontier coding model, and are training agents through real-time RL on user data. Your job is to turn those research advances into product that developers can feel.

This is not a role where you write specs and hand them off. You'll be reading agent traces, analyzing failure modes, designing evaluation frameworks, and making judgment calls about what an agent should and shouldn't attempt. You'll work at the boundary between research and product, where the roadmap is shaped by empirical results as much as customer feedback.

Example projects include...

  • Owning the agent planning and execution framework: how agents decompose tasks, decide what tools to use, and recover when a step fails. Balancing autonomy with predictability.
  • Designing how developers observe and steer agents: real-time progress, guardrails, the ability to redirect mid-task. The experience should build trust without requiring micromanagement.
  • Building evaluation and benchmarking systems: defining what "good" means for agent quality—task completion rate, error recovery, hallucination frequency—and building the harnesses to measure it. These measurements drive engineering and research priorities.
  • Analyzing agent traces at scale: identifying where agents get stuck, loop, hallucinate, or take unproductive paths, and turning those patterns into concrete improvements.
  • Defining the primitives for agent extensibility: how agents use tools, access codebase context, call external services via MCPs and plugins on the Cursor Marketplace, and how developers customize agent behavior through rules and constraints.
  • Improving the default Cursor agent experience (the “Auto” model setting): making smart model choices based on user needs, model capabilities, and cost appetite.
  • Shaping multi-agent coordination: how subagents share context and avoid conflicts when executing in parallel across files and systems. This matters more as developers spin up fleets of agents simultaneously.

You may be a fit if

  • You have built or evaluated AI agents, LLM applications, or ML-powered developer tools.
  • You're deeply technical. You're comfortable reading code, analyzing traces, and reasoning about system behavior at a low level.
  • You have strong intuition for evaluation and measurement. You know how to define metrics that capture quality, not just activity.
  • You can move between the big picture and the details—from "what should agents be capable of in six months?" to "why did this agent fail on this specific task?"
  • You're comfortable in a research-adjacent environment where the roadmap is shaped by empirical results, not just customer requests.
  • You have experience with reinforcement learning, agent frameworks, or AI evaluation—either as a practitioner or working closely with researchers.
  • You thrive in ambiguous, fast-moving environments and enjoy making hard tradeoffs with incomplete information.

#LI-DNI