Engineering Manager, Evals

Cursor Cursor · Coding AI · San Francisco, CA · Engineering

Engineering Manager for the Evals team at Cursor, responsible for creating high-signal evaluation datasets for coding agents, building tools for engineers to write and run evals, and owning online evaluation systems that track agent quality in production. The role involves setting the eval roadmap, leading a team of engineers and researchers, guiding the development of evaluation benchmarks like CursorBench, defining online quality signals, and integrating evals into decision-making processes for launches, deploys, and model training.

What you'd actually do

  1. Set the eval roadmap end-to-end—what we measure, why it matters, and how signals turn into shipping + training decisions.
  2. Lead and grow a high-impact team of engineers and researchers building eval datasets and developer-friendly tools to write and run evals.
  3. Guide the next generation of [CursorBench](https://cursor.com/blog/cursorbench) so it continues to reflect real developer workflows at Cursor, and expand it with new evals that measure other properties developers value.
  4. Define crisp online quality signals and turn regressions into robust guardrails.
  5. Integrate evals into decision-making cadence for launches, deploys, and model training loops.

Skills

Required

  • People leadership and coaching skills
  • Experience leading engineering teams shipping production systems
  • Ability to align research, product, data, and infrastructure on metrics and processes
  • Strong data acumen
  • Experience building and operating evaluation or measurement systems

Nice to have

  • Good taste and strong opinions on model and agent behaviors
  • Up-to-date on emerging research and industry trends
  • Collaborate effectively with data scientists and researchers

What the JD emphasized

  • high-signal evaluation datasets
  • coding agents
  • online evaluation systems
  • agent quality in production
  • CursorBench
  • eval roadmap
  • eval datasets
  • run evals
  • new evals
  • online quality signals
  • guardrails
  • evals into decision-making cadence
  • launches
  • deploys
  • model training loops
  • evaluation or measurement systems
  • AI evals
  • experimentation platforms
  • ranking/relevance
  • search quality
  • reliability instrumentation

Other signals

  • evaluation systems
  • coding agents
  • quality measurable