Software Engineer, Agent Evaluation and Quality

Cursor Cursor · Coding AI · San Francisco, CA · Engineering

Software Engineer on the Agent Quality team at Cursor, responsible for building the measurement, evaluation, and feedback-loop infrastructure to improve the Cursor core agent. This role involves designing and building AI evaluation systems, feedback loops from user usage, analysis tooling for agent behavior, and improving reliability and guardrails by making quality measurable.

What you'd actually do

  1. Designing and building best-in-class AI evaluation system: curated datasets, offline replay, scorers / judges, regression alerts, and dashboards.
  2. Designing feedback loops from real usage: collecting, cleaning, and interpreting user signals to inform model and harness changes.
  3. Developing analysis tooling and workflows for debugging agent behavior: deep dives on failure modes, clustering themes, and surfacing actionable insights.
  4. Improving reliability and guardrails by making quality measurable and operational: defining “good/bad/degraded” sessions, alerting, and triage primitives.

Skills

Required

  • AI evaluation systems
  • measurement systems
  • data pipelines
  • analysis tooling
  • debugging agent behavior
  • software engineering fundamentals
  • shipping production systems

Nice to have

  • AI evals
  • experimentation
  • ranking/relevance
  • search quality
  • data acumen
  • collaboration with data scientists and researchers
  • taste and strong opinions on model and agent behaviors
  • staying up-to-date on emerging research and industry trends

What the JD emphasized

  • build the measurement, evaluation, and feedback-loop infrastructure
  • instrument what matters
  • define how we judge quality
  • analyze agent behavior at scale
  • turn insights into improvements
  • AI evaluation system
  • feedback loops
  • analysis tooling and workflows for debugging agent behavior
  • improving reliability and guardrails by making quality measurable and operational
  • built and operated evaluation or measurement systems
  • turn ambiguous “quality” questions into concrete metrics, pipelines, and decisions

Other signals

  • AI evaluation system
  • feedback loops
  • agent behavior analysis
  • quality measurement