Software Engineer, Agent Dev Velocity

Notion Notion · Enterprise · San Francisco, CA · Engineering

This role focuses on building and improving the infrastructure for AI evaluations at Notion. The goal is to make AI evaluations easy to create, cheap to run, and impactful, enabling engineers to iterate with confidence and ship high-quality AI faster and more safely. The role involves building scalable eval runners, improving tooling for engineers to add evals, maintaining benchmark and dataset tooling, and enhancing the reliability and observability of eval execution.

What you'd actually do

  1. Build and improve scalable eval runners and harnesses that work locally, in CI, and on scheduled runs.
  2. Make it easy for engineers to add high-signal evals: better templates, fixtures, debugging tools, and clear workflows.
  3. Build and maintain benchmark and dataset tooling (curation pipelines, versioning, artifact management, and regression tracking).
  4. Improve reliability and observability for eval execution (retries, idempotency, cost and latency visibility, and failure triage).
  5. Partner closely with AI product, AI platform, and infrastructure teams to integrate evals into day-to-day shipping workflows.

Skills

Required

  • Strong software engineering fundamentals and experience shipping production systems.
  • Proficiency with TypeScript/Node and/or Python.
  • Experience building reliable systems in distributed environments (queues, retries, idempotency, and backfills).
  • Comfort working with data pipelines (batch processing, data quality, versioning, and reproducibility).
  • Practical experience designing measurement or evaluation systems

Nice to have

  • Experience building developer tooling (CLI tools, CI integrations, or internal platforms).
  • Familiarity with LLM evaluation techniques (rubrics, human review loops, dataset curation, and regression detection).
  • Experience collaborating across teams to roll out new workflows and drive adoption.

What the JD emphasized

  • evals at scale
  • durable benchmarks and datasets
  • reusable eval workspaces
  • data-driven workflows
  • continuous measurement
  • scalable eval runners
  • high-signal evals
  • benchmark and dataset tooling
  • regression tracking
  • reliability and observability
  • cost and latency visibility
  • failure triage

Other signals

  • building systems for running and maintaining evals at scale
  • creating durable benchmarks and datasets
  • enabling reusable eval workspaces and data-driven workflows
  • improving reliability and observability for eval execution