User Researcher, AI Evaluations

Notion Notion · Enterprise · San Francisco, CA · User Research and Product Operations

Notion is seeking an experienced UX Researcher to define and scale how they evaluate Notion's AI-powered experiences, focusing on model output quality and the end-to-end product experience. The role involves running studies to uncover user mental models and translating insights into reusable rubrics, workflows, and measurement approaches for product, design, engineering, and data science. The researcher will also identify failure modes, recovery behaviors, and operationalize evaluation with partners.

What you'd actually do

  1. Define what “good” looks like (frameworks & rubrics): Establish clear, reusable evaluation criteria that reflect real user expectations—helpfulness, trust, tone, control, and transparency. You’ll translate qualitative insight into scoring guidance that can be applied consistently across teams and over time.
  2. Run recurring evals (longitudinal & feature-specific): Run recurring longitudinal and feature-specific surveys and studies to measure experience quality over time against defined rubrics. Lead qualitative studies, side-by-side comparisons, and human-in-the-loop evaluation efforts to deepen understanding of where experiences break down and how they can improve. You’ll help teams spot regressions, benchmark improvements, and understand when expectations shift.
  3. Anchor evaluation in real workflows (context > isolated feedback): Ensure evals reflect jobs-to-be-done, user intent, and the full interaction journey (goal setting, delegation, review, iteration), not just decontextualized thumbs up/down. You’ll help teams understand _who_ is evaluating, _what_ they’re trying to do, and _why_ outputs succeed or fail.
  4. Identify failure modes & recovery behavior (guardrails): Uncover breakdowns, regressions, and edge cases across the system—from model behavior to UI and integrations—and study how people notice issues, correct them, and continue their work. You’ll turn these insights into actionable guidance for guardrails, fixes, and prioritization.
  5. Operationalize evaluation with partners (process & tooling): Collaborate closely with Product, Design, Engineering, and Data Science to align on target use cases and build scalable evaluation loops (human-in-the-loop review, longitudinal studies, and calibration of automated/LLM-judge approaches against human judgment).

Skills

Required

  • Ability to operationalize insight into measurement
  • AI fluency and systems thinking
  • Clear communication and impact orientation
  • Strong UX research craft (quant + qual)
  • Pragmatism in fast-moving environments
  • 5+ years doing UX research in industry

Nice to have

  • Familiarity with LLM-as-judge methods, prompt design for evaluators, or “golden dataset” creation
  • Experience using AI research tooling for rapid synthesis and communication (e.g., Dovetail, Listen Labs, Maze, Outset, etc.), as well as AI observability tooling like Braintrust
  • Experience using data querying languages (e.g., SQL), scripting languages (e.g., Python), or statistical/mathematical software (e.g., R, SAS, Matlab, etc.)
  • Master’s or PhD in HCI, Psychology, Behavioral Science, Anthropology, Sociology, or a related field
  • You’re familiar with the work of computing heroes like Douglas Engelbart, Alan Kay, Bret Victor, etc. — and understand why we're big fans.

What the JD emphasized

  • define and scale how we evaluate
  • evaluation criteria
  • measure experience quality
  • evaluation loops

Other signals

  • evaluating AI-powered experiences
  • define and scale how we evaluate
  • translate qualitative insight into scoring guidance
  • measure experience quality over time
  • operationalize evaluation with partners