Model Behavior Engineer

Notion Notion · Enterprise · New York, NY · Engineering

This role focuses on owning the quality bar for Notion AI products by designing and implementing evaluation systems, analyzing production data, and driving quality improvements. It involves context engineering, debugging, building measurement systems, evaluating and launching new models from leading AI labs, and collaborating with product and engineering teams to prioritize quality.

What you'd actually do

  1. Context engineering — Design, test, and iterate on system prompts, tool prompts, and context strategies that shape how Notion's AI products behave. Understand the nuances of how models respond to different context structures and use that knowledge to drive quality improvements directly.
  2. Understand & debug — Live in production data: transcripts, logs, user feedback. Reproduce issues, identify root causes, and translate symptoms into actionable problem statements. Find signal in noisy data.
  3. Build evals & Measurement — Design eval strategies, build datasets, run evaluations. Track quality over time. Identify issues before users do. Own the loop: define quality goals, create evals, test and improve
  4. Evaluate and launch new models with leading research labs — Evaluate and launch models from OpenAI, Anthropic, Google, and others. Benchmark across dimensions: quality, latency, cost, edge cases. Help shape Notion's model strategy based on real data.
  5. Drive quality priorities — Work embedded with eng and product teams to surface the most important issues. Own the quality narrative: severity, frequency, what to fix and why. Be the voice of quality in the room.
  6. Build tooling & systems — Help manage AI observability and eval platforms (e.g., Braintrust). Build the playbooks and tools that enable all teams at Notion to build AI products.

Skills

Required

  • Driver mentality
  • Curiosity
  • Analytical instinct
  • Comfortable working with data
  • Clear communication
  • Experience with LLMs, prompting, or AI products

Nice to have

  • Backgrounds in engineering, product, data science, research, consulting
  • You've built something on your own to solve a problem — side project, startup, tool, whatever

What the JD emphasized

  • own the quality bar
  • understand and shaping how our AI products behave
  • shape Notion's model strategy
  • problem-seeking generalists interested in 0 → 1
  • build a new function
  • real ownership from day one
  • help write the playbook
  • Context engineering
  • Understand & debug
  • Build evals & Measurement
  • Evaluate and launch new models
  • Drive quality priorities
  • Build tooling & systems
  • Driver mentality
  • Analytical instinct
  • Experience with LLMs

Other signals

  • evaluating and launching new models
  • shaping Notion's model strategy
  • building systems to define what 'good' looks like
  • driving changes to deliver reliable and high-quality AI experiences