Senior Software Engineer, AI Evals

Sentry Sentry · Enterprise · San Francisco, CA · Engineering

Senior Software Engineer focused on building the evaluation infrastructure for Sentry's AI/ML systems, including debugging agents and AI-powered features. The role involves designing datasets, benchmarks, and test harnesses to measure accuracy, reliability, and performance, ensuring AI systems behave correctly and safely as they scale. This includes creating evaluation frameworks, curating datasets, building metrics pipelines, and owning the evaluation lifecycle.

What you'd actually do

  1. Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
  2. Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
  3. Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
  4. Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
  5. Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring

Skills

Required

  • 5+ years of professional experience
  • Bachelor’s degree in computer science, machine learning, or a related field
  • Experience building testing, evaluation, or data infrastructure for complex systems
  • Production-quality code (Python and TypeScript)
  • Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
  • Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)

Nice to have

  • AI/ML experience strongly preferred
  • experience evaluating LLMs, agentic systems, or AI-assisted developer tools

What the JD emphasized

  • AI/ML team
  • AI systems
  • AI-powered features
  • AI behavior
  • AI with confidence
  • AI systems
  • AI team
  • AI initiatives
  • AI/ML experience
  • modern ML systems
  • evaluating LLMs
  • agentic systems
  • AI-assisted developer tools

Other signals

  • building evaluation infrastructure
  • measures accuracy, reliability, and real-world performance of AI systems
  • design datasets, benchmarks, and test harnesses
  • turn ambiguous AI behavior into measurable signals