Research Engineer, Evaluations

AssemblyAI · AI Frontier · New York, NY · Remote · Research

Research Engineer focused on owning and building evaluation infrastructure for voice AI models. This role bridges research, product, and engineering by defining metrics, developing benchmarking pipelines, and translating customer feedback into research priorities. The role requires understanding of ML fundamentals, strong Python skills, metric intuition, and familiarity with the voice agent stack.

What you'd actually do

  1. Own end-to-end and integration-level model evaluation across accuracy, latency, and feature-specific metrics (e.g., turn detection latency, endpointing accuracy)
  2. Build and maintain competitive benchmarking pipelines against other providers in the market
  3. Design and run systematic experiments to measure the impact of model changes
  4. Onboard, curate, and maintain evaluation datasets—both public benchmarks and internal test sets
  5. Create evaluation subsets that stress-test specific capabilities and edge cases

Skills

Required

  • ML fundamentals
  • Strong Python skills
  • Metric intuition
  • Voice agent stack familiarity
  • Tinkerer mentality
  • Communication skills
  • Ownership mindset

Nice to have

  • SQL
  • Cloud infrastructure

What the JD emphasized

  • Own the evaluation infrastructure
  • Translating customer pain points into quantifiable research targets
  • Measuring the right things
  • Benchmarking against the right competitors
  • Understand how ASR fits into the broader voice agent stack
  • Rigorous about measurement

Other signals

  • Owns evaluation infrastructure
  • Translates customer pain points into quantifiable research targets
  • Operates at the frontier of the voice agent ecosystem
  • Connective tissue between customer needs and researcher builds