Research Engineer, Agents

Decagon Decagon · Vertical AI · New York, NY · Engineering

Research Engineer role focused on building and evolving agent harnesses, runtime systems, and control-plane logic for a conversational AI platform. The role involves designing distributed systems for agent orchestration, optimizing for latency and reliability, and iterating based on real-world failures and experimentation. It operates in a fast-moving, ambiguous space with tight feedback loops, collaborating with Research, Infra, and Product teams.

What you'd actually do

  1. Design and evolve agent harnesses that power different product experiences
  2. Build core runtime systems, including AOP execution and multi-model orchestration
  3. Develop control-plane logic for routing, planning, and tool invocation with strong safety guarantees
  4. Optimize agent systems for latency, reliability, and production correctness
  5. Analyze real-world failures and use data to drive iterative improvements

Skills

Required

  • Strong experience building distributed systems or backend platforms in production environments
  • Comfort working in ambiguous, fast-moving environments with rapid iteration cycles
  • Experience owning systems end-to-end, from design through production and iteration
  • Familiarity with experimentation, evaluation, or data-driven product improvement loops
  • A track record of improving system reliability, performance, and observability
  • Ability to debug complex systems and identify root causes of failures

Nice to have

  • You’ve built or worked on agent harnesses, orchestration layers, or execution frameworks
  • You think in terms of control planes, feedback loops, and system-level optimization, not just features
  • You’re excited about diagnosing failure modes and iterating toward measurable improvements
  • You care deeply about production quality—not just making systems work, but making them reliable, safe, and scalable
  • You’re motivated by pushing the frontier of how intelligent systems behave in the real world

What the JD emphasized

  • strong safety guarantees
  • real-world failures
  • offline evaluation
  • online experimentation
  • production correctness
  • system-level optimization
  • production quality
  • reliable, safe, and scalable
  • push the frontier

Other signals

  • agent orchestration
  • multi-model orchestration
  • tool use
  • safety guarantees
  • latency requirements
  • real-world failures
  • offline evaluation
  • online experimentation
  • system design
  • agent performance