Senior Machine Learning Engineer, Agentic AI

Robinhood Robinhood · Fintech · Bellevue, WA +1 · ENG Data and AI Platform Division

Robinhood is seeking a Senior Machine Learning Engineer for their Agentic AI team. This role will focus on building and evaluating agentic AI systems for customer experiences, defining technical direction for agent evaluation, designing scalable evaluation frameworks, driving model selection, and partnering with cross-functional teams to ensure quality and reliability of production systems. The role involves mentoring engineers and influencing technical direction.

What you'd actually do

  1. Lead the design and evolution of agentic AI systems that power intelligent customer experiences across Robinhood.
  2. Define the technical direction for evaluating autonomous agents, including reasoning quality, planning, tool selection, memory, task completion, safety, latency, and overall user experience.
  3. Design and build scalable evaluation frameworks for agentic systems using automated evals, benchmark datasets, LLM-as-a-Judge techniques, and human feedback to continuously improve agent performance.
  4. Drive model selection and optimization across frontier foundation models, fine-tuned models, retrieval systems, and tool-using agents, balancing quality, latency, cost, and reliability.
  5. Partner closely with Product, Data Science, and Engineering to establish launch criteria, quality standards, and measurable success metrics for production agentic systems.

Skills

Required

  • Significant experience building and deploying production AI systems powered by large language models, autonomous agents, or multi-step reasoning workflows.
  • Deep understanding of modern agent architectures, including tool calling, planning, memory, retrieval-augmented generation (RAG), orchestration, and multi-agent systems.
  • Experience designing evaluation frameworks for agentic AI, including automated evals, benchmark datasets, LLM-as-a-Judge methodologies, human evaluation pipelines, and continuous quality measurement.
  • Strong understanding of the tradeoffs between prompting, fine-tuning, retrieval, and agent orchestration, and when to apply each approach.
  • Experience evaluating frontier foundation models across quality, latency, safety, cost, robustness, and production readiness.
  • Proven ability to debug complex agent behaviors, identify failure modes, and improve reasoning, reliability, and overall system performance.
  • Strong software engineering skills with experience building scalable distributed systems and production ML infrastructure.
  • Demonstrated technical leadership through architecture design, mentorship, and influencing engineering direction across multiple teams.

Nice to have

  • Experience with agent frameworks, AI observability platforms, model evaluation tooling, or regulated AI applications is a strong plus.

What the JD emphasized

  • building and deploying production AI systems powered by large language models, autonomous agents, or multi-step reasoning workflows
  • Deep understanding of modern agent architectures, including tool calling, planning, memory, retrieval-augmented generation (RAG), orchestration, and multi-agent systems.
  • Experience designing evaluation frameworks for agentic AI, including automated evals, benchmark datasets, LLM-as-a-Judge methodologies, human evaluation pipelines, and continuous quality measurement.
  • Proven ability to debug complex agent behaviors, identify failure modes, and improve reasoning, reliability, and overall system performance.
  • Demonstrated technical leadership through architecture design, mentorship, and influencing engineering direction across multiple teams.

Other signals

  • building agentic AI systems
  • design evaluation frameworks
  • guide model selection
  • partner with product, data science, and engineering teams
  • ensure systems meet clear standards for correctness, safety, latency, and user satisfaction