Program Lead: Product Operations - AI Observability

Uber Uber · Consumer · Sunnyvale, CA · Community Operations

This role focuses on establishing and implementing frameworks for monitoring, understanding, and improving Uber's GenAI-powered agentic systems. It involves defining methodologies for agentic reasoning observability, developing automated evaluation systems, and designing simulators to test AI performance. The goal is to translate complex agent behaviors into actionable insights and metrics to ensure accuracy, safety, and reliability.

What you'd actually do

  1. Own the strategy for understanding AI agentic reasoning, enabling deep analysis of step-by-step agent decision-making.
  2. Design and roll out automated evaluation systems (LLM-as-a-judge) to provide a scalable, high-confidence "pulse" on AI performance across conversational and voice interfaces.
  3. Develop granular signals within agentic activity—identifying latent failures, reasoning loops, or tool-calling inefficiencies—to drive product improvements
  4. Partner with Product & Engineering to build and maintain simulation environments that test AI agents against edge cases before deployment, and democratise these tools with Operations teams
  5. Act as the primary liaison between Product, Engineering, and Data Science to ensure observability tooling is integrated into the development lifecycle and directly informs release "Go/No-Go" decisions.

Skills

Required

  • 5+ years of experience in Technical Program Management, Product Operations, AI Quality, or Observability
  • Bachelor’s degree in Engineering, Computer Science, Data Science, or a related technical field.

Nice to have

  • Deep understanding of GenAI systems, including LLM orchestration, agentic workflows, and the nuances of reasoning chains (e.g., Chain of Thought).
  • Proven experience designing technical frameworks or evaluation pipelines (e.g., autoevals, RAG evaluation, or model benchmarking).
  • Ability to define and track complex technical metrics (micrometrics) and correlate them with high-level business KPIs.
  • Demonstrated ability to drive complex initiatives in an IC capacity by building strong partnerships with Engineering and Product teams.
  • Experience with "LLM-as-a-judge" frameworks, prompt engineering for evaluations, and fine-tuning feedback loops.
  • Background in building simulators, "digital twins," or robust A/B testing frameworks for conversational AI or autonomous agents.
  • Familiarity with AI observability tools
  • Exceptional ability to turn "noisy" AI logs into structured failure pattern analysis.
  • Strong ability to translate highly technical agent behaviors into business-relevant insights for non-technical stakeholders.
  • Experience in Customer Support technology, Voice UX, or high-volume automated workflows.

What the JD emphasized

  • agentic reasoning observability
  • automated evaluation (autoeval) systems
  • micrometrics
  • agentic reasoning
  • automated evaluation
  • LLM orchestration
  • agentic workflows
  • reasoning chains
  • autoevals
  • model benchmarking
  • LLM-as-a-judge
  • prompt engineering for evaluations
  • fine-tuning feedback loops
  • conversational AI
  • autonomous agents

Other signals

  • AI Observability
  • GenAI agentic systems
  • automated evaluation
  • micrometrics