What you'd actually do

Own the strategy for understanding AI agentic reasoning, enabling deep analysis of step-by-step agent decision-making.

Design and roll out automated evaluation systems (LLM-as-a-judge) to provide a scalable, high-confidence "pulse" on AI performance across conversational and voice interfaces.

Develop granular signals within agentic activity—identifying latent failures, reasoning loops, or tool-calling inefficiencies—to drive product improvements

Partner with Product & Engineering to build and maintain simulation environments that test AI agents against edge cases before deployment, and democratise these tools with Operations teams

Act as the primary liaison between Product, Engineering, and Data Science to ensure observability tooling is integrated into the development lifecycle and directly informs release "Go/No-Go" decisions.

Skills

Required

5+ years of experience in Technical Program Management, Product Operations, AI Quality, or Observability
Bachelor’s degree in Engineering, Computer Science, Data Science, or a related technical field.

Nice to have

Deep understanding of GenAI systems, including LLM orchestration, agentic workflows, and the nuances of reasoning chains (e.g., Chain of Thought).
Proven experience designing technical frameworks or evaluation pipelines (e.g., autoevals, RAG evaluation, or model benchmarking).
Ability to define and track complex technical metrics (micrometrics) and correlate them with high-level business KPIs.
Demonstrated ability to drive complex initiatives in an IC capacity by building strong partnerships with Engineering and Product teams.
Experience with "LLM-as-a-judge" frameworks, prompt engineering for evaluations, and fine-tuning feedback loops.
Background in building simulators, "digital twins," or robust A/B testing frameworks for conversational AI or autonomous agents.
Familiarity with AI observability tools
Exceptional ability to turn "noisy" AI logs into structured failure pattern analysis.
Strong ability to translate highly technical agent behaviors into business-relevant insights for non-technical stakeholders.
Experience in Customer Support technology, Voice UX, or high-volume automated workflows.

What the JD emphasized

agentic reasoning observability

automated evaluation (autoeval) systems

micrometrics

agentic reasoning

automated evaluation

LLM orchestration

agentic workflows

reasoning chains

autoevals

model benchmarking

LLM-as-a-judge

prompt engineering for evaluations

fine-tuning feedback loops

conversational AI

autonomous agents

About the Role

The AI Observability Program Leader will own the end-to-end strategy, design, and implementation of the frameworks used to monitor, understand, and improve Uber’s GenAI-powered agentic systems. This role sits within the Global Digital Experience team, the operational arm of Uber’s customer support tech organization, and is a critical driver of accuracy, safety, and reliability across Uber’s next-generation AI solutions.

This leader will bridge the gap between raw AI logs and actionable product insights. You will define the methodologies for agentic reasoning observability, develop automated evaluation (autoeval) systems, and design simulators to stress-test AI performance before it reaches the customer. You will partner closely with Product, Engineering and Data Science to translate complex agent behaviors into micrometrics—the granular signals that help us pinpoint exactly where a reasoning chain succeeded or failed.

The ideal candidate brings a systems thinking mindset, technical literacy in LLM orchestration, and the ability to influence technical roadmaps through rigorous data and observability frameworks.

What You'll Do

Architect Observability Frameworks: Own the strategy for understanding AI agentic reasoning, enabling deep analysis of step-by-step agent decision-making.
Drive Autoeval Strategy: Design and roll out automated evaluation systems (LLM-as-a-judge) to provide a scalable, high-confidence "pulse" on AI performance across conversational and voice interfaces.
Define Micrometrics: Develop granular signals within agentic activity—identifying latent failures, reasoning loops, or tool-calling inefficiencies—to drive product improvements
Lead Pre-Launch Simulation: Partner with Product & Engineering to build and maintain simulation environments that test AI agents against edge cases before deployment, and democratise these tools with Operations teams
Cross-Functional Technical Partnership: Act as the primary liaison between Product, Engineering, and Data Science to ensure observability tooling is integrated into the development lifecycle and directly informs release "Go/No-Go" decisions.
Insight Synthesis: Package complex technical observability data into clear, actionable narratives for leadership, highlighting specific failure patterns and opportunities for CX improvement.
Operational Excellence: Establish the standards and tooling for how AI performance is reported globally, ensuring consistency across different regions and support modalities.

Basic Qualifications

5+ years of experience in Technical Program Management, Product Operations, AI Quality, or Observability
Bachelor’s degree in Engineering, Computer Science, Data Science, or a related technical field.

Preferred Qualifications

AI Literacy: Deep understanding of GenAI systems, including LLM orchestration, agentic workflows, and the nuances of reasoning chains (e.g., Chain of Thought).
Systems Thinking: Proven experience designing technical frameworks or evaluation pipelines (e.g., autoevals, RAG evaluation, or model benchmarking).
Analytical Rigor: Ability to define and track complex technical metrics (micrometrics) and correlate them with high-level business KPIs.
Influence without Authority: Demonstrated ability to drive complex initiatives in an IC capacity by building strong partnerships with Engineering and Product teams.
Advanced AI Expertise: Experience with "LLM-as-a-judge" frameworks, prompt engineering for evaluations, and fine-tuning feedback loops.
Simulation & Testing: Background in building simulators, "digital twins," or robust A/B testing frameworks for conversational AI or autonomous agents.
Tooling Proficiency: Familiarity with AI observability tools
Problem Solving: Exceptional ability to turn "noisy" AI logs into structured failure pattern analysis.
Communication: Strong ability to translate highly technical agent behaviors into business-relevant insights for non-technical stakeholders.
Domain Knowledge: Experience in Customer Support technology, Voice UX, or high-volume automated workflows.

For Sunnyvale, CA-based roles: The base salary range for this role is USD$162,000 per year - USD$180,000 per year.

You will be eligible to participate in Uber's bonus program, and may be offered an equity award & other types of comp. All full-time employees are eligible to participate in a 401(k) plan. You will also be eligible for various benefits. More details can be found at the following link https://jobs.uber.com/en/benefits.

Uber's mission is to reimagine the way the world moves for the better. Here, bold ideas create real-world impact, challenges drive growth, and speed fuels progress. What moves us, moves the world - let's move it forward, together.

Uber is proud to be an Equal Opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you have a disability or special need that requires accommodation, please let us know by completing this form.

Offices continue to be central to collaboration and Uber's cultural identity. Unless formally approved to work fully remotely, Uber expects employees to spend at least half of their work time in their assigned office. For certain roles, such as those based at green-light hubs, employees are expected to be in-office for 100% of their time. Please speak with your recruiter to better understand in-office expectations for this role.