Staff Applied Scientist - Observability at Uber

What you'd actually do

Design and improve state-of-the-art anomaly detection and alerting for multivariate time series metrics.

Build methods to reduce incident impact, such as by shortening incident time-to-detection and time-to-resolution while reducing alert fatigue (deduplication, correlation, grouping, etc).

Contribute to intelligent incident response workflows: auto-triage to right team, suspected root-cause hints, auto-mitigation actions as well as agentic mitigation flows (supporting on-call Engineers in debugging and mitigating).

Develop statistical monitoring approaches for code deployment safety and feature rollout safety (e.g. near-real-time sequential A/B testing, before/after system degradation detection, etc).

Define success metrics for incident detection systems (precision, recall, time to detect, coverage, etc) and create evaluation harnesses using historical incidents and annotated alerts.

Skills

Required

M.S. or Ph.D. in Computer Science, Machine Learning, Statistics, Operations Research, Economics, or another quantitative field.
6+ years of proven experience as an Applied Scientist, Machine Learning Scientist/Engineer, Research Scientist, or equivalent.
Strong expertise in causal inference / experimentation, including designing, executing, and analyzing A/B tests; experience with related methodologies (e.g., quasi-experimental designs, uplift/heterogeneous treatment effects) is highly valued.
Strong expertise in anomaly detection and time-series analysis, with hands-on experience building production-grade, scalable detection and alerting pipelines for large-scale, real-time systems (including time-series feature engineering, modeling, monitoring, and drift/seasonality handling).
Experience in production coding and deployment of ML, statistical, causal, and/or optimization models in real-time or near-real-time systems (end-to-end: data, modeling, evaluation, deployment, monitoring, and iteration).
Ability to use Python (or similar languages) to work efficiently at scale with large datasets in production environments; strong software engineering fundamentals (testing, reliability, performance).
Proficiency in SQL and distributed data processing (e.g. PySpark, Flink SQL).
Excellent communication skills in cross-functional settings, with demonstrated ability to translate business/system problems into technical solutions and influence stakeholders.
Thought leadership and ownership to drive multi-functional initiatives from conceptualization through productionization, including setting technical direction and raising the quality bar.

Nice to have

Experience with real-time or near-real-time pipelines and large-scale data systems (e.g., Spark, streaming, Kafka-like systems, OLAP stores).
Experience in observability, user analytics, experimentation platforms, or reliability monitoring.
Familiarity with event correlation and change attribution (e.g., linking regressions to code/config/feature flag changes).
Experience building tools that improve workflow quality (onboarding, annotation, diagnosis dashboards).

About the Role

We are looking for an experienced Applied Scientist with a passion for building software solutions where customer experiences take centre stage and products are built with service quality at heart.

We are building a real-time data platform to enable customer experience observability and analytics at scale: key ingredients to ensure we deliver best-in-class experiences for our users. The platform helps detect and respond to degradations in customer experience, supports safe code deployments and fast feature rollouts through real-time monitoring, and powers deeper analytics that inform product improvements, enabling both reactive and proactive service quality processes.

This is an outstanding opportunity for an applied scientist with a collaborative spirit to the core, who will work with the engineering team to drive an ambitious observability platform. It’s a high-impact role where you will collaborate on challenges across domains and functions, spanning time-series anomaly detection, statistical monitoring and guardrails, and the data foundations needed to make customer experience measurable and actionable.

If you have the technical chops, we invite you to join us to solve tough large-scale data challenges and raise the bar of service quality.

What You Will Do

Incident Detection & Mitigation

Design and improve state-of-the-art anomaly detection and alerting for multivariate time series metrics.
Build methods to reduce incident impact, such as by shortening incident time-to-detection and time-to-resolution while reducing alert fatigue (deduplication, correlation, grouping, etc).
Contribute to intelligent incident response workflows: auto-triage to right team, suspected root-cause hints, auto-mitigation actions as well as agentic mitigation flows (supporting on-call Engineers in debugging and mitigating).

Rollout Safety & Speed (Experimentation & Monitoring)

Develop statistical monitoring approaches for code deployment safety and feature rollout safety (e.g. near-real-time sequential A/B testing, before/after system degradation detection, etc).
Support safe and fast product releases by adjusting code deployment soak times or feature rollout speed based on statistical significance in guardrail metrics.

Analytics Enablement

Partner with Engineering on building data infrastructure producing “analytics-ready” datasets: consistent definitions, clean data, scalable feature/metric computation.
Define best practices in instrumentation and metric definitions to facilitate incident detection, including SOPs and templates for common patterns to be applied across different user flows and user traffic patterns.
Contribute to monitoring converge assisted observability and monitoring.

Scientific & Operational Excellence

Define success metrics for incident detection systems (precision, recall, time to detect, coverage, etc) and create evaluation harnesses using historical incidents and annotated alerts.
Communicate results clearly to technical and non-technical stakeholders; drive alignment on tradeoffs, OKRs and roadmap.

Basic Qualifications

M.S. or Ph.D. in Computer Science, Machine Learning, Statistics, Operations Research, Economics, or another quantitative field.
6+ years of proven experience as an Applied Scientist, Machine Learning Scientist/Engineer, Research Scientist, or equivalent.
Strong expertise in causal inference / experimentation, including designing, executing, and analyzing A/B tests; experience with related methodologies (e.g., quasi-experimental designs, uplift/heterogeneous treatment effects) is highly valued.
Strong expertise in anomaly detection and time-series analysis, with hands-on experience building production-grade, scalable detection and alerting pipelines for large-scale, real-time systems (including time-series feature engineering, modeling, monitoring, and drift/seasonality handling).
Experience in production coding and deployment of ML, statistical, causal, and/or optimization models in real-time or near-real-time systems (end-to-end: data, modeling, evaluation, deployment, monitoring, and iteration).
Ability to use Python (or similar languages) to work efficiently at scale with large datasets in production environments; strong software engineering fundamentals (testing, reliability, performance).
Proficiency in SQL and distributed data processing (e.g. PySpark, Flink SQL).
Excellent communication skills in cross-functional settings, with demonstrated ability to translate business/system problems into technical solutions and influence stakeholders.
Thought leadership and ownership to drive multi-functional initiatives from conceptualization through productionization, including setting technical direction and raising the quality bar.

Preferred Qualifications

Experience with real-time or near-real-time pipelines and large-scale data systems (e.g., Spark, streaming, Kafka-like systems, OLAP stores).
Experience in observability, user analytics, experimentation platforms, or reliability monitoring.
Familiarity with event correlation and change attribution (e.g., linking regressions to code/config/feature flag changes).
Experience building tools that improve workflow quality (onboarding, annotation, diagnosis dashboards).

Uber's mission is to reimagine the way the world moves for the better. Here, bold ideas create real-world impact, challenges drive growth, and speed fuelds progress. What moves us, moves the world - let’s move it forward, together.

Offices continue to be central to collaboration and Uber's cultural identity. Unless formally approved to work fully remotely, Uber expects employees to spend at least half of their work time in their assigned office. For certain roles, such as those based at green-light hubs, employees are expected to be in-office for 100% of their time. Please speak with your recruiter to better understand in-office expectations for this role.

*Accommodations may be available based on religious and/or medical conditions, or as required by applicable law. To request an accommodation, please reach out to accommodations@uber.com.