What you'd actually do

Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes.

Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling.

Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield.

Detect Complex Failure Modes: Design detection logic for "silent" failures, such as sensor degradation, compute saturation, or recording pipeline stalls.

Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows.

Skills

Required

5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
Distributed Systems experience
modern observability platforms (e.g., Prometheus, Grafana, ELK)
edge, IoT, or hardware-integrated environments
Go, Python, or C++ coding skills
building and operating production systems
Linux internals proficiency
shell scripting
debugging across services, containers (Docker), and networking stacks
owning reliability, infrastructure, or platform systems for large-scale production workloads
designing and operating observability systems (metrics, logging, alerting, and dashboards)
defining and implementing SLIs and SLOs
Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT)
data handling in bandwidth-constrained environments
driving complex technical projects and architectural reviews

Nice to have

Knowledge of sensor data protocols (e.g., Camera, LiDAR, Radar)
Experience with "Grey Failure" detection and management
Proven track record in 'Fleet Health' for large-scale hardware deployments

What the JD emphasized

sensor reliability

data yield

supply hours

sensor uptime

data recording capability

hardware diversity

intermittent connectivity

fleet scaling

systemic issues

silent failures

sensor degradation

compute saturation

recording pipeline stalls

manual intervention

hardware and software failure scenarios

operational efficiency

diagnose and deploy mitigations rapidly

latent patterns in fleet telemetry

proactive detection

systemic regressions

hardware degradation

edge, IoT, or hardware-integrated environments

hardware-to-cloud data ingestion pipelines

Grey Failure

Fleet Health

automation was used to replace manual intervention

About the Role

We are looking for a hardware focused Senior Reliability Engineer to focus on sensor and hardware system reliability, owning the observability, alerting, and automation that ensures Uber’s in-vehicle sensor data collection systems operate reliably at scale.

This role is centered on maximizing sensor uptime, data yield, and supply hours across a large, geographically distributed fleet. You will design systems that determine when to react to issues impacting data recording capability, whether caused by failing sensors, degraded onboard computers, software regressions, or systemic environmental factors.

As the technical owner for sensor reliability and observability, you will build the infrastructure that converts low-level signals into actionable intelligence and automated responses. This is a seniorrole requiring strong software engineering fundamentals, deep systems thinking, and the ability to drive cross-team technical direction without direct authority.

What You Will Do

Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes.
Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling.
Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield.
Detect Complex Failure Modes: Design detection logic for "silent" failures, such as sensor degradation, compute saturation, or recording pipeline stalls.
Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows.
Partner on Mitigation: Collaborate with Operations and Engineering to build safe, automated responses to recurring hardware and software failure scenarios.
Drive Operational Efficiency: Build technical interfaces to help Operations surface issues and Engineering diagnose and deploy mitigations rapidly (TTD/TTM).
Lead Technical Strategy: Drive reliability-focused design reviews and translate operational pain points into concrete technical requirements and roadmaps.
Uncover Proactive Insights: Apply advanced data analytics to identify latent patterns in fleet telemetry, enabling the proactive detection of systemic regressions and hardware degradation before they impact operations.

Basic Qualifications

5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
Distributed Systems: Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments.
Language Proficiency: coding skills in one or more of Go, Python, or C++, with experience building and operating production systems.
Systems Expertise: Proficiency in Linux internals and shell scripting for triaging and debugging edge devices or hardware-adjacent systems.
Engineering Fundamentals: Ability to debug across services, containers (Docker), and networking stacks.
Reliability Experience: Proven track record owning reliability, infrastructure, or platform systems for large-scale production workloads.
Observability Tooling: Experience designing and operating observability systems (metrics, logging, alerting, and dashboards).
Metrics-Driven: Experience defining and implementing SLIs and SLOs for system availability or data yield.
Networking Knowledge: Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT) and data handling in bandwidth-constrained environments.
Leadership: Experience driving complex technical projects and architectural reviews across multiple teams from design through production.

Preferred Qualifications

Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments.
Knowledge of sensor data protocols (e.g., Camera, LiDAR, Radar) or hardware-to-cloud data ingestion pipelines.
Experience with "Grey Failure" detection and management in complex, distributed systems.
Proven track record in 'Fleet Health' for large-scale hardware deployments (e.g., cloud infrastructure, global server fleets, or industrial IoT) where automation was used to replace manual intervention.

For Sunnyvale, CA-based roles: The base salary range for this role is USD$180,000 per year - USD$200,000 per year.

You will be eligible to participate in Uber's bonus program, and may be offered an equity award & other types of comp. All full-time employees are eligible to participate in a 401(k) plan. You will also be eligible for various benefits. More details can be found at the following link https://jobs.uber.com/en/benefits.

Uber's mission is to reimagine the way the world moves for the better. Here, bold ideas create real-world impact, challenges drive growth, and speed fuels progress. What moves us, moves the world - let's move it forward, together.

Uber is proud to be an Equal Opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you have a disability or special need that requires accommodation, please let us know by completing this form.

Offices continue to be central to collaboration and Uber's cultural identity. Unless formally approved to work fully remotely, Uber expects employees to spend at least half of their work time in their assigned office. For certain roles, such as those based at green-light hubs, employees are expected to be in-office for 100% of their time. Please speak with your recruiter to better understand in-office expectations for this role.

About the Role

What You Will Do

Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes.
Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling.
Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield.
Detect Complex Failure Modes: Design detection logic for "silent" failures, such as sensor degradation, compute saturation, or recording pipeline stalls.
Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows.
Partner on Mitigation: Collaborate with Operations and Engineering to build safe, automated responses to recurring hardware and software failure scenarios.
Drive Operational Efficiency: Build technical interfaces to help Operations surface issues and Engineering diagnose and deploy mitigations rapidly (TTD/TTM).
Lead Technical Strategy: Drive reliability-focused design reviews and translate operational pain points into concrete technical requirements and roadmaps.
Uncover Proactive Insights: Apply advanced data analytics to identify latent patterns in fleet telemetry, enabling the proactive detection of systemic regressions and hardware degradation before they impact operations.

Basic Qualifications

5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
Distributed Systems: Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments.
Language Proficiency: coding skills in one or more of Go, Python, or C++, with experience building and operating production systems.
Systems Expertise: Proficiency in Linux internals and shell scripting for triaging and debugging edge devices or hardware-adjacent systems.
Engineering Fundamentals: Ability to debug across services, containers (Docker), and networking stacks.
Reliability Experience: Proven track record owning reliability, infrastructure, or platform systems for large-scale production workloads.
Observability Tooling: Experience designing and operating observability systems (metrics, logging, alerting, and dashboards).
Metrics-Driven: Experience defining and implementing SLIs and SLOs for system availability or data yield.
Networking Knowledge: Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT) and data handling in bandwidth-constrained environments.
Leadership: Experience driving complex technical projects and architectural reviews across multiple teams from design through production.

Preferred Qualifications

Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments.
Knowledge of sensor data protocols (e.g., Camera, LiDAR, Radar) or hardware-to-cloud data ingestion pipelines.
Experience with "Grey Failure" detection and management in complex, distributed systems.
Proven track record in 'Fleet Health' for large-scale hardware deployments (e.g., cloud infrastructure, global server fleets, or industrial IoT) where automation was used to replace manual intervention.

For Sunnyvale, CA-based roles: The base salary range for this role is USD$180,000 per year - USD$200,000 per year.