Senior Reliability Engineer - Av Labs

Uber Uber · Consumer · Sunnyvale, CA · Engineering

Senior Reliability Engineer focused on sensor and hardware system reliability for autonomous vehicles, owning observability, alerting, and automation for data collection systems. The role involves architecting observability platforms, building for edge constraints, defining criticality models, detecting failure modes, scaling through automation, and driving technical strategy for fleet health.

What you'd actually do

  1. Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes.
  2. Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling.
  3. Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield.
  4. Detect Complex Failure Modes: Design detection logic for "silent" failures, such as sensor degradation, compute saturation, or recording pipeline stalls.
  5. Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows.

Skills

Required

  • 5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
  • Distributed Systems experience
  • modern observability platforms (e.g., Prometheus, Grafana, ELK)
  • edge, IoT, or hardware-integrated environments
  • Go, Python, or C++ coding skills
  • building and operating production systems
  • Linux internals proficiency
  • shell scripting
  • debugging across services, containers (Docker), and networking stacks
  • owning reliability, infrastructure, or platform systems for large-scale production workloads
  • designing and operating observability systems (metrics, logging, alerting, and dashboards)
  • defining and implementing SLIs and SLOs
  • Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT)
  • data handling in bandwidth-constrained environments
  • driving complex technical projects and architectural reviews

Nice to have

  • Knowledge of sensor data protocols (e.g., Camera, LiDAR, Radar)
  • Experience with "Grey Failure" detection and management
  • Proven track record in 'Fleet Health' for large-scale hardware deployments

What the JD emphasized

  • sensor reliability
  • data yield
  • supply hours
  • sensor uptime
  • data recording capability
  • hardware diversity
  • intermittent connectivity
  • fleet scaling
  • systemic issues
  • silent failures
  • sensor degradation
  • compute saturation
  • recording pipeline stalls
  • manual intervention
  • hardware and software failure scenarios
  • operational efficiency
  • diagnose and deploy mitigations rapidly
  • latent patterns in fleet telemetry
  • proactive detection
  • systemic regressions
  • hardware degradation
  • edge, IoT, or hardware-integrated environments
  • hardware-to-cloud data ingestion pipelines
  • Grey Failure
  • Fleet Health
  • automation was used to replace manual intervention