Senior Site Reliability Engineer, Vehicle Sw

Wayve Wayve · Robotics · London, United Kingdom · Vehicle SW Engineering

Wayve is seeking a Senior Site Reliability Engineer for their Vehicle Software team. This role focuses on ensuring the reliability, availability, and performance of the autonomous driving fleet's software systems. Responsibilities include building and operating monitoring/logging/alerting tools, driving incident response, designing automation for fleet operations, and partnering with other teams to define reliability metrics. The ideal candidate has proven SRE experience with complex distributed systems, strong Linux fundamentals, CI/CD, container/orchestration knowledge, proficiency in Python/C++/Rust, and experience with observability stacks. Experience with real-time/safety-critical systems, fleet operations, and cloud platforms is desirable.

What you'd actually do

  1. Own and improve the reliability, availability, and performance of vehicle software systems used across the dev fleet.
  2. Take part in a team on-call rotation, providing out-of-hours support for live systems when required.
  3. Build and operate monitoring, logging, alerting, and on-call tooling that enables fast detection, diagnosis, and recovery.
  4. Drive incident response and post-incident learning, translating root causes into durable fixes and preventative controls.
  5. Design and deliver automation for fleet operations, deployments, and repetitive workflows to reduce manual intervention.

Skills

Required

  • SRE, production reliability, or platform operations role for complex distributed systems
  • Linux fundamentals
  • CI/CD
  • containers (Docker)
  • orchestration (Kubernetes)
  • Python, C++, or Rust
  • automation
  • troubleshooting across networking, distributed systems, and databases
  • designing observability stacks
  • Datadog, Prometheus, Grafana, OpenTelemetry, Splunk, or Humio
  • Clear communication skills, including incident leadership, writing postmortems, and influencing engineering priorities

Nice to have

  • Cloud platform experience (AWS, GCP, or Azure)
  • infrastructure-as-code
  • secure production operations
  • real-time or safety-critical systems
  • hardware-in-the-loop
  • embedded/robotics environments
  • fleet operations
  • telemetry pipelines
  • operating software on edge devices at scale
  • defining and running SLOs/SLIs and reliability programs

What the JD emphasized

  • safety-critical systems
  • reliability
  • performance
  • availability
  • monitoring
  • logging
  • alerting
  • incident response
  • automation
  • fleet operations
  • edge devices