Staff Cloud Sre – Ai/ml Platform & GPU Compute

Wayve Wayve · Robotics · London, United Kingdom · AI Platform

Staff Cloud SRE role focused on building and scaling the reliability of AI cloud platforms, including Model Development Platform and GPU Compute infrastructure for model training and inference. The role involves owning reliability, incident response, observability, and automation for large-scale AI systems and GPU clusters.

What you'd actually do

  1. Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments.
  2. Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents.
  3. Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery.
  4. Build automation for cluster operations, training workflows, remediation, and scaling tasks.

Skills

Required

  • SRE
  • Production Engineer
  • Cloud Reliability
  • GPU-backed environments
  • large-scale ML infrastructure
  • model training or inference pipelines
  • MLOps
  • Kubernetes
  • AWS
  • GCP
  • Azure
  • distributed systems
  • compute-heavy workloads
  • high-performance workloads
  • large compute clusters
  • Linux
  • Python
  • Go
  • C++
  • automation
  • troubleshooting
  • networking
  • storage
  • performance at scale
  • observability stacks
  • Datadog
  • Prometheus
  • Grafana
  • OpenTelemetry
  • communication skills
  • incident leadership
  • postmortems

Nice to have

  • infrastructure-as-code
  • Terraform
  • secure cloud production environments
  • SLOs/SLIs
  • reliability programs
  • founding SRE hire
  • establishing processes from scratch
  • leadership responsibilities

What the JD emphasized

  • founding Staff SRE
  • from the ground up
  • won’t inherit a mature SRE function, you’ll help create it
  • define the frameworks, automation, and operational standards
  • intersection of AI research, large-scale cloud infrastructure, and production operations
  • directly enable faster model training, reliable experimentation, and scalable AI deployment
  • Essential skills
  • Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems.
  • Experience operating GPU-backed environments or large-scale ML infrastructure.
  • Experience running model training or inference pipelines in production (MLOps).
  • Strong Kubernetes experience, including operating production clusters.
  • Hands-on experience running production workloads in AWS, GCP, or Azure.
  • Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads.
  • Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred.
  • Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation.
  • Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale.
  • Experience designing and operating observability stacks (e.g. Datadog, Prometheus, Grafana, OpenTelemetry).
  • Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements.

Other signals

  • GPU Compute
  • AI/ML Platform
  • Model Development Platform
  • large-scale, multi-tenant GPU fleets
  • model training and inference at scale