Principal Engineer, Model Development Platform

Wayve Wayve · Robotics · Sunnyvale, CA · AI Platform

Principal Engineer for the Model Development Platform at Wayve, focusing on the end-to-end AI model lifecycle for Embodied AI in autonomous vehicles. The role involves owning the architecture, reliability, scalability, and coherence of the platform, which supports data ingestion, training, experiment scheduling, and on-road testing. It requires deep technical leadership across web applications, distributed compute, ML Ops, data pipelines, and optimization algorithms, enabling researchers and engineers to iterate and deploy models safely and efficiently.

What you'd actually do

  1. System architecture & reliability - Design and evolve the platform's overall architecture for reliability, observability, and scalability. Set performance, latency, and availability targets, and drive the engineering standards to meet them.
  2. Cross-domain technical leadership - Unify the platform across disciplines, from front-end UIs and distributed training to Spark data pipelines and optimization-based experiment scheduling, ensuring systems interoperate cleanly.
  3. Hands-on problem solving - Dive into the hardest challenges across subteams, lead architectural reviews, and propose pragmatic solutions that balance innovation with operational simplicity.
  4. Experimentation & scheduling systems - Build systems that optimize how models are tested in simulation and on-road, using techniques like linear programming and heuristic optimization to balance hardware, safety, and research priorities while improving throughput and turnaround.
  5. Data & compute infrastructure - Architect pipelines that ingest, transform, and enrich petabytes of fleet sensor data, and drive efficient compute use across GPU, CPU, cloud, and edge for both prototyping and large-scale training.

Skills

Required

  • Technical Leadership at Scale
  • Architectural Depth & Breadth
  • Reliability and performance
  • Hands-On Systems Design
  • Collaborative Influence
  • Mentorship & Culture

Nice to have

  • Optimization & Scheduling Expertise
  • ML Ops & Experimentation Systems
  • Domain Experience
  • Full-Stack Fluency
  • Data Governance

What the JD emphasized

  • 10+ years of experience designing and building large-scale distributed systems, ML/AI infrastructure, full stack web application, or developer platforms, including at least 3 years as a staff or principal-level engineer.
  • Proven ability to design systems spanning web platforms, ML pipelines, and large-scale compute orchestration (e.g., Spark, Ray, Kubernetes, Airflow, MLflow).
  • Experience driving platform reliability improvements, defining SLAs/SLOs, and building self-healing and observable systems that operate at “four nines” availability or better.
  • Deep understanding of distributed computing, workflow orchestration, data modeling, and API design, with the ability to write and review production-quality code.
  • Experience applying algorithmic or mathematical optimization (e.g., linear programming, graph algorithms) to operational or scheduling problems.

Other signals

  • building large-scale distributed systems
  • ML/AI infrastructure
  • platform reliability, scalability, and coherence
  • accelerate model development and fleet learning
  • optimize how models are tested in simulation and on-road