Staff ML Performance Engineer (training Efficiency)

Wayve Wayve · Robotics · Sunnyvale, CA · AI Platform

Staff ML Performance Engineer focused on optimizing large-scale ML training and inference workloads for Embodied AI technology in autonomous driving systems. The role involves profiling ML jobs, designing and implementing efficiency improvements (parallelism, model compilation, mixed precision), developing observability tools, and creating benchmarking tools to track performance gains. Collaboration with Research teams is key to integrate efficiency improvements and foster a performance optimization culture.

What you'd actually do

  1. Profile ML workloads to identify their bottlenecks, e.g. using NVIDIA Nsight Systems
  2. Design and implement efficiency improvements to maximize MFU and throughput, e.g. parallelism, model compilation, mixed precision
  3. Design and implement observability tools to identify bottlenecks and drive performance improvements, e.g. to track MFU, throughput, latency, etc
  4. Design and implement benchmarking tools, e.g. to track efficiency gains or regressions
  5. Collaborate closely with Research teams to integrate training efficiency improvements and create a culture of performance optimization

Skills

Required

  • Python
  • BS or MS in Machine Learning, Computer Science, Engineering, or a related technical discipline or equivalent experience

Nice to have

  • Experience working with concurrent, parallel and distributed computing.
  • Experience using NVIDIA NSight Systems or other system profilers.
  • Experience implementing GPU kernels (CUDA, Triton, etc).
  • Knowledge of computing fundamentals - what makes code fast, secure and reliable.

What the JD emphasized

  • 10+ years of industry experience driving performance engineering across ML systems, GPU compute infrastructure, distributed platforms or similar field.
  • Experience optimizing large scale jobs on GPU compute clusters.
  • Experience in writing, reporting, and tracking performance benchmarks in an open and accessible way.

Other signals

  • Optimizing large scale ML jobs
  • Increase efficiency of training and inference workloads
  • Train larger models faster
  • MFU and throughput
  • GPU compute clusters