Staff Software Engineer, ML Training and Inference Infrastructure

Rivian Rivian · Auto · London, United Kingdom · Autonomous Driving

Staff Software Engineer focused on ML training and inference infrastructure for perception systems in autonomous driving vehicles. Responsibilities include optimizing performance of deep learning training on GPUs and optimizing inference latency on onboard systems, with a focus on transformer architectures. Requires deep knowledge of PyTorch and experience with large-scale distributed training.

What you'd actually do

  1. Optimize the performance of Deep Learning training workload on NVIDIA GPU systems on a large scale
  2. Optimize the latency of model inference and model pre- and post-processing on onboard systems
  3. Design, train, and deploy large deep learning models that can leverage the vast amount of labeled and unlabeled data

Skills

Required

  • PyTorch
  • Large scale distributed training
  • Transformer architecture
  • Model training frameworks (e.g. PyTorch Lightning, ray)
  • Performance optimization (training and inference)
  • Profiling models

Nice to have

  • NVIDIA GPU systems
  • Model pre- and post-processing on onboard systems

What the JD emphasized

  • PhD in CS/CE/EE, or equivalent, in industry experience
  • Deep knowledge of PyTorch
  • In-depth knowledge of transformer architecture and ways to accelerate the training and inference of transformer models
  • Experience of performing large scale distributed training of models
  • A track record of profiling models and doing detective work to improve model training and inference speed

Other signals

  • optimize training performance
  • optimize inference latency
  • design, train, and deploy large deep learning models
  • distributed training
  • transformer architecture acceleration