Staff Software Engineer, ML Training and Inference Infrastructure

Rivian Rivian · Auto · Palo Alto, CA · Autonomous Driving

Staff Software Engineer focused on optimizing ML training and inference infrastructure for large autonomous driving models on NVIDIA GPU systems, including on-board inference latency optimization and distributed training.

What you'd actually do

  1. Optimize the performance of Deep Learning training workload on NVIDIA GPU systems on a large scale
  2. Optimize the latency of model inference and model pre- and post-processing on onboard systems
  3. Design, train, and deploy large deep learning models that can leverage the vast amount of labeled and unlabeled data

Skills

Required

  • PyTorch
  • model training frameworks
  • transformer architecture
  • large scale distributed training
  • model profiling
  • inference optimization

Nice to have

  • CUDA
  • Triton
  • Nvidia TensorRT
  • NCCL
  • edge computing systems

What the JD emphasized

  • Deep knowledge of PyTorch
  • Knowledge of model training framework (e.g. PyTorch Lightning, ray, etc.)
  • In-depth knowledge of transformer architecture and ways to accelerate the training and inference of transformer models
  • Experience of performing large scale distributed training of models
  • A track record of profiling models and doing detective work to improve model training and inference speed

Other signals

  • optimize training performance
  • optimize inference latency
  • design, train, and deploy large deep learning models
  • large scale distributed training
  • transformer architecture acceleration