AI Infrastructure Engineer, Distributed Training, Optimus

Tesla Tesla · Auto · Palo Alto, CA · Tesla AI

The AI Infrastructure Engineer will build and improve the training infrastructure, pipelines, and deployment tools for neural networks used in Optimus robots. This role focuses on enabling faster and more stable training, validating PyTorch models, managing datasets, and deploying trained models to Tesla hardware, with a significant emphasis on scaling training jobs across GPU clusters.

What you'd actually do

  1. Build and improve our Python training infrastructure for stable and faster training
  2. Build the tooling and infrastructure for reporting and visualizing model metrics and performance
  3. Build the pipelines to run and validate our PyTorch models
  4. Manage, analyze, and visualize our training and test datasets
  5. Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning

Skills

Required

  • Python
  • C++
  • system-level software
  • hardware-software interactions
  • resource utilization
  • modern machine learning concepts
  • state of the art deep learning
  • training frameworks
  • PyTorch
  • scaling neural network training jobs across clusters of GPU’s

Nice to have

  • deep learning deployment
  • Profiling and optimizing CPU-GPU interactions
  • pipelining compute/transfers

What the JD emphasized

  • scaling neural network training jobs across clusters of GPU’s
  • deploy trained neural nets to Tesla hardware

Other signals

  • build and improve our Python training infrastructure
  • build the pipelines to run and validate our PyTorch models
  • demonstrated experience scaling neural network training jobs across clusters of GPU’s
  • build and improve tooling to deploy trained neural nets to Tesla hardware