Training Performance Engineer

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

The Training Performance Engineer will drive efficiency improvements across OpenAI's distributed training stack, focusing on optimizing GPU utilization, throughput, and uptime for large-scale distributed model training. This role involves profiling, analyzing performance bottlenecks in compute, communication, and storage, and collaborating with engineers to improve kernel efficiency, scheduling, and collective communication performance, particularly for pre-training.

What you'd actually do

  1. Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.
  2. Optimize GPU utilization and throughput for large-scale distributed model training.
  3. Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.
  4. Implement model graph transforms to improve end to end throughput.
  5. Build tooling to monitor and visualize MFU, throughput, and uptime across clusters.

Skills

Required

  • Python
  • C++
  • distributed systems debugging
  • performance analysis
  • HPC clusters
  • multi-GPU systems
  • PyTorch
  • JAX
  • TensorFlow

Nice to have

  • Rust
  • CUDA
  • NCCL
  • MPI
  • UCX
  • large-scale data loading
  • checkpointing systems
  • training runtime
  • distributed scheduling
  • ML compiler optimization

What the JD emphasized

  • performance engineering
  • distributed training
  • large-scale training
  • pre-training

Other signals

  • distributed training runtime
  • performance engineering
  • GPU utilization
  • large-scale model training
  • pre-training