Senior High-performance LLM Training Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA is seeking an experienced Senior High-Performance LLM Training Engineer to optimize LLM training workloads on advanced computing systems. The role focuses on improving the efficiency of NVIDIA's high-performance LLM software stack using frameworks like PyTorch and JAX for training on thousands of GPUs, and influencing future hardware roadmaps.

What you'd actually do

  1. Understand, analyze, profile, and optimize AI training workloads on innovative hardware and software platforms.
  2. Understand the big picture of training performance on GPUs, prioritizing and then solving problems across all state-of-the-art neural networks.
  3. Implement production-quality software in multiple layers of NVIDIA's deep learning platform stack, from drivers to DL frameworks.
  4. Build and support NVIDIA submissions to the MLPerf Training benchmark suite.
  5. Implement key DL training workloads in NVIDIA's proprietary processor and system simulators to enable future architecture studies.

Skills

Required

  • PhD in Computer Science, Electrical Engineering or Computer Engineering and 5+ years; or MS (or equivalent experience) and 8+ years of meaningful work experience.
  • Strong background in deep learning and neural networks, in particular training.
  • A deep background in computer architecture and familiarity with the fundamentals of GPU architecture.
  • Proven experience analyzing and tuning application performance & processor and system-level performance modelling.
  • Programming skills in C++, Python, and CUDA.

Nice to have

  • PyTorch
  • JAX
  • MLPerf Training benchmark suite
  • NVIDIA's proprietary processor and system simulators

What the JD emphasized

  • high-performance training on thousands of GPUs
  • production-quality software
  • deep learning and neural networks, in particular training
  • computer architecture
  • GPU architecture
  • analyzing and tuning application performance
  • processor and system-level performance modelling
  • C++, Python, and CUDA

Other signals

  • optimizing LLM training workloads
  • high-performance training on thousands of GPUs
  • shaping hardware roadmaps for next-gen GPUs