Senior AI Training Performance Engineer

NVIDIA · Semiconductors · Shanghai, China

NVIDIA is seeking a Senior AI Training Performance Engineer to optimize AI training workloads on state-of-the-art hardware and software platforms. The role involves analyzing, profiling, and optimizing performance across the hardware/software stack, implementing production-quality software, and building automation tools. Requires a strong background in deep learning training, computer architecture (especially GPU), performance tuning, and programming in C++, Python, and CUDA.

What you'd actually do

  1. Understand, analyze, profile, and optimize AI and deep learning training workloads on state-of-the-art hardware and software platforms.
  2. Understand the big picture of training performance on GPUs, prioritizing and then solving problems across many dozens of state-of-the-art neural networks.
  3. Implement production-quality software in multiple layers of NVIDIA's deep learning platform stack, from drivers to DL frameworks.
  4. Implement key DL training workloads in NVIDIA's proprietary processor and system simulators to enable future architecture studies.
  5. Build tools to automate workload analysis, workload optimization, and other critical workflows.

Skills

Required

  • PhD (or equivalent experience) in CS, EE or CSEE and 5+ years; or MS and 8+ years of relevant work experience.
  • Strong background in deep learning and neural networks, in particular training.
  • Deep understanding of computer architecture, and familiarity with the fundamentals of GPU architecture.
  • Proven experience analyzing and tuning application performance.
  • Experience with processor and system-level performance modelling.
  • Programming skills in C++, Python, and CUDA.
  • Fluency in English

What the JD emphasized

  • obsessed with performance analysis and optimization
  • squeeze every last clock cycle
  • unafraid to work across all layers of the hardware/software stack
  • peak performance
  • directly impact the hardware and software roadmap
  • Deep understanding of computer architecture
  • fundamentals of GPU architecture
  • Proven experience analyzing and tuning application performance
  • Experience with processor and system-level performance modelling

Other signals

  • AI training workloads
  • GPU architecture
  • Deep Learning Framework
  • performance analysis and optimization