Senior Performance Engineer - Deep Learning

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Performance Engineer at NVIDIA focused on optimizing Deep Learning models and frameworks (PyTorch, JAX) for NVIDIA GPUs. The role involves building and supporting Transformer Engine, collaborating on systems research for performance improvements, implementing and benchmarking new DL models, contributing to MLPerf, and engaging with the open-source community and enterprise customers. It also involves influencing future hardware and software design.

What you'd actually do

  1. Build and support Transformer Engine, the open-source library for accelerating the training of Large Language Models.
  2. Collaborate on systems research that improves Deep Learning model performance, such as training using extremely low precision, parallelism methods, etc.
  3. Implement, benchmark, and optimize new Deep Learning models such as LLMs straight out of groundbreaking research to scale efficiently on NVIDIA GPUs and systems.
  4. Build and contribute to NVIDIA submissions on community benchmarks such as MLPerf.
  5. Engage with the open-source community as well as support enterprise customers and partners by delivering the benefits of NVIDIA’s latest hardware and software innovations.

Skills

Required

  • C++
  • Python
  • parallel systems programming
  • GPU programming
  • Computer Architecture
  • Code Optimization
  • Operating Systems
  • developing large software projects
  • communication skills

Nice to have

  • PyTorch
  • JAX
  • DL framework
  • performance analysis
  • profiling
  • code optimization techniques
  • multi-GPU or multi-node systems
  • LLM architectures
  • attention mechanisms
  • cuBLAS
  • cuDNN
  • cuSOLVER
  • CUDA
  • OpenAI Triton
  • Pallas
  • open source contributions

What the JD emphasized

  • Transformer Engine
  • accelerating the training of Large Language Models
  • Deep Learning model performance
  • training using extremely low precision
  • parallelism methods
  • LLMs
  • scale efficiently on NVIDIA GPUs and systems
  • MLPerf
  • open-source community
  • enterprise customers
  • NVIDIA’s latest hardware and software innovations
  • performance analysis
  • profiling
  • code optimization techniques
  • multi-GPU or multi-node systems
  • modern LLM architectures
  • attention mechanisms
  • low-level DL libraries
  • GPU kernels
  • CUDA
  • OpenAI Triton
  • Pallas

Other signals

  • optimizing deep learning frameworks
  • accelerating training of large language models
  • implementing and benchmarking new deep learning models
  • contributing to community benchmarks
  • influencing hardware and software design