Senior Deep Learning Framework Communications Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

Senior Deep Learning Framework Communications Engineer at NVIDIA, focusing on integrating and optimizing communication libraries (NCCL, NVSHMEM) within AI frameworks (PyTorch, TRT-LLM, vLLM, JAX) to enhance performance for large-scale AI training and inference. The role involves deep analysis of AI workloads, compiler improvements, and kernel authoring for multi-GPU systems.

What you'd actually do

  1. Integrate new communication libraries features in AI frameworks: from PoC to performance analysis to production
  2. Perform deep analysis of AI workloads and frameworks to identify multi-GPU communication requirements and opportunities. Collaborate hands-on with teams working on the latest AI models.
  3. Improve AI compilers to hide communications or perform automatic fusion.
  4. Conduct in-depth AI workload performance characterization on multi-GPU clusters.
  5. Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads.

Skills

Required

  • Python
  • C++
  • CUDA
  • PyTorch
  • JAX
  • TRT-LLM
  • vLLM
  • SGLang
  • Triton
  • cuTe
  • AI compilers
  • performance benchmarking
  • HPC/AI communication concepts

Nice to have

  • NCCL
  • NVSHMEM
  • MPI
  • computer system architecture
  • HW-SW interactions
  • operating systems principles
  • Distributed inference
  • MoE
  • Reinforcement Learning
  • kernel authoring
  • AI compiler pattern matching
  • memory hierarchy
  • consistency model
  • tensor layout

What the JD emphasized

  • 5+ software engineering and HPC/AI experience
  • Development or integration experience with Deep Learning Frameworks such PyTorch, JAX, and Inference Engines such as TRT-LLM, vLLM, SGLang
  • Rapid prototyping and development with Python, C++, CUDA or related DSLs (Triton, cuTe)
  • Solid grasp of AI models, parallelisms, and/or compiler technologies (e.g. torch.compile)
  • Experience conducting performance benchmarking on AI clusters.
  • Understanding of HPC/AI communication concepts (1-sided v 2-sided communication, elasticity, resiliency, topology discovery, etc)

Other signals

  • integrating communication libraries into AI frameworks
  • improving AI compilers
  • authoring custom communication kernels
  • performance analysis of AI workloads