Senior Deep Learning Communication Architect

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

Senior Deep Learning Communication Architect role focused on optimizing communication performance for large-scale distributed deep learning training and inference. This involves identifying bottlenecks, designing efficient protocols, collaborating on hardware/software co-design, and exploring new communication technologies. The role requires deep understanding of parallelism techniques and experience with DNN frameworks and GPU computing.

What you'd actually do

  1. Optimizing communication performance: Identify and eliminate bottlenecks in data transfer and synchronization during distributed deep learning training and inference.
  2. Designing efficient communication protocols: Develop and implement communication algorithms and protocols tailored for deep learning workloads, minimizing communication overhead and latency.
  3. Hardware and software co-craft: Collaborate with hardware and software teams to craft systems that effectively apply high-speed interconnects (e.g., NVLink, InfiniBand, SPC-X) and communication libraries (e.g., MPI, NCCL, UCX, UCC, NVSHMEM).
  4. Exploring innovative communication technologies: Research and evaluate new communication technologies and techniques to enhance the performance and scalability of deep learning systems.
  5. Developing and implementing solutions: Build proofs-of-concept, conduct experiments, and perform quantitative modeling to validate and deploy new communication strategies.

Skills

Required

  • Ph.D., Masters, or BS in Computer Science (CS), Electrical Engineering (EE), Computer Science and Electrical Engineering (CSEE), or a closely related field or equivalent experience
  • 6+ years of experience in Building DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training and inference workloads
  • Experience in evaluating, analyzing, and optimizing LLM training and inference performance of state-of-the-art models on cutting-edge hardware
  • Deep understanding of parallelism techniques, including Data Parallelism, Pipeline Parallelism, Tensor Parallelism, Expert Parallelism, and FSDP
  • Understanding of the emerging serving architectures like Disaggregated Serving and inference servers like Dynamo and Triton
  • Proficiency in developing code for one or more deep neural network (DNN) training and Inference frameworks, such as PyTorch, TensorRT-LLM, vLLM, SGLang
  • Strong programming skills in C++ and Python
  • Familiarity with GPU computing, including CUDA and OpenCL
  • Familiarity with InfiniBand and RoCE networks

Nice to have

  • Prior contributions to one or more DNN training and Inference frameworks
  • Deep understanding and contributions to the scaling of LLMs on large-scale systems

What the JD emphasized

  • Scaling of DNNs
  • deep learning training and inference workloads
  • LLM training and inference performance
  • large-scale systems

Other signals

  • Scaling DNN training and inference frameworks to hundreds of thousands of nodes
  • Optimizing communication performance for distributed deep learning
  • Designing efficient communication protocols for deep learning workloads
  • Hardware and software co-design for high-speed interconnects
  • Evaluating and optimizing LLM training and inference performance