Principal Deep Learning Communication Architect

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

NVIDIA is seeking a Principal Deep Learning Communication Architect to lead the technical roadmap for communication libraries across next-generation platforms, ensuring seamless scaling of models to massive clusters. The role involves designing and optimizing communication primitives for heterogeneous interconnects, co-designing with application developers and silicon architects, and developing analytical models for system behavior. Expertise in parallel computing, HPC/distributed deep learning, inference engines, and GPU architecture is required.

What you'd actually do

  1. Define the long-term technical roadmap for communication libraries across NVIDIA’s next-generation platforms. You will ensure the seamless scaling of models to clusters comprising hundreds of thousands of nodes.
  2. Lead the development of next-generation communication primitives and collective algorithms. This includes optimizing for heterogeneous interconnects such as NVLink, Spectrum-X (Ethernet), and Quantum-X (InfiniBand).
  3. Partner with application developers to architect and implement specialized communication primitives. You will ensure that AI and HPC libraries—including NCCL, NIXL, NVSHMEM, UCC, and UCX—evolve to meet the requirements of trillion-parameter and Agentic AI.
  4. Collaborate with silicon Aarchitects and software engineers to influence hardware specifications for next-generation networking, ensuring they meet the evolving demands of trillion-parameter LLMs and Agentic AI.
  5. Develop high-fidelity analytical models and simulators to predict system behavior under emerging workloads.

Skills

Required

  • Ph.D. or M.S. in Computer Science, Electrical Engineering, or a related field (or equivalent experience), with 12+ years of industry experience in high-performance computing (HPC) or distributed deep learning.
  • Deep understanding of 3D parallelism (Data, Tensor, Pipeline) and advanced strategies including Context Parallelism, Expert Parallelism, and Zero Redundancy Optimizer (ZeRO) variants.
  • Deep technical proficiency with NCCL, UCX, UCC, NVSHMEM, or MPI.
  • Experience with RDMA, RoCE, and low-level InfiniBand verbs is required.
  • Advanced knowledge of high-throughput inference engines and schedulers, specifically TensorRT-LLM, vLLM, SGLang, and NVIDIA Dynamo.
  • Expert knowledge of the NVIDIA GPU memory hierarchy (HBM3e/HBM4, L2 cache) and CUDA programming models.

Nice to have

  • Hands-on experience developing within Megatron-Core, DeepSpeed, or JAX/XLA, with an understanding of how these frameworks interact with low-level communication runtimes is a plus.
  • Significant upstream contributions to major open-source projects (e.g., PyTorch Distributed, KServe, or Ray).
  • A proven track record of deploying and optimizing models on NVIDIA platforms or similar rack-scale systems.
  • A strong portfolio of patents or papers in top-tier systems/architecture venues (e.g., ISCA, ASPLOS, NeurIPS, SC).

What the JD emphasized

  • trillion-parameter and Agentic AI
  • trillion-parameter LLMs and Agentic AI
  • high-throughput inference engines and schedulers
  • Expert knowledge of the NVIDIA GPU memory hierarchy (HBM3e/HBM4, L2 cache) and CUDA programming models.

Other signals

  • trillion-parameter LLMs
  • Agentic AI
  • next-generation platforms
  • hundreds of thousands of nodes