Senior GPU Networking Architect

NVIDIA NVIDIA · Semiconductors · Zurich, Switzerland +4 · Remote

This role focuses on building and optimizing GPU communication kernels for large-scale AI systems, linking GPU computing with networking. The Senior GPU Networking Architect will leverage deep knowledge of GPU architecture to improve kernel efficiency, minimize latency, and overlap computation with communication. Responsibilities include developing GPU-resident communication primitives, profiling and tuning kernels, and collaborating with various teams to co-design communication strategies. The role requires strong CUDA programming, GPU architecture fundamentals, and systems-level C/C++ development.

What you'd actually do

  1. Build, implement, and optimize GPU communication kernels that underpin collective and point-to-point operations in large-scale AI systems.
  2. Leverage deep knowledge of GPU architecture—thread scheduling, memory hierarchy, execution pipelines—to improve kernel efficiency, minimize latency, and overlap computation with communication.
  3. Develop GPU-resident communication primitives and device-side APIs that enable fine-grained, kernel-initiated data movement across nodes and accelerators.
  4. Profile and tune GPU kernels end-to-end, identifying bottlenecks at the intersection of compute, memory, and network, and driving targeted optimizations.
  5. Collaborate with network software, hardware, and AI framework teams to co-design communication strategies that align with GPU execution patterns and emerging model architectures.

Skills

Required

  • CUDA programming
  • GPU architecture fundamentals
  • systems-level C/C++ development
  • GPU data movement mechanisms
  • GPU performance profiling

Nice to have

  • NCCL, NVSHMEM, or similar GPU-aware communication frameworks
  • distributed deep learning parallelism techniques
  • RDMA, InfiniBand, high-speed networking, GPU system topology
  • overlap techniques
  • LLM training or inference workloads

What the JD emphasized

  • 5+ years of hands-on CUDA programming, including writing and optimizing non-trivial GPU kernels.
  • Strong understanding of GPU architecture fundamentals
  • Familiarity with GPU data movement mechanisms such as GPUDirect RDMA and GPU-initiated communication.
  • Ability to read and reason about GPU performance profiles
  • Experience developing or optimizing communication kernels in libraries such as NCCL, NVSHMEM, or similar GPU-aware communication frameworks.
  • Understanding of distributed deep learning parallelism techniques
  • Background in RDMA, InfiniBand, high-speed networking, and GPU system topology
  • Experience with overlap techniques such as kernel pipelining, persistent kernels, or cooperative groups to hide communication latency behind compute.
  • Proven experience evaluating and optimizing large-scale LLM training or inference workloads

Other signals

  • GPU communication kernels
  • large-scale AI systems
  • GPU architecture
  • low-latency
  • distributed deep learning