Senior Hpc and AI Network Software Architect

NVIDIA NVIDIA · Semiconductors · Zurich, Switzerland +4 · Remote

NVIDIA is seeking a Senior HPC and AI Network Software Architect to design and build scalable AI infrastructure for distributed training and inference. The role involves developing software and hardware approaches to optimize communication efficiency and performance across large-scale systems, collaborating with AI framework teams and hardware teams.

What you'd actually do

  1. Build and evolve the architecture of scalable software systems for distributed AI training and inference, focusing on throughput, latency, resiliency, and memory efficiency across cluster-scale deployments.
  2. Develop and evaluate next-generation communication and runtime capabilities in libraries such as NCCL, UCX, and UCC, tailored to the evolving demands of frontier AI workloads.
  3. Partner with AI framework teams (e.g., TensorFlow, PyTorch, JAX) and internal platform teams to build integrations, explore new approaches, and improve end-to-end performance and reliability.
  4. Collaborate on hardware and system-level features across GPUs, DPUs, and interconnects to speed up data movement and enable new capabilities for training, inference, and model serving at scale.
  5. Drive innovation across runtime systems, communication libraries, and AI-specific protocol layers, helping turn new ideas into practical capabilities and robust implementations.

Skills

Required

  • systems programming
  • parallel or distributed computing
  • high-performance networking
  • large-scale data movement
  • C++
  • Python
  • CUDA
  • AI frameworks (PyTorch, TensorFlow, JAX)
  • communication libraries
  • runtime systems
  • high-throughput, low-latency systems
  • software stacks
  • hardware capabilities
  • system bottlenecks
  • collaboration skills

Nice to have

  • NCCL
  • UCX
  • UCC
  • networking and communication protocols
  • RDMA
  • collective communications
  • congestion-aware transport
  • accelerator-aware networking
  • large model training
  • inference serving
  • hardware-software co-design
  • GPU
  • DPU
  • interconnect
  • runtime capabilities
  • infrastructure for deployment of LLMs
  • transformer-based models
  • sharding
  • pipelining
  • expert parallelism
  • hybrid parallelism

What the JD emphasized

  • Ph.D., or equivalent industry experience
  • 5+ years of experience
  • track record of building production-quality performance-critical software
  • extensive hands-on experience
  • solid grasp
  • demonstrated success
  • strong collaboration skills

Other signals

  • distributed training
  • real-time inference
  • communication efficiency
  • large systems
  • AI workloads