System Software Architect, AI and GPU Networking

NVIDIA NVIDIA · Semiconductors · Beijing, China +1

NVIDIA is seeking a System Software Architect to research and develop advanced networking solutions for AI data centers, focusing on accelerating AI workloads, inference, and model serving. The role involves enhancing GPU networking offerings, designing optimizations for data movement, and evaluating new technologies.

What you'd actually do

  1. Enhance NVIDIA's GPU Networking offerings for accelerating AI workloads, such as NVIDIA Dynamo, NVIDIA NIXL and NVIDIA UCX, tailored to the unique requirements of AI workloads.
  2. Design and prototype features and optimizations that accelerate data movement and enable new capabilities for inference and model serving - focusing on throughput, latency, and memory efficiency..
  3. Identify and evaluate new technologies, innovations and partner relationships for alignment with our technology roadmap and business value.
  4. Develop and evaluate innovative features with respect to runtime systems, communication libraries, AI-specific technologies.
  5. Develop and evaluate enhancements to communication libraries such as NIXL, UCX, and GPUnetIO, tailored to the unique demands of AI workloads.

Skills

Required

  • M.Sc. or Ph.D. in Computer Science, Electrical or Computer Engineering
  • 5+ years of industry experience in system architecture, AI systems architecture, scaling of AI, Parallelism of AI frameworks, or deep learning training workloads.
  • algorithm design
  • system programming
  • computer architecture
  • operating systems
  • virtualization
  • networking
  • storage
  • performance profiling and optimization techniques
  • C++
  • Python
  • CUDA or other GPU programming models

Nice to have

  • Shown research track record
  • system architecture
  • CPU/GPU/memory/storage/networking
  • Deep Learning frameworks
  • AI communication libraries (NCCL, UCX, MPI and equivalents)
  • Inference and Training workloads and optimizations, like Prefill/Decode, data parallelism, Tensor parallelism, FDSP and others.

What the JD emphasized

  • 5+ years of industry experience (or equivalent) in system architecture, AI systems architecture, scaling of AI, Parallelism of AI frameworks, or deep learning training workloads.
  • Deep understanding of performance profiling and optimization techniques, together with defining and using hardware features.
  • Deep understanding of Inference and Training workloads and optimizations, like Prefill/Decode, data parallelism, Tensor parallelism, FDSP and others.

Other signals

  • AI workloads
  • AI data centers
  • distributed AI
  • deep learning solutions
  • inference
  • model serving