Senior Software Engineer, Nccl

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Software Engineer to design, implement, and maintain highly-optimized communication runtimes for Deep Learning frameworks (e.g. NCCL for TensorFlow/Pytorch) and HPC programming interfaces (e.g. UCX for MPI/OpenSHMEM) on GPU clusters. This role involves system software design for GPU interactions and creating proof-of-concepts for new designs.

What you'd actually do

  1. Design, implement and maintain highly-optimized communication runtimes for Deep Learning frameworks (e.g. NCCL for TensorFlow/Pytorch) and HPC programming interfaces (e.g. UCX for MPI/OpenSHMEM) on GPU clusters.
  2. Participating in and contributing to parallel programming interface specifications like MPI/OpenSHMEM.
  3. Design, implement and maintain system software that enables interactions among GPUs and interactions between GPUs and other system components.
  4. Creating proof-of-concepts to evaluate and motivate extensions in programming models, new designs in runtimes and new features in hardware.

Skills

Required

  • C/C++ programming
  • Linux
  • Computer system architecture
  • Operating systems
  • Parallel programming interfaces
  • Communication runtimes

Nice to have

  • CUDA programming
  • NVIDIA GPUs
  • High-performance networks (InfiniBand, iWARP)
  • HPC applications
  • Deep Learning Frameworks (PyTorch, TensorFlow)
  • Collaborative and interpersonal skills

What the JD emphasized

  • Excellent C/C++ programming and debugging skills.
  • Expert understanding of computer system architecture and operating systems.
  • Experience with parallel programming interfaces and communication runtimes.