Senior Software Engineer, Nccl and Cuda - Csp Engagements

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

Senior Software Engineer specializing in NCCL and CUDA for ML software stack functionality and performance in datacenter products. Focuses on customer engagements, root cause analysis, performance optimization, and debugging of libraries and system software for large-scale deployments, particularly for deep learning workloads.

What you'd actually do

  1. Engage with our CSPs to root cause functional and performance issues in NCCL and CUDA libraries.
  2. Analyze and improve multi-GPU workloads performance through profiling, benchmarking, and tuning.
  3. Understand and solve NCCL and NVSHMEM data movement issues in multi-node clusters.
  4. Understand and solve CUDA porting issues for customer workloads.
  5. Apply datacenter-specific scheduling and topologies for optimal performance

Skills

Required

  • parallel programming models
  • communication libraries (MPI, NCCL, NVSHMEM) run time
  • performance optimization and profiling tools (e.g., Nsight, nvprof)
  • C/C++ programming and debugging skills
  • CUDA development
  • PCIe and NVLINK
  • operating systems and data-center system architecture
  • high-performance networking like InfiniBand, and RoCE
  • compute, networking and cloud deployment, specifically on bare-metal and VMs
  • containers, cloud provisioning and scheduling tools such as Docker, Kubernetes, SLURM, and Ansible
  • system software validation experience
  • communicate effectively and collaborate with partner and customer teams

Nice to have

  • Strong software architecture experience
  • Experience with deep learning workloads training and inferencing
  • Experience conducting performance benchmarking and developing tooling on HPC clusters

What the JD emphasized

  • 8+ years of system software validation experience