Senior System Software Engineer - GPU Performance

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Senior System Software Engineer focused on GPU performance for communication libraries (NCCL, NVSHMEM, UCX) in Deep Learning and HPC applications. Involves performance characterization, analysis, root-cause analysis of performance issues, and building tools for data visualization on large multi-GPU/multi-node clusters. Requires experience with parallel programming, communication runtimes, and systems software fundamentals.

What you'd actually do

  1. Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  2. Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack
  3. Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available
  4. Triage and root-cause performance issues reported by our customers
  5. Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information

Skills

Required

  • parallel programming
  • communication runtime (MPI, NCCL, UCX, NVSHMEM)
  • performance benchmarking
  • triage on large scale HPC clusters
  • computer system architecture
  • HW-SW interactions
  • operating systems principles
  • micro-benchmarks in C/C++
  • scripting language (Python)
  • containers
  • cloud provisioning
  • scheduling tools (Kubernetes, SLURM, Ansible, Docker)

Nice to have

  • Infiniband/Ethernet networks
  • RDMA
  • topologies
  • congestion control
  • network issues debugging
  • CUDA programming
  • GPUs
  • Deep Learning Frameworks (PyTorch, TensorFlow)

What the JD emphasized

  • performance engineering
  • HPC
  • parallel programming
  • communication runtime
  • performance benchmarking
  • large scale HPC clusters
  • computer system architecture
  • HW-SW interactions
  • systems software fundamentals
  • performance issues
  • large scale deployments