Senior System Software Engineer - AI Performance and Efficiency Tools

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior System Software Engineer role focused on developing and improving tools for AI workload performance and efficiency on GPU clusters, supporting AI researchers and SW/HW teams. Involves building profiling, debugging, and benchmarking tools, and partnering with hardware architects.

What you'd actually do

  1. Build internal profiling and analysis tools for AI workloads at large scale
  2. Build debugging tools for common encountered problems like memory or networking
  3. Create benchmarking and simulation technologies for AI system or GPU cluster
  4. Partner with HW architects to propose new features or improve existing features with real world use cases

Skills

Required

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
  • Strong software skills in design, coding (C++ and Python), analytical, and debugging
  • Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
  • Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
  • Experience with NVIDIA GPUs, CUDA Programming and NCCL
  • Motivated self-starter with strong problem-solving skills and customer-facing communication skills
  • Passion for continuous learning. Ability to work concurrently with multiple global groups

Nice to have

  • Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
  • Solid experience in large AI job performance analysis for training/inference workload
  • Knowledge of Linux device drivers and/or compiler implementation
  • Knowledge of GPU and/or CPU architecture and general computer architecture principles

What the JD emphasized

  • Strong software skills in design, coding (C++ and Python), analytical, and debugging
  • Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
  • Experience with NVIDIA GPUs, CUDA Programming and NCCL
  • Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
  • Solid experience in large AI job performance analysis for training/inference workload

Other signals

  • Developing tools for AI researchers and SW/HW teams running AI workloads in GPU clusters
  • Build internal profiling and analysis tools for AI workloads at large scale
  • Create benchmarking and simulation technologies for AI system or GPU cluster
  • Solid experience in large AI job performance analysis for training/inference workload