Senior System Software Engineer – Data Center GPU Compute Diagnostics

NVIDIA NVIDIA · Semiconductors · Durham, NC

Senior System Software Engineer at NVIDIA focused on developing diagnostics and stress tests for Data Center GPU compute engines in AI supercomputer systems. The role involves low-level hardware interaction, performance tuning, and validation of next-generation GPUs, with a strong emphasis on CUDA, C++, and Python programming, as well as understanding of computer architecture and GPU memory systems. The engineer will also mentor other team members and collaborate with various hardware and manufacturing teams.

What you'd actually do

  1. Working closely with hardware architecture, driver, manufacturing and field teams through product development lifecycle of rack-scale AI systems.
  2. Responsible for crafting CUDA/C++ diagnostic workloads and software infrastructure required for new chip development, validation, productization, and field triage.
  3. Designing and implementing GPU compute tests that stress Tensor Cores, SMs, L2/cache hierarchy, HBM memory, and related power/thermal operating points.
  4. Developing and tuning GEMM-style diagnostic workloads, including tests combined with additional load in NVLink, PCIe or CPU subsystems.
  5. Developing and integrating higher-level AI workload tests, including PyTorch-based large model workloads to stress GPUs, memory, interconnects, thermals, and system software under realistic rack-scale AI use cases.

Skills

Required

  • BS or MS degree in Electrical Engineering, Computer Engineering, Computer Science, or equivalent experience.
  • 12+ years of system software, GPU software, embedded software, or hardware validation experience.
  • Experience driving technical work across multiple engineers, mentoring others, or leading development of a complex software component.
  • Strong C/C++ and Python programming skills.
  • Understanding of memory systems, ECC behavior, cache hierarchy, bandwidth bottlenecks, and hardware failure signatures.
  • Understanding of GEMM-style workloads and how workload shape, precision, runtime, and verification affect compute stress, power, memory, and thermal behavior.
  • Experience with voltage/frequency characterization, thermal testing, power stress, or related silicon validation concepts such as Vmin/Fmax and P-state testing.
  • Background with PCIe, NVLink, or networking technologies such as InfiniBand and Ethernet.

Nice to have

  • Experience with Linux device drivers, CUDA kernels, GPU compute workloads, or related accelerator programming is strongly preferred.

What the JD emphasized

  • Experience writing diagnostics and stress tests that interface to low-level hardware drivers and hardware registers.
  • Understanding of memory systems, ECC behavior, cache hierarchy, bandwidth bottlenecks, and hardware failure signatures.
  • Understanding of GEMM-style workloads and how workload shape, precision, runtime, and verification affect compute stress, power, memory, and thermal behavior.
  • Experience with voltage/frequency characterization, thermal testing, power stress, or related silicon validation concepts such as Vmin/Fmax and P-state testing.
  • Background with PCIe, NVLink, or networking technologies such as InfiniBand and Ethernet.