System Software Engineer – Data Center GPU Compute Diagnostics

NVIDIA NVIDIA · Semiconductors · Durham, NC

System Software Engineer focused on diagnostics for Data Center GPUs in AI supercomputer systems. The role involves creating and executing applications to stress GPU components, working closely with hardware and software teams throughout the product lifecycle. Responsibilities include implementing CUDA/C++ diagnostic workloads, tuning GPU compute tests, and contributing to higher-level AI workload tests.

What you'd actually do

  1. Working closely with hardware architecture, driver, manufacturing, and field teams through the product development lifecycle of rack-scale AI systems.
  2. Implementing and maintaining CUDA/C++ diagnostic workloads and software infrastructure used in chip development, validation, productization, and field triage.
  3. Writing and tuning GPU compute tests that stress Tensor Cores, SMs, L2/cache hierarchy, HBM memory, and related power/thermal operating points.
  4. Implementing and tuning GEMM-style diagnostic workloads, including tests combined with additional load in NVLink, PCIe or CPU subsystems.
  5. Contributing to higher-level AI workload tests, including PyTorch-based large model workloads that stress GPUs, memory, interconnects, thermals, and system software under realistic rack-scale AI use cases.

Skills

Required

  • BS or MS degree in Electrical Engineering, Computer Engineering, Computer Science, or equivalent experience.
  • 5+ years of system software, GPU software, embedded software, or hardware validation experience.
  • Strong C/C++ and Python programming skills.
  • Strong problem solving and low-level debugging skills.

Nice to have

  • Exposure to GPU architecture, CUDA kernels, GPU compute workloads, or related accelerator programming.
  • Working knowledge of memory systems, ECC behavior and DMA engines.
  • Familiarity with GEMM-style workloads.
  • Awareness of voltage/frequency characterization, thermal testing, power stress, or related silicon validation concepts such as Vmin/Fmax and P-state testing.
  • Experience using modern AI development and analysis tools to improve engineering velocity, including code development, debugging, and test creation.
  • Experience writing low-level diagnostics, interacting with device firmware and hardware level debuggers.

What the JD emphasized

  • Experience writing low-level diagnostics, interacting with device firmware and hardware level debuggers.
  • Strong C/C++ and Python programming skills.
  • Exposure to GPU architecture, CUDA kernels, GPU compute workloads, or related accelerator programming is strongly preferred.
  • Working knowledge of memory systems, ECC behavior and DMA engines.
  • Familiarity with GEMM-style workloads.
  • Awareness of voltage/frequency characterization, thermal testing, power stress, or related silicon validation concepts such as Vmin/Fmax and P-state testing.
  • Experience using modern AI development and analysis tools to improve engineering velocity, including code development, debugging, and test creation.
  • Strong problem solving and low-level debugging skills.