Senior System Software Engineer, Enterprise Mods

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

NVIDIA is seeking a Senior System Software Engineer to develop and lead diagnostic systems for their data center platforms, focusing on hardware and software tools to stress test CPUs, GPUs, memory, storage, and interconnects. The role involves platform bring-up, integration, hardware validation strategy, root cause analysis of failures, and influencing long-term diagnostic architecture and roadmaps for NVIDIA and its partners. The position requires deep systems knowledge, C/C++/Python expertise, and experience with high-speed interconnects. While the company is at the forefront of AI, this specific role focuses on the engineering and diagnostics of the underlying hardware and software infrastructure that supports these AI workloads, rather than direct AI/ML model development or research.

What you'd actually do

  1. Develop diagnostic systems for NVIDIA data center platforms, which involve hardware and software tools to develop the worst case stress workloads for CPUs, GPUs, memory, storage, and interconnects.
  2. Lead platform bring-up and integration, ensuring diagnostics are embedded early and effectively across the server lifecycle.
  3. Drive hardware validation strategy in collaboration with architecture and hardware teams, crafting robust validation plans for new server generations.
  4. Analyze root causes of complex failures, acting as a Level 2 engineering contact for critical issues and offering scalable solutions across the stack.
  5. Develop diagnostics software to ensure quality and performance at scale across ODM and partner production lines.

Skills

Required

  • architecting diagnostics for complex server systems
  • SW/HW interface
  • x86/ARM architectures
  • Linux/Windows OS internals
  • firmware (UEFI/BIOS)
  • BMC
  • platform security
  • C
  • C++
  • Python
  • high-speed interconnects (PCIe, Infiniband, NVLink, Ethernet)
  • communication skills
  • 8+ years of engineering experience in diagnostics, embedded systems, or cloud platforms

Nice to have

  • rack-level or cluster-level deployments
  • cloud-scale infrastructure
  • partner engagement
  • influencing product direction and vendor roadmaps
  • mentoring and building high-performing teams

What the JD emphasized

  • diagnostics for complex server systems
  • SW/HW interface
  • systems knowledge
  • diagnostics software
  • quality and performance at scale