Senior Software Engineer - Nvlink Rack Scale Stability and Reliability

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +5 · Remote

Senior Software Engineer focused on the stability and reliability of NVLink Rack-Scale Systems, which are critical for large-scale AI infrastructure. The role involves platform bringup, software validation, developing diagnostics and automation, leading reliability validation, and triaging complex issues across software, firmware, networking, and platform layers. Experience with large-scale AI systems and data center infrastructure is required.

What you'd actually do

  1. Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems.
  2. Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support.
  3. Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution.
  4. Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments.
  5. Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability.

Skills

Required

  • C/C++
  • Python
  • system-level debugging
  • networking fundamentals
  • large-scale AI systems
  • telemetry analysis
  • root-cause debugging

Nice to have

  • Bash/Shell scripting
  • NVIDIA GPU systems
  • NVLink
  • NVSwitch
  • CUDA
  • large-scale AI/HPC clusters
  • PCIe
  • memory hierarchy
  • DMA
  • high-speed interconnects
  • distributed training/inference systems
  • server management technologies
  • data center operations
  • cluster provisioning
  • scaling
  • fleet monitoring
  • diagnostics
  • automation
  • CI/CD pipelines
  • dashboards
  • reliability tooling

What the JD emphasized

  • large-scale AI infrastructure
  • next-generation AI infrastructure
  • large-scale AI systems