Manager, Distinguished Engineer - Dgx Systems Software

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Manager, Distinguished Engineer for DGX Systems Software at NVIDIA, responsible for the end-to-end delivery of DGX compute systems, ensuring seamless integration of firmware, OS, drivers, CUDA, networking, and AI applications. The role involves leading firmware development, defining validation strategies, driving platform bring-up and architecture, and managing the complete product delivery lifecycle for next-generation platforms.

What you'd actually do

  1. End-to-End Stack Readiness: Ensure every DGX platform is ready for the full NVIDIA software stack—firmware, DGX OS, GPU drivers, CUDA toolkit, DCGM, DOCA/OFED, and management tools—as a validated, production-quality product. Own the GA SW/FW release process delivering firmware bundles, BaseOS ISOs, and release notes to OEM/OSV partners. Ensure platforms support AI agents like NemoClaw, Hermes agents, NIM microservices, and workloads customers expect out of the box.
  2. Platform Firmware Development: Lead development of the manageability firmware stack (BMC, BIOS) for all DGX platforms. Ensure firmware from partner teams (GPU, CPU, networking) integrates correctly at system level. Manage 3rd-party vendors and drive platform requirements (NVPOR) across all firmware areas.
  3. Validation Strategy: Define validation strategy proving each DGX platform is production-ready: end-to-end system validation including firmware regression, NVQual certification, DL workload performance, OS/CUDA stack testing, multi-user scenarios, power/thermal validation, and field upgrade reliability. Establish quality gates and zero ship-stopper discipline.
  4. Platform Bring-Up & Architecture: Drive platform bring-up for each new DGX system—coordinating first boot across new silicon (CPU, GPU), board design, and firmware teams. Own architectural strategy for next-generation platforms including firmware update mechanisms, system security posture, and AI application readiness.
  5. Customer Deployment & Enablement: Ensure firmware release flows meet CSP and enterprise deployment requirements. Represent DGX platform readiness in executive reviews and strategic planning with VP/SVP leadership. Engage with industry standards bodies (DMTF Redfish, OCP).

Skills

Required

  • BS or MS in Computer Science, Electrical Engineering, or related field or equivalent experience
  • 12+ overall years in systems firmware/software engineering
  • 5+ years in engineering leadership
  • Deep expertise in server system stack including SBIOS, BMC, OS, applications and system-level integration of complex multi-component products
  • Proven track record delivering multi-generation server or data center platforms from architecture through customer deployment
  • Experience managing engineering organizations across multiple geographies in a matrix environment
  • Strong understanding of server hardware: CPU, GPU, interconnect, memory, PCIe, power delivery
  • Experience owning end-to-end product quality—from firmware validation through full-stack system testing to field deployment

Nice to have

  • Experience with NVIDIA DGX, or GPU-accelerated server platforms
  • Track record driving server bring-up for new silicon and system architecture redesigns
  • Familiarity with DMTF Redfish, OCP standards, and server manageability ecosystems
  • Experience with AI/DL workload validation and performance optimization at the platform level
  • Demonstrated ability to operate at VP/SVP level, influencing cross-BU strategic decisions

What the JD emphasized

  • end-to-end delivery
  • production-ready system
  • customer deployment
  • firmware
  • AI applications
  • end-to-end product quality