Senior Debug System Engineer, Datacenter

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA is seeking a Senior System Debug Engineer for their datacenter product engineering team. The role involves driving failure analysis and debug efforts during the Mass Production phase of datacenter GPU products, identifying root causes of factory build problems, and collaborating with cross-functional teams to ensure product quality. Responsibilities include performing failure analysis on GPU baseboards and servers, analyzing hardware, software, and firmware failures, building experiments, and developing debug guides.

What you'd actually do

  1. Perform failure analysis (FA) on GPU baseboards and servers at rack, system, and/or component level (including from L6 to L11/rack level).
  2. Analyze logs and failures that may span Hardware (HW), Software (SW), and Firmware (FW) and propose debug and mitigation strategies.
  3. Build experiments and collect/analyze data for Failure Analysis root cause.
  4. Provide root cause and corrective action plans in a timely manner and write clear and complete reports detailing steps taken and findings.
  5. Develop debug guides for partner teams and customers.

Skills

Required

  • failure analysis
  • debug
  • Hardware
  • Software
  • Firmware
  • servers
  • motherboards
  • graphic cards
  • datacenter products
  • DFx
  • Test
  • Validation
  • oscilloscopes
  • analyzers

Nice to have

  • negotiation
  • organization
  • time management
  • problem solving
  • teamwork
  • independent work
  • communication

What the JD emphasized

  • 12+ years of working experience in a related field
  • Excellent failure analysis or debug experience on motherboards, graphic cards, servers, PCs, or datacenter products
  • Proven understanding and strong skills in one or more areas: Hardware, Software, Component, Process, Test, Validation