Senior Debug System Engineer, Datacenter

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

This role is for a Senior System Debug Engineer in NVIDIA's datacenter product engineering team, focusing on failure analysis and debug efforts during the New Product Introduction (NPI) phase for GPU Server products. The engineer will identify root causes of factory build problems, analyze logs spanning HW, SW, and FW, build experiments, and develop debug guides.

What you'd actually do

  1. Perform failure analysis (FA) on GPU baseboards and servers at rack, system, and/or component level (including from L6 to L11/rack level).
  2. Analyze logs and failures that may span Hardware (HW), Software (SW), and Firmware (FW) and propose debug and mitigation strategies.
  3. Build experiments and collect/analyze data for Failure Analysis root cause.
  4. Provide root cause and corrective action plans in a timely manner and write clear and complete reports detailing steps taken and findings.
  5. Develop debug guides for partner teams.

Skills

Required

  • failure analysis
  • debug
  • Hardware
  • Software
  • Firmware
  • server systems
  • motherboards
  • graphic cards
  • datacenter products
  • DFx
  • Test
  • Validation
  • oscilloscopes
  • analyzers

Nice to have

  • NPI
  • root cause analysis
  • mitigation strategies
  • experiment design
  • corrective action plans
  • debug guides
  • communication
  • negotiation
  • organization
  • time management
  • problem solving
  • teamwork
  • independent work

What the JD emphasized

  • 8+ years of working experience in a related field
  • Bachelor’s or Master’s degree in Electrical Engineering, or related field (or equivalent experience)
  • Excellent failure analysis or debug experience on motherboards, graphic cards, servers, PCs, or datacenter products.
  • Proven understanding and strong skills in one or more areas: Hardware, Software, Component, Process, Test, Validation.