Failure Analysis Engineer - Power & Thermal

AMD AMD · Semiconductors · Secaucus, NJ · Engineering

AMD is seeking a Failure Analysis Engineer focused on Power and Thermal issues for GPU accelerators. This role involves PCB triage, board-level fault isolation, developing debug strategies, running tests, and collaborating with design, validation, FW, and manufacturing teams to identify root causes and implement corrective actions. The engineer will analyze power behavior, thermal analysis, and liquid-cooling performance, documenting findings and presenting them to stakeholders. Experience with hardware debug, power/thermal analysis, liquid cooling, PCB triage, scripting (Python), and firmware is preferred.

What you'd actually do

  1. Support internal and external requests to troubleshoot AMD GPU product failures with primary focus on Power and Thermal failure analysis, PCB triage, and board-level failure isolation for continuous yield, quality, and customer support improvements.
  2. Develop and execute diagnostics, scope-based measurements, and functional test DOE’s to reproduce, characterize, and isolate difficult board-, power-, and thermal-related failures.
  3. Develop Automation and tools to run tests and analyze results/logs.
  4. Perform structured PCB triage by narrowing failures to the board, component, power rail, layout interaction, or system integration level, and work with the contract manufacturer and internal AMD teams to reproduce failures, isolate root cause, and determine the most effective next steps for debug and corrective action.
  5. Use schematics, layout data, lab measurements, and power/thermal behavior knowledge to understand system behavior, trace likely fault paths, form debug hypotheses, and build targeted validation plans that drive efficient fault isolation and high-quality failure analysis.

Skills

Required

  • Power and Thermal failure analysis
  • PCB triage
  • board-level fault isolation
  • debug strategies
  • schematics review
  • functional test DOE's
  • root cause analysis
  • corrective actions
  • liquid-cooling fundamentals
  • Python
  • shell scripting
  • Windows and Linux environments
  • firmware, drivers, and hardware interactions
  • hardware verification
  • system integration
  • computer systems and servers assembly/installation/configuration
  • communication
  • documentation
  • collaboration
  • presentation skills
  • schematics reading
  • datasheet interpretation
  • component identification
  • soldering/rework

Nice to have

  • oscilloscopes
  • logic analyzers
  • power analyzers
  • custom test tools
  • high-speed digital design
  • power delivery networks
  • voltage regulator behavior
  • memory interfaces (HBM, GDDR)
  • PCIe
  • display output