System Failure Analysis Engineer (gpu Servers / Data Center)

AMD AMD · Semiconductors · Austin, TX · Engineering

This role focuses on failure analysis for GPU-accelerated server platforms in data center environments. The engineer will diagnose and resolve complex system failures, leveraging advanced diagnostic tools and AI-assisted techniques to improve system reliability. Responsibilities include platform and system failure analysis, ODM factory enablement, and cross-functional collaboration. Experience with server bring-up, rack-level troubleshooting, and AI tools for debug workflows is preferred.

What you'd actually do

  1. Perform component-, system-, and rack-level failure analysis on GPU-accelerated server platforms
  2. Debug issues across CPU, GPU, memory, PCIe, networking, storage, power, and thermal subsystems
  3. Support server bring-up and manufacturing test failures (POST, BIOS/UEFI configuration, PCIe enumeration, firmware interactions)
  4. Analyze BIOS, BMC, IPMI, and system logs to identify hardware/firmware interaction issues
  5. Support ODM manufacturing test, failure debug, and troubleshooting

Skills

Required

  • Bachelor’s degree in Electrical Engineering
  • Experience with server platforms
  • Experience with rack-level or data center environments
  • Experience with GPU servers and multi-node systems
  • Understanding of power sequencing, PCIe, memory, and thermal systems
  • Familiarity with BIOS/UEFI, BMC, IPMI, and firmware-level debugging
  • Proficiency with lab debug tools (oscilloscopes, logic analyzers, protocol analyzers, power analyzers)
  • Cross-functional collaboration and communication skills
  • Technical documentation skills

Nice to have

  • Master’s degree in Electrical or Systems Engineering
  • Experience using AI tools for log analysis, debug workflows, or knowledge development
  • US citizenship required
  • Willingness to travel up to 25%

What the JD emphasized

  • server bring-up
  • rack-level troubleshooting
  • AI-assisted debug
  • AI tools for log analysis, debug workflows, or knowledge development