Principal AI Systems Design Engineer

AMD AMD · Semiconductors · Santa Clara, CA · Engineering

This Principal AI Systems Design Engineer role at AMD focuses on customer engineering, specifically debugging AI infrastructure on AMD GPU platforms. The role requires deep expertise in high-speed memory standards (DDR, HBM), silicon bring-up, and full-stack hardware/firmware/software debugging to resolve customer issues and enable successful adoption of AMD's GPU platforms for AI workloads.

What you'd actually do

  1. Drive resolutions for customer issues with innovative debug methods with a goal to root cause and enable customers in a fast-paced environment
  2. Provide technical leadership on issue debug closely working with SoC, Memory Technology, Design, Validation and Manufacturing teams driving to root cause
  3. Ability to setup hardware systems and probe components in the system; check electrical, power signals, and validate a system using different AI workloads
  4. Communicate / Document flows and methods of bring-up, system initialization, running stress workloads and debug
  5. Lead technical presentations demonstrating a good understanding of customer application, infrastructure, and system design

Skills

Required

  • Significant hands-on experience with high speed memories such as DDR5/LP5/HBM3
  • Silicon bring up and debug
  • Understanding of Memory controllers and PHYs, Training algorithms and FW interactions, ECC, Manufacturing, and reliability mechanisms
  • Hands-on system development experience
  • Debugging of complex full stack SW/FW/HW issues
  • Understand memory bottlenecks through the system
  • Validate items connecting to the GPU SOC (HBM, VRs, internal networking)
  • Communication skills
  • Technical presentations
  • SoC architecture
  • Memory standards
  • Debug of complex system level issues
  • Debug capabilities of memory protocols in Server CPU/GPU/FPGA in single and multi-node platforms
  • Troubleshooting experience
  • Using industry debug tools, scopes as well examine board level signal, power integrity
  • Hardware, architecture, and software expertise
  • Programming skills in Python, C, or C++
  • Scripting languages such as Perl, Ruby, and Shell script
  • Running, analyzing, and system benchmarks such as JEDEC standards
  • Revision control (GIT, SVN and CVS)
  • Drive resolution of critical problems within a lab, Datacenter
  • Relationship with external customers/partners

Nice to have

  • Hands on experience with Hardware in silicon/system lab environment

What the JD emphasized

  • debug AI infrastructure
  • high-speed memory standards such as DDR and HBM
  • silicon bring up and debug
  • complex full stack SW/FW/HW issues is a must
  • debug memory bottlenecks through the system
  • validate the items connecting to the GPU SOC (HBM, VRs, internal networking)