Lead Engineer for Manufacturing and Datacenter Lab, Trainium Manufacturing, Quality and Reliability

Amazon Amazon · Big Tech · Austin, TX · Hardware Development

Lead Engineer for Amazon's Trainium Manufacturing Quality & Reliability organization, focusing on bridging manufacturing outcomes with datacenter operational performance. This role involves establishing and operating a preparedness lab to analyze datacenter performance of manufactured systems, identify root causes of field issues, and feed insights back into manufacturing processes, test strategies, and design improvements. The position also influences hardware design for DFM, DFR, and DFT, establishes data-driven analytics frameworks connecting manufacturing test data to datacenter performance (leveraging ML), and develops manufacturing processes at ODMs/CMs. The role requires experience with AI/ML acceleration systems and managing cross-functional teams.

What you'd actually do

  1. Own operational production performance of Trainium systems across entire product lifecycle from manufacturing through datacenter deployment and fleet operations
  2. Design and build preparedness lab replicating datacenter conditions for assembly, repair and system testing
  3. Define and drive assembly and repair recipes in the manufacturing lab as the baseline prior to high volume manufacturing and datacenter deployment.
  4. Ensure all manufacturing and datacenter test flows are regressed in the manufacturing lab prior to deployment.
  5. Influence hardware design strategy for Design for Manufacturing (DFM), Design for Reliability (DFR), and Design for Test (DFT) based on field failure analysis

Skills

Required

  • BS or MS degree in Electrical Engineering, Mechanical Engineering, Computer Engineering, Industrial Engineering, or related technical fields
  • 8+ years industry experience in one or more of the following: Manufacturing Engineering, Test Engineering, Quality Engineering, Reliability Engineering, or Datacenter Infrastructure Engineering
  • 7+ years working directly with engineering teams in cross-functional environments
  • Experience with AI/ML acceleration systems, high-performance computing servers, or complex multi-rack systems
  • Demonstrated track record delivering stable, performant hardware solutions meeting cost and quality targets
  • Experience with System Mechanical & Thermal design for air-cooled and liquid-cooled systems
  • Strong problem-solving capabilities to isolate, define, and resolve complex problems spanning manufacturing quality and field reliability
  • Experience with root cause analysis methodologies (8D, 5-Why, Fishbone, FMEA) and implementing corrective/preventive actions
  • Proficiency in data analysis tools, statistical methods, and programming (Python, Bash, Shell script, Linux)
  • Ability to take hardware concept from requirements through fabrication and deployment
  • Ability to collaborate effectively with teams spanning multiple sites and develop detailed specifications for product teams
  • Experience working with ODMs, JDMs, component vendors, and internal design teams on cross-boundary triaging, debugging, and resolving issues
  • Strong communication skills with ability to influence senior leadership and cross-functional stakeholders
  • Experience in Design for Manufacturing (DFM), also known as Design for Manufacturability, a product design approach that focuses on optimizing the ease and cost of manufacturing a product

Nice to have

  • ML techniques to predict field failures

What the JD emphasized

  • Experience with AI/ML acceleration systems