Sr Manager, AI Systems Quality & Reliability , Annapurna AI Servers and Systems

Amazon Amazon · Big Tech · Austin, TX · Hardware Development

Senior Manager, AI Systems Quality & Reliability at Amazon's Annapurna Labs, leading the QnR function for Trainium AI server products. This role involves owning quality and reliability outcomes from component qualification to fleet performance, defining reliability strategy for liquid-cooled and air-cooled platforms, building quality systems across a global manufacturing base, driving failure investigations, and establishing reliability characterization for next-generation technologies. Requires extensive experience in root cause analysis, reliability/quality engineering with server compute platforms, and people management.

What you'd actually do

  1. Lead and grow a QnR engineering team, hiring, developing, and retaining top reliability and quality engineering talent.
  2. Set technical direction for component qualification, reliability testing (HALT, HTOL, thermal cycling, QRV), DFMEA, and vendor quality standards across all Trainium programs.
  3. Own quality and reliability outcomes end-to-end — from DFM input during design through fleet reliability performance.
  4. Drive component specific manufacturing process quality improvements in partnership with Manufacturing Engineering, establishing incoming quality requirements and process controls at all supplier sites.
  5. Build and maintain the reliability prediction and monitoring infrastructure — ensuring fleet performance is tracked against predictions, degradation trends are identified early, and corrective actions are data-driven.

Skills

Required

  • root cause analysis
  • error correction
  • reliability engineering
  • quality engineering
  • server compute platforms
  • semiconductor packaging
  • high-volume electronics manufacturing
  • people management
  • quality management systems
  • reliability programs

Nice to have

  • leading teams across multiple locations
  • complex manufacturing/production environments
  • fast-paced, rapidly changing operations environment
  • liquid cooling reliability
  • advanced semiconductor packaging reliability
  • vendor quality standards
  • reliability prediction methodologies
  • manufacturing quality tools
  • executive communication skills

What the JD emphasized

  • server compute platforms
  • high-volume electronics manufacturing
  • quality management systems
  • reliability programs