AI Hardware Systems Manager, Annapurna Labs, Trainium Machine Learning Fleet Operations

Amazon Amazon · Big Tech · Austin, TX · Software Development

This role manages a team of engineers responsible for the operations of ML hardware fleet, focusing on automation, tooling, and operational excellence for advanced server hardware. The role involves building and growing the team, defining roadmaps, and ensuring the health and optimization of ML infrastructure at scale.

What you'd actually do

  1. Build, hire, mentor, and grow a team of platform development engineers responsible for ML fleet operations across multiple accelerator platforms
  2. Define team roadmap and technical strategy for fleet health, automation, and data infrastructure — balancing near-term operational demands against long-term engineering investments
  3. Drive operational excellence by establishing metrics, SLAs, and processes that maximize platform sellability and customer experience
  4. Partner with hardware engineering, software engineering, and product teams to prioritize debug efforts and translate fleet learnings into permanent design fixes
  5. Own escalation paths for critical fleet incidents and lead cross-functional war rooms to resolution

Skills

Required

  • engineering team management
  • Python scripting language
  • troubleshooting/debugging of hardware
  • designing, building, operating, and managing large-scale distributed systems or web services
  • systems engineering
  • platform engineering
  • SRE
  • hardware operations

Nice to have

  • automating, deploying, and supporting large-scale infrastructure
  • server technologies such as, thermal, mechanical, power, and signal integrity
  • working cross-functionally across several teams both technical and non-technical
  • GPU, ML accelerator, or high-performance computing hardware
  • managing teams through ambiguity on new or unreleased products

What the JD emphasized

  • ML fleet operations
  • accelerator platforms
  • large-scale distributed systems
  • hardware systems
  • fleet health
  • automation
  • data infrastructure
  • customer experience
  • debug efforts
  • critical fleet incidents