Soc Platform Software Engineering Manager, Annapurna Labs Machine Learning Acceleration, Aws

Amazon Amazon · Big Tech · Austin, TX · Software Development

Engineering Manager for a SoC Platform Software team at AWS, focusing on the Hardware Abstraction Layer (HAL) for ML accelerator chips (Trainium and Inferentia). The role involves managing a team of engineers to develop and maintain a single C++ codebase that compiles and runs across three different execution environments: chip verification, system emulation, and production microcontrollers. The team's work enables ML training across clusters of accelerators, but the role itself does not require ML expertise, focusing instead on platform software, hardware-software interaction, and API design.

What you'd actually do

  1. Manage, coach, and grow a team of 6 engineers — set technical direction, own hiring, and create an environment where strong engineers want to stay
  2. Own the platform abstraction layer that enables one C++ codebase to compile and run correctly across three target environments with fundamentally different runtime characteristics
  3. Shape the external API contracts that verification, emulation, and production teams build on — balancing stability for consumers against the need to evolve as new chip generations arrive
  4. Drive the architecture of our C++ template metaprogramming framework that generates type-safe register interfaces for every hardware block, and our BUTR (Built-in Unit Test for Registers) and HITL (Hardware-in-the-Loop) test infrastructure
  5. Build and maintain the CI/CD and validation strategy that catches integration issues across all three platforms before they reach customers

Skills

Required

  • 3+ years of engineering team management experience
  • 7+ years of professional software development in C or C++, including systems, platform, or infrastructure software
  • 4+ years of designing or architecting software systems (platform abstractions, API design, multi-target build systems)
  • Experience developing software that interfaces with hardware or runs across multiple execution environments
  • Experience designing APIs or abstraction layers consumed by other engineering teams

Nice to have

  • Experience in recruiting, hiring, mentoring/coaching and managing teams of Software Engineers to improve their skills, and make them more effective, product software engineers
  • Experience building or maintaining hardware abstraction layers, board support packages, or platform software for SoC, ASIC, or embedded systems
  • Experience with multi-platform or cross-compilation build systems (targeting simulation, emulation, and production from a single source tree)
  • Familiarity with bus protocols (APB, AXI, PCIe) or memory subsystems (HBM, DDR)
  • Experience with C++ template metaprogramming or code generation frameworks
  • Experience with pre-silicon software development (simulation, emulation, or virtual platforms)

What the JD emphasized

  • single source tree
  • three radically different execution environments
  • platform abstractions
  • external API contracts
  • stateless
  • survive live-updates on running production servers without reboots
  • correct down to individual register bits
  • single abstraction leak can break chip verification, stall emulation, or misconfigure millions of servers in AWS's global fleet
  • HAL must resume managing the SoC by querying hardware state on-demand
  • resilience possible while keeping the complexity invisible to consumers
  • pre-silicon simulation
  • production fleet
  • full ML training workload within 12 hours of first power-on