Post-silicon Systems Validation Engineer, Annapurna Labs

Amazon Amazon · Big Tech · Austin, TX · Software Development

This role focuses on validating next-generation machine learning accelerators for AWS, covering the entire vertical stack from silicon to system. The engineer will develop and execute validation strategies, conduct hands-on bring-up and debug, and collaborate with various teams to ensure the quality and performance of AI/ML accelerators used in AWS data centers for AI training and inference.

What you'd actually do

  1. Developing comprehensive validation strategies and detailed test plans covering functional, performance, power, and stress testing from silicon bring-up to product release
  2. Executing complex test plans from RTL simulation and emulation environments through physical silicon validation
  3. Conducting hands-on silicon bring-up and debug in the lab using oscilloscopes, logic analyzers, and protocol analyzers
  4. Validating ML accelerator performance, accuracy, and reliability using real-world neural network workloads
  5. Building test infrastructure, CI/CD, and automated regression frameworks to enable efficient validation at scale

Skills

Required

  • Python
  • Lua
  • C/C++
  • Rust
  • Go
  • computer architecture
  • AWS services
  • cloud infrastructure
  • firmware development (BIOS, BMC, drivers)
  • PCIe
  • HBM
  • GPUs
  • neural networks
  • ML HW architecture
  • CI/CD
  • RTL simulation (SystemVerilog/UVM, VCS, Questa, Xcelium)
  • emulation (Palladium, Zebu, Veloce)
  • silicon failure analysis and debug
  • general troubleshooting/debugging of hardware
  • Linux environments
  • Git

Nice to have

  • Machine Learning Hardware/Software Architecture
  • EDA Simulations or Emulation

What the JD emphasized

  • next-generation machine learning accelerators
  • AI training and inference
  • ML workloads
  • ML accelerator performance, accuracy, and reliability

Other signals

  • validating next-generation machine learning accelerators
  • power AWS's cloud computing infrastructure
  • AI training and inference
  • ML workloads
  • ML accelerator performance, accuracy, and reliability