AI Hardware Systems Engineer, Annapurna Labs, Trainium Machine Learning Fleet Operations

Amazon Amazon · Big Tech · Austin, TX · Software Development

The role focuses on operating and optimizing a fleet of ML servers and accelerators, debugging hardware and software issues, developing data infrastructure, and creating automation software for operational scaling. It involves system remediation, root cause analysis of hardware failures, and implementing system-level testing.

What you'd actually do

  1. Member of a team responsible for system remediation, operational excellence, and customer experience on bleeding edge ML products
  2. Utilize data to root cause hardware failures and identify live trends on the most complex systems in AWS
  3. Implement and improve system level testing across the product lifecycle
  4. Develop software which can be maintained, improved upon, documented, tested, and reused
  5. Dive deep on issues at the intersection of hardware and software

Skills

Required

  • software development experience
  • designing or architecting systems experience
  • administrative experience in networking, storage systems, operating systems and hands-on systems engineering experience
  • systems engineering fundamentals (networking, storage, operating systems)
  • Experience programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby
  • Experience with Linux/Unix
  • Experience debugging and systems analysis to identify and quickly resolve or mitigate issues
  • Bachelor's degree in Computer Science, Computer Engineering, or Electrical Engineering

Nice to have

  • Experience in hardware design and validation of components, subsystems and systems
  • Experience with SOC bring-up and post-silicon validation
  • Master's degree in Computer Science, Computer Engineering, or Electrical Engineering

What the JD emphasized

  • customer experience
  • hardware and software