Systems Development Eng (aws Generative AI & ML Servers), Aws Hardware Engineering Accelerators

Amazon Amazon · Big Tech · Cupertino, CA · Systems, Quality, & Security Engineering

This role focuses on designing, delivering, and operating AWS cloud offerings that enable high performance and scalability for AI/ML and HPC workloads, specifically targeting continuous price-performance improvements for large-scale LLM training and inference. The candidate will work across the full technical stack from baremetal server hardware to userland software, acting as a technical leader and systems debugger.

What you'd actually do

  1. Be a technical leader for your team, setting the standards for engineering best practices and operational excellence.
  2. Design technology solutions and architectures that solve complex business or technical problems related to server system testability, reliability, and diagnosis.
  3. Own your team’s systems, proactively identifying and fixing extant risks, limitations, and deficiencies.
  4. Partner with your manager and other team leaders to develop your team’s technical strategy.
  5. Resolve the contributing causes of endemic problems, including architectural deficiencies and areas where your team limits the innovation of other teams.

Skills

Required

  • Systems development engineering
  • Hardware and software integration
  • Cloud computing
  • AI/ML infrastructure
  • HPC workloads
  • Debugging complex systems
  • Technical leadership
  • Architectural design
  • Problem-solving

Nice to have

  • Server hardware design
  • Performance optimization
  • Scalability
  • Operational excellence

What the JD emphasized

  • high performance and scalability in AI/ML and HPC workloads
  • continuous price performance improvements in the cloud for AI model training for multi billion variable LLMs
  • full technical stack - vertically from baremetal server hardware up to the software in userland

Other signals

  • designing, delivering and operating AWS cloud offerings that enable high performance and scalability in AI/ML and HPC workloads
  • continuous price performance improvements in the cloud for AI model training for multi billion variable LLMs
  • server hardware up to the software in userland