Systems Development Eng (aws Generative AI & ML Servers), Aws Hardware Engineering Accelerators

Amazon Amazon · Big Tech · Seattle, WA · Systems, Quality, & Security Engineering

This role focuses on building and operating AWS cloud offerings that enable high performance and scalability for AI/ML and HPC workloads, specifically targeting generative AI training and inference. The Systems Development Engineer will work on server designs, optimizing price-performance, and ensuring reliability and scalability of the underlying hardware infrastructure.

What you'd actually do

  1. You will be a technical leader solving complex architectural problems which may not defined before hand.
  2. You will be owning the teams systems and work proactively in identifying deficiencies, writing tactical code to solve issues before they impact customers, and working with your team to scale the solution.
  3. You will decompose big difficult server system testability, reliability and diagnosis problems into straightforward tasks, components or features that you will lead to deliver yourself and through others in parallel.
  4. You will use combination of hardware, software, system designs, x86 architecture, processes, diagnosis and operations knowledge.
  5. Working with a variety of job roles (SDEs, SDETs, Hardware Engineers, TPMs, Managers, Principals) and groups (AWS Hardware Engineering, EC2, other AWS services) through server conception, test, launch, and operations.
  6. Driving high quality and reliability into future/new designs for AWS Accelerated server solutions for AWS Cloud.

Skills

Required

  • Systems development
  • Hardware engineering
  • Cloud computing
  • AI/ML infrastructure
  • Server design
  • Performance optimization
  • Scalability
  • Debugging
  • Problem-solving
  • Technical leadership
  • x86 architecture

Nice to have

  • Generative AI
  • LLMs
  • HPC workloads

What the JD emphasized

  • building the backbone of Generative AI cloud at AWS
  • build the future of the cloud for AI training and inference
  • delivering continuous price performance improvements in the cloud for AI model training for multi billion variable LLMs
  • designing, delivering and operating AWS cloud offerings that enable high performance and scalability in AI/ML and HPC workloads
  • full technical stack - vertically from baremetal server hardware up to the software in userland
  • systems debugger
  • technical leader solving complex architectural problems
  • owning the teams systems
  • scale the solution
  • server conception, test, launch, and operations

Other signals

  • building the backbone of Generative AI cloud at AWS
  • designing, delivering and operating AWS cloud offerings that enable high performance and scalability in AI/ML and HPC workloads
  • delivering continuous price performance improvements in the cloud for AI model training for multi billion variable LLMs