Systems Development Eng (aws Generative AI & ML Servers), Aws Hardware Engineering Accelerators

Amazon Amazon · Big Tech · Seattle, WA · Systems, Quality, & Security Engineering

This role focuses on building and operating AWS cloud offerings that enable high performance and scalability in AI/ML and HPC workloads, specifically designing and delivering server hardware for AI training and inference. The candidate will work across the full technical stack from baremetal hardware to userland software, solving complex architectural problems related to server systems, their testability, reliability, and diagnosis, with the goal of delivering continuous price-performance improvements for large language models at cloud scale.

What you'd actually do

  1. You will be a technical leader solving complex architectural problems which may not defined before hand.
  2. You will be owning the teams systems and work proactively in identifying deficiencies, writing tactical code to solve issues before they impact customers, and working with your team to scale the solution.
  3. You will decompose big difficult server system testability, reliability and diagnosis problems into straightforward tasks, components or features that you will lead to deliver yourself and through others in parallel.
  4. You will use combination of hardware, software, system designs, x86 architecture, processes, diagnosis and operations knowledge.
  5. Working with a variety of job roles (SDEs, SDETs, Hardware Engineers, TPMs, Managers, Principals) and groups (AWS Hardware Engineering, EC2, other AWS services) through server conception, test, launch, and operations.
  6. Driving high quality and reliability into future/new designs for AWS Accelerated server solutions for AWS Cloud.

Skills

Required

  • Systems development engineering
  • Hardware engineering
  • Generative AI and ML infrastructure
  • Cloud computing
  • Server hardware design
  • Software development
  • Systems debugging
  • Architectural problem solving
  • x86 architecture
  • Operations knowledge
  • Technical leadership
  • Communication skills
  • Organizational and planning skills

Nice to have

  • HPC workloads
  • LLM training optimization
  • AWS services

What the JD emphasized

  • AWS Generative AI & ML Servers
  • AWS Hardware Engineering Accelerators
  • AI training and inference
  • price performance improvements in the cloud for AI model training
  • multi billion variable LLMs
  • high performance and scalability in AI/ML and HPC workloads
  • full technical stack - vertically from baremetal server hardware up to the software in userland
  • systems and software decisions impact the user
  • excellent systems debugger
  • next-generation AWS platforms
  • server conception, test, launch, and operations
  • AWS Accelerated server solutions

Other signals

  • AWS Generative AI & ML Servers
  • AWS Hardware Engineering Accelerators
  • AI training and inference
  • price performance improvements in the cloud for AI model training
  • multi billion variable LLMs
  • high performance and scalability in AI/ML and HPC workloads
  • cloud scale
  • baremetal server hardware up to the software in userland
  • systems and software decisions impact the user
  • systems debugger
  • next-generation AWS platforms
  • server conception, test, launch, and operations
  • AWS Accelerated server solutions
  • Hardware Engineering AI / ML development team