Sr. Systems Development Engineer (aws Generative AI & ML Servers), Aws Hw Engineering

Amazon Amazon · Big Tech · Seattle, WA · Systems, Quality, & Security Engineering

This role focuses on building and operating AWS cloud infrastructure for AI training and inference, specifically targeting generative AI and large language models. The engineer will be responsible for designing, delivering, and optimizing server hardware and software systems to enable high performance and scalability for AI/ML workloads, with a focus on price-performance improvements. The role involves creating automation through agentic workflows and implementing AI-driven tools to enhance engineer productivity and influence AI implementation and core architecture.

What you'd actually do

  1. You will be a technical leader solving complex architectural problems which may not defined before hand.
  2. You will be owning the teams systems and work proactively in identifying deficiencies, writing tactical code to solve issues before they impact customers, and working with your team to scale the solution.
  3. You will decompose big difficult server system testability, reliability and diagnosis problems into straightforward tasks, components or features that you will lead to deliver yourself and through others in parallel.
  4. You will use combination of hardware, software, system designs, x86 architecture, processes, diagnosis and operations knowledge.
  5. In this role you will create automation through agentic workflows.
  6. You’ll develop smart automation solutions, implement AI-driven tools and workflows and be part of AI transformation.

Skills

Required

  • Systems Development Engineering
  • AWS cloud infrastructure
  • Generative AI and ML
  • Server hardware design and operation
  • Software development
  • Systems debugging
  • Automation
  • Agentic workflows
  • AI-driven tools
  • Technical leadership
  • Architectural problem-solving
  • x86 architecture

Nice to have

  • HPC workloads
  • LLMs
  • Cloud scale
  • Supply chain
  • Security

What the JD emphasized

  • building the backbone of Generative AI cloud
  • delivering continuous price performance improvements in the cloud for AI model training
  • high performance and scalability in AI/ML and HPC workloads
  • building intelligent systems that drive the debug and development of next-generation cloud technologies
  • full technical stack - vertically from baremetal server hardware up to the software in userland
  • systems and software decisions impact the user
  • excellent systems debugger
  • create automation through agentic workflows
  • implement AI-driven tools and workflows
  • AI transformation

Other signals

  • building the backbone of Generative AI cloud
  • delivering continuous price performance improvements in the cloud for AI model training
  • designing, delivering and operating AWS cloud offerings that enable high performance and scalability in AI/ML and HPC workloads
  • building intelligent systems that drive the debug and development of next-generation cloud technologies
  • create automation through agentic workflows
  • implement AI-driven tools and workflows