Cloud Hardware Dev Engineer (aws Generative AI & ML Servers), Accelerator Servers

Amazon Amazon · Big Tech · Seattle, WA · Hardware Development

This role focuses on the hardware engineering for accelerated servers used in AWS for AI training and inference. The engineer will be responsible for the design, development, root cause analysis, and oversight of these servers, working with internal and external customers, component vendors, and manufacturing partners. The goal is to deliver high-performance, scalable, and cost-effective server solutions for AI/ML and HPC workloads.

What you'd actually do

  1. own and lead the design, development and root cause of a new segment of accelerated servers.
  2. work closely with our customers to understand their technical needs and business goals, leveraging your experience with server design and the knowledge of various teams to architect the solutions that we will deploy at scale.
  3. work with an interdisciplinary team of component, firmware, test, qualification, and integration engineers, and lead our design and manufacturing partners to bring these servers to the data center.
  4. oversee the fleet of servers you develop, monitoring their quality and how they are meeting the customer requirements.
  5. interfacing with our internal and external customers to understand project requirements and facilitate system development ontop of your server design.

Skills

Required

  • Experience in developing functional specifications, design verification plans and functional test procedures
  • Bachelor's degree or above in electrical engineering, computer engineering, or equivalent
  • Experience in English-language communication skills, both written and verbal
  • Experience with design & innovation and research & development
  • Knowledge of operating systems, hardware, storage, network, security, database administration and cloud infrastructure
  • Experience in server technologies such as, thermal, mechanical, power, and signal integrity
  • 5+ years of professional work (non-internship) experience

Nice to have

  • 5+ years of hardware design and validation of components, subsystems and systems experience
  • Experience in server technologies: board design, high-speed bus design and signal integrity, failure analysis, server components (CPU, GPU, SSDs, memory), BIOS, BMC, and networking
  • Experience developing and executing test procedures for mechanical or electrical systems/components
  • Experience working with ODMs/manufacturer through the product development and manufacturing lifecycle
  • Experience building predictive failure detection or proactive remediation systems at fleet scale
  • Experience with storage/compute/GPU/accelerator platforms including integration, diagnostics, or performance validation
  • Familiarity with PCIe topology, NVLink, NVMe, and accelerator interconnects
  • Experience with large-scale datacenter or cloud environment

What the JD emphasized

  • accelerated servers
  • AI training and inference
  • AI/ML and HPC workloads
  • server design