Sr. System Development Engineer, High-performance Accelerator Servers for Ai/ml

Amazon Amazon · Big Tech · Austin, TX · Systems, Quality, & Security Engineering

This role is for a Senior System Development Engineer focused on high-performance accelerator servers for AI/ML at AWS. The engineer will be responsible for designing, delivering, and operating next-generation infrastructure for AI training and inference, focusing on performance, efficiency, and scalability. The role involves deep technical understanding of the full stack from hardware to software, systems debugging, and leading complex projects.

What you'd actually do

  1. You will be a technical leader solving complex problems.
  2. You will decompose big difficult server system testability, reliability and diagnosis problems into straightforward tasks, components or features that you will lead to deliver yourself and through others in parallel.
  3. You will use combination of hardware, software, system designs, x86 architecture, processes, diagnosis and operations knowledge.
  4. Driving high quality and reliability into future/new designs for AWS Accelerated server solutions for AWS Cloud.

Skills

Required

  • 4+ years of non-internship professional software development experience
  • 4+ years of deploying and operating in a Linux/Unix environment experience
  • 4+ years of systems development in an IT or data center environment experience
  • 3+ years of programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby experience
  • 2+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
  • 2+ years of systems design, software development, operations, automation, and process improvement experience
  • Experience leading the design, build and deployment of complex and performant (reliable and scalable) software solutions in production

Nice to have

  • 3+ years of development/programming/scripting language (Python/Java/Bash/Perl) experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience taking a leading role in building complex software or computing infrastructure that has been successfully delivered to customers
  • Experience debugging, integrating, and validating complex AI/ML and Cloud Computing servers.

What the JD emphasized

  • complex problems
  • big difficult server system testability, reliability and diagnosis problems
  • complex and performant (reliable and scalable) software solutions
  • debugging, integrating, and validating complex AI/ML and Cloud Computing servers

Other signals

  • building the foundation of the world’s most advanced cloud for AI training and inference
  • multi-billion-parameter models come to life at scale
  • design, deliver, and operate next-generation infrastructure that powers breakthrough innovation in AI/ML and HPC workloads
  • pushing the limits of performance, efficiency, and scalability in the cloud
  • build the systems that define what’s next for AWS — and for the entire AI industry