Senior Software Dev Engineer, Ec2 Nitro

Amazon · Big Tech · Seattle, WA · Software Development

Senior Software Development Engineer to build and optimize infrastructure for AI/ML workloads on EC2 Nitro. Focus on performance measurement, benchmarking, regression testing, and influencing future hardware designs for LLMs, multimodal systems, and emerging architectures. Role involves both customer-facing performance problem-solving and foundational infrastructure development.

What you'd actually do

Design and implement scalable performance measurement infrastructure that serves as the foundation for ML benchmarking across AWS, incorporating critical metrics like tokens/second, latency, and accelerator utilization
Lead technical projects establishing EC2 as the definitive source for ML performance best practices across diverse applications including LLMs, multimodal systems, and emerging model architectures
Develop and maintain comprehensive regression testing systems that validate performance across major component releases including frameworks, firmware, drivers, and networking infrastructure
Collaborate with hardware engineering teams to influence future accelerator platform designs based on performance insights gathered from state-of-the-art research and customer workloads
Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels

Skills

Required

5+ years of non-internship professional software development experience
5+ years of programming with at least one software programming language experience
5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience as a mentor, tech lead or leading an engineering team
Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Nice to have

5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
Knowledge of machine learning model architecture and inference

What the JD emphasized

revolutionize supercomputing in the cloud
build and optimize infrastructure powering the most computationally intensive AI/ML workloads
establish EC2 as the definitive source for best-known-configurations across diverse ML applications
influencing future accelerated platform designs
deep expertise in ML systems performance
full stack from low-level hardware optimization to high-level frameworks
translate state of the art ML research into practical platform improvements
build foundational measurement infrastructure
directly support customers with performance challenges
solving complex performance optimization problems at massive scale
directly influencing product strategy
scalable performance measurement infrastructure
ML benchmarking
tokens/second, latency, and accelerator utilization
ML performance best practices
LLMs, multimodal systems, and emerging model architectures
regression testing systems
frameworks, firmware, drivers, and networking infrastructure
accelerator platform designs
state-of-the-art research and customer workloads
complex performance challenges
large language model training workflow
framework engineers
platform design review
future hardware decisions
bootstrap team
scale-out machine learning platforms
training and inference workloads
computationally intensive AI/ML workloads
push the boundaries of what's possible with machine learning
accelerate machine learning workflows

Other signals

optimize infrastructure powering AI/ML workloads
ML systems performance
low-level hardware optimization
high-level frameworks
translate state of the art ML research into practical platform improvements
foundational measurement infrastructure
performance challenges
performance optimization problems at massive scale
influence product strategy
scalable performance measurement infrastructure
ML benchmarking
tokens/second, latency, and accelerator utilization
ML performance best practices
LLMs, multimodal systems, and emerging model architectures
regression testing systems
frameworks, firmware, drivers, and networking infrastructure
accelerator platform designs
customer workloads
performance challenges
large language model training workflow
framework engineers
future hardware decisions
bootstrap team
scale-out machine learning platforms
training and inference workloads
push the boundaries of what's possible with machine learning
accelerate machine learning workflows

Read full job description

Join the EC2 Nitro Machine Learning Systems team to revolutionize supercomputing in the cloud. We're seeking an experienced Software Development Engineer to build and optimize infrastructure powering the most computationally intensive AI/ML workloads. In this role, you'll establish EC2 as the definitive source for best-known-configurations across diverse ML applications while influencing future accelerated platform designs.

You'll bring deep expertise in ML systems performance, working across the full stack from low-level hardware optimization to high-level frameworks. This position offers unique opportunities to translate state of the art ML research into practical platform improvements, build foundational measurement infrastructure, and directly support customers with performance challenges. If you're passionate about solving complex performance optimization problems at massive scale while directly influencing product strategy, this role provides the perfect opportunity to make a significant impact.

Key job responsibilities

Design and implement scalable performance measurement infrastructure that serves as the foundation for ML benchmarking across AWS, incorporating critical metrics like tokens/second, latency, and accelerator utilization
Lead technical projects establishing EC2 as the definitive source for ML performance best practices across diverse applications including LLMs, multimodal systems, and emerging model architectures
Develop and maintain comprehensive regression testing systems that validate performance across major component releases including frameworks, firmware, drivers, and networking infrastructure
Collaborate with hardware engineering teams to influence future accelerator platform designs based on performance insights gathered from state-of-the-art research and customer workloads
Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels

A day in the life Your day revolves around translating technical performance data into actionable business insights while solving complex optimization challenges. You might start by analyzing performance bottlenecks in a customer's large language model training workflow, then collaborate with framework engineers to implement optimizations. Later, you'll present findings at a platform design review, where your data-driven insights directly influence future hardware decisions. Throughout the day, you'll balance immediate customer needs with long-term infrastructure development, all while helping establish processes for this bootstrap team.

About the team The EC2 Nitro Machine Learning Systems team is responsible for development, operations, and maintenance of scale-out machine learning platforms used for training and inference workloads. We build and optimize the infrastructure that powers some of the most computationally intensive AI/ML workloads in the cloud. Our team is passionate about creating reliable, high-performance systems that enable customers to push the boundaries of what's possible with machine learning.

Working with us means having the opportunity to influence the future of supercomputing in the cloud while solving complex technical challenges at massive scale. We collaborate closely with customers and internal teams to continuously improve our platforms and deliver innovations that accelerate machine learning workflows.

Basic Qualifications

5+ years of non-internship professional software development experience
5+ years of programming with at least one software programming language experience
5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience as a mentor, tech lead or leading an engineering team
Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Preferred Qualifications

5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
Knowledge of machine learning model architecture and inference

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.

USA, WA, Seattle - 168,100.00 - 227,400.00 USD annually

Key job responsibilities

Design and implement scalable performance measurement infrastructure that serves as the foundation for ML benchmarking across AWS, incorporating critical metrics like tokens/second, latency, and accelerator utilization
Lead technical projects establishing EC2 as the definitive source for ML performance best practices across diverse applications including LLMs, multimodal systems, and emerging model architectures
Develop and maintain comprehensive regression testing systems that validate performance across major component releases including frameworks, firmware, drivers, and networking infrastructure
Collaborate with hardware engineering teams to influence future accelerator platform designs based on performance insights gathered from state-of-the-art research and customer workloads
Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels

Basic Qualifications

5+ years of non-internship professional software development experience
5+ years of programming with at least one software programming language experience
5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience as a mentor, tech lead or leading an engineering team
Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Preferred Qualifications

5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
Knowledge of machine learning model architecture and inference

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

USA, WA, Seattle - 168,100.00 - 227,400.00 USD annually