Software Dev Engineer, Ec2 Nitro

Amazon · Big Tech · Seattle, WA · Software Development

Software Development Engineer to build and optimize performance measurement infrastructure for AI/ML workloads on AWS EC2 Nitro. The role involves low-level systems, ML frameworks, and serving layers to translate performance insights into technical requirements for platform designs.

What you'd actually do

Design and build foundational infrastructure for ML performance measurement that scales with business demand and operates as reliable CI/CD systems, ensuring high-quality implementations that balance customer requirements with operational excellence
Develop comprehensive regression test coverage across all major component releases including frameworks, firmware, drivers, and networking technologies to maintain optimal platform performance
Collaborate with cross-functional teams to establish EC2 as the definitive source for best-known-configurations across diverse ML applications including LLMs, multimodal models, and MoE architectures
Document and communicate performance insights to influence future platform designs by translating technical findings from research and customer workloads into actionable recommendations
Identify and resolve complex performance challenges through systematic analysis of training and inference performance KPIs across accelerated platforms, working directly with customers to improve their ML system efficiency

Skills

Required

3+ years of non-internship professional software development experience
2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience programming with at least one software programming language
Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Nice to have

3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
Knowledge of machine learning model architecture and inference

What the JD emphasized

revolutionize accelerated computing
build and optimize the performance measurement infrastructure
computationally intensive AI/ML workloads
low-level systems (CUDA, EFA, firmware)
ML frameworks
serving layers
deep technical knowledge
complex performance data
machine learning infrastructure at cloud scale
high-performance computing
distributed systems
machine learning technologies
foundational infrastructure for ML performance measurement
scales with business demand
reliable CI/CD systems
high-quality implementations
customer requirements
operational excellence
comprehensive regression test coverage
frameworks, firmware, drivers, and networking technologies
optimal platform performance
cross-functional teams
definitive source for best-known-configurations
LLMs, multimodal models, and MoE architectures
performance insights
future platform designs
technical findings from research and customer workloads
actionable recommendations
complex performance challenges
systematic analysis
training and inference performance KPIs
accelerated platforms
customers to improve their ML system efficiency
performance data from overnight benchmark runs
ML frameworks and hardware configurations
investigate anomalies
optimization opportunities
design reviews
future platform capabilities
building measurement infrastructure
analyzing performance trends
documenting best practices
customers optimize their workloads
development, operations, and maintenance of scale-out machine learning platforms
training and inference workloads
infrastructure that powers some of the most computationally intensive AI/ML workloads
reliable, high-performance systems
customers to push the boundaries of what's possible with machine learning
influence the future of supercomputing in the cloud
solving complex technical challenges at massive scale
collaborate closely with customers and internal teams
continuously improve our platforms
deliver innovations that accelerate machine learning workflows
Knowledge of Machine Learning and LLM fundamentals
transformer architecture
training/inference lifecycles
optimization techniques
Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
Knowledge of machine learning model architecture and inference

Other signals

performance measurement infrastructure
AI/ML workloads
accelerated computing
training and inference performance

Read full job description

Join the EC2 Nitro Machine Learning Systems team to revolutionize accelerated computing in the cloud. We're seeking an exceptional Software Development Engineer to build and optimize the performance measurement infrastructure for some of the most computationally intensive AI/ML workloads on AWS. In this role, you'll establish EC2 as the definitive source for best-known-configurations across diverse ML applications including LLMs, multimodal models, and video generation workloads. Your expertise will directly influence future platform designs by translating performance insights from state of the art research and customer workloads into technical requirements for upcoming accelerated platform launches.

Your impact will extend from low-level systems (CUDA, EFA, firmware) through ML frameworks to serving layers, requiring deep technical knowledge and the ability to communicate complex performance data as actionable business insights. This position offers the unique opportunity to shape the future of machine learning infrastructure at cloud scale while working at the intersection of high-performance computing, distributed systems, and machine learning technologies.

Key job responsibilities

Design and build foundational infrastructure for ML performance measurement that scales with business demand and operates as reliable CI/CD systems, ensuring high-quality implementations that balance customer requirements with operational excellence
Develop comprehensive regression test coverage across all major component releases including frameworks, firmware, drivers, and networking technologies to maintain optimal platform performance
Collaborate with cross-functional teams to establish EC2 as the definitive source for best-known-configurations across diverse ML applications including LLMs, multimodal models, and MoE architectures
Document and communicate performance insights to influence future platform designs by translating technical findings from research and customer workloads into actionable recommendations
Identify and resolve complex performance challenges through systematic analysis of training and inference performance KPIs across accelerated platforms, working directly with customers to improve their ML system efficiency

A day in the life Your typical day begins with reviewing performance data from overnight benchmark runs across various ML frameworks and hardware configurations. You'll investigate anomalies, collaborate with the team on optimization opportunities, and join design reviews to influence future platform capabilities. You'll balance your time between building measurement infrastructure, analyzing performance trends, and documenting best practices to help customers optimize their workloads.

About the team The EC2 Nitro Machine Learning Systems team is responsible for development, operations, and maintenance of scale-out machine learning platforms used for training and inference workloads. We build and optimize the infrastructure that powers some of the most computationally intensive AI/ML workloads in the cloud. Our team is passionate about creating reliable, high-performance systems that enable customers to push the boundaries of what's possible with machine learning.

Working with us means having the opportunity to influence the future of supercomputing in the cloud while solving complex technical challenges at massive scale. We collaborate closely with customers and internal teams to continuously improve our platforms and deliver innovations that accelerate machine learning workflows.

Basic Qualifications

3+ years of non-internship professional software development experience
2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience programming with at least one software programming language
Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Preferred Qualifications

3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
Knowledge of machine learning model architecture and inference

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.

USA, WA, Seattle - 143,700.00 - 194,400.00 USD annually

Key job responsibilities

Design and build foundational infrastructure for ML performance measurement that scales with business demand and operates as reliable CI/CD systems, ensuring high-quality implementations that balance customer requirements with operational excellence
Develop comprehensive regression test coverage across all major component releases including frameworks, firmware, drivers, and networking technologies to maintain optimal platform performance
Collaborate with cross-functional teams to establish EC2 as the definitive source for best-known-configurations across diverse ML applications including LLMs, multimodal models, and MoE architectures
Document and communicate performance insights to influence future platform designs by translating technical findings from research and customer workloads into actionable recommendations
Identify and resolve complex performance challenges through systematic analysis of training and inference performance KPIs across accelerated platforms, working directly with customers to improve their ML system efficiency

Basic Qualifications

3+ years of non-internship professional software development experience
2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience programming with at least one software programming language
Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Preferred Qualifications

3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
Knowledge of machine learning model architecture and inference

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

USA, WA, Seattle - 143,700.00 - 194,400.00 USD annually