Join our EC2 Nitro Machine Learning Systems team to lead a pivotal performance engineering group at the intersection of state of the art ML research and production-scale infrastructure. You'll build and manage a team focused on establishing EC2 as the definitive source for ML performance optimization across diverse workloads including LLMs, multimodal models, and next-generation AI applications. Your work will directly influence product management, marketing, and executive decision-making by delivering comprehensive performance data and best practices for AWS's accelerated computing platforms.
This role spans the full ML systems stack—from low-level CUDA optimization and accelerator firmware through frameworks to serving layers—requiring both technical depth and strategic vision. You'll translate performance insights into platform design recommendations that shape future AWS hardware, while simultaneously establishing foundational benchmarking infrastructure that scales with rapidly evolving customer demands.
Key job responsibilities
- Lead the design, implementation, and delivery of foundational ML performance measurement infrastructure that operates as reliable CI/CD systems across diverse accelerator platforms
- Build and nurture a high-performing engineering team focused on establishing EC2's source for ML performance best-known-configurations
- Drive architectural decisions that influence future platform design by feeding insights from state-of-the-art research and customer workloads into accelerated platform launches
- Develop comprehensive regression coverage across all major component releases including frameworks, firmware, drivers, and networking components
- Establish mechanisms to scale performance engineering practices from current LLM focus to multimodal models, Mixture-of-Experts architectures, and emerging AI application domains
A day in the life Your day involves strategic context-switching across multiple domains. You might start the morning diving deep into CI/CD triage of performance regressions, transition to product management calls where you translate technical insights into business value, then finish by collaborating with platform architects to shape next-generation hardware requirements. You serve as the connective tissue between ML research, platform engineering, customer needs, and business strategy—requiring exceptional translation skills to convert technical performance data into actionable business insights and platform improvements.
About the team The EC2 Nitro Machine Learning Systems team is responsible for development, operations, and maintenance of scale-out machine learning platforms used for training and inference workloads. We build and optimize the infrastructure that powers some of the most computationally intensive AI/ML workloads in the cloud. Our team is passionate about creating reliable, high-performance systems that enable customers to push the boundaries of what's possible with machine learning.
Working with us means having the opportunity to influence the future of supercomputing in the cloud while solving complex technical challenges at massive scale. We collaborate closely with customers and internal teams to continuously improve our platforms and deliver innovations that accelerate machine learning workflows.
Basic Qualifications
- 3+ years of engineering team management experience
- 7+ years of working directly within engineering teams experience
- 3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
- Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
- Experience partnering with product or program management teams
Preferred Qualifications
- Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy
- Experience in recruiting, hiring, mentoring/coaching and managing teams of Software Engineers to improve their skills, and make them more effective, product software engineers
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.
USA, WA, Seattle - 184,900.00 - 250,200.00 USD annually