Senior Software Dev Engineer, Ec2 Nitro

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Senior Software Development Engineer to build and optimize infrastructure for AI/ML workloads on EC2 Nitro. Focus on performance measurement, benchmarking, regression testing, and influencing future hardware designs for LLMs, multimodal systems, and emerging architectures. Role involves both customer-facing performance problem-solving and foundational infrastructure development.

What you'd actually do

  1. Design and implement scalable performance measurement infrastructure that serves as the foundation for ML benchmarking across AWS, incorporating critical metrics like tokens/second, latency, and accelerator utilization
  2. Lead technical projects establishing EC2 as the definitive source for ML performance best practices across diverse applications including LLMs, multimodal systems, and emerging model architectures
  3. Develop and maintain comprehensive regression testing systems that validate performance across major component releases including frameworks, firmware, drivers, and networking infrastructure
  4. Collaborate with hardware engineering teams to influence future accelerator platform designs based on performance insights gathered from state-of-the-art research and customer workloads
  5. Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Nice to have

  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
  • Knowledge of machine learning model architecture and inference

What the JD emphasized

  • revolutionize supercomputing in the cloud
  • build and optimize infrastructure powering the most computationally intensive AI/ML workloads
  • establish EC2 as the definitive source for best-known-configurations across diverse ML applications
  • influencing future accelerated platform designs
  • deep expertise in ML systems performance
  • full stack from low-level hardware optimization to high-level frameworks
  • translate state of the art ML research into practical platform improvements
  • build foundational measurement infrastructure
  • directly support customers with performance challenges
  • solving complex performance optimization problems at massive scale
  • directly influencing product strategy
  • scalable performance measurement infrastructure
  • ML benchmarking
  • tokens/second, latency, and accelerator utilization
  • ML performance best practices
  • LLMs, multimodal systems, and emerging model architectures
  • regression testing systems
  • frameworks, firmware, drivers, and networking infrastructure
  • accelerator platform designs
  • state-of-the-art research and customer workloads
  • complex performance challenges
  • large language model training workflow
  • framework engineers
  • platform design review
  • future hardware decisions
  • bootstrap team
  • scale-out machine learning platforms
  • training and inference workloads
  • computationally intensive AI/ML workloads
  • push the boundaries of what's possible with machine learning
  • accelerate machine learning workflows

Other signals

  • optimize infrastructure powering AI/ML workloads
  • ML systems performance
  • low-level hardware optimization
  • high-level frameworks
  • translate state of the art ML research into practical platform improvements
  • foundational measurement infrastructure
  • performance challenges
  • performance optimization problems at massive scale
  • influence product strategy
  • scalable performance measurement infrastructure
  • ML benchmarking
  • tokens/second, latency, and accelerator utilization
  • ML performance best practices
  • LLMs, multimodal systems, and emerging model architectures
  • regression testing systems
  • frameworks, firmware, drivers, and networking infrastructure
  • accelerator platform designs
  • customer workloads
  • performance challenges
  • large language model training workflow
  • framework engineers
  • future hardware decisions
  • bootstrap team
  • scale-out machine learning platforms
  • training and inference workloads
  • push the boundaries of what's possible with machine learning
  • accelerate machine learning workflows