Software Dev Engineer, Ec2 Nitro

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Engineer to build and optimize performance measurement infrastructure for AI/ML workloads on AWS EC2 Nitro. The role involves low-level systems, ML frameworks, and serving layers to translate performance insights into technical requirements for platform designs.

What you'd actually do

  1. Design and build foundational infrastructure for ML performance measurement that scales with business demand and operates as reliable CI/CD systems, ensuring high-quality implementations that balance customer requirements with operational excellence
  2. Develop comprehensive regression test coverage across all major component releases including frameworks, firmware, drivers, and networking technologies to maintain optimal platform performance
  3. Collaborate with cross-functional teams to establish EC2 as the definitive source for best-known-configurations across diverse ML applications including LLMs, multimodal models, and MoE architectures
  4. Document and communicate performance insights to influence future platform designs by translating technical findings from research and customer workloads into actionable recommendations
  5. Identify and resolve complex performance challenges through systematic analysis of training and inference performance KPIs across accelerated platforms, working directly with customers to improve their ML system efficiency

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • Knowledge of Machine Learning and LLM fundamentals, including transformer architecture, training/inference lifecycles, and optimization techniques

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
  • Knowledge of machine learning model architecture and inference

What the JD emphasized

  • revolutionize accelerated computing
  • build and optimize the performance measurement infrastructure
  • computationally intensive AI/ML workloads
  • low-level systems (CUDA, EFA, firmware)
  • ML frameworks
  • serving layers
  • deep technical knowledge
  • complex performance data
  • machine learning infrastructure at cloud scale
  • high-performance computing
  • distributed systems
  • machine learning technologies
  • foundational infrastructure for ML performance measurement
  • scales with business demand
  • reliable CI/CD systems
  • high-quality implementations
  • customer requirements
  • operational excellence
  • comprehensive regression test coverage
  • frameworks, firmware, drivers, and networking technologies
  • optimal platform performance
  • cross-functional teams
  • definitive source for best-known-configurations
  • LLMs, multimodal models, and MoE architectures
  • performance insights
  • future platform designs
  • technical findings from research and customer workloads
  • actionable recommendations
  • complex performance challenges
  • systematic analysis
  • training and inference performance KPIs
  • accelerated platforms
  • customers to improve their ML system efficiency
  • performance data from overnight benchmark runs
  • ML frameworks and hardware configurations
  • investigate anomalies
  • optimization opportunities
  • design reviews
  • future platform capabilities
  • building measurement infrastructure
  • analyzing performance trends
  • documenting best practices
  • customers optimize their workloads
  • development, operations, and maintenance of scale-out machine learning platforms
  • training and inference workloads
  • infrastructure that powers some of the most computationally intensive AI/ML workloads
  • reliable, high-performance systems
  • customers to push the boundaries of what's possible with machine learning
  • influence the future of supercomputing in the cloud
  • solving complex technical challenges at massive scale
  • collaborate closely with customers and internal teams
  • continuously improve our platforms
  • deliver innovations that accelerate machine learning workflows
  • Knowledge of Machine Learning and LLM fundamentals
  • transformer architecture
  • training/inference lifecycles
  • optimization techniques
  • Knowledge of ML frameworks including JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, and TensorRT
  • Knowledge of machine learning model architecture and inference

Other signals

  • performance measurement infrastructure
  • AI/ML workloads
  • accelerated computing
  • training and inference performance