Software Development Engineer, Ai/ml, Aws Neuron, Model Inference

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Development Engineer focused on optimizing and enabling AI/ML model inference on AWS's custom hardware accelerators (Inferentia and Trainium), working across the stack from frameworks like PyTorch/JAX to hardware-specific optimizations and kernel development.

What you'd actually do

  1. Design, develop, and optimize machine learning models and frameworks for deployment on custom ML hardware accelerators.
  2. Participate in all stages of the ML system development lifecycle including distributed computing based architecture design, implementation, performance profiling, hardware-specific optimizations, testing and production deployment.
  3. Build infrastructure to systematically analyze and onboard multiple models with diverse architecture.
  4. Design and implement high-performance kernels and features for ML operations, leveraging the Neuron architecture and programming models
  5. Analyze and optimize system-level performance across multiple generations of Neuron hardware

Skills

Required

  • Python
  • System level programming
  • ML knowledge
  • experience optimizing inference performance for both latency and throughput
  • system level optimizations
  • Pytorch or JAX
  • low-level optimization
  • system architecture
  • ML model acceleration

Nice to have

  • deep learning
  • GenAI workloads
  • ML compiler
  • runtime
  • application framework
  • ML frameworks
  • PyTorch
  • JAX
  • ML inference
  • training performance
  • novel architecture
  • high-performance kernels
  • ML functions
  • AI acceleration
  • frameworks
  • kernels
  • compiler
  • runtime
  • collectives
  • future architecture designs
  • high-performance computing
  • distributed architectures
  • AI acceleration technology
  • business critical features
  • model enablement
  • optimization expertise
  • machine learning workloads
  • AWS ML accelerators
  • open source ecosystems
  • peak performance at scale
  • development
  • enablement
  • performance tuning
  • LLM model families
  • large language models
  • Llama family
  • DeepSeek
  • distributed inference solutions
  • Trainium
  • Inferentia
  • latency
  • throughput
  • distributed inference support
  • Neuron SDK
  • highest performance
  • efficiency
  • AWS Trainium and Inferentia silicon and servers
  • compiler engineers
  • runtime engineers
  • distributed computing based architecture design
  • implementation
  • performance profiling
  • hardware-specific optimizations
  • testing
  • production deployment
  • infrastructure
  • systematically analyze and onboard multiple models
  • diverse architecture
  • high-performance kernels and features for ML operations
  • Neuron architecture and programming models
  • system-level performance
  • multiple generations of Neuron hardware
  • detailed performance analysis
  • profiling tools
  • identify and resolve bottlenecks
  • fusion
  • sharding
  • tiling
  • scheduling
  • comprehensive testing
  • unit and end-to-end model testing
  • continuous deployment and releases through pipelines
  • customers to enable and optimize their ML models
  • AWS accelerators
  • innovative optimization techniques
  • cross-functional team
  • applied scientists
  • system engineers
  • product managers
  • state-of-the-art inference capabilities
  • Generative AI applications
  • debugging performance issues
  • optimizing memory usage
  • shaping the future of Neuron's inference stack
  • Amazon and the Open Source Community
  • design and code solutions
  • drive efficiencies in software architecture
  • create metrics
  • implement automation and other improvements
  • resolve the root cause of software defects
  • build high-impact solutions

What the JD emphasized

  • critical to this role
  • must have
  • hard requirement

Other signals

  • AWS Neuron SDK
  • accelerate deep learning and GenAI workloads
  • custom machine learning accelerators
  • ML compiler, runtime, and application framework
  • PyTorch and JAX
  • ML inference and training performance
  • Inference Enablement and Acceleration team
  • novel architecture
  • maximizing their performance
  • hardware-software boundary
  • high-performance kernels for ML functions
  • AI acceleration
  • frameworks and kernels
  • compiler to runtime and collectives
  • future architecture designs
  • intersection of machine learning, high-performance computing, and distributed architectures
  • AI acceleration technology
  • architect and implement business critical features
  • model enablement
  • optimization expertise
  • machine learning workloads
  • AWS ML accelerators
  • open source ecosystems
  • peak performance at scale
  • development, enablement and performance tuning
  • LLM model families
  • large language models
  • distributed inference solutions
  • Trainium and Inferentia
  • optimizing inference performance
  • latency and throughput
  • system level optimizations
  • Pytorch or JAX
  • distributed inference support
  • Neuron SDK
  • highest performance
  • efficiency
  • AWS Trainium and Inferentia silicon and servers
  • Python
  • System level programming
  • ML knowledge
  • compiler engineers
  • runtime engineers
  • low-level optimization
  • system architecture
  • ML model acceleration
  • Design, develop, and optimize machine learning models and frameworks
  • custom ML hardware accelerators
  • distributed computing based architecture design
  • implementation
  • performance profiling
  • hardware-specific optimizations
  • testing and production deployment
  • infrastructure to systematically analyze and onboard multiple models
  • diverse architecture
  • high-performance kernels and features for ML operations
  • Neuron architecture and programming models
  • system-level performance
  • multiple generations of Neuron hardware
  • detailed performance analysis
  • profiling tools
  • identify and resolve bottlenecks
  • fusion, sharding, tiling, and scheduling
  • comprehensive testing
  • unit and end-to-end model testing
  • continuous deployment and releases through pipelines
  • customers to enable and optimize their ML models
  • AWS accelerators
  • develop innovative optimization techniques
  • cross-functional team
  • applied scientists
  • system engineers
  • product managers
  • state-of-the-art inference capabilities
  • Generative AI applications
  • debugging performance issues
  • optimizing memory usage
  • shaping the future of Neuron's inference stack
  • Amazon and the Open Source Community
  • design and code solutions
  • drive efficiencies in software architecture
  • create metrics
  • implement automation and other improvements
  • resolve the root cause of software defects
  • build high-impact solutions