Senior Software Development Engineer, Ai/ml, Aws Neuron, Model Inference

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Senior Software Development Engineer role focused on optimizing and enabling AI/ML model inference on AWS's custom hardware accelerators (Inferentia and Trainium). The role involves working across the stack from frameworks (PyTorch, JAX) to hardware, building infrastructure, optimizing performance (latency and throughput), and collaborating with various teams and customers to ensure efficient execution of large language models and other GenAI workloads. Experience with inference serving platforms like vLLM is required.

What you'd actually do

  1. would with state of the art LLMs, Open source and internal LLM families, large scale performance and benchmark evaluations etc.,
  2. develop and performance tune a wide variety of LLM model families, including 500B+ large language models like the Llama family, DeepSeek and beyond.
  3. work side by side with performance, compiler and runtime engineers to create, build and tune distributed inference solutions with Trainium and Inferentia.
  4. build infrastructure to systematically analyze and onboard multiple models with diverse architecture.
  5. collaborate with performance team to enable and evaluate optimizations such as fusion, sharding, tiling, and scheduling etc.,

Skills

Required

  • Python
  • System level programming
  • ML knowledge
  • low-level optimization
  • system architecture
  • ML model acceleration
  • optimizing inference performance for both latency and throughput
  • experience with vLLM, SGLang, TensorRT or similar platforms

Nice to have

  • PyTorch
  • JAX
  • deep learning
  • GenAI workloads
  • ML compiler
  • runtime
  • application framework
  • distributed architectures
  • frameworks
  • kernels
  • compiler
  • runtime
  • collectives
  • future architecture designs
  • large scale performance and benchmark evaluations
  • fusion, sharding, tiling, and scheduling
  • unit and end-to-end model testing
  • continuous deployment and releases through pipelines
  • online/offline inference serving
  • applied scientists
  • product managers
  • debugging performance issues
  • optimizing memory usage
  • software architecture
  • metrics
  • automation
  • software defects

What the JD emphasized

  • critical to this role
  • must have
  • experience optimizing inference performance for both latency and throughput on such large models across the stack from system level optimizations through to Pytorch or JAX is a must have

Other signals

  • AWS Neuron SDK
  • accelerate deep learning and GenAI workloads
  • ML compiler, runtime, and application framework
  • ML inference and training performance
  • LLM model families
  • distributed inference solutions
  • optimizing inference performance for both latency and throughput
  • low-level optimization, system architecture, and ML model acceleration
  • vLLM, SGLang, TensorRT