Software Development Engineer Ai/ml, Inference Serving, Aws Neuron

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Development Engineer to lead and architect next-generation model serving infrastructure for generative AI applications on AWS Inferentia and Trainium accelerators, focusing on performance, reliability, and scalability of inference serving systems.

What you'd actually do

  1. Architect and lead the design of distributed ML serving systems optimized for generative AI workloads
  2. Drive technical excellence in performance optimization and system reliability across the Neuron ecosystem
  3. Design and implement scalable solutions for both offline and online inference workloads
  4. Lead integration efforts with frameworks such as vLLM, SGLang, Torch XLA, TensorRT, and Triton
  5. Develop and optimize system components for tensor/data parallelism and disaggregated serving

Skills

Required

  • 5+ years of programming using a modern programming language such as Java, C++, or C#, including object-oriented design experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • 5+ years of non-internship professional software development experience
  • Experience as a mentor, tech lead or leading an engineering team

Nice to have

  • Master's degree in computer science or equivalent
  • Deep expertise in ML Frameworks/Libraries such as JAX, PyTorch, vLLM, SGLang, Dynamo, TorchXLA, TensorRT.

What the JD emphasized

  • lead and architect
  • next-generation model serving infrastructure
  • large-scale generative AI applications
  • performance optimization
  • system reliability
  • scalable solutions
  • disaggregated serving
  • distributed KV cache management
  • container-native solutions
  • pushing the boundaries of what's possible in large-scale ML serving

Other signals

  • AWS Neuron software stack
  • AWS Inferentia and Trainium machine learning accelerators
  • high-performance, low-cost inference at scale
  • serving modern machine learning models
  • large language models (LLMs) and multimodal workloads
  • generative AI applications
  • distributed ML serving systems
  • performance optimization and system reliability
  • offline and online inference workloads
  • vLLM, SGLang, Torch XLA, TensorRT, and Triton
  • tensor/data parallelism and disaggregated serving
  • custom PyTorch operators and NKI kernels
  • disaggregated serving, distributed KV cache management, CPU offloading, and container-native solutions
  • upstreaming Neuron SDK contributions to the open-source community