Senior Software Development Engineer, Ai/ml, Aws Neuron, Model Inference

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Senior Software Development Engineer role focused on optimizing and enabling AI/ML model inference on AWS's custom hardware accelerators (Inferentia and Trainium) through the Neuron SDK. The role involves system-level optimizations, performance tuning for latency and throughput, building infrastructure for model onboarding, and collaborating across hardware, software, and framework teams to ensure optimal performance for customers running large language models and other GenAI workloads.

What you'd actually do

  1. Design, develop, and optimize machine learning models and frameworks for deployment on custom ML hardware accelerators.
  2. Participate in all stages of the ML system development lifecycle including distributed computing based architecture design, implementation, performance profiling, hardware-specific optimizations, testing and production deployment.
  3. Build infrastructure to systematically analyze and onboard multiple models with diverse architecture.
  4. Design and implement high-performance kernels and features for ML operations, leveraging the Neuron architecture and programming models
  5. Analyze and optimize system-level performance across multiple generations of Neuron hardware

Skills

Required

  • Python
  • System level programming
  • ML knowledge
  • low-level optimization
  • system architecture
  • ML model acceleration
  • performance profiling
  • distributed computing
  • hardware-specific optimizations

Nice to have

  • PyTorch
  • JAX
  • compiler
  • runtime
  • frameworks
  • kernels
  • LLM model families
  • Llama family
  • DeepSeek

What the JD emphasized

  • critical to this role
  • must have
  • optimize inference performance for both latency and throughput

Other signals

  • AWS Neuron SDK
  • accelerate deep learning and GenAI workloads
  • custom machine learning accelerators
  • ML inference and training performance
  • optimize inference performance for both latency and throughput
  • distributed inference solutions