Software Engineer- Ai/ml, Aws Neuron Distributed Training - Performance Optimization

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Engineer focused on performance optimization for distributed training of large-scale AI/ML models (LLMs, multi-modal) on AWS Neuron accelerators. This involves tuning across the software stack, including collective communications, memory utilization, compiler optimizations, and kernel performance, working with PyTorch and JAX.

What you'd actually do

  1. optimize distributed training performance on Trainium, with a primary focus on maximizing training throughput, model flops utilization, and efficiency across the Neuron software stack.
  2. work across PyTorch, JAX, and the Neuron compiler and runtime to enable and tune large-scale training workloads on the latest Trainium instances.
  3. identify and resolve performance bottlenecks across the stack, from collective communications and memory utilization to compiler optimizations and kernel performance.

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • performance tuning
  • performance optimization
  • distributed training

Other signals

  • performance optimization
  • distributed training
  • ML accelerators
  • large language models
  • multi-modal models