Sr. Software Engineer- Ai/ml, Aws Neuron Distributed Training - Performance Optimization

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Senior Software Engineer focused on performance optimization for distributed AI model training on AWS Trainium accelerators. The role involves working with frameworks like PyTorch and JAX, optimizing the Neuron software stack, and improving training throughput and efficiency for large-scale models.

What you'd actually do

  1. This role will lead efforts to optimize distributed training performance on Trainium, with a primary focus on maximizing training throughput, model flops utilization, and efficiency across the Neuron software stack.
  2. You will work across PyTorch, JAX, and the Neuron compiler and runtime to enable and tune large-scale training workloads on the latest Trainium instances.
  3. You will identify and resolve performance bottlenecks across the stack, from collective communications and memory utilization to compiler optimizations and kernel performance.

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience as a mentor, tech lead or leading an engineering team

Nice to have

  • Machine Learning knowledge in frameworks and end to end model training.

What the JD emphasized

  • performance optimization
  • distributed training
  • training throughput
  • model flops utilization
  • efficiency
  • compiler optimizations
  • kernel performance

Other signals

  • AWS Neuron
  • AWS Trainium
  • performance optimization
  • distributed training
  • large language models
  • multi-modal generation models