Sr. Software Engineer- Ai/ml, Aws Neuron Distributed Training

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Senior Software Engineer role focused on developing, enabling, and optimizing large-scale ML model training (pre-training and post-training) on AWS Trainium accelerators. This involves working with distributed training frameworks, mixed-precision techniques, and performance tuning across various model families including LLMs, multimodal models, and RL workloads.

What you'd actually do

  1. You will design, implement and optimize distributed training solutions for large scale ML models running on Trainium instances. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP (Fully-Sharded Data Parallel), torchtitan and Hugging Face libraries for the Neuron ecosystem.
  2. A core focus of this role involves developing and optimizing mixed-precision and low-precision training techniques. You will work with BF16, FP8, and emerging numerical formats to maximize training throughput while maintaining model accuracy and convergence quality. This requires implementing precision aware training strategies, loss scaling techniques, and careful gradient management to ensure training stability across reduced precision formats. Understanding the tradeoffs between computational efficiency and numerical fidelity is essential to success in this position.
  3. Beyond precision optimization, you will profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. You will partner with hardware, compiler, and runtime teams to influence system design and unlock new capabilities. Additionally, you will work directly with AWS solution architects and customers to deploy and optimize training workloads at scale.

Skills

Required

  • machine learning
  • large scale training with LLMs
  • Pytorch
  • distributed training solutions
  • mixed-precision training techniques
  • performance optimization
  • software development life cycle

Nice to have

  • computer architecture
  • Jax
  • Tensorflow
  • Distributed libraries and Frameworks
  • End-to-end Model Training

What the JD emphasized

  • large scale ML model training
  • pre-training
  • post-training
  • mixed-precision
  • low-precision training techniques
  • performance optimization

Other signals

  • large scale ML model training
  • distributed training frameworks
  • mixed-precision and low-precision training techniques
  • performance optimization on Trainium hardware