Sr. Software Engineer- Ai/ml, Aws Neuron Distributed Training

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Senior Software Engineer role focused on developing, enabling, and performance tuning distributed training solutions for large-scale ML models (LLMs, Stable Diffusion, ViT) on AWS Neuron accelerators using PyTorch. The role involves building distributed training support into PyTorch, the Neuron compiler, and runtime stacks, with a focus on strategies like FSDP, PP, and Context parallel. Experience with post-training strategies is a plus.

What you'd actually do

  1. You will lead efforts to build distributed training support into PyTorch, the Neuron compiler, and runtime stacks.
  2. You will enable distribute training strategies as well as use them to optimize models to achieve peak performance and maximize efficiency on AWS custom silicon, including Trainium servers.
  3. Strong software development skills, the ability to deep dive, work effectively within cross-functional teams, and a solid foundation in Machine Learning are critical for success in this role.

Skills

Required

  • Bachelor's degree in computer science or equivalent
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Experience in machine learning, large scale training with LLMs and expertise in Pytorch.

Nice to have

  • Master's degree in computer science or equivalent
  • Experience in computer architecture
  • Previous software engineering expertise with Pytorch/Jax/Tensorflow, Distributed libraries and Frameworks, End-to-end Model Training.
  • Experience in post-training strategies like DPO/PPO/HF torch-tune

What the JD emphasized

  • Experience with training these large models using Pythorch is a must.
  • Strong software development skills
  • a solid foundation in Machine Learning are critical for success in this role.

Other signals

  • AWS Neuron
  • AWS Trainium
  • distributed training
  • LLMs
  • PyTorch