Sr. Software Engineer- Ai/ml, Aws Neuron Distributed Training

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Senior Software Engineer role focused on the distributed training of large-scale ML models (LLMs, Stable Diffusion, ViTs) on AWS custom silicon (Trainium, Inferentia). Responsibilities include leading efforts to build distributed training support in PyTorch and JAX, optimizing models for performance and efficiency, and working with chip architects and compiler engineers. Requires strong software development skills, ML foundation, and experience with distributed training libraries.

What you'd actually do

  1. You will lead efforts to build distributed training support into PyTorch and JAX using XLA, the Neuron compiler, and runtime stacks.
  2. You will optimize models to achieve peak performance and maximize efficiency on AWS custom silicon, including Trainium and Inferentia, as well as Trn2, Trn1, Inf1, and Inf2 servers.
  3. Strong software development skills, the ability to deep dive, work effectively within cross-functional teams, and a solid foundation in Machine Learning are critical for success in this role.

Skills

Required

  • Bachelor's degree in computer science or equivalent
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Experience in machine learning, data mining, information retrieval, statistics or natural language processing
  • Experience with training these large models using Python is a must

Nice to have

  • Master's degree in computer science or equivalent
  • Experience in computer architecture
  • Previous software engineering expertise with Pytorch/Jax/Tensorflow, Distributed libraries and Frameworks, End-to-end Model Training.

What the JD emphasized

  • critical for success

Other signals

  • AWS Neuron
  • AWS Trainium
  • AWS Inferentia
  • distributed training
  • LLMs
  • Stable Diffusion
  • Vision Transformers
  • PyTorch
  • JAX
  • XLA
  • Neuron compiler
  • runtime stacks
  • FSDP
  • Deepspeed
  • Nemo