Software Engineer- Ai/ml, Aws Neuron

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Engineer role focused on building and tuning distributed training solutions for AWS Inferentia and Trainium accelerators, specifically for large language models and other ML model families. The role involves working with PyTorch, Jax, XLA, and the Neuron compiler/runtime to maximize performance and efficiency on AWS Trainium.

What you'd actually do

  1. This role will help lead the efforts building distributed training support into Pytorch and Jax using XLA and the Neuron compiler and runtime stacks.
  2. This role will help tune these models to ensure highest performance and maximize the efficiency of them running on the customer AWS Trainium .
  3. Experience training these large models using Python is a must.
  4. FSDP, Deepspeed and other distributed training libraries are central to this and extending all of this for the Neuron based system is key.

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • Experience training these large models using Python is a must.

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • Strong software development and ML knowledge are both critical to this role.

Other signals

  • distributed training
  • performance tuning
  • large language models
  • accelerators