Software Engineer I - Ai/ml, Aws Neuron Distributed Training

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Engineer role focused on developing, enabling, and optimizing large-scale ML model training (pre-training and post-training of LLMs, multimodal, and RL workloads) on AWS Trainium accelerators. This involves working with distributed training frameworks, mixed-precision techniques, and performance tuning on specific hardware.

What you'd actually do

  1. You will contribute to the design and implementation of distributed training solutions for large-scale ML models running on Trainium instances.
  2. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP, torchtitan, and Hugging Face libraries for the Neuron ecosystem.
  3. A core focus of this role involves developing and optimizing mixed-precision and low-precision training techniques.
  4. You will work with BF16, FP8, and emerging numerical formats to improve training throughput while maintaining model accuracy and convergence quality.
  5. Beyond precision optimization, you will profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware.

Skills

Required

  • Bachelor's degree or above in computer science, computer engineering, or related field, or Bachelor's degree
  • 1+ years of programming experience with at least one software programming language (including academic projects, internships, or research)
  • Experience with software development practices including code reviews, source control, testing, and build processes
  • Experience with machine learning concepts and at least one ML framework (PyTorch, JAX, or TensorFlow)

Nice to have

  • Master's degree or above in computer science or equivalent
  • Experience with large-scale distributed training or LLM workloads
  • Experience with computer architecture or hardware-software co-optimization
  • Experience with distributed systems, libraries, or frameworks
  • Familiarity with end-to-end model training pipelines
  • Previous internship or research experience in ML infrastructure or systems software

What the JD emphasized

  • large scale ML model training
  • pre-training and post-training of LLMs
  • mixed-precision and low-precision training techniques
  • performance optimization on Trainium hardware

Other signals

  • large scale ML model training
  • pre-training and post-training of LLMs
  • distributed training frameworks
  • mixed-precision and low-precision training techniques
  • performance optimization on Trainium hardware