Software Development Manager, Aws Neuron Sdk - Distributed Training

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

This role focuses on engineering and optimizing distributed training for large-scale ML models, particularly LLMs with multi-modal inputs/outputs, on AWS Neuron accelerators. The primary goal is to enhance training resiliency and performance across thousands of nodes, ensuring Trainium devices are first-class citizens for ML acceleration.

What you'd actually do

  1. Lead the efforts building distributed training large cluster stability support into Pytorch, Jax using XLA and the Neuron compiler and runtime stacks.
  2. Help tune these models to ensure highest performance and maximize the efficiency of them running on the customer AWS Trainium TRN2+ servers.
  3. Responsible for the full development life cycle of providing Distributed Training support for multi-modal transformer models such as MM-Llama3.2, DiT/Pixart, CLIP etc.
  4. Develop scalability features and performance optimizations in the Neuron ML Framework components to enable them make Trainium devices as the first-class citizens for ML Acceleration.
  5. Scale and Optimize the application stack for LLMs that leverage multi-modal modes of input/output-generation such as Text, Vision, Video, Audio etc.

Skills

Required

  • Knowledge of object-oriented design, data structures, and algorithms
  • Experience (non-internship) in professional software development
  • 3+ years of engineering team management experience
  • 7+ years of working directly within engineering teams experience
  • 3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
  • 8+ years of leading the definition and development of multi tier web services experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience partnering with product or program management teams

Nice to have

  • Experience designing and building large-scale systems in a multi-tiered, distributed environment (Service Oriented Architecture)
  • Experience in Distributed Training on thousands of nodes.
  • Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy
  • Experience in recruiting, hiring, mentoring/coaching and managing teams of Software Engineers to improve their skills, and make them more effective, product software engineers

What the JD emphasized

  • Scaling and Stabilizing Machine Learning Distributed Training components
  • Scaling model training across thousands of nodes a must
  • Distributed Training on thousands of nodes

Other signals

  • Distributed Training
  • LLMs
  • Multi-modal
  • Scalability
  • Performance Optimization