Software Development Manager, Aws Neuron Sdk - Distributed Training

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Development Manager for AWS Neuron SDK, focusing on distributed training for ML accelerators. The role involves leading a team to design and deploy new products, optimize performance of ML models at scale, and ensure support for key ML functionality. Responsibilities include customer onboarding, maximizing model FLOPS utilization, building tooling, partnering with other teams, and driving technical strategy for frontier model architectures.

What you'd actually do

  1. Lead a team of engineers focused on enabling new ML training customers on the Neuron SDK / Trainium platform.
  2. Own the customer onboarding journey from model evaluation through production training at scale
  3. Drive engineering initiatives to maximize Model FLOPS Utilization (MFU) for customer workloads through performance analysis, profiling, and tuning tools.
  4. Build and maintain tooling, automation, and documentation that accelerates time-to-first-training for new customer models.
  5. Partner with compiler, runtime, and framework teams to identify and resolve blockers in customer workloads.

Skills

Required

  • Experience working with PyTorch or JAX software
  • 3+ years of engineering team management experience
  • 7+ years of working directly within engineering teams experience
  • 3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
  • Experience partnering with product or program management teams
  • 3+ years of experience in deep learning / machine learning, including model training workflows
  • Experience with distributed training at scale (multi-node, multi-accelerator)

Nice to have

  • Experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware
  • Experience directly managing scientists or machine learning engineers
  • Experience debugging, profiling, and implementing best software engineering practices in large-scale systems
  • Experience with CUDA kernels or ML/low-level kernels
  • Experience with performance analysis, profiling, and optimization for deep learning training workloads

What the JD emphasized

  • direct customer-facing experience
  • strong technical ability
  • motivation to achieve results
  • Experience in Machine Learning and software development is also a must
  • performance optimization
  • accuracy
  • resilience
  • deep learning / machine learning, including model training workflows
  • distributed training at scale

Other signals

  • leading a team
  • customer onboarding
  • performance optimization
  • distributed training
  • frontier model architectures