Software Development Manager, Neuron Tools, Annapurna Labs

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Manager for AWS Neuron Tools team, responsible for leading engineers to develop and maintain high-performance monitoring and profiling tools for AI accelerators (Inferentia, Trainium). The role involves managing the full development lifecycle of the Neuron Profiler, ensuring scalability, reliability, and usability, and collaborating with cross-functional teams to optimize AI workloads. Experience with ML-specific profiler tools and performance analysis is required.

What you'd actually do

  1. leading a talented team of engineers to develop and maintain high-performance monitoring and profiling tools for machine learning applications and AI accelerators
  2. oversee the design, development, and deployment of the Neuron Profiler and other Neuron Tools
  3. manage the full development life cycle of the Neuron Profiler/Tools toolchain, ensuring scalability, reliability, and usability
  4. collaborate with cross-functional teams to ensure that the our C++ compiler and runtime generates key information so customers can understand and optimize the performance of our custom hardware
  5. drive innovations that allow the profiler to support multiple frameworks, such as PyTorch, TensorFlow, and XLA

Skills

Required

  • 3+ years of engineering team management experience
  • 7+ years of working directly within engineering teams experience
  • 3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
  • Experience partnering with product or program management teams
  • Experience in C++, Go, and Python

Nice to have

  • 2+ years experience leading teams that in Machine Learning development including building and training large models, working with Pytorch and/or Tensorflow using large distributed fleets of GPU or other accelerated systems
  • Experience with Linux distributions such as Ubuntu or CentOS, kernel development, and tooling such as perf and gdb
  • Experience with performance profiling, tracing, and analysis of AI training/inference applications
  • Experience with large scale, distributed AI training/inference applications, including libfabric, MPI, slurm, and EKS
  • Experience with fleet monitoring, debugging, and reliability
  • Knowledge of AI-powered optimization suggestions for profiling would be an advantage for this position

What the JD emphasized

  • high-performance monitoring and profiling tools for machine learning applications and AI accelerators
  • optimizing AI workloads
  • performance bottlenecks
  • ML-specific profiler tools
  • performance profiling, tracing, and analysis of AI training/inference applications

Other signals

  • AWS Neuron
  • AWS Inferentia and Trainium
  • Neuron Profiler
  • AI accelerators
  • optimizing AI workloads
  • performance bottlenecks
  • C++ compiler and runtime
  • PyTorch, TensorFlow, and XLA