Software Development Engineer, Neuron Collectives, Annapurna Labs

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Engineer role focused on optimizing collective operations for AWS Trainium, a purpose-built AI training chip. The role involves enhancing collective algorithms and topologies, optimizing compute for specific LLM training topologies, and working closely with hardware teams to maximize performance using C/C++. The goal is to scale AI compute across the data center for training frontier AI models.

What you'd actually do

  1. Enhance collective algorithms and topologies for optimal training performance
  2. Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
  3. Monitor and analyze processor, DMA, firmware, and workload metrics
  4. Optimize collective operations to scale AI compute across the data center
  5. Work closely with the hardware team to co-optimize software and Trainium silicon

Skills

Required

  • Experience building complex software systems that have been successfully delivered to customers
  • Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
  • Bachelor's degree in computer science or equivalent
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience in development in the last 3 years, or experience in embedded development in C/C++

Nice to have

  • Master's degree in computer science or equivalent
  • Experience with hardware/software integration and real-time systems
  • Familiarity with collective communication algorithms (e.g., all-reduce, all-gather) or distributed training frameworks

What the JD emphasized

  • optimize compute for the specific topologies used to train modern LLMs
  • fully utilize compute and bus bandwidth to scale across the data center
  • impact how AI training runs at AWS scale

Other signals

  • AWS Trainium
  • scale AI compute
  • frontier AI models
  • collective operations
  • LLMs