Neuron Collectives Software Engineer, Trainium Collectives

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Engineer role focused on enhancing collective algorithms and topologies for optimal AI training performance on Amazon's Trainium chips. This involves optimizing communication primitives to scale AI compute across data centers, working closely with hardware teams, and developing C/C++ implementations for training LLMs.

What you'd actually do

  1. Enhance collective algorithms and topologies for optimal training performance
  2. Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
  3. Monitor and analyze processor, DMA, firmware, and workload metrics
  4. Optimize collective operations to scale AI compute across the data center
  5. Work closely with the hardware team to co-optimize software and Trainium silicon

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • C/C++

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • optimal training performance
  • scale AI compute
  • modern LLMs
  • AI training hardware
  • critical initiatives
  • Machine Learning (ML)
  • purpose-built AI training chip
  • collective operations
  • AI training to scale
  • frontier models
  • AI today
  • maximum performance
  • compute and interconnect bandwidth
  • hardware, firmware, and distributed systems

Other signals

  • optimize collective operations to scale AI compute
  • training topologies used by modern LLMs
  • purpose-built AI training chip
  • collective operations — the communication primitives that allow AI training to scale across thousands of chips