What you'd actually do

Enhance collective algorithms and topologies for optimal training performance

Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization

Monitor and analyze processor, DMA, firmware, and workload metrics

Optimize collective operations to scale AI compute across the data center

Work closely with the hardware team to co-optimize software and Trainium silicon

Skills

Required

Experience building complex software systems that have been successfully delivered to customers
Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
Bachelor's degree in computer science or equivalent
Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience in development in the last 3 years, or experience in embedded development in C/C++

Nice to have

Master's degree in computer science or equivalent
Experience with hardware/software integration and real-time systems
Familiarity with collective communication algorithms (e.g., all-reduce, all-gather) or distributed training frameworks

Annapurna Labs is an integral part of AWS and develops hardware and software components that are critical building blocks for EC2 infrastructure. We specialize in designing software, systems and chips that optimize the AWS customer experience.

The AWS Neuron Collectives team is seeking a Software Engineer to optimize collective operations for AWS Trainium. Trainium is one of Amazon's highest priority initiatives, powering the frontier AI models being trained today. Collectives are the critical operations that scale AI compute across the data center. You'll work in depth to optimize compute for the specific topologies used to train modern LLMs. Working closely with the hardware team, you'll push for maximum performance using C/C++, interfacing with DMA and firmware and investigating detailed topologies. You'll analyze current collective algorithms using publicly accessible tools like Neuron Explorer and optimize these to fully utilize compute and bus bandwidth to scale across the data center. This is a unique opportunity to impact how AI training runs at AWS scale, while growing your technical breadth and depth.

Key job responsibilities

As a Neuron Collectives Software Developer, you will:

Enhance collective algorithms and topologies for optimal training performance
Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
Monitor and analyze processor, DMA, firmware, and workload metrics
Optimize collective operations to scale AI compute across the data center
Work closely with the hardware team to co-optimize software and Trainium silicon
Develop and optimize C/C++ implementations of collective communication patterns
Investigate and implement improvements for specific training topologies used by modern LLMs
Build and maintain analysis frameworks and automation solutions

The role offers opportunities to work on cutting-edge AI training hardware while contributing to one of Amazon's most critical initiatives.

A day in the life Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 16 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust.

Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives.

Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded professional and enable them to take on more complex tasks in the future.

About the team Annapurna Labs, part of AWS, created Trainium as a purpose-built AI training chip to revolutionize machine learning at Amazon scale. The Neuron Collectives team owns the software stack that enables collective operations — the communication primitives that allow AI training to scale across thousands of chips in the data center. Our work is essential to training the frontier models that power AI today. We work closely with hardware teams to extract maximum performance from Trainium, ensuring that compute and interconnect bandwidth are fully utilized. Our team sits at the intersection of hardware, firmware, and distributed systems.

Basic Qualifications

Experience building complex software systems that have been successfully delivered to customers
Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
Bachelor's degree in computer science or equivalent
Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience in development in the last 3 years, or experience in embedded development in C/C++

Preferred Qualifications

Master's degree in computer science or equivalent
Experience with hardware/software integration and real-time systems
Familiarity with collective communication algorithms (e.g., all-reduce, all-gather) or distributed training frameworks

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service; and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company’s reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.

USA, CA, Cupertino - 165,200.00 - 223,600.00 USD annually

Key job responsibilities

As a Neuron Collectives Software Developer, you will:

Enhance collective algorithms and topologies for optimal training performance
Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
Monitor and analyze processor, DMA, firmware, and workload metrics
Optimize collective operations to scale AI compute across the data center
Work closely with the hardware team to co-optimize software and Trainium silicon
Develop and optimize C/C++ implementations of collective communication patterns
Investigate and implement improvements for specific training topologies used by modern LLMs
Build and maintain analysis frameworks and automation solutions

The role offers opportunities to work on cutting-edge AI training hardware while contributing to one of Amazon's most critical initiatives.

Basic Qualifications

Experience building complex software systems that have been successfully delivered to customers
Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
Bachelor's degree in computer science or equivalent
Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience in development in the last 3 years, or experience in embedded development in C/C++

Preferred Qualifications

Master's degree in computer science or equivalent
Experience with hardware/software integration and real-time systems
Familiarity with collective communication algorithms (e.g., all-reduce, all-gather) or distributed training frameworks

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

USA, CA, Cupertino - 165,200.00 - 223,600.00 USD annually

Software Development Engineer, Neuron Collectives, Annapurna Labs

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Basic Qualifications

Preferred Qualifications

Basic Qualifications

Preferred Qualifications