Software Development Engineer Ii, Ai/ml Elastic Collectives - Annapurna Labs

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Development Engineer II at Amazon's Annapurna Labs, focusing on distributed AI/ML systems and collective operations for scaling AI across multiple accelerators and servers. The role requires strong C/C++ and Linux skills, with experience in embedded systems, high-speed networking, or HPC interconnects being valuable. This position is on the forefront of AI/ML, working with large-scale clusters and models within AWS's EC2 infrastructure.

What you'd actually do

  1. work on distributed AI/ML systems
  2. working on collective operations - the fundamental operations that enable AI to scale across multiple accelerators & servers
  3. solid knowledge of Linux, kernels, and performant code is important
  4. work with HPC and ML customers, iterate fast and deliver meaningful solutions at scale
  5. working on features for the largest clusters, with the largest customers, for the largest AI models

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • Knowledge of Linux fundamentals

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Knowledge of Python and/or C++ programming
  • Experience with embedded systems is valued
  • experience with high-speed networking or HPC interconnects is valued highly

What the JD emphasized

  • solid knowledge of Linux, kernels, and performant code is important

Other signals

  • distributed AI/ML systems
  • collective operations
  • scale across multiple accelerators & servers
  • performant code
  • largest AI models