Software Development Engineer III — Neuron Containers, Neuron Containers, Annapurna Labs

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Development Engineer III role focused on designing and leading the development of container platform integrations for ML accelerator resource management, including device plugins, DRA drivers, and operator development. The role involves solving scalability challenges, simplifying systems by deprecating legacy software, driving operational excellence, and elevating team quality through code reviews and contributions. The team enables customers to run training and inference workloads on Neuron at scale, owning integrations with Kubernetes, ECS, and Slurm.

What you'd actually do

  1. Lead multi-person projects end-to-end — from design documentation and architecture reviews through to delivery
  2. Design container platform integrations — device plugins, DRA drivers, and operator development for ML accelerator resource management
  3. Solve scalability challenges — diagnose performance issues across thousand-node customer clusters
  4. Simplify systems — deprecate legacy software and reduce complexity in container delivery pipelines
  5. Drive operational excellence — own on-call responsibilities, proactively triage test failures, and drive ticket resolution

Skills

Required

  • Go or similar systems languages
  • distributed systems experience
  • full software development life cycle experience
  • leading design or architecture experience
  • programming with at least one software programming language experience
  • non-internship professional software development experience
  • mentor, tech lead or leading an engineering team experience

Nice to have

  • Kubernetes architecture device plugins, schedulers, controllers, DRA drivers
  • Helm, Prometheus, Kubernetes operator frameworks
  • ML training/inference infrastructure and container image pipelines
  • AWS compute services (EC2, EKS, ECS, ECR)
  • Deep Learning Containers or Deep Learning AMIs

What the JD emphasized

  • ML accelerator resource management
  • Kubernetes
  • training and inference workloads