Sde Ii, ML Infra Services, Annapurna Labs

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Engineer to lead the development of machine learning tools to run, optimize, and analyze machine learning workloads on AWS Neuron ML accelerators. Focus on ML infrastructure platform, capacity management, workload scheduling, and fleet orchestration.

What you'd actually do

  1. lead the design and implementation of ML infrastructure platform, building systems for capacity management, workload scheduling, and fleet orchestration across ML accelerators
  2. work with ML scientists, training infrastructure engineers, hardware teams, and internal customers to ensure the ML Infra service delivers seamless ML Accelerator access with low wait times, high utilization, and zero-config deployment from various environments
  3. create metrics, implement automation and other improvements, and resolve the root cause of software defects
  4. Build high-impact solutions to deliver to our large customer base
  5. Participate in design discussions, code review, and communicate with internal and external stakeholders

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Experience taking a leading role in building complex software or computing infrastructure that has been successfully delivered to customers
  • Experience with AWS Services including EC2, Lambda, S3, DynamoDB, SQS
  • Experience in Kubernetes, Docker or containers ecosystem, or experience managing full application stacks from the OS up through custom applications and experience in any Bigdata architecture
  • Experience with version control systems and CI/CD pipeline implementation
  • Strong proficiency in Go/Java, Python, and Javascript/Typescript
  • Application and kernel performance profiling and optimization
  • Proficiency in integrated software/hardware performance analysis and optimization
  • Experience designing and operating production services

What the JD emphasized

  • leading machine learning tool projects
  • Deep knowledge of profiling and optimization
  • resource management
  • scheduling
  • code generation
  • new instruction set architectures

Other signals

  • ML infrastructure platform
  • ML accelerators
  • capacity management
  • workload scheduling
  • fleet orchestration