Sde Ii, Neuron Infra Services

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Software Engineer role focused on developing and optimizing services for AWS Inferentia and Trainium machine learning accelerators. Responsibilities include leading the design and implementation of tools, pipelines, and automation for ML workloads, managing infrastructure, monitoring performance, and ensuring security. Requires experience with distributed systems, ML optimization, and cloud technologies.

What you'd actually do

  1. This engineer will lead the design and implementation of new tools, pipelines and automation, will work with developers, system architects, hardware engineers and users both within and external to Amazon to ensure compatibility of this new toolset with existing and next-generation AI accelerators.
  2. Design, implement, and maintain CI/CD pipelines to automate the software release process.
  3. Collaborate with development teams to integrate new software releases.
  4. Manage and automate infrastructure provisioning.
  5. Implement monitoring solutions to track system performance.

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • Knowledge of system performance, memory management, and parallel computing principles
  • Experience in debugging, profiling, and implementing software engineering best practices in large-scale systems, or experience debugging, profiling, and implementing best software engineering practices in large-scale systems
  • Experience with AWS or cloud technologies

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Knowledge of fundamentals of networking, security, databases (relational or NoSQL), operating systems (Unix, Linux, and/or Windows)
  • Fundamentals of Machine learning and LLMs, their architecture along with work experience on certain LLM models.

What the JD emphasized

  • AWS Inferentia and Trainium
  • machine learning accelerators
  • optimization
  • resource management
  • scheduling
  • large-scale systems
  • AWS or cloud technologies