Neuron Runtime Software Development Engineer , Neuron Runtime

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Engineer responsible for developing and maintaining high-performance runtime libraries and drivers for AWS ML accelerators (Inferentia and Trainium). The role involves managing the full development lifecycle of the Neuron Runtime, optimizing AI workloads, improving ML kernels and frameworks, and supporting multiple ML frameworks. Requires experience in distributed systems, AWS services, and end-to-end service ownership.

What you'd actually do

  1. develop and maintain high-performance runtime libraries and drivers for machine learning applications and AI accelerators
  2. work on design, development, and deployment of Neuron Runtime and other Neuron components
  3. manage the full development life cycle of the Neuron Runtime, ensuring scalability, reliability, and usability
  4. collaborate with cross-functional teams to ensure that the our C++ compiler generates key information so customers can understand and optimize the performance of our custom hardware
  5. drive innovations that allow the profiler to support multiple frameworks, such as PyTorch, JAX, and XLA

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language
  • architecting, building, and operating distributed systems with a focus on high availability and fault tolerance
  • Hands-on experience with AWS services (e.g., EC2, ECS, CloudWatch, S3, Lambda) in production environments
  • track record in Owning services end-to-end including deployment, monitoring, alarming, on-call, and post-incident review

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent

What the JD emphasized

  • high-performance runtime libraries and drivers for machine learning applications and AI accelerators
  • optimizing AI workloads across hardware platforms
  • Improving performance of ML Kernels and ML Frameworks
  • manage the full development life cycle of the Neuron Runtime
  • architecting, building, and operating distributed systems with a focus on high availability and fault tolerance
  • Owning services end-to-end including deployment, monitoring, alarming, on-call, and post-incident review

Other signals

  • AWS Inferentia and Trainium cloud-scale machine learning accelerators
  • high-performance runtime libraries and drivers for machine learning applications
  • optimizing AI workloads across hardware platforms
  • Improving performance of ML Kernels and ML Frameworks
  • manage the full development life cycle of the Neuron Runtime
  • C++ compiler generates key information so customers can understand and optimize the performance of our custom hardware
  • support multiple frameworks, such as PyTorch, JAX, and XLA
  • architecting, building, and operating distributed systems with a focus on high availability and fault tolerance
  • AWS services (e.g., EC2, ECS, CloudWatch, S3, Lambda) in production environments
  • Owning services end-to-end including deployment, monitoring, alarming, on-call, and post-incident review
  • build massive-scale distributed training and inference solutions