Software Development Engineer, Neuron Foundation Tools

Amazon Amazon · Big Tech · Seattle, WA · Software Development

Software Development Engineer responsible for developing and maintaining high-performance monitoring and profiling tools for AWS Neuron AI accelerators (Inferentia and Trainium). The role involves managing the full development lifecycle of the Neuron Profiler/Tools toolchain, optimizing ML Kernels and Frameworks, and collaborating with compiler and runtime teams to provide insights for customer optimization of AI workloads.

What you'd actually do

  1. working alongside a team of engineers to develop and maintain high-performance monitoring and profiling tools for machine learning applications and AI accelerators
  2. work on design, development, and deployment of the Neuron Profiler and other Neuron Tools
  3. manage the full development life cycle of the Neuron Profiler/Tools toolchain, ensuring scalability, reliability, and usability
  4. collaborate with cross-functional teams to ensure that the C++ compiler and runtime generates key information so customers can understand and optimize the performance of our custom hardware
  5. drive innovations that allow the profiler to support multiple frameworks, such as PyTorch, JAX, and XLA

Skills

Required

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience programming with at least one software programming language

Nice to have

  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • ML-specific profiler tools (like PyTorch Profiler or TensorFlow Profiler)
  • direct customer-facing experience

What the JD emphasized

  • high-performance
  • optimizing AI workloads
  • performance bottlenecks
  • Improving performance
  • scalability, reliability, and usability
  • performance of our custom hardware
  • performance analysis tools
  • ML-specific profiler tools
  • achieve results

Other signals

  • AWS Neuron
  • AI accelerators
  • performance analysis tools
  • optimizing AI workloads