Sr. ML Kernel Performance Engineer, Aws Neuron, Annapurna Labs

Amazon Amazon · Big Tech · Cupertino, CA · Software Development

Senior ML Kernel Performance Engineer for AWS Neuron SDK, focusing on optimizing deep learning and GenAI workloads on custom ML accelerators (Inferentia, Trainium). The role involves designing and implementing high-performance compute kernels, optimizing performance at the hardware-software boundary, and collaborating with customers and internal teams on model enablement and acceleration.

What you'd actually do

  1. Design and implement high-performance compute kernels for ML operations, leveraging the Neuron architecture and programming models
  2. Analyze and optimize kernel-level performance across multiple generations of Neuron hardware
  3. Conduct detailed performance analysis using profiling tools to identify and resolve bottlenecks
  4. Implement compiler optimizations such as fusion, sharding, tiling, and scheduling
  5. Work directly with customers to enable and optimize their ML models on AWS accelerators
  6. Collaborate across teams to develop innovative kernel optimization techniques

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability, scalability, and maintainability)
  • low-level optimization
  • system architecture
  • ML model acceleration
  • compiler optimizations (fusion, sharding, tiling, scheduling)
  • performance analysis and profiling tools

Nice to have

  • deep hardware knowledge
  • ML expertise
  • experience with AWS Neuron SDK
  • experience with PyTorch
  • experience with ML compilers and runtimes
  • experience with distributed architectures

What the JD emphasized

  • maximizing performance
  • high-performance kernels
  • optimal performance
  • push the boundaries of what's possible
  • unparalleled ML inference and training performance
  • optimize current performance
  • optimal performance
  • cutting-edge products
  • optimize machine learning workloads
  • low-level optimization
  • kernel-level performance
  • optimize kernel-level performance
  • optimize their ML models
  • kernel optimization techniques

Other signals

  • AWS Neuron SDK
  • deep learning and GenAI workloads
  • custom machine learning accelerators
  • high-performance kernels
  • ML compiler
  • runtime
  • PyTorch
  • ML inference and training performance
  • compiler optimizations
  • customer model enablement
  • low-level optimization
  • system architecture
  • ML model acceleration