Senior ML Kernel Performance Engineer

Amazon Amazon · Big Tech · CA, ON +1 · Software Development

The Annapurna Labs team at Amazon is seeking a Senior ML Kernel Performance Engineer to optimize deep learning and GenAI workloads on Amazon's custom ML accelerators (Inferentia and Trainium). This role involves crafting high-performance kernels, pushing the boundaries of AI acceleration at the hardware-software boundary, and collaborating with customers to enable their models. The engineer will work on compiler optimizations, performance analysis, and contribute to future architecture designs.

What you'd actually do

  1. Design and implement high-performance compute kernels for ML operations, leveraging the Neuron architecture and programming models
  2. Analyze and optimize kernel-level performance across multiple generations of Neuron hardware
  3. Conduct detailed performance analysis using profiling tools to identify and resolve bottlenecks
  4. Implement compiler optimizations such as fusion, sharding, tiling, and scheduling
  5. Work directly with customers to enable and optimize their ML models on AWS accelerators

Skills

Required

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience as a mentor, tech lead or leading an engineering team

Nice to have

  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Bachelor's degree in computer science or equivalent
  • Expertise in accelerator architectures for ML or HPC such as GPUs, CPUs, FPGAs, or custom architectures
  • Experience with GPU kernel optimization and GPGPU computing such as CUDA, NKI, Triton, OpenCL, SYCL, or ROCm
  • Demonstrated experience with NVIDIA PTX and/or AMD GPU ISA
  • Experience developing high performance libraries for HPC applications
  • Proficiency in low-level performance optimization for GPUs
  • Experience with LLVM/MLIR backend development for GPUs
  • Knowledge of ML frameworks (PyTorch, TensorFlow) and their GPU backends
  • Experience with parallel programming and optimization techniques
  • Understanding of GPU memory hierarchies and optimization strategies

What the JD emphasized

  • high-performance kernels for ML functions
  • push the boundaries of what's possible in AI acceleration
  • optimize current performance
  • future architecture designs
  • cutting-edge products
  • inventing
  • experimenting
  • unique learning culture
  • optimal performance
  • high-performance compute kernels
  • kernel-level performance
  • performance analysis
  • compiler optimizations
  • optimize their ML models
  • kernel optimization techniques

Other signals

  • optimizing ML workloads
  • accelerating deep learning and GenAI workloads
  • high-performance kernels for ML functions