Sr. Software Development Engineer – GPU Kernel Development

AMD AMD · Semiconductors · Santa Clara, CA · Engineering

Software Development Engineer focused on optimizing deep learning frameworks and GPU kernels for AMD GPUs, enhancing training and inference performance on multi-GPU/multi-node systems using advanced compiler technologies.

What you'd actually do

  1. Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories.
  2. Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations.
  3. Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance.
  4. Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs.
  5. Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream.

Skills

Required

  • C++
  • Linux
  • TensorFlow
  • PyTorch
  • Python
  • performance tuning
  • debugging
  • test design
  • large-scale, heterogeneous compute environments
  • compiler internals
  • LLVM
  • ROCm

Nice to have

  • HIP
  • CUDA
  • assembly (ASM)
  • AMD architectures (GCN, RDNA)
  • low-level programming
  • Compute Kernel (CK)
  • CUTLASS
  • Triton
  • graph compilers

What the JD emphasized

  • deep learning frameworks
  • AMD GPUs
  • GPU kernels
  • training and inference performance
  • multi-GPU and multi-node systems
  • compiler technologies
  • Deep expertise in designing and optimizing GPU kernels for deep learning on AMD GPUs using HIP, CUDA, and assembly (ASM).
  • Strong knowledge of AMD architectures (GCN, RDNA) and low-level programming to maximize performance for AI operations, leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance.
  • Proven ability and experience to integrate GPU-accelerated compute into ML frameworks (e.g., PyTorch, TensorFlow), with a focus on throughput, scalability, and efficient execution for training and inference workloads.
  • Thorough and detailed understanding of compiler internals, LLVM, and ROCm, with the ability to drive system-level optimizations from source to machine code.

Other signals

  • optimizing deep learning frameworks
  • enhancing GPU kernels
  • training/inference performance
  • multi-GPU and multi-node systems
  • compiler technologies