GPU Kernel Development Engineer

AMD AMD · Semiconductors · Shanghai, China · Engineering

GPU Kernel Development Engineer at AMD focused on optimizing deep learning frameworks (TensorFlow, PyTorch) for AMD GPUs. The role involves developing and optimizing GPU kernels, deep learning models, and improving training/inference performance across distributed systems, leveraging compiler technologies and low-level programming. Collaboration with internal GPU library teams and open-source maintainers is key.

What you'd actually do

  1. Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories.
  2. Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations.
  3. Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance.
  4. Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs.
  5. Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream.

Skills

Required

  • C++ development
  • Linux environments
  • software engineering best practices
  • Python
  • C++
  • debugging
  • performance tuning
  • test design

Nice to have

  • GPU Kernel Development & Optimization
  • HIP
  • CUDA
  • assembly (ASM)
  • AMD architectures (GCN, RDNA)
  • low-level programming
  • Compute Kernel (CK)
  • CUTLASS
  • Triton
  • multi-GPU
  • multi-platform performance
  • Deep Learning Integration
  • TensorFlow
  • PyTorch
  • scaling
  • throughput
  • High-Performance Computing
  • heterogeneous compute clusters
  • Compiler Optimization
  • compiler theory
  • LLVM
  • ROCm

What the JD emphasized

  • deep learning frameworks
  • AMD GPUs
  • GPU kernels
  • deep learning models
  • training/inference performance
  • multi-GPU
  • multi-node systems
  • compiler technologies
  • HIP
  • CUDA
  • assembly (ASM)
  • AMD architectures (GCN, RDNA)
  • low-level programming
  • Compute Kernel (CK)
  • CUTLASS
  • Triton
  • TensorFlow
  • PyTorch
  • Python
  • C++
  • LLVM
  • ROCm

Other signals

  • optimizing deep learning frameworks
  • enhancing GPU kernels
  • training/inference performance
  • multi-GPU and multi-node systems
  • compiler technologies