AI Product Performance Engineer

AMD AMD · Semiconductors · Shenzhen, China · Engineering

This role focuses on designing, implementing, and optimizing high-performance GPU kernels for AI/ML workloads, specifically for data center AI applications like LLMs and Generative AI. The engineer will analyze and optimize kernel execution for latency and throughput, evaluate performance impact on full-stack AI models, and use profiling tools to identify bottlenecks. Collaboration with software stack teams to integrate optimized kernels into high-level frameworks and inference engines is also a key responsibility.

What you'd actually do

  1. Design, implement, and optimize high-performance GPU kernels for AI/ML workloads to maximize hardware utilization.
  2. Analyze and optimize kernel execution for latency and throughput, addressing bottlenecks in memory bandwidth, instruction latency, and thread divergence.
  3. Evaluate the end-to-end performance impact of individual kernels on full-stack AI models, ensuring that micro-optimizations translate to application-level speedups.
  4. Utilize advanced GPU profiling tools (e.g., ROCm Profiler, Pytorch Profiler) to identify performance cliffs, stall pipelines, and memory hierarchy inefficiencies.
  5. Collaborate with software stack teams to expose optimized kernels within high-level frameworks and inference engines.

Skills

Required

  • C++
  • parallel computing
  • NVIDIA CUDA or AMD HIP kernel programming
  • GPU profiling tools
  • GPU architectures

Nice to have

  • OpenAI Triton
  • vLLM
  • SGLang
  • TensorRT-LLM
  • PyTorch custom extensions
  • Python DSLs
  • Hardware Agnosticism

What the JD emphasized

  • high-performance GPU kernels
  • AI/ML workloads
  • performance optimization
  • latency and throughput
  • GPU profiling tools
  • GPU architectures
  • NVIDIA CUDA
  • AMD HIP
  • OpenAI Triton
  • vLLM
  • SGLang
  • TensorRT-LLM
  • PyTorch

Other signals

  • High-Performance Kernel Development
  • Performance Optimization
  • Workload Analysis
  • Profiling & Tuning
  • Architecture Adaptation
  • Framework Integration