Lead GPU Kernel Optimization Engineer

AMD AMD · Semiconductors · Hyderabad, India · Engineering

This role focuses on optimizing low-level GPU kernels for accelerating the inference and training of large machine learning models. It involves multi-GPU and multi-node optimization, performance profiling, and leveraging parallel computing techniques. The candidate will work with frameworks like PyTorch and VLLM, and requires deep expertise in CUDA, GPU programming, and C++ optimization.

What you'd actually do

  1. Develop and optimize low-level GPU kernels to accelerate inference and training of large machine learning models. Maximize computational efficiency and reduce execution time while ensuring model accuracy.
  2. Design and implement strategies for distributed model training and inference across multiple GPUs and nodes. Address data parallelism and model parallelism challenges to fully utilize available resources.
  3. Profile and analyze system and application performance to identify bottlenecks and areas for improvement. Use profiling tools to understand and optimize hardware resource utilization.
  4. Leverage parallel computing techniques to improve the scalability and performance of machine learning workloads. Implement multi-threading and GPU synchronization techniques.
  5. Develop benchmarks and testing procedures to assess the performance and stability of optimized models and frameworks. Ensure that the solutions meet or exceed the defined performance criteria.

Skills

Required

  • GPU Kernel Optimization
  • C++17, C++20, C++23
  • PyTorch
  • VLLM
  • CUTLASS
  • Kokkos
  • GPU/CPU architectures
  • Python
  • CUDA
  • GPU programming
  • Distributed computing
  • Multi-GPU environments
  • Performance profiling tools
  • Parallel computing

Nice to have

  • Optimization in assembly
  • Software development processes
  • Good coding practices
  • Design pattern identification and implementation
  • Agile processes and procedures

What the JD emphasized

  • Optimizing GPU kernels in C++17, C++20, and C++23
  • Strong experience in low-level GPU kernel optimization
  • Proficiency in CUDA and GPU programming
  • Experience with distributed computing and multi-GPU environments

Other signals

  • optimize GPU kernels for ML inference and training
  • distributed model training and inference
  • GPU/CPU architectures
  • CUDA and GPU programming