Sr. Software Development Engineer

AMD AMD · Semiconductors · Beijing, China · Engineering

Software engineer focused on optimizing AI inference systems for AMD GPUs, working on LLMs and multimodal models. Responsibilities include framework optimization, performance-conscious coding, profiling, and collaboration with internal and open-source teams.

What you'd actually do

  1. Deep Learning & LLM Framework Optimization: Experience with optimizing major DL/LLM frameworks (PyTorch, vLLM, SGLang) for AMD GPUs and contribute improvements upstream.
  2. Model-Aware Implementation: Build features that interact closely with LLMs and multimodal architectures (e.g., Llama, Qwen-VL, Wan), requiring understanding of attention mechanisms, cross-modal fusion, KV caching, and quantization.
  3. Performance-Conscious Coding: Write efficient, scalable code while considering memory usage, concurrency, and bottlenecks in multi-GPU environments.
  4. Profiling: Use profiling tools to evaluate the impact of your changes, identify regressions, and validate performance improvements as part of the development cycle.
  5. End-to-End Performance Engineering: Perform comprehensive profiling to identify bottlenecks and implement system, memory, and communication optimizations across multi-GPU and multi-node setups.

Skills

Required

  • Python
  • C++
  • Linux development environment
  • Transformer architectures
  • attention mechanisms
  • vision-language alignment
  • inference pipelines
  • Transformer/Attention/MoE/KV Cache
  • quantization (FP8/FP4)
  • command-line tools
  • Git
  • debugging/profiling utilities
  • multi-GPU and multi-node environments
  • distributed inference for large-scale models
  • Tensor Parallel
  • Pipeline Parallel

Nice to have

  • async programming
  • GPU Kernel Development & Optimization
  • HIP
  • CUDA
  • ASM
  • CK
  • CUTLASS
  • Triton
  • Compiler & System-Level Optimization
  • LLVM
  • ROCm
  • multimodal models (e.g., Qwen-VL, Qwen-Image-Edit, Wan)
  • diffusion-based generative models
  • PagedAttention
  • continuous batching
  • speculative decoding
  • GPU computing (ROCm, CUDA)
  • performance profiling tools (e.g., PyTorch Profiler)
  • open-source contributions

What the JD emphasized

  • production-quality code
  • production systems
  • production-quality performance optimizations

Other signals

  • building robust, efficient software components that enable high-performance execution of large language models and multimodal models across multi-GPU systems
  • implement features that improve throughput, latency, and scalability
  • full-stack development within AI inference systems
  • model behavior and framework integration