AI Framework Eng.

AMD AMD · Semiconductors · Shanghai, China · Engineering

Software engineer focused on building and optimizing robust, efficient software components for high-performance execution of large language models and multimodal models across multi-GPU systems. The role involves collaborating with internal teams and open-source maintainers to improve throughput, latency, and scalability, with an emphasis on full-stack development within AI inference systems and model behavior.

What you'd actually do

  1. Deep Learning & LLM Framework Optimization: Experience with optimizing major DL/LLM frameworks (PyTorch, vLLM, SGLang) for AMD GPUs and contribute improvements upstream.
  2. Model-Aware Implementation: Build features that interact closely with LLMs and multimodal architectures (e.g., Llama, Qwen-VL, Wan), requiring understanding of attention mechanisms, cross-modal fusion, KV caching, and quantization.
  3. Performance-Conscious Coding: Write efficient, scalable code while considering memory usage, concurrency, and bottlenecks in multi-GPU environments.
  4. Profiling: Use profiling tools to evaluate the impact of your changes, identify regressions, and validate performance improvements as part of the development cycle.
  5. End-to-End Performance Engineering: Perform comprehensive profiling to identify bottlenecks and implement system, memory, and communication optimizations across multi-GPU and multi-node setups.

Skills

Required

  • Python
  • Linux development environment
  • Understanding of LLM or multimodal model concepts
  • Transformer architectures
  • Attention mechanisms
  • Vision-language alignment
  • Inference pipelines
  • Transformer/Attention/MoE/KV Cache
  • quantization (FP8/FP4)
  • command-line tools
  • Git
  • debugging/profiling utilities
  • End-to-End LLM Performance Engineering
  • compute, memory, and communication bottlenecks
  • multi-GPU and multi-node environments
  • Software Engineering Excellence

Nice to have

  • C++
  • async programming
  • GPU Kernel Development & Optimization
  • HIP
  • CUDA
  • ASM
  • CK
  • CUTLASS
  • Triton
  • Compiler & System-Level Optimization
  • LLVM
  • ROCm
  • compiler-driven techniques
  • multimodal models (e.g., Qwen-VL, Qwen-Image-Edit, Wan)
  • diffusion-based generative models
  • GPU computing (ROCm, CUDA)
  • performance profiling tools (e.g., PyTorch Profiler)
  • Distributed Systems Experience
  • distributed inference for large-scale models
  • Tensor Parallel
  • Pipeline Parallel
  • open-source contributions
  • self-motivation

What the JD emphasized

  • production-quality code
  • performance
  • multi-GPU
  • multi-node
  • LLM
  • multimodal models
  • optimization

Other signals

  • Optimizing DL/LLM frameworks for AMD GPUs
  • Implementing features for throughput, latency, and scalability
  • Full-stack development within AI inference systems
  • Production-quality code balancing functionality, correctness, and performance