Principal Genai Inference Optimization Engineer

AMD AMD · Semiconductors · San Jose, CA · Engineering

This role focuses on improving the performance, efficiency, and scalability of generative AI inference workloads on AMD GPU platforms. The Principal GenAI Inference Optimization Engineer will optimize latency, throughput, and cost efficiency for large-scale models, working across the software-hardware stack, including kernels, runtimes, and serving frameworks.

What you'd actually do

  1. Optimize performance of GenAI inference workloads on AMD GPU platforms across single-node and distributed environments.
  2. Improve latency, throughput, and cost efficiency for LLM and multimodal model serving in production.
  3. Analyze and resolve bottlenecks across compute, memory, and communication (e.g., kernel efficiency, KV-cache usage, memory bandwidth, scheduling).
  4. Contribute to cross-stack optimizations spanning kernels, runtimes, communication libraries, and inference/serving frameworks (e.g., vLLM, SGLang, Triton, or similar systems).
  5. Implement and evaluate inference optimization techniques such as batching strategies, quantization, prefix caching, and speculative decoding.

Skills

Required

  • GenAI inference optimization
  • GPU performance
  • large-scale serving systems
  • GPU architecture
  • memory systems
  • communication patterns
  • kernels
  • runtimes
  • frameworks
  • serving systems
  • batching strategies
  • quantization
  • prefix caching
  • speculative decoding
  • request scheduling
  • resource utilization
  • profiling
  • benchmarking
  • performance analysis tools
  • Python
  • C++
  • CUDA
  • HIP

Nice to have

  • distributed systems
  • ML frameworks (PyTorch, JAX, or TensorFlow)
  • vLLM
  • SGLang
  • Triton
  • TensorRT-LLM

What the JD emphasized

  • GenAI inference optimization
  • LLM and multimodal model serving
  • inference optimization techniques
  • inference/serving frameworks

Other signals

  • Optimize performance of GenAI inference workloads on AMD GPU platforms
  • Improve latency, throughput, and cost efficiency for LLM and multimodal model serving in production
  • Contribute to cross-stack optimizations spanning kernels, runtimes, communication libraries, and inference/serving frameworks