Senior Software Development Engineer – LLM Inference Framework

AMD AMD · Semiconductors · Santa Clara, CA · Engineering

Senior Software Development Engineer focused on building and optimizing LLM inference frameworks and runtimes on AMD GPUs. The role involves architecting distributed inference solutions, improving performance and scalability, and contributing to open-source projects like vLLM and SGLang. Requires expertise in systems-level optimization, GPU architectures, and high-performance computing.

What you'd actually do

  1. Architect and optimize distributed LLM inference runtimes based on in-house LLM engines or open-source stacks such as vLLM, SGLang, and llm-d
  2. Design and improve TP / PP / EP (MoE) hybrid execution, including KV-cache management, attention dispatch, and token scheduling
  3. Implement and optimize multi-node inference pipelines using RCCL, RDMA, and collective-based execution
  4. Drive throughput, latency, and memory efficiency across single-GPU and multi-GPU clusters
  5. Work with AMD GPU libraries (AITER, HIPBLAS-LT, RCCL, ROCm runtime) to ensure inference frameworks efficiently use FP8 / FP4 GEMM and FlashAttention / MLA

Skills

Required

  • Python
  • C/C++
  • debugging
  • performance tuning
  • test design
  • large-scale systems
  • heterogeneous GPU clusters
  • compiler and runtime systems
  • GPU code generation
  • distributed systems
  • GPU runtime and kernel backends
  • throughput
  • latency
  • memory movement
  • scheduling
  • large-scale inference frameworks
  • debugging performance across GPUs and nodes
  • collaborating with kernel, compiler, and networking teams
  • open source
  • driving architecture-level improvements
  • in-house LLM engines
  • open-source stacks
  • TP / PP / EP (MoE) hybrid execution
  • KV-cache management
  • attention dispatch
  • token scheduling
  • multi-node inference pipelines
  • RCCL
  • RDMA
  • collective-based execution
  • continuous batching
  • speculative decoding
  • KV-cache paging
  • prefix caching
  • multi-turn serving
  • AMD GPU libraries
  • AITER
  • HIPBLAS-LT
  • RCCL
  • ROCm runtime
  • FP8 / FP4 GEMM
  • FlashAttention / MLA
  • Triton
  • LLVM
  • ROCm
  • framework-level performance
  • vLLM
  • SGLang
  • llm-d
  • customer PoCs
  • production deployments
  • benchmark-grade inference pipelines
  • PyTorch
  • TensorFlow
  • high-throughput and scalable inference
  • NVIDIA GPU architectures
  • AMD GPU architectures
  • kernel development
  • optimizing for efficiency and scalability

Nice to have

  • Master’s or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field.

What the JD emphasized

  • production-grade
  • distributed inference runtimes
  • open-source inference frameworks
  • vLLM
  • SGLang
  • customer-facing deployments
  • benchmarking platforms
  • AMD GPUs
  • large-scale inference frameworks
  • performance across GPUs and nodes
  • open source
  • inference platforms
  • multi-node inference pipelines
  • multi-GPU clusters
  • AMD GPU libraries
  • compiler teams
  • upstream features and performance fixes
  • customer PoCs and production deployments
  • benchmark-grade inference pipelines
  • distributed inference scaling
  • contributing to upstream open-source projects
  • high-throughput and scalable inference
  • GPU architectures
  • kernel development
  • large-scale systems
  • heterogeneous GPU clusters
  • compiler and runtime systems
  • GPU code generation

Other signals

  • optimize production-grade inference runtimes
  • enabling tensor parallelism, pipeline parallelism, expert parallelism (MoE)
  • upstreamed into open-source inference frameworks such as vLLM and SGLang