Senior Software Engineer - AI Inference

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +3 · Remote

Senior Software Engineer focused on optimizing and contributing to open-source LLM inference serving engines like vLLM and SGLang to run efficiently on NVIDIA GPUs, focusing on high-throughput, low-latency inference at scale.

What you'd actually do

  1. Contribute features, fixes, and optimizations upstream to vLLM/SGLang: author PRs, participate in reviews, write benchmarks/tests, and help drive designs to completion.
  2. Implement and optimize inference‑runtime capabilities: batching and scheduling policies, streaming, request lifecycle management, and KV‑cache efficiency (paging/sharding) to improve throughput and tail latency.
  3. Profile and improve hot paths across layers-from Python orchestration to C++/CUDA kernels-using data to guide optimization work.
  4. Improve multi‑GPU inference performance and reliability: parallelism strategies, communication patterns, and resource utilization across NVIDIA platforms.
  5. Build and maintain performance and correctness regression tests to prevent slowdowns and ensure stable behavior across model and hardware configurations.

Skills

Required

  • systems engineering fundamentals
  • LLM inference/serving stacks
  • Python
  • C++
  • CUDA
  • profiling
  • performance investigation
  • distributed systems concepts
  • concurrency
  • open-source communities

Nice to have

  • Open-source contributions to vLLM, SGLang, PyTorch, Triton, NCCL, Dynamo
  • performance work (attention/KV cache efficiency, speculative decoding, scheduler improvements, quantization-aware serving, streaming latency reductions)
  • reproducible benchmarking and performance regression infrastructure
  • Systems performance background (memory bandwidth, kernel fusion, PCIe/NVLink effects, network fabrics)

What the JD emphasized

  • 5+ years building production software with solid systems engineering fundamentals and a track record of delivering performance or reliability improvements.
  • Experience with LLM inference/serving stacks (e.g., vLLM, SGLang) and an understanding of the tradeoffs that drive real production performance.
  • Strong programming skills in Python plus C++ and/or CUDA; ability to debug and optimize performance‑critical code.
  • Experience with profiling and performance investigation (microbenchmarks, flame graphs, GPU profiling) and a measurement‑driven mindset.
  • Familiarity with distributed systems concepts and concurrency (queues/schedulers, multi‑process/multi‑threading, scaling across GPUs/nodes).

Other signals

  • shipping high-quality changes
  • improving the underlying stack
  • contributing directly to upstream inference engines