Principal Software Engineer - AI Inference

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

Principal Software Engineer focused on advancing open-source LLM serving, specifically contributing to inference engines like vLLM and SGLang, optimizing them for NVIDIA GPUs and systems to achieve high-throughput, low-latency inference at scale. The role requires deep technical expertise in inference runtime architecture, GPU performance engineering, and distributed systems.

What you'd actually do

  1. Drive upstream-first engineering in vLLM/SGLang: author and land PRs or equivalent experience, engage in development discussions, help compose roadmaps, and build durable maintainer relationships.
  2. Build and implement inference-runtime features that improve efficiency, latency, and tail behavior: request scheduling, batching policies, KV-cache management (paging/sharding), memory planning, and streaming.
  3. Optimize core hot paths across the stack—from Python orchestration down to C++/CUDA kernels—using profiling and measurement to guide decisions.
  4. Improve multi-GPU and multi-node inference: communication patterns, parallelism strategies (tensor/sequence/pipeline), and system-level scaling/efficiency.
  5. Strengthen correctness, robustness, and operability: determinism where needed, graceful degradation, backpressure, observability hooks, and performance regression testing.

Skills

Required

  • LLM inference/serving systems
  • Rust
  • C++
  • Python
  • CUDA
  • GPU performance analysis
  • distributed systems
  • concurrency

Nice to have

  • vLLM
  • SGLang
  • PyTorch
  • Triton
  • NCCL
  • paged attention
  • speculative decoding
  • quantization-aware serving
  • low-latency streaming optimizations
  • tokenizer and Python runtime overheads
  • kernel fusion
  • memory bandwidth
  • PCIe/NVLink effects
  • network fabrics
  • InfiniBand
  • benchmarking and regression infrastructure

What the JD emphasized

  • 15+ years building production software with significant depth in systems engineering; strong track record of owning ambiguous, high-impact technical problems end-to-end.
  • Demonstrated expertise in LLM inference/serving systems (e.g., vLLM, SGLang) and the tradeoffs that drive real production performance.
  • Strong programming skills in Rust, C++, Python, CUDA; ability to read, modify, and optimize performance-critical code across layers.
  • Experience with GPU performance analysis tools and methodologies (profiling, microbenchmarking, memory/comms analysis) and a strong measurement culture.
  • Solid foundation in distributed systems and concurrency: queues/schedulers, RPC/streaming, multi-process/multi-threaded runtime behavior, and scaling patterns across nodes.

Other signals

  • LLM serving
  • inference engines
  • NVIDIA GPUs
  • high-throughput, low-latency inference