Senior Software Engineer, AI Inference Systems

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Software Engineer focused on building and optimizing AI inference systems, including vLLM, GPU kernels, and orchestration for large-scale model deployments. The role involves performance engineering, benchmarking (MLPerf), and potentially research integration.

What you'd actually do

  1. Contribute features to vLLM that empower the newest models with the latest NVIDIA GPU hardware features; profile and optimize the inference framework (vLLM) with methods like speculative decoding, data/tensor/expert/pipeline-parallelism, prefill-decode disaggregation.
  2. Develop, optimize, and benchmark GPU kernels (hand-tuned and compiler-generated) using techniques such as fusion, autotuning, and memory/layout optimization; build and extend high-level DSLs and compiler infrastructure to boost kernel developer productivity while approaching peak hardware utilization.
  3. Define and build inference benchmarking methodologies and tools; contribute both new benchmark and NVIDIA’s submissions to the industry-leading MLPerf Inference benchmarking suite.
  4. Architect the scheduling and orchestration of containerized large-scale inference deployments on GPU clusters across clouds.
  5. Conduct and publish original research that pushes the pareto frontier for the field of ML Systems; survey recent publications and find a way to integrate research ideas and prototypes into NVIDIA’s software products.

Skills

Required

  • Python
  • C/C++
  • CS fundamentals: algorithms & data structures, operating systems, computer architecture, parallel programming, distributed systems, deep learning theories
  • performance engineering in ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM and SGLang)
  • GPU programming and performance: CUDA, memory hierarchy, streams, NCCL
  • profiling/debug tools (e.g., Nsight Systems/Compute)
  • containers and orchestration (Docker, Kubernetes, Slurm)
  • Linux namespaces and cgroups
  • debugging
  • problem-solving
  • communication skills

Nice to have

  • Go
  • Rust
  • ML compilers and DSLs (e.g., Triton, TorchDynamo/Inductor, MLIR/LLVM, XLA)
  • GPU libraries (e.g., CUTLASS)
  • containerization/virtualization technologies such as containerd/CRI-O/CRIU
  • cloud platforms (AWS/GCP/Azure)
  • infrastructure as code
  • CI/CD
  • production observability

What the JD emphasized

  • extreme efficiency
  • high-performance inference stacks
  • optimize GPU kernels and compilers
  • push the frontier of accelerated computing for AI
  • profile and optimize the inference framework
  • peak hardware utilization
  • MLPerf Inference benchmarking suite
  • large-scale inference deployments
  • original research that pushes the pareto frontier
  • ML Systems

Other signals

  • Serve large-scale models with extreme efficiency
  • Optimize GPU kernels and compilers
  • Scale workloads across multi-GPU, multi-node, and multi-cloud environments
  • Push the frontier of accelerated computing for AI