Senior Software Engineer, AI Inference Systems

NVIDIA NVIDIA · Semiconductors · Toronto, ON

NVIDIA is seeking a Senior Software Engineer to build and optimize AI inference systems for large-scale models, focusing on extreme efficiency and performance across multi-GPU, multi-node, and multi-cloud environments. The role involves architecting inference stacks, optimizing GPU kernels and compilers, driving benchmarks (MLPerf), and orchestrating large-scale deployments.

What you'd actually do

  1. Contribute features to vLLM that empower the newest models with the latest NVIDIA GPU hardware features; profile and optimize the inference framework (vLLM) with methods like speculative decoding, data/tensor/expert/pipeline-parallelism, prefill-decode disaggregation.
  2. Develop, optimize, and benchmark GPU kernels (hand-tuned and compiler-generated) using techniques such as fusion, autotuning, and memory/layout optimization; build and extend high-level DSLs and compiler infrastructure to boost kernel developer productivity while approaching peak hardware utilization.
  3. Define and build inference benchmarking methodologies and tools; contribute both new benchmark and NVIDIA’s submissions to the industry-leading MLPerf Inference benchmarking suite.
  4. Architect the scheduling and orchestration of containerized large-scale inference deployments on GPU clusters across clouds.
  5. Conduct and publish original research that pushes the pareto frontier for the field of ML Systems; survey recent publications and find a way to integrate research ideas and prototypes into NVIDIA’s software products.

Skills

Required

  • Python
  • C/C++
  • CS fundamentals: algorithms & data structures, operating systems, computer architecture, parallel programming, distributed systems, deep learning theories
  • performance engineering in ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM and SGLang)
  • GPU programming and performance: CUDA, memory hierarchy, streams, NCCL
  • profiling/debug tools (e.g., Nsight Systems/Compute)
  • containers and orchestration (Docker, Kubernetes, Slurm)
  • Linux namespaces and cgroups
  • debugging, problem-solving, and communication skills

Nice to have

  • Go
  • Rust
  • ML compilers and DSLs (e.g., Triton, TorchDynamo/Inductor, MLIR/LLVM, XLA)
  • GPU libraries (e.g., CUTLASS)
  • CUDA Graph
  • Tensor Cores
  • containerization/virtualization technologies such as containerd/CRI-O/CRIU
  • cloud platforms (AWS/GCP/Azure)
  • infrastructure as code
  • CI/CD
  • production observability
  • open-source projects
  • published papers and artifacts

What the JD emphasized

  • extreme efficiency
  • high-performance inference stacks
  • optimize GPU kernels and compilers
  • push the frontier of accelerated computing for AI
  • original research that pushes the pareto frontier for the field of ML Systems
  • top-tier publications in ML Systems, GPU architecture, or high-performance computing

Other signals

  • large-scale models
  • extreme efficiency
  • high-performance inference stacks
  • GPU kernels and compilers
  • industry benchmarks
  • multi-GPU, multi-node, multi-cloud