Senior Software Engineer, AI Inference Systems

NVIDIA NVIDIA · Semiconductors · Germany +5 · Remote

Senior Software Engineer focused on building and optimizing AI inference systems for large-scale models, involving GPU kernel optimization, inference framework development (vLLM), benchmarking (MLPerf), and orchestration of distributed deployments.

What you'd actually do

  1. Contribute features to vLLM that empower the newest models with the latest NVIDIA GPU hardware features; profile and optimize the inference framework (vLLM) with methods like speculative decoding, data/tensor/expert/pipeline-parallelism, prefill-decode disaggregation.
  2. Develop, optimize, and benchmark GPU kernels (hand-tuned and compiler-generated) using techniques such as fusion, autotuning, and memory/layout optimization; build and extend high-level DSLs and compiler infrastructure to boost kernel developer productivity while approaching peak hardware utilization.
  3. Define and build inference benchmarking methodologies and tools; contribute both new benchmark and NVIDIA’s submissions to the industry-leading MLPerf Inference benchmarking suite.
  4. Architect the scheduling and orchestration of containerized large-scale inference deployments on GPU clusters across clouds.
  5. Conduct and publish original research that pushes the pareto frontier for the field of ML Systems; survey recent publications and find a way to integrate research ideas and prototypes into NVIDIA’s software products.

Skills

Required

  • Python
  • C/C++
  • algorithms & data structures
  • operating systems
  • computer architecture
  • parallel programming
  • distributed systems
  • deep learning theories
  • performance engineering in ML frameworks
  • inference engines
  • GPU programming
  • CUDA
  • memory hierarchy
  • streams
  • NCCL
  • profiling/debug tools
  • containers
  • orchestration
  • Linux namespaces
  • cgroups
  • debugging
  • problem-solving
  • communication skills

Nice to have

  • Go
  • Rust
  • vLLM
  • SGLang
  • Nsight Systems/Compute
  • Docker
  • Kubernetes
  • Slurm
  • containerd/CRI-O/CRIU
  • AWS/GCP/Azure
  • infrastructure as code
  • CI/CD
  • production observability
  • Triton
  • TorchDynamo/Inductor
  • MLIR/LLVM
  • XLA
  • CUTLASS
  • CUDA Graph
  • Tensor Cores

What the JD emphasized

  • 7+ years of experience
  • 5+ years of experience
  • PhD degree with the thesis and top-tier publications
  • performance engineering
  • GPU programming and performance
  • profiling/debug tools
  • containers and orchestration
  • building and optimizing LLM inference engines
  • ML compilers and DSLs
  • GPU libraries
  • cloud platforms
  • production observability
  • Contributions to open-source projects and/or publications

Other signals

  • large-scale models
  • extreme efficiency
  • high-performance inference stacks
  • GPU kernels and compilers
  • multi-GPU, multi-node, and multi-cloud environments
  • push the frontier of accelerated computing for AI