Senior Software Engineer, AI Inference

NVIDIA NVIDIA · Semiconductors · Toronto, ON

Senior Software Engineer focused on optimizing and scaling AI inference for large language models, working with customers and contributing to open-source projects like vLLM.

What you'd actually do

  1. Work directly with customer engineering teams through long-term technical partnerships, understanding their LLM serving architectures and performance goals, then designing and implementing end-to-end benchmarking campaigns across Kubernetes and Slurm environments to surface actionable insights.
  2. Set up and operate vLLM serving deployments on GPU clusters, tuning configurations for throughput, latency, and efficiency — and collect Nsight Systems / Nsight Compute profiling traces to identify performance gaps relative to reference frameworks.
  3. Develop detailed performance plans based on profiling findings and collaborate with NVIDIA's kernel engineering and OSS vLLM teams to drive improvements that benefit both your customers and the broader community.
  4. Build internal tools, benchmarking harnesses, and automation pipelines that raise the productivity of your teammates and customers alike — with a multiplier attitude that makes everyone around you more effective.
  5. Document architectures, findings, and recommendations with clarity for technical audiences, and contribute improvements back to vLLM and related open-source projects where appropriate.

Skills

Required

  • 5+ years of industry experience building and operating complex, production-grade software systems
  • Hands-on experience deploying and operating LLM inference workloads — particularly with vLLM
  • Proficiency with container orchestration (Kubernetes) and HPC scheduling (Slurm)
  • Solid understanding of LLM serving fundamentals: batching strategies (continuous batching, chunked prefill), KV cache management, and tensor/pipeline parallelism.
  • Familiarity with GPU performance analysis: memory hierarchy, utilization, roofline modeling, and profiling with Nsight Systems or Nsight Compute.
  • Strong written and verbal communication skills

Nice to have

  • Experience with NVIDIA Dynamo or other disaggregated inference serving frameworks.
  • Contributions to open-source inference or ML systems projects, particularly vLLM or SGLang
  • Background with ML compilers or GPU kernel development (Triton, CUTLASS, TorchInductor).
  • Experience building developer tools or internal platforms that meaningfully improved team productivity.
  • Prior experience in a customer-facing or forward-deployed engineering capacity within a technical product organization.

What the JD emphasized

  • LLM serving
  • vLLM
  • performance

Other signals

  • customer-facing
  • performance optimization
  • open-source contributions