AI Inference Performance Engineer - New College Grad 2026

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA is seeking an AI Inference Performance Engineer to optimize and benchmark GenAI inference on their accelerators, working with frameworks like TensorRT-LLM, SGLang, and vLLM. The role involves driving industry benchmark results, defining cutting-edge workloads, architecting distributed inference, establishing performance methodology, and influencing the ecosystem through open-source contributions and cross-functional partnerships. Requires strong programming skills, DL framework expertise, and a deep understanding of LLM inference mechanics.

What you'd actually do

  1. Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
  2. Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.
  3. Architect distributed inference: Design and optimize execution from single-GPU to rack-scale clusters, managing performance across clusters of GPUs.
  4. Establish performance methodology: Apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.
  5. Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.

Skills

Required

  • BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
  • 2+ years of relevant software development experience.
  • Strong Python or C++ programming, software design, and software engineering skills.
  • Expertise with a DL framework such as PyTorch or JAX.
  • Proven track record of delivering measurable performance improvements in deep learning inference or high-performance systems.
  • Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode-phase bottlenecks, speculative decoding, disaggregated serving etc.

Nice to have

  • Prior experience with an LLM framework (TensorRT-LLM, vLLM, SGLang, etc) or a DL compiler in inference, deployment, algorithms, or implementation.
  • Prior experience with performance modeling, profiling, debug, and code optimization of a DL/HPC/high-performance application.
  • Experience with scale-out inference orchestration (MPI, NCCL, K8S) on large GPU clusters.
  • Expertise in kernel development (CUTLASS, cuteDSL, tilelang, OpenAI Triton) or compiler/runtime paths (torch.compile, graph lowering, operator fusion). Architectural knowledge of CPU, GPU, FPGA or other DL accelerators; GPU programming experience (CUDA).
  • Track record of leading ambiguous, high-impact technical programs across multiple teams under tight deadlines.

What the JD emphasized

  • Proven track record of delivering measurable performance improvements in deep learning inference or high-performance systems.
  • Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode-phase bottlenecks, speculative decoding, disaggregated serving etc.

Other signals

  • optimizing inference performance
  • benchmarking GenAI
  • TensorRT-LLM, SGLang, vLLM