Senior Deep Learning Software Engineer, Tensorrt Performance

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

NVIDIA is seeking a Senior Deep Learning Software Engineer to analyze and improve the performance of their deep learning inference ecosystem, specifically focusing on TensorRT. The role involves optimizing inference solutions for various NVIDIA accelerators, contributing to inference frameworks, and developing new model pipelines for generative AI and other applications.

What you'd actually do

  1. Establish groundbreaking performance benchmarking methodologies and analysis workflows and identify performance issues and opportunities for NVIDIA’s inference ecosystem (e.g. TensorRT/TensorRT-EdgeLLM/Torch-TensorRT)
  2. Contribute features and code to NVIDIA/OSS inference frameworks including but not limited to TensorRT/TensorRT-EdgeLLM/Torch-TensorRT.
  3. Develop new model pipelines for NVIDIA’s inference ecosystem with optimized performance including but not limited to areas like quantization, scheduling, memory management, and distributed inference to set the gold standard for Gen AI performance.
  4. Work with cross-collaborative teams inside and outside of NVIDIA across generative AI, automotive, robotics, image understanding, and speech understanding to set directions and develop innovative inference solutions.
  5. Scale performance of deep learning models across different architectures and types of NVIDIA accelerators.

Skills

Required

  • Bachelors, Masters, PhD, or equivalent experience in relevant fields (Computer Science, Computer Engineering, EECS, AI).
  • At least 3 years of relevant software development experience.
  • Strong C++, Python programming and software engineering skills
  • Experience with DL frameworks (e.g. PyTorch, JAX, TensorFlow, ONNX) and inference libraries (e.g. TensorRT, TensorRT-LLM, vLLM, SGLang, FlashInfer).
  • Experience with performance analysis and performance optimization

Nice to have

  • Strong foundation and architectural knowledge of GPUs.
  • Deep understanding of modern deep learning models and workloads (e.g. Transformers, Recommenders, ASR, TTS, Visual Understanding).
  • Proficiency in one of the deep learning programming domain specific languages (e.g. CUDA/TileIR/CuTeDSL/cutlass/Triton).
  • Prior contributions to major LLM inference frameworks (e.g. vLLM) or prior experience with graph compilers in deep learning inference (e.g. TorchDynamo/TorchInductor).
  • Prior experience optimizing performance for low-latency, resource-constrained systems or embedded AI pipelines (e.g. Jetson systems or other edge AI accelerators).

What the JD emphasized

  • performance analysis and performance optimization
  • performance optimization
  • performance
  • performance modeling
  • performance analysis
  • performance optimization
  • performance
  • performance
  • performance
  • performance

Other signals

  • performance optimization
  • inference
  • GPU acceleration
  • deep learning models