Senior Software Engineer – Tensorrt Edge-llm

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2 · Remote

Senior Software Engineer to develop and optimize a state-of-the-art inference framework for Large Language, Vision-Language, and Multimodal models on edge and embedded platforms, focusing on real-time performance and constrained environments.

What you'd actually do

  1. Develop and evolve a state-of-the-art inference framework in modern C++ that extends TensorRT with autoregressive model serving capabilities, including speculative decoding, LoRA, MoE, and KV cache management.
  2. Design and implement compiler and runtime optimizations tailored for transformer-based models running on constrained, real-time platforms.
  3. Collaborate with teams across CUDA, kernel libraries, compilers, and robotics to deliver high-performance, production-ready solutions.
  4. Contribute to CUDA kernel and operator development for critical transformer components such as attention, GEMM, and MoE.
  5. Benchmark, profile, and optimize inference performance across diverse embedded and automotive environments.

Skills

Required

  • 4+ years of relevant software development experience
  • Deep understanding of transformer models and inference optimization techniques (e.g., quantization, tensor parallelism, or memory-efficient scheduling)
  • Proficient programming ability with modern C++ (C++11/14/17 and beyond)
  • Familiarity with popular LLM frameworks and libraries such as TensorRT, TensorRT-LLM, vLLM, SGLang, MLC-LLM, or FlashInfer

Nice to have

  • Demonstrated development experience or open-source contributions to LLM inference frameworks and libraries, such as SGLang, vLLM, or FlashInfer
  • Proficiency with CUDA, including efficient kernel development, performance profiling, and GPU architecture fundamentals
  • Prior work on autoregressive LLM serving systems, including speculative decoding or KV cache management
  • Familiarity with compiler infrastructure for large language model inference
  • Exposure to robotics or embedded AI pipelines, including optimizing for low-latency, resource-constrained systems

What the JD emphasized

  • modern C++
  • transformer-based models
  • constrained, real-time platforms
  • production-ready solutions
  • CUDA kernel
  • LLM/VLM ecosystem

Other signals

  • LLM inference framework development
  • Optimizing transformer models for edge devices
  • CUDA kernel development for AI components