Senior Software Engineer, Quantized Inference

NVIDIA NVIDIA · Semiconductors · Redmond, WA +1

Senior Software Engineer focused on optimizing quantized inference for LLMs by implementing recipes, developing kernels, and collaborating on inference engines like vLLM and TRT-LLM. The role involves model export pipelines, benchmarking, and data analysis tooling.

What you'd actually do

  1. Implement quantized and sparse recipes in inference engines (vLLM, TRT-LLM, SGLang)
  2. Own model export pipelines (ModelOpt, Megatron-LM <-> HuggingFace), ensuring quantized checkpoints serialize correctly for downstream serving
  3. Build prototypes and benchmarking harnesses to evaluate recipe throughput/interactivity before full optimization
  4. Develop data analysis tooling and visualizations for numerics debugging
  5. Improve developer productivity across the team: CI, build systems, training infrastructure, pipeline friction

Skills

Required

  • Python
  • C++
  • Software Engineering Fundamentals
  • ML Accelerators
  • PyTorch Internals
  • Open Source Contribution
  • Agile Development

Nice to have

  • Triton Kernel Development
  • Inference Serving Frameworks (vLLM, TRT-LLM, SGLang)
  • Numerical Debugging
  • Mixed-Precision Boundaries
  • Model Compression (PTQ, QAT, Sparsity)

What the JD emphasized

  • Proficient in Python
  • familiarity with C++
  • Strong software engineering fundamentals
  • Experience with ML accelerators
  • Familiarity with PyTorch internals
  • Experience reading, modifying, or contributing to a large open-source codebase
  • Demonstrated ability to move fast with ambiguous requirements

Other signals

  • inference optimization
  • quantization
  • kernel implementation
  • model export