Software Engineer, AI and Dl Kernel Libraries

NVIDIA · Semiconductors · Shanghai, China

Software Engineer focused on developing and optimizing AI inference systems software, including deep learning primitives, kernel libraries, and LLM inference runtimes, on NVIDIA GPUs.

What you'd actually do

Develop production-quality software that ships as part of NVIDIA's AI software stack, including cuDNN, FlashInfer, and optimized support for large language model inference workloads.
Innovate and develop new AI systems technologies for efficient inference, with a focus on performance, scalability, maintainability, and usability.
Design, implement, and optimize kernels for high-impact AI workloads across LLM inference, generative AI, computer vision, autonomous driving, and recommender systems.
Design and implement extensible software abstractions for deep learning libraries, LLM serving engines, and runtime systems.
Build and improve just-in-time compilation, code generation, and runtime technologies for performance-critical GPU workloads.

Skills

Required

Master's degree in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
3+ years of relevant industry, research, or systems software development experience in machine learning, deep learning systems, compilers, or GPU software.
Strong programming skills in C/C++ and Python, with hands-on experience developing high-performance software.
Solid experience with CUDA development and GPU programming fundamentals.
Strong experience developing or using deep learning frameworks such as PyTorch, JAX, TensorFlow, or ONNX.
Good understanding of linear algebra, performance analysis, profiling, and code optimization.
Experience designing software abstractions, APIs, or higher-level system architecture for performance-sensitive systems.
Familiarity with modern machine learning and inference system trends, especially around LLMs and generative AI.

Nice to have

Hands-on experience with inference engines and runtimes such as vLLM, SGLang, MLC, TensorRT-LLM, or similar systems.
Background in domain-specific compiler, code generation, or library solutions for LLM inference and training.
Expertise in machine learning compilers or IR systems such as MLIR, Apache TVM, TensorIR, or related technologies.
Practical experience with GPU performance modeling, computer architecture, or accelerator-oriented software design.
Open-source project ownership or meaningful contributions in deep learning systems, compilers, kernels, or inference infrastructure.

What the JD emphasized

production-quality software
efficient inference
performance
scalability
maintainability
usability
kernels
LLM inference
generative AI
computer vision
autonomous driving
recommender systems
software abstractions
LLM serving engines
runtime systems
just-in-time compilation
code generation
runtime technologies
performance-critical GPU workloads
workload performance
tune current software
future software and hardware-software interfaces
deep learning frameworks
PyTorch
JAX
TensorFlow
ONNX
linear algebra
performance analysis
profiling
code optimization
software abstractions
APIs
higher-level system architecture
performance-sensitive systems
modern machine learning
inference system trends
LLMs
generative AI
GPU kernel development
performance optimization
CUDA C/C++
cuTile
Triton

Other signals

Develop production-quality software that ships as part of NVIDIA's AI software stack
Innovate and develop new AI systems technologies for efficient inference
Design, implement, and optimize kernels for high-impact AI workloads
Collaborate with world-class engineers across deep learning software, compilers, GPU architecture, and open-source inference ecosystems

Read full job description

We're looking for outstanding AI systems software engineers to develop groundbreaking technologies across the inference systems software stack. Our team builds core AI systems software that accelerates high-impact workloads on NVIDIA GPUs, from deep learning primitives and kernel libraries to LLM inference runtimes, serving abstractions, and code generation technologies. As a member of the team, you will help design, build, optimize, and ship production-quality software that powers NVIDIA's AI software stack.

This role spans both foundational library engineering and next-generation inference systems work, with opportunities to contribute across the stack from low-level kernels and performance primitives to serving runtimes and developer-facing abstractions. You may work on GPU-accelerated deep learning primitives, efficient attention kernel implementations, LLM serving components, just-in-time compilation systems, software abstractions, and performance-critical runtime infrastructure for large language models, agents, and other advanced AI workloads. You will collaborate with world-class engineers across deep learning software, compilers, GPU architecture, and open-source inference ecosystems, and your work will directly impact NVIDIA's AI platform and the performance of real-world workloads at scale.

What you'll be doing:

Develop production-quality software that ships as part of NVIDIA's AI software stack, including cuDNN, FlashInfer, and optimized support for large language model inference workloads.
Innovate and develop new AI systems technologies for efficient inference, with a focus on performance, scalability, maintainability, and usability.
Design, implement, and optimize kernels for high-impact AI workloads across LLM inference, generative AI, computer vision, autonomous driving, and recommender systems.
Design and implement extensible software abstractions for deep learning libraries, LLM serving engines, and runtime systems.
Build and improve just-in-time compilation, code generation, and runtime technologies for performance-critical GPU workloads.
Analyze workload performance, tune current software, and propose improvements to future software and hardware-software interfaces.
Collaborate closely with engineers across deep learning frameworks, libraries, kernels, compilers, and GPU architecture teams at NVIDIA.
Contribute to open-source communities and ecosystem integrations where relevant, including projects such as FlashInfer, vLLM, and SGLang.

What we need to see:

Master's degree in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
3+ years of relevant industry, research, or systems software development experience in machine learning, deep learning systems, compilers, or GPU software. More experience is expected for senior-level candidates.
Strong programming skills in C/C++ and Python, with hands-on experience developing high-performance software.
Solid experience with CUDA development and GPU programming fundamentals.
Strong experience developing or using deep learning frameworks such as PyTorch, JAX, TensorFlow, or ONNX.
Good understanding of linear algebra, performance analysis, profiling, and code optimization.
Experience designing software abstractions, APIs, or higher-level system architecture for performance-sensitive systems.
Familiarity with modern machine learning and inference system trends, especially around LLMs and generative AI.
For senior candidates, strong experience in GPU kernel development and performance optimization, especially using CUDA C/C++, cuTile, Triton, or similar technologies, is expected.

Ways to stand out from the crowd:

Hands-on experience with inference engines and runtimes such as vLLM, SGLang, MLC, TensorRT-LLM, or similar systems.
Background in domain-specific compiler, code generation, or library solutions for LLM inference and training.
Expertise in machine learning compilers or IR systems such as MLIR, Apache TVM, TensorIR, or related technologies.
Practical experience with GPU performance modeling, computer architecture, or accelerator-oriented software design.
Open-source project ownership or meaningful contributions in deep learning systems, compilers, kernels, or inference infrastructure.

What you'll be doing:

Develop production-quality software that ships as part of NVIDIA's AI software stack, including cuDNN, FlashInfer, and optimized support for large language model inference workloads.
Innovate and develop new AI systems technologies for efficient inference, with a focus on performance, scalability, maintainability, and usability.
Design, implement, and optimize kernels for high-impact AI workloads across LLM inference, generative AI, computer vision, autonomous driving, and recommender systems.
Design and implement extensible software abstractions for deep learning libraries, LLM serving engines, and runtime systems.
Build and improve just-in-time compilation, code generation, and runtime technologies for performance-critical GPU workloads.
Analyze workload performance, tune current software, and propose improvements to future software and hardware-software interfaces.
Collaborate closely with engineers across deep learning frameworks, libraries, kernels, compilers, and GPU architecture teams at NVIDIA.
Contribute to open-source communities and ecosystem integrations where relevant, including projects such as FlashInfer, vLLM, and SGLang.

What we need to see:

Master's degree in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
3+ years of relevant industry, research, or systems software development experience in machine learning, deep learning systems, compilers, or GPU software. More experience is expected for senior-level candidates.
Strong programming skills in C/C++ and Python, with hands-on experience developing high-performance software.
Solid experience with CUDA development and GPU programming fundamentals.
Strong experience developing or using deep learning frameworks such as PyTorch, JAX, TensorFlow, or ONNX.
Good understanding of linear algebra, performance analysis, profiling, and code optimization.
Experience designing software abstractions, APIs, or higher-level system architecture for performance-sensitive systems.
Familiarity with modern machine learning and inference system trends, especially around LLMs and generative AI.
For senior candidates, strong experience in GPU kernel development and performance optimization, especially using CUDA C/C++, cuTile, Triton, or similar technologies, is expected.

Ways to stand out from the crowd:

Hands-on experience with inference engines and runtimes such as vLLM, SGLang, MLC, TensorRT-LLM, or similar systems.
Background in domain-specific compiler, code generation, or library solutions for LLM inference and training.
Expertise in machine learning compilers or IR systems such as MLIR, Apache TVM, TensorIR, or related technologies.
Practical experience with GPU performance modeling, computer architecture, or accelerator-oriented software design.
Open-source project ownership or meaningful contributions in deep learning systems, compilers, kernels, or inference infrastructure.