Deep Learning Performance Architect, Cutlass Dsl

NVIDIA · Semiconductors · Shanghai, China +1

NVIDIA is seeking an engineer to develop and optimize CUTLASS DSL, a Python-native language for GPU kernel development, and its associated MLIR dialects and lowering passes. The role involves accelerating kernel compilation for NVIDIA's next-generation AI platforms, aiming for performance comparable to CUTLASS C++.

What you'd actually do

Design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development
Build and advance the MLIR dialects, lowering passes, and code generation flows that power the CUTLASS DSL stack
Drive innovations that improve kernel compilation speed while maintaining performance on par with CUTLASS C++
Collaborate closely with architecture, research, software product teams, and the open-source community to bring cutting-edge optimizations into real products

Skills

Required

MS, PhD, or equivalent experience in Computer Science, Software Engineering, or a related field
2+ years of relevant work experience
Excellent programming skills in Python and strong proficiency in C++
Hands-on experience with DSLs, compilers, or code generation systems
Strong command of the MLIR/LLVM stack, including IR design and pass optimization
Strong communication skills and the ability to thrive in a highly collaborative environment

Nice to have

Deep understanding of the CUDA GPU programming model, GPU microarchitecture, and performance analysis and optimization techniques
Familiarity with key high-performance computing abstractions such as Layout, Tile, MMA, and TMA in the CuTe ecosystem

What the JD emphasized

Python-native language for GPU kernel development
MLIR dialects
lowering passes
code generation flows
kernel compilation speed
performance comparable to CUTLASS C++

Other signals

AI platforms
GPU kernel development
high-performance kernel development

Read full job description

Are you passionate about programming languages, compiler technology, and GPU performance? Do you want to help shape the future of high-performance kernel development for AI? We are looking for outstanding engineers to build CUTLASS DSL, a Python-native language for GPU kernel development, along with the MLIR dialects and lowering passes behind it. In this role, you will also help accelerate kernel compilation while delivering performance comparable to CUTLASS C++, enabling efficient hardware-software co-design for NVIDIA's next generation of AI platforms.

**What you'll be doing: **

Design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development
Build and advance the MLIR dialects, lowering passes, and code generation flows that power the CUTLASS DSL stack
Drive innovations that improve kernel compilation speed while maintaining performance on par with CUTLASS C++
Collaborate closely with architecture, research, software product teams, and the open-source community to bring cutting-edge optimizations into real products

**What we need to see: **

MS, PhD, or equivalent experience in Computer Science, Software Engineering, or a related field
2+ years of relevant work experience
Excellent programming skills in Python and strong proficiency in C++
Hands-on experience with DSLs, compilers, or code generation systems
Strong command of the MLIR/LLVM stack, including IR design and pass optimization
Strong communication skills and the ability to thrive in a highly collaborative environment

Ways to stand out from the crowd:

Deep understanding of the CUDA GPU programming model, GPU microarchitecture, and performance analysis and optimization techniques
Familiarity with key high-performance computing abstractions such as Layout, Tile, MMA, and TMA in the CuTe ecosystem

**What you'll be doing: **

Design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development
Build and advance the MLIR dialects, lowering passes, and code generation flows that power the CUTLASS DSL stack
Drive innovations that improve kernel compilation speed while maintaining performance on par with CUTLASS C++
Collaborate closely with architecture, research, software product teams, and the open-source community to bring cutting-edge optimizations into real products

**What we need to see: **

MS, PhD, or equivalent experience in Computer Science, Software Engineering, or a related field
2+ years of relevant work experience
Excellent programming skills in Python and strong proficiency in C++
Hands-on experience with DSLs, compilers, or code generation systems
Strong command of the MLIR/LLVM stack, including IR design and pass optimization
Strong communication skills and the ability to thrive in a highly collaborative environment

Ways to stand out from the crowd:

Deep understanding of the CUDA GPU programming model, GPU microarchitecture, and performance analysis and optimization techniques
Familiarity with key high-performance computing abstractions such as Layout, Tile, MMA, and TMA in the CuTe ecosystem