Senior Software Engineer, Cutlass Kernels

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4

Senior Software Engineer to develop and optimize high-performance deep learning kernels (e.g., GEMM, attention, convolution) using CUTLASS CUDA C++ and Python DSL for NVIDIA GPUs and future architectures. The role involves optimizing kernels for peak throughput, collaborating with various NVIDIA teams (architecture, compiler, libraries, DL frameworks), and requires strong C++ and CUDA experience, understanding of computer architecture, and experience with parallel programming languages targeting accelerators.

What you'd actually do

  1. Write Tensor Core-based deep learning kernels such as grouped-GEMM, attention, and convolution using CUTLASS CUDA C++ and Python DSL for Blackwell, Rubin, and future architectures.
  2. Optimize kernels for peak throughput on both silicon and software performance simulators.
  3. Collaborate with teams across NVIDIA including the GPU architecture, NVVM/PTX compiler, CUDA library, and DL frameworks teams to ensure fast, functional, and timely kernel delivery to customers.

Skills

Required

  • Masters or PhD degree in Computer Science, Computer Engineering, or related field (or equivalent experience)
  • 3+ years of relevant industry experience
  • Strong proficiency in C++ programming and software design, including debugging, performance evaluation, and testing.
  • Experience with CUDA, OpenCL, HIP, SYCL, Mojo, Pallas, Triton, Mosaic, Halide, or any general-purpose or domain-specific programming language targeting highly parallel accelerators.
  • Deep understanding of computer architecture and some experience working at the assembly level.

Nice to have

  • Experience writing code specifically targeting NVIDIA Tensor Cores, particularly through PTX or CUDA/cuTile.
  • Open-source contributions to math kernel libraries or frameworks.

What the JD emphasized

  • highest performance out of the hardware architecture
  • peak throughput
  • fast, functional, and timely kernel delivery

Other signals

  • high-performance computing platforms
  • AI revolution
  • deep learning kernels
  • NVIDIA GPUs
  • NVIDIA architectures