Senior Performance Compiler Engineer - Triton

NVIDIA NVIDIA · Semiconductors · Redmond, WA +5 · Remote

Senior Performance Compiler Engineer to work on the open-source Triton compiler project, focusing on using compilers to improve AI performance on NVIDIA GPUs for large language models, agents, and other AI applications. The role involves investigating GPU hardware, designing and implementing compiler technology using MLIR to optimize kernel descriptions for efficient GPU code generation, and collaborating with internal teams.

What you'd actually do

  1. Investigating the latest and future NVIDIA GPU hardware architecture and programming models.
  2. Working on the frontier of AI by understanding advanced algorithms (like attention sinks and MoEs) and numerics (like block-scaled floating point) to identify new opportunities for optimization.
  3. Designing and implementing compiler technology using MLIR to optimize high-level kernel descriptions (written in Triton's Python DSL), with a focus on generating efficient, low-level GPU code.
  4. Engaging in a dynamic, iterative process of optimization—sometimes starting with the kernel, sometimes with the compiler—to find the most efficient path to peak performance.
  5. Collaborating with teams across NVIDIA, including hardware architects and the CUDA compiler team, to influence future products and ensure we are always operating at maximum efficiency.

Skills

Required

  • Bachelor, Masters or Ph.D. degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or a related field.
  • 8+ years of relevant industry experience in software development.
  • Demonstrated strong C++ programming and software design skills
  • Experienced in parallel programming, including CUDA/OpenCL GPU programming or other parallel models such as OpenMP.
  • Solid understanding of computer architecture
  • hands-on experience with assembly-level programming.

Nice to have

  • Experience in tuning BLAS or deep learning library kernels.
  • Background in numerics and linear algebra.
  • Experience with machine learning compilers like TVM or MLIR.
  • Contributions to open-source projects, especially in the AI/ML or compiler space.
  • Familiarity with the latest research in AI algorithms and numerics as well as a strong track record of contributions to open-source projects, particularly in the AI/ML, compiler, or high-performance computing domains.

What the JD emphasized

  • performance analysis and debugging
  • parallel programming, including CUDA/OpenCL GPU programming or other parallel models such as OpenMP
  • assembly-level programming
  • machine learning compilers like TVM or MLIR
  • contributions to open-source projects, especially in the AI/ML or compiler space
  • latest research in AI algorithms and numerics
  • strong track record of contributions to open-source projects, particularly in the AI/ML, compiler, or high-performance computing domains

Other signals

  • AI performance on NVIDIA GPUs
  • Triton compiler project
  • optimizing high-level kernel descriptions
  • generating efficient, low-level GPU code