Software Engineer - GPU Kernels

Baseten · Data AI · San Francisco, CA · EPD

Software Engineer focused on optimizing GPU kernels for ML inference, including matrix multiplications, attention mechanisms, and quantization, using CUDA and PTX assembly.

What you'd actually do

  1. Design and implement high-performance GPU kernels for key ML operations, including matrix multiplications, attention mechanisms, and mixture-of-experts routing
  2. Write and optimize code using CUDA, PTX assembly, and architecture-specific techniques
  3. Apply advanced performance optimization methods such as memory coalescing, warp-level programming, tensor core acceleration, and compute/memory overlap
  4. Implement cutting-edge features like quantization (FP8/FP4), sparsity, and compute/communication overlap
  5. Identify and resolve performance bottlenecks using tools like Nsight Systems, Nsight Compute, and Torch Profiler

Skills

Required

  • GPU architecture
  • GPU programming paradigms
  • C++
  • GPU performance profiling tools
  • CUDA C++ API
  • Memory access patterns
  • bandwidth optimization
  • numerical precision
  • quantization strategies
  • modern GPU features

Nice to have

  • Transformer models
  • attention optimization
  • Flash Attention
  • GPU kernel libraries
  • Cutlass
  • Triton
  • Thrust
  • CUB
  • GEMM tuning
  • distributed/multi-GPU compute
  • open-source GPU projects
  • Research publications
  • conference presentations on GPU performance

What the JD emphasized

  • high-performance GPU kernels
  • ML operations
  • matrix multiplications
  • attention mechanisms
  • CUDA
  • PTX assembly
  • performance optimization
  • quantization
  • GPU architecture
  • GPU programming paradigms
  • C++
  • GPU performance profiling tools
  • CUDA C++ API
  • Memory access patterns
  • bandwidth optimization
  • numerical precision
  • quantization strategies
  • modern GPU features

Other signals

  • GPU kernel optimization
  • ML operations
  • inference performance