Member of Technical Staff, Performance Optimization

Fireworks AI · Data AI · San Mateo, CA · Engineering

Software Engineer focused on Performance Optimization for AI infrastructure, optimizing speed and efficiency across the stack for LLMs, VLMs, and video models. Responsibilities include low-level GPU kernel optimization, distributed systems scaling, and performance analysis for training and inference.

What you'd actually do

  1. Optimize system and GPU performance for high-throughput AI workloads across training and inference
  2. Analyze and improve latency, throughput, memory usage, and compute efficiency
  3. Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
  4. Implement low-level optimizations using CUDA, Triton, and other performance tooling
  5. Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)

Skills

Required

  • CUDA
  • ROCm
  • GPU profiling tools
  • PyTorch
  • distributed system debugging
  • GPU architecture
  • parallel programming models
  • compute kernels

Nice to have

  • Master’s or PhD
  • compiler stacks
  • ML compilers
  • torch.compile
  • Triton
  • XLA
  • open-source ML or HPC infrastructure
  • cloud-scale AI infrastructure
  • Kubernetes
  • ML systems engineering
  • hardware-aware model design

What the JD emphasized

  • 5+ years of experience working on performance optimization or high-performance computing systems
  • Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
  • Deep understanding of GPU architecture, parallel programming models, and compute kernels
  • Experience optimizing large models for training and inference (LLMs, VLMs, or video models)

Other signals

  • optimizing performance at every layer of the stack—from low-level GPU kernels to large-scale distributed systems
  • maximizing the performance of our most demanding workloads, including large language models (LLMs), vision-language models (VLMs), and next-generation video models
  • implement low-level optimizations using CUDA, Triton, and other performance tooling
  • scale inference and training systems across multi-GPU, multi-node environments