Machine Learning Performance Engineer

Jane Street Jane Street · Quant · New York, NY · Machine Learning

Seeking an engineer with low-level systems programming and optimization experience to join the ML team, focusing on optimizing the performance of ML models for both training and inference in a real-time trading environment. This role requires a deep understanding of GPU architecture, networking, and distributed systems to ensure efficient large-scale training and low-latency, high-throughput inference.

What you'd actually do

  1. optimizing the performance of our models – both training and inference
  2. improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking, and host- and GPU-level considerations
  3. ensure our platform makes sense even at the lowest level – is all that throughput actually goodput?
  4. debug a training run’s performance end to end
  5. use these networking technologies to link up GPU clusters

Skills

Required

  • experience in low-level systems programming and optimization
  • understanding of modern ML techniques and toolsets
  • experience and systems knowledge required to debug a training run’s performance end to end
  • Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy
  • Debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute
  • Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
  • Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads
  • Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink, and how to use these networking technologies to link up GPU clusters
  • An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
  • An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools

Nice to have

  • If you’ve never thought about a career in finance, you’re in good company.

What the JD emphasized

  • low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy
  • debugging and optimization experience using tools like CUDA GDB, NSight Systems, NSight Compute
  • library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
  • intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads
  • background in Infiniband, RoCE, GPUDirect, PXN, rail optimization, and NVLink
  • understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
  • inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools

Other signals

  • Optimizing ML model performance for training and inference
  • Low-level systems programming and optimization
  • Whole-systems approach including storage, networking, and host/GPU considerations
  • Debugging and optimization using specialized tools