Machine Learning Performance Engineer

Jane Street Jane Street · Quant · Hong Kong · Machine Learning

Machine Learning Performance Engineer at Jane Street, focusing on optimizing the performance of ML models for both training and inference. This role requires deep expertise in low-level systems programming, GPU optimization, and a whole-systems approach to performance, including storage and networking, within a high-frequency trading environment.

What you'd actually do

  1. optimising the performance of our models – both training and inference
  2. improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking and host- and GPU-level considerations
  3. ensure our platform makes sense even at the lowest level – is all that throughput actually goodput?
  4. debug a training run’s performance end to end

Skills

Required

  • Experience in low-level systems programming and optimisation
  • Understanding of modern ML techniques and toolsets
  • Systems knowledge to debug training run performance end to end
  • Low-level GPU knowledge (PTX, SASS, warps, cooperative groups, Tensor Cores, memory hierarchy)
  • Debugging and optimisation experience with tools like CUDA GDB, NSight Systems, NSight Compute
  • Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, cuBLAS
  • Intuition about latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, asynchronous memory loads
  • Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation, NVLink
  • Understanding of collective algorithms supporting distributed GPU training in NCCL or MPI
  • Inventive approach and willingness to ask hard questions

Nice to have

  • Experience in finance

What the JD emphasized

  • low-level systems programming and optimisation
  • efficient large-scale training
  • low-latency inference
  • high-throughput inference
  • low-level GPU knowledge
  • debugging and optimisation experience
  • library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS
  • collective algorithms supporting distributed GPU training

Other signals

  • Optimizing ML model performance for training and inference
  • Low-latency and high-throughput inference in real-time systems
  • Whole-systems approach to performance optimization (storage, networking, host/GPU)
  • Low-level GPU programming and optimization (CUDA, PTX, SASS, Tensor Cores, memory hierarchy)
  • Distributed GPU training optimization (NCCL, MPI)