Member of Technical Staff - ML Performance

Modal Modal · Data AI · New York, NY · Engineering

Seeking an ML Performance Engineer with 5+ years of experience to optimize ML systems for higher throughput and lower latency. The role involves working with inference engines like vLLM or TensorRT, understanding GPU architecture, and improving ML performance at scale.

What you'd actually do

  1. contributing to open-source projects and Modal’s container runtime to push language and diffusion models towards higher throughput and lower latency

Skills

Required

  • high-quality, high-performance code
  • torch
  • high-level ML frameworks
  • inference engines (vLLM or TensorRT)
  • Nvidia GPU architecture
  • CUDA
  • ML performance engineering

Nice to have

  • low-level operating system foundations (Linux kernel, file systems, containers, etc)

What the JD emphasized

  • experience with ML performance engineering
  • boosting GPU performance
  • debugging SM occupancy issues
  • rewriting an algorithm to be compute-bound
  • eliminating host overhead

Other signals

  • ML performance engineering
  • GPU performance
  • low-latency inference
  • high throughput