Member of Technical Staff - ML Training Systems

Modal Modal · Data AI · New York, NY · Engineering

The role focuses on engineering for ML training systems, specifically optimizing the training process for production machine learning models, including language models. This involves working with frameworks like PyTorch and high-level training libraries, and optimizing for performance bottlenecks.

What you'd actually do

  1. Experience working with torch and high-level training frameworks (Huggingface, verl, slime)
  2. Experience with ML training optimization (tell us a story about eliminating data loading bottlenecks, overlapping communications with compute, rewriting a trainer to handle off-policy rollouts, etc.)

Skills

Required

  • PyTorch
  • ML training optimization
  • high-performance code

Nice to have

  • Linux kernel
  • file systems
  • containers

What the JD emphasized

  • 5+ years of experience writing high-quality, high-performance code
  • Experience with ML training optimization

Other signals

  • ML training optimization
  • training production machine learning models
  • evolving Modal's infrastructure to train the next generation of language models