Post-training Research Engineer

Baseten · Data AI · San Francisco, CA · EPD

Baseten is seeking a Post-Training Research Engineer to build in-house tooling for post-training AI models at scale. This role involves deep technical dives into ML techniques, distributed computing, and systems-level concepts to support customer custom models, which are critical for Baseten's inference platform.

What you'd actually do

  1. build the in-house tooling to support all of this
  2. training a wide spectrum of different model architectures with a variety of techniques efficiently and at scale
  3. zooming deep into a particular technical topic
  4. working across the stack as a whole - systems-level concepts like Kubernetes, cgroups, storage systems, and networking topologies, as well as PyTorch distributed tensor computation, and GPU kernels
  5. dive into messy problems, work with researchers, derive specifications by asking important questions, and execute

Skills

Required

  • strong experience in machine learning
  • solid foundations in maths and computer science
  • deep understanding of modern ML techniques and tools for training transformers
  • Advanced experience in a tensor/array computation library like PyTorch, TensorFlow, Jax, or similar
  • detailed understanding of transformer training parallelism strategies like data parallelism, sharded data parallelism, tensor parallelism, pipeline parallelism, context parallelism
  • experience and knowledge to profile and improve the performance of a distributed GPU program in PyTorch or a similar library
  • ability to perform roofline analysis on a transformer training setup
  • willingness to dive into messy problems, work with researchers, derive specifications by asking important questions, and execute
  • familiarity with HPC and distributed computing platforms like Slurm, Ray, Kubernetes, and Dask
  • familiarity with cluster networking technology like Infiniband, RoCE, GPUDirect
  • Solid fundamentals in operating systems concepts like processes, files, kernel drivers, containerisation, and networking protocols
  • sense of creativity and willingness to ask difficult questions about our approach, assumptions, and tooling choices

Nice to have

  • exposure to a variety of ML startups

What the JD emphasized

  • post-trained
  • post-training

Other signals

  • post-trained models
  • custom models
  • training transformers
  • distributed GPU program
  • in-house tooling