Tech Lead Manager- Mlre, ML Systems

Scale AI Scale AI · Data AI · San Francisco, CA · Research

Tech Lead Manager for MLRE, ML Systems at Scale AI, focusing on building and optimizing the internal distributed framework for large language model training and evaluation. The role involves collaborating with ML teams to accelerate research and development, and integrating state-of-the-art technologies to optimize the ML system, supporting both training and inference.

What you'd actually do

  1. Build, profile and optimize our training and inference framework.
  2. Collaborate with ML and research teams to accelerate their research and development, and enable them to develop the next generation of models and data curation.
  3. Research and integrate state-of-the-art technologies to optimize our ML system.

Skills

Required

  • Experience with multi-node LLM training and inference
  • Experience with developing large-scale distributed ML systems
  • Experience with post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc.
  • Strong software engineering skills, proficient in frameworks and tools such as CUDA, Pytorch, transformers, flash attention, etc.
  • Strong written and verbal communication skills to operate in a cross functional team environment.

Nice to have

  • Demonstrated expertise in post-training methods and/or next generation use cases for large language models including instruction tuning, RLHF, tool use, reasoning, agents, and multimodal, etc.

What the JD emphasized

  • multi-node LLM training and inference
  • post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc.
  • Strong software engineering skills
  • post-training methods
  • instruction tuning, RLHF, tool use, reasoning, agents, and multimodal

Other signals

  • LLM post-training platform
  • distributed framework for LLM training and evaluation
  • enabling next generation LLM training, inference and data curation