ML Research Engineer, ML Systems

Scale AI Scale AI · Data AI · San Francisco, CA · Research

ML Research Engineer focused on building and optimizing the internal distributed framework for large language model training and inference, supporting ML research and development.

What you'd actually do

  1. Build, profile and optimize our training and inference framework
  2. Collaborate with ML teams to accelerate their research and development and enable them to develop the next generation of models and data curation
  3. Research and integrate state-of-the-art technologies to optimize our ML system

Skills

Required

  • Strong software engineering skills
  • CUDA
  • Pytorch
  • transformers
  • flash attention

Nice to have

  • multi-node LLM training and inference
  • developing large-scale distributed ML systems
  • instruction tuning
  • RLHF
  • tool use
  • reasoning
  • agents
  • multimodal

What the JD emphasized

  • multi-node LLM training and inference
  • large-scale distributed ML systems
  • system optimization

Other signals

  • ML platform
  • distributed framework
  • LLM training and inference
  • system optimization