Machine Learning Systems Research Engineer, Agent Post-training - Enterprise Genai

Scale AI Scale AI · Data AI · San Francisco, CA · Enterprise Engineering

Scale AI is seeking an ML Systems Research Engineer to work on building algorithms for their next-gen Agent RL training platform, supporting large-scale training, and researching/integrating state-of-the-art technologies to optimize ML systems. The role involves post-training state-of-the-art models for enterprise engagements and creating next-gen agent training algorithms for multi-agent/multi-tool rollouts.

What you'd actually do

  1. Build, profile and optimize our training and inference framework.
  2. Post-train state of the art models, developed both internally and from the community, to define stable post-training recipes for our enterprise engagements.
  3. Collaborate with ML teams to accelerate their research and development, and enable them to develop the next generation of models and data curation..
  4. Create a next-gen agent training algorithm for multi-agent/multi-tool rollouts.

Skills

Required

  • LLM training in a production environment
  • post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc.
  • operate the architecture of the modern GPU cluster
  • multi-node LLM training and inference
  • Strong software engineering skills
  • CUDA
  • Pytorch
  • transformers
  • flash attention
  • Strong written and verbal communication skills
  • PhD or Masters in Computer Science or a related field

Nice to have

  • Passion for system optimization

What the JD emphasized

  • LLM training in a production environment
  • post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc.
  • multi-node LLM training and inference

Other signals

  • post-training algorithms
  • complex agents
  • enterprise clients
  • next-gen Agent RL training platform