VP of Product, Research and Training Infrastructure

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +4 · Technology

VP of Product for Research and Training Infrastructure at an AI cloud provider. This role owns the product strategy and engineering execution for services powering AI research labs, focusing on specialized orchestration, evaluation, and iteration tools for massive-scale pre-training and post-training. Key responsibilities include evolving orchestration tools (SUNK), developing automated training-based evaluation frameworks, and building infrastructure for RL/RLHF pipelines. Requires deep knowledge of HPC, distributed training, and supporting frontier model research.

What you'd actually do

  1. Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.
  2. Drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.
  3. Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.
  4. Act as the primary technical partner for lead researchers at global AI labs, translating their "future-state" requirements into actionable product roadmaps.

Skills

Required

  • 15+ years of experience in engineering leadership
  • 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider
  • Deep, hands-on knowledge of Slurm
  • Deep, hands-on knowledge of Kubernetes
  • Deep, hands-on knowledge of the specific networking requirements (InfiniBand/RDMA) for distributed training clusters
  • Experience supporting frontier model research (pre-training and post-training)
  • Track record of delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures)
  • Ability to define "what’s next" in the AI stack

Nice to have

  • Product strategy
  • Engineering execution
  • HPC
  • Cloud-native agility
  • Automated training-based evaluation frameworks
  • Reinforcement Learning (RL)
  • RLHF pipelines
  • Customer advocacy
  • Technical partner for lead researchers
  • Actionable product roadmaps
  • Strategic vision

What the JD emphasized

  • AI Cloud Provider
  • Research Training Stack
  • pre-training and post-training
  • frontier model research
  • multi-thousand GPU clusters

Other signals

  • AI Cloud Provider
  • HPC for AI
  • Frontier Model Research
  • Training Infrastructure