Staff Technical Program Manager - Cluster Orchestration & Applied Training

Weights & Biases Weights & Biases · Data AI · Bellevue, WA · Technology

Staff Technical Program Manager to lead cross-functional programs for AI/ML Platform Services, focusing on Cluster Orchestration (scheduling, launching, managing AI workloads) and Applied Training (enabling researchers to use infrastructure for pre-training, fine-tuning, RL, evaluations). The role involves partnering with engineering, product, and research teams to improve workload execution and user interaction with training platforms, driving delivery across various AI training workflows and ensuring successful launches and operational ownership.

What you'd actually do

  1. Drive end-to-end program execution for cluster orchestration initiatives spanning workload scheduling, self-service provisioning, upgrade and migration flows, and platform integrations.
  2. Lead cross-functional programs that improve how AI training, evaluation, RL, and mixed workloads run across CoreWeave clusters.
  3. Partner with engineering and product leaders to define roadmap priorities and deliver measurable improvements in utilization, reliability, scalability, observability, and user experience.
  4. Drive delivery for applied training initiatives across pre-training, fine-tuning, reinforcement learning, sandbox environments, and evaluation systems.
  5. Coordinate dependencies across platform engineering, infrastructure, product, customer-facing teams, and ecosystem partners to ensure successful launches and clear operational ownership.

Skills

Required

  • Technical Program Management
  • Cloud Infrastructure
  • Distributed Systems
  • AI/ML Platforms
  • Kubernetes
  • Slurm or comparable schedulers
  • AI training workflows
  • Program metrics definition
  • Stakeholder communication

Nice to have

  • Kueue
  • Ray
  • GPU infrastructure
  • Capacity planning
  • Multi-tenant execution
  • Distributed training tradeoffs
  • Launch processes
  • Release governance
  • Dependency management
  • Operational review mechanisms
  • W&B
  • SkyPilot

What the JD emphasized

  • 8+ years of technical program management experience in cloud infrastructure, distributed systems, or AI/ML platforms.
  • Experience leading large-scale cross-functional programs involving scheduling systems, cluster infrastructure, or ML platform capabilities.
  • Strong technical fluency in Kubernetes, Slurm or comparable schedulers, distributed systems, and AI training workflows.

Other signals

  • driving programs across orchestration systems
  • scale the environments, tooling, and operational mechanisms
  • AI training, evaluation, RL, and mixed workloads
  • pre-training, fine-tuning, reinforcement learning, sandbox environments, and evaluation systems