Staff Software Engineer- AI Workload Orchestration

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

Staff Software Engineer on the AI Workload Orchestration Platform team, responsible for the technical vision and architecture of CoreWeave's Kubernetes-native orchestration strategy for AI workloads, including admission, scheduling, and governance across large GPU clusters using frameworks like Kueue, Volcano, and Ray. This platform supports both training and inference workloads.

What you'd actually do

  1. Own the technical vision and architecture for major portions of the AI Workload Orchestration Platform
  2. Design scalable, reliable orchestration primitives for AI workloads across multiple schedulers and runtimes
  3. Lead cross-team architecture reviews and drive alignment across infrastructure, CKS, and managed inference teams
  4. Define platform standards for reliability, observability, capacity management, and operational excellence
  5. Identify and resolve systemic performance, scalability, and fairness issues across large GPU clusters

Skills

Required

  • 8+ years of professional software engineering experience
  • deep expertise in distributed systems or cloud platforms
  • Strong proficiency in Go
  • experience designing large-scale, long-lived production systems
  • Deep knowledge of Kubernetes internals, scheduling mechanisms, and controller-based architectures
  • Demonstrated experience designing or evolving orchestration, scheduling, or resource-management platforms
  • Proven ability to lead technical initiatives across teams without direct authority
  • Strong operational mindset with experience owning mission-critical systems at scale

Nice to have

  • Hands-on experience with Kueue, Volcano, Ray, or similar Kubernetes-native orchestration frameworks
  • Background in AI infrastructure, ML platforms, HPC, or large-scale batch and streaming systems
  • Deep understanding of scheduling concepts including fairness, pre-emption, quota management, and multi-tenant isolation
  • Experience defining and operating SLOs, capacity models, and large-scale reliability improvements
  • Contributions to open-source infrastructure or orchestration projects

What the JD emphasized

  • AI Workload Orchestration Platform
  • Kubernetes-native orchestration strategy for AI workloads
  • Kueue, Volcano, and Ray
  • SUNK (Slurm on Kubernetes)
  • training and inference workloads
  • large GPU clusters
  • Kubernetes internals
  • scheduling mechanisms
  • controller-based architectures
  • orchestration, scheduling, or resource-management platforms

Other signals

  • AI Workload Orchestration Platform
  • Kubernetes-native orchestration strategy for AI workloads
  • serving both training and inference workloads