Principal Engineer, Cluster Orchestration

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

CoreWeave is seeking a Principal Engineer to lead the design and evolution of their AI infrastructure's cluster orchestration systems, including Slurm, Kubernetes, and SUNK. This role involves defining long-term architecture, solving scaling problems, and ensuring the reliability and efficiency of GPU resource utilization for AI training and inference workloads.

What you'd actually do

  1. Define the long-term architecture for CoreWeave’s orchestration platforms across Kubernetes, Slurm, SUNK, Kueue, and related systems.
  2. Lead the evolution of Kubernetes-native control planes, including SUNK and custom operators.
  3. Set standards for reliability, observability, and operational readiness across orchestration services.
  4. Write and review production code for Kubernetes controllers, schedulers, admission logic, and internal tooling.
  5. Mentor senior and staff engineers and help grow technical leaders.

Skills

Required

  • Distributed systems
  • Kubernetes
  • Slurm
  • GPU platforms
  • Go
  • Cloud-native systems development

Nice to have

  • Kueue
  • Kubeflow
  • Argo Workflows
  • Ray
  • Istio
  • Knative
  • ML platform engineering
  • model onboarding
  • lifecycle management
  • scheduling strategies
  • pre-emption
  • quota enforcement
  • elastic scaling
  • highly reliable systems
  • SLOs
  • incident processes
  • Kubernetes contributions
  • ML infrastructure contributions
  • open-source projects
  • mentoring senior engineers

What the JD emphasized

  • Deep, practical knowledge of Kubernetes and Slurm internals.
  • Experience running GPU-heavy platforms for AI training, inference, or HPC workloads.
  • Proven ability to set technical direction across teams without direct authority.
  • Comfortable making high-impact technical decisions in complex systems.

Other signals

  • GPU clusters
  • AI training
  • inference
  • model onboarding
  • Kubernetes
  • Slurm