Staff Software Engineer, Cluster Orch (sunk)

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

Staff Software Engineer role focused on advancing CoreWeave's orchestration platform (SUNK - Slurm on Kubernetes) for AI training and inference at scale. The role involves technical leadership, architectural direction, and ensuring efficient, reliable workload execution across large GPU clusters.

What you'd actually do

  1. play a key role in advancing CoreWeave’s orchestration platform including SUNK (Slurm on Kubernetes) and beyond, our Kubernetes-native foundation that powers AI training and inference at scale.
  2. ensure workloads run seamlessly, reliably, and efficiently across massive GPU clusters.
  3. building the systems that eliminate infrastructure bottlenecks and create new orchestration capabilities
  4. technical leader shaping the long-term strategy for CoreWeave’s orchestration platform.
  5. define architectural direction, own critical parts of the orchestration platform and other managed services, and drive cross-org initiatives in scheduling, quota enforcement, and scaling at hyperscale.

Skills

Required

  • 8–12 years of professional software engineering experience
  • designing and operating large-scale distributed systems in production
  • Slurm/Kubernetes internals and cloud-native development
  • Go and distributed systems design
  • setting technical direction and influencing cross-team architecture
  • mentoring senior engineers and elevating organizational standards

Nice to have

  • orchestration and workflow technologies such as Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows
  • distributed workloads, GPU-based applications, or ML pipelines
  • scheduling concepts like quota enforcement, pre-emption, and scaling strategies
  • reliability practices including SLOs, alarms, and post-incident reviews
  • AI infrastructure and workloads (ML training, inference, or HPC)

What the JD emphasized

  • AI training and inference at scale
  • massive GPU clusters
  • next-generation AI workloads
  • large-scale distributed systems
  • Slurm/Kubernetes internals
  • Go and distributed systems design
  • scheduling concepts
  • AI infrastructure and workloads

Other signals

  • AI training and inference at scale
  • massive GPU clusters
  • next-generation AI workloads