Senior Software Engineer, Cluster Orchestration

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

CoreWeave is seeking a Senior Software Engineer to join their Cluster Orchestration team. This role will focus on advancing CoreWeave's orchestration platform, including SUNK (Slurm on Kubernetes) and their Kubernetes-native foundation, which powers AI training and inference at scale. The engineer will be responsible for ensuring workloads run seamlessly, reliably, and efficiently across massive GPU clusters, eliminating infrastructure bottlenecks and creating new orchestration capabilities to empower customers. The role involves owning multiple services, leading design/code reviews, decomposing projects, driving improvements in reliability and performance, defining SLIs/SLOs, strengthening operational practices, and mentoring junior engineers.

What you'd actually do

  1. own multiple services within the orchestration platform
  2. lead design/code reviews, decompose projects into milestones, and drive measurable improvements in reliability and performance
  3. define SLIs/SLOs for your services, strengthen operational practices, and mentor IC1/IC2 engineers
  4. ensure customers see consistent improvements in throughput, latency, and system resilience

Skills

Required

  • 3–5 years of professional software engineering experience building distributed systems or cloud services
  • Strong coding in Go
  • solid CS fundamentals
  • Hands-on experience running Kubernetes at production scale
  • Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry)
  • Proven ability to improve service reliability and performance using metrics (P95/P99 latency, throughput, error budgets)

Nice to have

  • Python or C++
  • orchestration and workflow technologies such as Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows
  • distributed workloads, GPU-based applications, or ML pipelines
  • scheduling concepts like quota enforcement, pre-emption, and scaling strategies
  • reliability practices including SLOs, alarms, and post-incident reviews

What the JD emphasized

  • SUNK (Slurm on Kubernetes)
  • Kubernetes-native foundation
  • AI training and inference at scale
  • massive GPU clusters
  • eliminate infrastructure bottlenecks
  • customers to innovate faster
  • push the boundaries of what’s possible with AI
  • production scale
  • distributed systems
  • cloud services
  • reliability and performance
  • SLIs/SLOs
  • throughput, latency, and system resilience

Other signals

  • AI training and inference at scale
  • massive GPU clusters
  • eliminate infrastructure bottlenecks
  • customers to innovate faster
  • push the boundaries of what’s possible with AI