Senior Software Engineer Ii, AI Workload Orchestration

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

CoreWeave is seeking a Senior Software Engineer II on the AI Workload Orchestration team to build and operate a Kubernetes-native platform for AI workloads. This role involves designing, building, and operating services for AI workload orchestration and scheduling, improving scheduling latency, cluster utilization, and workload reliability, and contributing to architectural discussions. The ideal candidate has strong experience in distributed systems, Kubernetes, and Go, with familiarity with AI infrastructure and scheduling concepts.

What you'd actually do

  1. Design, build, and operate Kubernetes-native services for AI workload orchestration and scheduling
  2. Own one or more platform components end-to-end, including design, implementation, testing, and on-call support
  3. Improve scheduling latency, cluster utilization, and workload reliability through metrics-driven engineering
  4. Contribute to architectural discussions across services and influence design decisions within the platform
  5. Work closely with adjacent teams (CKS, infrastructure, managed inference) to ensure clean interfaces and integrations

Skills

Required

  • 5–8 years of professional software engineering experience in distributed systems, cloud infrastructure, or platform engineering
  • Strong experience building production systems in Go
  • Solid understanding of Kubernetes fundamentals, APIs, controllers, and operating services in production
  • Experience working with scheduling, resource management, or quota-based systems
  • Proven ability to improve system reliability and performance using data and operational metrics
  • Comfortable owning services in production and participating in on-call rotations

Nice to have

  • Python or C++ a plus
  • Experience with Kubernetes-native orchestration frameworks such as Kueue, Volcano, Ray, Kubeflow, or Argo Workflows
  • Familiarity with GPU-based workloads, ML training, or inference pipelines
  • Knowledge of scheduling concepts such as quota enforcement, pre-emption, and backfilling
  • Experience with reliability practices including SLOs, alerting, and incident response
  • Exposure to AI infrastructure, HPC, or large-scale distributed compute environments

What the JD emphasized

  • AI workloads at scale
  • AI training and inference workflows
  • Kubernetes-native services for AI workload orchestration and scheduling
  • scheduling, resource management, or quota-based systems
  • GPU-based workloads, ML training, or inference pipelines
  • AI infrastructure, HPC, or large-scale distributed compute environments

Other signals

  • Kubernetes-native platform for AI workloads
  • scheduling and operating AI workloads at scale
  • integrates multiple orchestration and scheduling frameworks
  • support modern AI training and inference workflows
  • own meaningful components of the platform
  • drive reliability and performance improvements
  • scale the system as customer demand and workload complexity continue to grow