Staff Software Engineer, Applied Training

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

Staff Software Engineer role focused on building and scaling the Kubernetes-native research cluster platform and sandbox client for agentic training and evaluation. The goal is to provide AI labs with research infrastructure, enabling distributed systems, ML infrastructure, or developer platforms expertise, with a strong emphasis on Kubernetes and researcher productivity.

What you'd actually do

  1. Contribute to the roadmap for Applied Training — figure out what actually unlocks new workloads and what's just nice to have
  2. Work directly and closely with customers, and other teams inside CoreWeave that are building cloud native primitives: compute, storage, networking, etc.
  3. For the research cluster platform: design and build a complete research cluster experience — CLI, job configuration schema, Kubernetes operators, daemons — solving the problems researchers actually hit: code distribution, checkpoint-triggered evaluation, cross-cluster scheduling, programmatic job control
  4. For sandbox infrastructure: own the Python SDK and work in a tight loop with the backend team, enabling RL training runs to spawn thousands of isolated containers for agent rollouts and agent benchmarks at scale
  5. Write documentation for running popular OSS training frameworks on CoreWeave to unblock customers and help them succeed

Skills

Required

  • 8–12+ years building distributed systems, ML infrastructure, or developer platforms
  • Real Kubernetes experience: custom controllers, operators, scheduling, CRDs, workload orchestration at scale
  • Understanding of researcher productivity needs
  • Familiarity with training: distributed job scheduling, rank initialization, scaling issues
  • Shipped production infrastructure
  • Strong communication skills

Nice to have

  • Experience building internal ML platforms or research clusters at a company doing large-scale training
  • Familiarity with agentic AI: RL training with rollouts, agent evaluation, sandbox isolation for running untrusted code
  • Background with Slurm, Ray, or similar workload orchestration
  • Experience with container runtimes, isolation (gVisor, Kata), or serverless platforms
  • OSS contributions to Kubernetes SIGs, Ray, PyTorch, or similar

What the JD emphasized

  • Kubernetes experience: custom controllers, operators, scheduling, CRDs, workload orchestration at scale
  • shipped infrastructure that other people rely on daily
  • agentic AI: RL training with rollouts, agent evaluation, sandbox isolation for running untrusted code

Other signals

  • Kubernetes-native research cluster platform
  • sandbox client for agentic training and evaluation
  • research infrastructure for AI labs
  • Python SDK for RL training runs
  • agent rollouts and agent benchmarks at scale