Staff Software Engineer, Inference

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

Staff Software Engineer on the Inference Platform Team at CoreWeave, focusing on building and operating a Kubernetes-native inference platform for AI workloads. The role involves technical leadership in architecture, performance optimization (latency, throughput, GPU utilization), and system reliability for low-latency, high-throughput systems at massive scale, with deep work in distributed systems and Kubernetes infrastructure.

What you'd actually do

  1. act as a technical leader driving architecture, performance, and reliability across multiple services and teams
  2. leading cross-team design initiatives
  3. optimizing inference performance (latency, throughput, and GPU utilization)
  4. improving system reliability at scale
  5. work deeply in distributed systems and Kubernetes-based infrastructure, focusing on areas like scheduling, batching, and memory optimization

Skills

Required

  • 8–12+ years of experience building and operating large-scale distributed systems or cloud platforms
  • Proven experience leading cross-team technical initiatives impacting multiple services or organizations
  • Strong programming skills in Go, Python, or C++
  • Deep expertise in Kubernetes at production scale, including orchestration, scheduling, and service design
  • Strong understanding of distributed systems, networking, and performance optimization
  • Experience designing and operating low-latency, high-throughput systems with strict P95/P99 latency requirements
  • Hands-on experience with inference systems, including batching or micro-batching strategies, caching, and memory optimization
  • Experience improving system performance using metrics-driven approaches (e.g., latency, throughput, utilization)
  • Familiarity with mixed precision (BF16, FP8) and streaming inference workloads

Nice to have

  • Experience with inference frameworks such as vLLM, Triton, TensorRT-LLM, Ray Serve, or TorchServe
  • Experience with GPU systems and performance optimization (CUDA, NCCL, RDMA, NUMA, GPU interconnects)
  • Experience leading multi-team or org-level technical initiatives
  • Exposure to large-scale AI/ML infrastructure or hyperscale cloud environments

What the JD emphasized

  • strict P95/P99 latency requirements
  • low-latency, high-throughput systems

Other signals

  • operating low-latency, high-throughput AI workloads at massive scale
  • Kubernetes-native inference platform
  • scheduling, batching, and memory optimization
  • optimize inference performance (latency, throughput, and GPU utilization)