Senior Software Engineer Ii, Inference

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +1 · Technology

Senior Software Engineer II focused on owning and optimizing CoreWeave's Kubernetes-native inference platform to meet strict P99 SLAs at scale. Responsibilities include leading design reviews, implementing advanced optimizations for latency and throughput, strengthening incident posture, and mentoring junior engineers. Requires strong experience in distributed systems, Python/Go, networked systems performance, Kubernetes, and ML inference internals.

What you'd actually do

  1. Lead design reviews and drive architecture within the team; decompose multi-service work into clear milestones.
  2. Define and own SLIs/SLOs; ensure post-incident actions land and reliability improves release-over-release.
  3. Implement advanced optimizations (e.g., micro-batch schedulers, speculative decoding, KV-cache reuse) and quantify impact.
  4. Strengthen incident posture: capacity planning, autoscaling policy, graceful degradation, rollback/traffic-shift strategies.
  5. Own an area spanning multiple services and teams (e.g., request routing & adaptive scheduling, cost-per-token analytics, GPU resource isolation).

Skills

Required

  • distributed systems
  • cloud services
  • Python
  • Go
  • networked systems
  • performance optimization
  • Kubernetes
  • CI/CD
  • observability stacks (Prometheus, Grafana, OpenTelemetry)
  • ML inference internals (batching, caching, mixed precision, streaming token delivery)

Nice to have

  • C++
  • CUDA kernels
  • inference frameworks (vLLM, Triton, TensorRT-LLM, Ray Serve, TorchServe)
  • NCCL/SHARP
  • RDMA/NUMA
  • GPU interconnect topologies
  • leading multi-team initiatives
  • partnering with customers

What the JD emphasized

  • P99 SLAs
  • tail latency (P95/P99)
  • optimize end-to-end ML system performance
  • developing and tuning CUDA kernels
  • reducing model latency
  • maximizing compute and memory bandwidth utilization
  • leveraging custom accelerators for high-efficiency workloads
  • practical knowledge of inference internals: batching, caching, mixed precision (BF16/FP8), streaming token delivery

Other signals

  • inference platform
  • P99 SLAs
  • Kubernetes-native
  • optimize ML system performance
  • CUDA kernels
  • low-latency LLM
  • tail latency