Staff Engineer, Storage Control Plane

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +5 · Technology

Staff Storage Engineer to design, build, and operate the control plane for a high-performance AI storage platform. The role involves evolving storage systems by building reliable, scalable, and high-throughput solutions that power AI workloads. Responsibilities include designing a scalable multi-tenant control plane, contributing to exabyte-scale object storage and distributed file systems, and optimizing storage performance using technologies like RDMA and GPU Direct Storage. Collaboration with infrastructure, compute, and platform teams is key, as is improving reliability, durability, and observability of the storage stack.

What you'd actually do

  1. Design and implement a highly scalable multi-tenant control plane that supports CoreWeave’s growing AI storage and cloud infrastructure needs.
  2. Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file system and integrate dedicated storage clusters into diverse customer environments.
  3. Work with technologies such as RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, and distributed filesystems to optimize storage performance and efficiency.
  4. Participate in efforts to improve the reliability, durability, and observability of our storage stack.
  5. Collaborate with operations teams to monitor, analyze, and optimize storage systems using telemetry, metrics, and dashboards to improve performance, latency, and resilience.

Skills

Required

  • Storage systems engineering or infrastructure
  • Object storage or distributed filesystems
  • S3, NFS
  • Ceph, DAOS
  • Go, C++, or Rust
  • Storage observability tools and telemetry pipelines
  • Cloud-native infrastructure, Kubernetes
  • Scalable system architecture
  • Debugging and problem-solving skills in distributed, high-performance environments

Nice to have

  • RDMA
  • GPU Direct Storage
  • RoCE
  • InfiniBand
  • SPDK
  • ClickHouse
  • Prometheus
  • Grafana

What the JD emphasized

  • 10+ years of experience working in storage systems engineering or infrastructure
  • Strong hands-on experience with object storage or distributed filesystems in production environments
  • Proficiency in a systems programming language such as Go, C++, or Rust