Staff Engineer, Storage Engine

Weights & Biases Weights & Biases · Data AI · Bellevue, WA +4 · Technology

CoreWeave is seeking a Staff Engineer for their Storage Engine Team to design and implement distributed storage solutions for AI workloads. Responsibilities include developing exabyte-scale S3-compatible object storage, integrating dedicated storage clusters, optimizing performance using technologies like RDMA and GPU Direct Storage, and improving reliability, durability, security, and observability of the storage stack. The role involves analyzing telemetry data, collaborating with cross-functional teams, and mentoring engineers. Requires 8-10+ years of experience in storage systems engineering, proficiency in systems programming languages (Go, C, Rust), and familiarity with storage protocols (S3, NFS) and systems (Ceph, DAOS).

What you'd actually do

  1. Design and Implement distributed storage solutions to support scaling data intensive AI workloads.
  2. Contribute to the development of exabyte-scale, S3-compatible object storage and integrate dedicated storage clusters into diverse customer environments.
  3. Work with technologies such as RDMA, GPU Direct Storage, and distributed filesystems protocols such as NFS or FUSE to optimize storage performance and efficiency.
  4. Lead efforts to improve the reliability, durability, security, and observability of our storage stack.
  5. Analyze telemetry and system data to drive improvements in throughput, latency, and resilience.

Skills

Required

  • 8–10+ years of experience working in storage systems engineering or infrastructure.
  • Strong hands-on experience with object storage or distributed filesystems in production environments.
  • Experience with one or more storage protocols (e.g. S3, NFS) and file systems such as Ceph, DAOS, or similar.
  • Proficiency in a systems programming language such as Go, C, or Rust.
  • Familiarity with storage observability tools and telemetry pipelines (e.g., ClickHouse, Prometheus, Grafana).
  • Experience working with cloud-native infrastructure, Kubernetes, and scalable system architectures.

Nice to have

  • Proficiency leveraging AI tools to augment software development.

What the JD emphasized

  • exabyte-scale
  • S3-compatible object storage
  • RDMA
  • GPU Direct Storage
  • distributed filesystems
  • NFS
  • FUSE
  • reliability
  • durability
  • security
  • observability
  • telemetry
  • throughput
  • latency
  • resilience
  • object storage
  • distributed filesystems
  • Ceph
  • DAOS