Senior Software Development Engineer -distributed Kv Caching and Storage Systems

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Senior Software Development Engineer to design and develop core distributed KV caching and storage systems, focusing on low latency, high throughput, and high availability. Responsibilities include building planet-scale reliability, driving efficiency improvements, creating a production-grade ecosystem with automation and monitoring, and implementing capabilities like backup and tiered storage. The role also involves researching new hardware and technologies, including "AI+DB" directions.

What you'd actually do

  1. Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
  2. Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
  3. Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
  4. Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
  5. Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.

Skills

Required

  • BS or a higher degree in Computer Science or related fields, or equivalent practical experience
  • Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment
  • Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming
  • strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency)
  • Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA)
  • Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills

Nice to have

  • 3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience
  • Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
  • Strong knowledge of distributed consensus algorithms, with experience in database kernel development
  • Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem
  • Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware
  • Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning)

What the JD emphasized

  • strict requirements on availability, latency, throughput, global deployment, and cost efficiency
  • large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA)
  • proven ability to improve stability, performance, and cost
  • Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning)