Software Development Engineer -distributed Kv Caching and Storage Systems

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Software Development Engineer focused on building and optimizing distributed KV caching and storage systems for ByteDance's global infrastructure, serving core business scenarios with strict availability, latency, and cost requirements. Responsibilities include designing, developing, and ensuring the reliability and efficiency of these systems, as well as building operational ecosystems and exploring AI applications in database systems.

What you'd actually do

  1. Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
  2. Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
  3. Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
  4. Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
  5. Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.

Skills

Required

  • BS or a higher degree in Computer Science or related fields, or equivalent practical experience
  • Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment
  • Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming
  • Strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency)
  • Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA), with proven ability to improve stability, performance, and cost
  • Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills

Nice to have

  • 3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience
  • Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
  • Strong knowledge of distributed consensus algorithms, with experience in database kernel development
  • Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem
  • Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware
  • Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning)

What the JD emphasized

  • mission-critical
  • massive scale
  • strict requirements on availability, latency, throughput, global deployment, and cost efficiency
  • low latency, high throughput, and high availability
  • planet-scale reliability
  • large-scale stability engineering
  • billion-level request traffic
  • production-grade ecosystem
  • always-on business workloads
  • strict SLO/SLA
  • proven ability to improve stability, performance, and cost