Senior Software Development Engineer - Distributed Kv Caching and Storage Systems

ByteDance ByteDance · Big Tech · San Jose, CA · Infrastructure

Senior Software Development Engineer to design and develop core KV caching and storage systems for ByteDance's global infrastructure. Responsibilities include building planet-scale reliability, driving efficiency improvements, creating a production-grade ecosystem with automated operations and monitoring, and implementing capabilities like bulkload and backup. The role requires strong fundamentals in distributed systems, databases, networking, and multi-threaded programming, with hands-on experience in large-scale distributed systems. Experience with modern hardware and applying AI techniques to database systems is preferred.

What you'd actually do

  1. Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
  2. Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
  3. Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
  4. Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
  5. Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.

Skills

Required

  • BS or a higher degree in Computer Science or related fields, or equivalent practical experience
  • Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment
  • Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming
  • strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency)
  • Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA)
  • Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills

Nice to have

  • 3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience
  • Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
  • Strong knowledge of distributed consensus algorithms, with experience in database kernel development
  • Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem
  • Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware
  • Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning)

What the JD emphasized

  • strict requirements on availability, latency, throughput, global deployment, and cost efficiency
  • low latency, high throughput, and high availability
  • planet-scale reliability
  • compute/storage efficiency improvements
  • automated orchestration operations
  • monitoring systems
  • large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA)
  • proven ability to improve stability, performance, and cost