Senior Software Development Engineer - Distributed Kv Caching and Storage Systems

ByteDance · Big Tech · San Jose, CA · Infrastructure

Senior Software Development Engineer to design and develop core KV caching and storage systems for ByteDance's global infrastructure. Responsibilities include building planet-scale reliability, driving efficiency improvements, creating a production-grade ecosystem with automated operations and monitoring, and implementing capabilities like bulkload and backup. The role requires strong fundamentals in distributed systems, databases, networking, and multi-threaded programming, with hands-on experience in large-scale distributed systems. Experience with modern hardware and applying AI techniques to database systems is preferred.

What you'd actually do

Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.

Skills

Required

BS or a higher degree in Computer Science or related fields, or equivalent practical experience
Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment
Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming
strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency)
Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA)
Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills

Nice to have

3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience
Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
Strong knowledge of distributed consensus algorithms, with experience in database kernel development
Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem
Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware
Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning)

What the JD emphasized

strict requirements on availability, latency, throughput, global deployment, and cost efficiency
low latency, high throughput, and high availability
planet-scale reliability
compute/storage efficiency improvements
automated orchestration operations
monitoring systems
large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA)
proven ability to improve stability, performance, and cost

Read full job description

About the Team Join ByteDance's KV caching and storage systems team, where we build and own mission-critical distributed KV caching and storage products powering ByteDance's global infrastructure. Our portfolio includes Redis-compatible services, next-generation shared-storage engines, and performance/cost optimization components, along with a full ecosystem of operational automation, observability, data movement, and recovery capabilities. We serve ByteDance's core business scenarios at massive scale — recommendation, search, ads, e-commerce, messaging, live streaming, and collaboration suites — with strict requirements on availability, latency, throughput, global deployment, and cost efficiency.

Responsibilities

Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.
Research new hardware and new technologies, evaluate and land improvements using ZNS SSD, io_uring, RDMA/CXL, and "AI+DB" directions in production.

Requirements

Minimum Qualifications:

BS or a higher degree in Computer Science or related fields, or equivalent practical experience.
Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment.
Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming; strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency).
Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA), with proven ability to improve stability, performance, and cost.
Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills.

Preferred Qualifications:

3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience.
Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
Strong knowledge of distributed consensus algorithms, with experience in database kernel development.
Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem.
Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware.
Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning).

Responsibilities

Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.
Research new hardware and new technologies, evaluate and land improvements using ZNS SSD, io_uring, RDMA/CXL, and "AI+DB" directions in production.

Requirements

Minimum Qualifications:

BS or a higher degree in Computer Science or related fields, or equivalent practical experience.
Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment.
Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming; strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency).
Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA), with proven ability to improve stability, performance, and cost.
Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills.

Preferred Qualifications:

3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience.
Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
Strong knowledge of distributed consensus algorithms, with experience in database kernel development.
Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem.
Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware.
Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning).