Software Development Engineer -distributed Kv Caching and Storage Systems

ByteDance · Big Tech · Seattle, WA · Infrastructure

Software Development Engineer focused on building and optimizing distributed KV caching and storage systems for ByteDance's global infrastructure, serving core business scenarios with strict availability, latency, and cost requirements. Responsibilities include designing, developing, and ensuring the reliability and efficiency of these systems, as well as building operational ecosystems and exploring AI applications in database systems.

What you'd actually do

Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.

Skills

Required

BS or a higher degree in Computer Science or related fields, or equivalent practical experience
Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment
Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming
Strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency)
Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA), with proven ability to improve stability, performance, and cost
Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills

Nice to have

3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience
Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
Strong knowledge of distributed consensus algorithms, with experience in database kernel development
Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem
Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware
Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning)

What the JD emphasized

mission-critical
massive scale
strict requirements on availability, latency, throughput, global deployment, and cost efficiency
low latency, high throughput, and high availability
planet-scale reliability
large-scale stability engineering
billion-level request traffic
production-grade ecosystem
always-on business workloads
strict SLO/SLA
proven ability to improve stability, performance, and cost

Read full job description

About the Team Join ByteDance's KV caching and storage systems team, where we build and own mission-critical distributed KV caching and storage products powering ByteDance's global infrastructure. Our portfolio includes Redis-compatible services, next-generation shared-storage engines, and performance/cost optimization components, along with a full ecosystem of operational automation, observability, data movement, and recovery capabilities. We serve ByteDance's core business scenarios at massive scale — recommendation, search, ads, e-commerce, messaging, live streaming, and collaboration suites — with strict requirements on availability, latency, throughput, global deployment, and cost efficiency.

Responsibilities

Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.
Research new hardware and new technologies, evaluate and land improvements using ZNS SSD, io_uring, RDMA/CXL, and "AI+DB" directions in production.

Requirements

Minimum Qualifications:

BS or a higher degree in Computer Science or related fields, or equivalent practical experience.
Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment.
Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming; strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency).
Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA), with proven ability to improve stability, performance, and cost.
Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills.

Preferred Qualifications:

3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience.
Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
Strong knowledge of distributed consensus algorithms, with experience in database kernel development.
Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem.
Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware.
Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning).

Responsibilities

Design and develop core KV caching and storage systems, including distributed caching systems and Redis-compatible KV storage systems, with a focus on low latency, high throughput, and high availability.
Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
Drive compute/storage efficiency improvements (CPU, memory, IO, network), including cache hierarchy designs (memory/SSD), read/write amplification reductions, and capacity planning for billion-level request traffic.
Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.
Research new hardware and new technologies, evaluate and land improvements using ZNS SSD, io_uring, RDMA/CXL, and "AI+DB" directions in production.

Requirements

Minimum Qualifications:

BS or a higher degree in Computer Science or related fields, or equivalent practical experience.
Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment.
Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming; strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency).
Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA), with proven ability to improve stability, performance, and cost.
Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills.

Preferred Qualifications:

3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience.
Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
Strong knowledge of distributed consensus algorithms, with experience in database kernel development.
Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem.
Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware.
Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning).