Research Engineer / Scientist - Storage for LLM

ByteDance ByteDance · Big Tech · San Jose, CA · Infrastructure

Research Engineer/Scientist focused on designing and implementing a high-performance KV cache layer for LLM inference to improve latency, throughput, and cost-efficiency in transformer-based model serving.

What you'd actually do

  1. Design and implement a distributed KV cache system to store and retrieve intermediate states (e.g., attention keys/values) for transformer-based LLMs across GPUs or nodes.
  2. Optimize low-latency access and eviction policies for caching long-context LLM inputs, token streams, and reused embeddings.
  3. Collaborate with inference and serving teams to integrate the cache with token streaming pipelines, batched decoding, and model parallelism.
  4. Develop cache consistency and synchronization protocols for multi-tenant, multi-request environments.
  5. Implement memory-aware sharding, eviction (e.g., windowed LRU, TTL), and replication strategies across GPUs or distributed memory backends.

Skills

Required

  • PhD in Computer Science, Applied Mathematics, Electrical Engineering, or a related technical field
  • Strong understanding of transformer-based model internals and how KV caching affects autoregressive decoding
  • Experience with distributed systems, memory management, and low-latency serving (RPC, gRPC, CUDA-aware networking)
  • Familiarity with high-performance compute environments (NVIDIA GPUs, TensorRT, Triton Inference Server)
  • Proficiency in languages like C++, Rust, Go, or CUDA for systems-level development

Nice to have

  • Prior experience building inference-serving systems for LLMs (e.g., vLLM, SGLang, FasterTransformer, DeepSpeed, Hugging Face Text Generation Inference)
  • Experience with memory hierarchy optimization (HBM, NUMA, NVLink) and GPU-to-GPU communication (NCCL, GDR, GDS, InfiniBand)
  • Exposure to cache-aware scheduling, batching, and prefetching strategies in model serving

What the JD emphasized

  • deep expertise in large-scale distributed storage and caching infrastructure
  • high-performance KV cache layer for large language model (LLM) inference
  • improving latency, throughput, and cost-efficiency in transformer-based model serving
  • optimizing the reuse of attention key-value states and prompt embeddings
  • systems researcher or engineer
  • low-latency access and eviction policies for caching long-context LLM inputs, token streams, and reused embeddings
  • cache consistency and synchronization protocols for multi-tenant, multi-request environments
  • memory-aware sharding, eviction (e.g., windowed LRU, TTL), and replication strategies across GPUs or distributed memory backends
  • Evaluate and, where needed, extend open-source KV stores or build custom GPU-aware caching layers (e.g., CUDA, Triton, shared memory, RDMA)
  • Strong understanding of transformer-based model internals and how KV caching affects autoregressive decoding
  • Experience with distributed systems, memory management, and low-latency serving (RPC, gRPC, CUDA-aware networking)
  • Familiarity with high-performance compute environments (NVIDIA GPUs, TensorRT, Triton Inference Server)
  • Proficiency in languages like C++, Rust, Go, or CUDA for systems-level development
  • Prior experience building inference-serving systems for LLMs (e.g., vLLM, SGLang, FasterTransformer, DeepSpeed, Hugging Face Text Generation Inference)
  • Experience with memory hierarchy optimization (HBM, NUMA, NVLink) and GPU-to-GPU communication (NCCL, GDR, GDS, InfiniBand)
  • Exposure to cache-aware scheduling, batching, and prefetching strategies in model serving

Other signals

  • LLM inference KV caches
  • transformer-based model serving
  • distributed storage and caching infrastructure