Senior Research Engineer / Scientist - Storage for LLM

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Senior Research Engineer/Scientist focused on designing and implementing a high-performance KV cache layer for LLM inference to improve latency, throughput, and cost-efficiency. This role involves optimizing caching for transformer-based models, collaborating with inference teams, and potentially extending open-source KV stores or building custom GPU-aware caching layers.

What you'd actually do

  1. Design and implement a distributed KV cache system to store and retrieve intermediate states (e.g., attention keys/values) for transformer-based LLMs across GPUs or nodes.
  2. Optimize low-latency access and eviction policies for caching long-context LLM inputs, token streams, and reused embeddings.
  3. Collaborate with inference and serving teams to integrate the cache with token streaming pipelines, batched decoding, and model parallelism.
  4. Develop cache consistency and synchronization protocols for multi-tenant, multi-request environments.
  5. Implement memory-aware sharding, eviction (e.g., windowed LRU, TTL), and replication strategies across GPUs or distributed memory backends.

Skills

Required

  • PhD in Computer Science, Applied Mathematics, Electrical Engineering, or a related technical field
  • Strong understanding of transformer-based model internals and how KV caching affects autoregressive decoding
  • Experience with distributed systems, memory management, and low-latency serving (RPC, gRPC, CUDA-aware networking)
  • Familiarity with high-performance compute environments (NVIDIA GPUs, TensorRT, Triton Inference Server)
  • Proficiency in languages like C++, Rust, Go, or CUDA for systems-level development

Nice to have

  • Prior experience building inference-serving systems for LLMs (e.g., vLLM, SGLang, FasterTransformer, DeepSpeed, Hugging Face Text Generation Inference)
  • Experience with memory hierarchy optimization (HBM, NUMA, NVLink) and GPU-to-GPU communication (NCCL, GDR, GDS, InfiniBand)
  • Exposure to cache-aware scheduling, batching, and prefetching strategies in model serving

What the JD emphasized

  • high-performance KV cache layer for large language model (LLM) inference
  • improving latency, throughput, and cost-efficiency in transformer-based model serving
  • optimizing the reuse of attention key-value states and prompt embeddings
  • low-latency access and eviction policies for caching long-context LLM inputs, token streams, and reused embeddings
  • integrate the cache with token streaming pipelines, batched decoding, and model parallelism
  • GPU-aware caching layers

Other signals

  • LLM inference optimization
  • distributed KV cache system
  • low-latency serving