Research Engineer / Scientist - Storage for LLM

ByteDance ByteDance · Big Tech · Seattle, WA · Infrastructure

Research Engineer/Scientist focused on designing and implementing a high-performance KV cache layer for LLM inference to improve latency, throughput, and cost-efficiency. This role involves optimizing intermediate state storage and retrieval for transformer-based LLMs, collaborating with inference and serving teams, and potentially extending open-source KV stores or building custom GPU-aware caching layers.

What you'd actually do

  1. Design and implement a distributed KV cache system to store and retrieve intermediate states (e.g., attention keys/values) for transformer-based LLMs across GPUs or nodes.
  2. Optimize low-latency access and eviction policies for caching long-context LLM inputs, token streams, and reused embeddings.
  3. Collaborate with inference and serving teams to integrate the cache with token streaming pipelines, batched decoding, and model parallelism.
  4. Develop cache consistency and synchronization protocols for multi-tenant, multi-request environments.
  5. Implement memory-aware sharding, eviction (e.g., windowed LRU, TTL), and replication strategies across GPUs or distributed memory backends.

Skills

Required

  • Distributed systems
  • Memory management
  • Low-latency serving
  • Transformer model internals
  • C++
  • Rust
  • Go
  • CUDA

Nice to have

  • LLM inference serving systems (vLLM, SGLang, FasterTransformer, DeepSpeed, Hugging Face Text Generation Inference)
  • Memory hierarchy optimization (HBM, NUMA, NVLink)
  • GPU-to-GPU communication (NCCL, GDR, GDS, InfiniBand)
  • Cache-aware scheduling
  • Batching
  • Prefetching strategies in model serving

What the JD emphasized

  • PhD in Computer Science, Applied Mathematics, Electrical Engineering, or a related technical field.
  • Strong understanding of transformer-based model internals and how KV caching affects autoregressive decoding.
  • Experience with distributed systems, memory management, and low-latency serving (RPC, gRPC, CUDA-aware networking).
  • Familiarity with high-performance compute environments (NVIDIA GPUs, TensorRT, Triton Inference Server).

Other signals

  • LLM inference optimization
  • KV cache layer
  • low-latency serving
  • distributed systems