Senior Solutions Architect - Kv Cache and AI Storage

NVIDIA NVIDIA · Semiconductors · Beijing, China

Senior Solutions Architect focused on building LLM inference platforms using NVIDIA GPUs, KV cache, and tiered memory solutions. The role involves technical exploration with customers, performance analysis, and translating customer needs into product roadmaps.

What you'd actually do

  1. Lead technical exploration with customer architects to understand models, frameworks, SLOs, and KV cache usage patterns.
  2. Build end-to-end KV cache solutions using tiered memory and NVIDIA modern networking technologies.
  3. Analyze performance profiles, identify bottlenecks, and drive PoCs and benchmarks to validate improvements.
  4. Translate customer difficulties into clear feature requests and roadmap input for NVIDIA products.
  5. Build reference architectures, best-practice guides, and deliver tech talks to support our field teams and customers.

Skills

Required

  • Bachelor's degree or higher in Computer Science or a related field with strong systems or storage background.
  • 5+ years of relevant experience, including 2+ years passionate about KV stores/caches or storage backends.
  • Hands‑on experience with distributed storage, caching, or large‑scale backend systems.
  • Solid understanding of Transformer / LLM inference and KV cache concepts, plus experience with at least one LLM serving stack (for example vLLM, TensorRT‑LLM or SGLang).
  • Strong knowledge of NVMe SSDs, KV SSDs, and modern storage servers, including controller/firmware behavior and I/O characteristics.
  • Practical experience with tiered memory and KV cache optimizations such as offloading (HBM → DRAM → NVMe), eviction/selection strategies, compression/quantization, or attention‑level optimizations.
  • Familiarity with at least one large‑scale storage or caching system (such as Ceph, Redis, Cassandra, RocksDB‑based KV, object storage, or distributed logs).

Nice to have

  • Experience building or running LLM inference platforms or large‑scale online services in cloud or internet companies (multi‑tenant, quota, cost control).
  • Development experience with KV cache subsystems in file systems, user‑space storage engines, or memory/cache managers, or building custom KV stores/cache layers optimized for AI/LLM.
  • Exposure to NVIDIA technologies such as Triton Inference Server, TensorRT‑LLM, NeMo, Dynamo/KVBM, BlueField / DOCA, GPUDirect Storage, Spectrum‑X, or CMX.
  • Public talks, papers, blogs, or open‑source work in LLM inference, KV cache, or storage systems.

What the JD emphasized

  • KV cache usage patterns
  • KV cache solutions
  • KV cache concepts
  • KV cache optimizations
  • KV cache subsystems
  • KV stores/caches

Other signals

  • LLM inference platforms
  • KV cache solutions
  • NVIDIA GPUs
  • tiered memory