Senior Product Manager, AI Inference - Dynamo

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

Product Manager for NVIDIA Dynamo, a distributed inference framework for LLMs and Generative AI. Focuses on defining the roadmap for high-scale serving, optimizing hardware-software co-design, and developing agentic inference capabilities. Collaborates with engineering, open-source communities, and customers to integrate model evaluation into workflows.

What you'd actually do

  1. Core Dynamo Architecture: Drive the product strategy for Dynamo’s modular components, including the KV-aware Router, KV Block Manager (KVBM), and communication planes.
  2. Inference Orchestration: Define requirements for sophisticated routing logic that minimizes redundant prefill and optimizes Time to First Token (TTFT) across substantial GPU clusters.
  3. Memory & KV Cache Management: Define strategy for multi-tier KV cache offloading enabling long-context windows and high-concurrency serving without compromising user experience.
  4. Hardware-Software Co-Design: Collaborate with engineering to ensure Dynamo extracts maximum performance from NVIDIA hardware.
  5. Agentic Inference: Develop Agent-first capabilities (e.g. priority, output length, cache pinning) to support sophisticated, multi-turn reasoning.

Skills

Required

  • Product management
  • AI inference
  • distributed systems
  • GPU-accelerated computing
  • LLM inference lifecycle
  • KV cache mechanics
  • distributed serving techniques
  • translate low-level technical capabilities into high-level business value
  • Teamwork and influencing skills
  • Empathy and deep care for your customers
  • Pragmatic and data-driven project management skills

Nice to have

  • Agentic frameworks (LangChain, NeMo Agents)
  • multi-turn, stateful AI applications
  • LLMs and Generative AI trends
  • Responsible AI
  • MLOps
  • Technical background and hands-on experience building AI (and LLM) solutions as an engineer
  • intuition for ML models and systems evaluation
  • read relevant research papers

What the JD emphasized

  • Proven experience in AI inference, distributed systems, and GPU-accelerated computing.
  • Deep understanding of the LLM inference lifecycle (Prefill vs. Decode), KV cache mechanics, and distributed serving techniques, like Disaggregated Serving.
  • Proven track record working with Agentic frameworks (LangChain, NeMo Agents) or building multi-turn, stateful AI applications.

Other signals

  • Define the roadmap for high-scale LLM and Generative AI serving
  • bridging the gap between cutting-edge hardware (Vera Rubin, LPU, and NVLink) and software optimizations
  • incorporate model evaluation into end-2-end LLM workflows
  • Develop Agent-first capabilities