Senior Deep Learning Architect, LLM Inference

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

Senior Deep Learning Architect focused on LLM inference performance optimization, benchmarking, and contributing to deep learning software projects like PyTorch, TRT-LLM, vLLM, and SGLang. Requires strong knowledge of deep learning inference serving, PyTorch, profiling, and GPU microarchitecture.

What you'd actually do

  1. You will do workload characterization of the latest LLMs and inference servers like vLLM, SGLang and TRT-LLM to ensure NVIDIA maintains its leadership position.
  2. Join forces with the performance marketing team to build engaging content, including blog posts and updates to InferenceX to highlight NVIDIA's outstanding inference achievements.
  3. Collaborate with engineers from AI startup companies to establish standard benchmarking methodologies.
  4. Develop a constantly evolving inference performance data results website.
  5. Invent E2E profiling and analysis tools that you will use to keep up with the rapid pace of Generative AI.

Skills

Required

  • Master's or PhD degree in Computer Science, Computer Engineering, related fields, or equivalent experience.
  • 6+ years of relevant software development experience.
  • Detailed knowledge of deep learning inference serving, PyTorch programming, profiling, and compiler optimizations.
  • Experience developing client server LLM applications with OpenAI API or MCP and identifying performance bottlenecks.
  • Solid understanding of CPU and GPU microarchitecture and performance characteristics.
  • Experience with complex software projects like frameworks, compilers, or operating systems.
  • Demonstrated proficiency with the latest AI coding agents like Claude Code, Codex, and Cursor

Nice to have

  • Experience with databases and visualization tools

What the JD emphasized

  • LLM Inference
  • inference server performance optimization
  • LLMs
  • inference achievements
  • inference performance data results
  • inference serving

Other signals

  • LLM inference performance optimization
  • benchmarking
  • GPU hardware and software performance