Deep Learning Architect, LLM Inference - New College Grad 2026

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

The role focuses on optimizing LLM inference server performance, workload characterization, and benchmarking for NVIDIA's GPUs. It involves collaborating with AI startups, developing performance tools, contributing to deep learning software projects, and guiding inference serving direction.

What you'd actually do

  1. You will do workload characterization of the latest LLMs and inference servers like vLLM, SGLang and TRT-LLM to ensure NVIDIA maintains its leadership position.
  2. Join forces with the performance marketing team to build engaging content, including blog posts and updates to InferenceX to highlight NVIDIA's outstanding inference achievements.
  3. Collaborate with engineers from AI startup companies to establish standard benchmarking methodologies.
  4. Develop a constantly evolving inference performance data results website.
  5. Invent E2E profiling and analysis tools that you will use to keep up with the rapid pace of Generative AI.

Skills

Required

  • Master's or PhD degree in Computer Science, Computer Engineering, related fields, or equivalent experience.
  • Relevant software development experience.
  • Detailed knowledge of deep learning inference serving, PyTorch programming, profiling, and compiler optimizations.
  • Experience developing client server LLM applications with OpenAI API or MCP and identifying performance bottlenecks.
  • Solid understanding of CPU and GPU microarchitecture and performance characteristics.
  • Experience with complex software projects like frameworks, compilers, or operating systems.
  • Demonstrated proficiency with the latest AI coding agents like Claude Code, Codex, and Cursor
  • Excellent written and verbal communication skills and the ability to work independently and collaboratively in a fast-paced environment.

Nice to have

  • novel use cases for agentic AI tools in the workplace.
  • Experience with databases and visualization tools

What the JD emphasized

  • deep learning inference serving
  • OpenAI API
  • AI coding agents

Other signals

  • LLM inference performance optimization
  • benchmarking
  • GPU hardware and software performance