Senior Performance Architect, Nemotron

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +2

NVIDIA is seeking a Senior Performance Architect for Nemotron to focus on deep model-system-hardware co-design. The role involves developing high-fidelity performance models to evaluate architectural choices, predict deployment efficiency, and ensure Pareto-optimal trade-offs for future Nemotron models. This position will guide future software and hardware roadmaps by modeling end-to-end performance impact of GenAI workflows and collaborating with research, framework, compiler, and hardware teams.

What you'd actually do

  1. Develop high-fidelity analytical performance models to prototype emerging algorithmic techniques & hardware optimizations to drive model-hardware co-design Nemotron family of models.
  2. Prioritize features to guide future software and hardware roadmap based on detailed performance modeling and analysis
  3. Model end-to-end performance impact of emerging GenAI workflows - such as Speculative Decoding, Agentic Pipelines, Inference-time compute scaling, RL etc. – to understand future datacenter needs

Skills

Required

  • Master's degree (or equivalent experience) in Computer Science, Electrical Engineering or related fields
  • Strong background in computer architecture, roofline modeling, queuing theory and statistical performance analysis techniques
  • Solid understanding of ML fundamentals, model parallelism and inference serving techniques
  • Proficiency in Python (and optionally C++) for simulator design and data analysis
  • 3+ years of hands-on experience in system evaluation of AI/ML workloads or performance analysis, modeling and optimizations for AI
  • Comfortable defining metrics, designing experiments and visualizing large performance datasets to identify resource bottlenecks
  • Experience with deep learning frameworks like PyTorch, TRT-LLM, VLLM, SGLang

Nice to have

  • Proven track record of working in multi-functional teams, spanning algorithms, software and hardware architecture
  • Ability to distill complex analyses into clear recommendations for both technical and non-technical collaborators
  • Experience with GPU computing (CUDA)

What the JD emphasized

  • performance modeling
  • model-hardware co-design
  • inference serving techniques

Other signals

  • performance modeling
  • model-system-hardware co-design
  • inference serving
  • algorithmic techniques
  • hardware optimizations