Principal High-performance LLM Training Engineer

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

NVIDIA is seeking a Principal Engineer to lead performance analysis and optimization of large-scale AI training and post-training workloads on NVIDIA's hardware and software stack. The role involves deep technical analysis across compute, memory, communication, and frameworks to improve efficiency and influence future roadmaps.

What you'd actually do

  1. Lead end-to-end performance analysis and optimization of innovative LLM pre-training and post-training workloads on the latest NVIDIA hardware and software platforms.
  2. Drive workloads closer to speed-of-light performance by identifying and removing bottlenecks across compute, memory, communication, scheduling, parallelism strategy, kernel efficiency, framework overhead, and system-level scaling.
  3. Develop production-quality software, tools, models, benchmarks, and analysis infrastructure that improve training performance, efficiency, and developer velocity across NVIDIA’s AI software stack.
  4. Build and refine performance models, workload characterizations, and simulation methodologies to guide future GPU, networking, system, and software architecture decisions.
  5. Serve as a technical authority for AI training performance, partnering closely with teams across GPU architecture, systems, CUDA libraries, compilers, networking, frameworks, product management, and applied AI.

Skills

Required

  • MS, or PhD (or equivalent experience) in Computer Science, Electrical Engineering, Computer Engineering, or a related field, with 12+ years of relevant work or research experience.
  • Demonstrated principal-level technical impact in one or more of the following areas: large-scale AI training systems, GPU performance optimization, distributed systems, high-performance computing, ML frameworks, compilers/runtimes, or hardware/software co-design.
  • Deep hands-on experience analyzing and optimizing performance of large-scale deep learning workloads, especially transformer-based models, LLM pre-training, reinforcement learning, fine-tuning, or other post-training workloads.
  • Strong understanding of GPU and AI accelerator architecture from individual accelerators to datacenter-scale systems.
  • Experience with distributed training techniques such as data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, sequence parallelism, activation checkpointing, mixed precision training, and communication/computation overlap.
  • A strong track record of using profiling, tracing, benchmarking, and performance modeling tools to diagnose complex bottlenecks and drive measurable improvements.
  • Excellent communication and technical leadership skills, with the ability to influence architecture and software decisions across multiple teams without relying on direct authority.

Nice to have

  • Experience with PyTorch, JAX, NeMo, and NeMo RL
  • CUDA libraries
  • networking
  • memory systems

What the JD emphasized

  • principal-level technical impact
  • Deep hands-on experience analyzing and optimizing performance of large-scale deep learning workloads
  • strong track record of using profiling, tracing, benchmarking, and performance modeling tools
  • Excellent communication and technical leadership skills, with the ability to influence architecture and software decisions across multiple teams without relying on direct authority.

Other signals

  • performance optimization
  • large-scale training
  • GPU architecture
  • distributed systems
  • deep learning frameworks