Principal Developer, AI Networking

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +3 · Remote

This role focuses on optimizing AI workloads, specifically LLM training and inference, on large-scale GPU and CPU clusters. The core responsibility is to profile, analyze, and optimize the performance of distributed systems with a strong emphasis on high-performance networking and communication libraries. The engineer will develop tools for performance analysis and collaborate across hardware and software teams to identify and resolve bottlenecks.

What you'd actually do

  1. Characterizing AI workloads and deep learning models aimed at large-scale LLM training and inference on NVIDIA supercomputers. The role centers on distributed systems with a focus on high-performance networking and NVIDIA communication libraries.
  2. Benchmarking, profiling, and analyzing the performance to find bottlenecks and identify areas for improvement and optimizations, with a strong emphasis on networking aspects.
  3. Developing PyTorch trace-based profiling, analysis, and replaying toolset to aid in benchmarking, debugging, and co-designing network systems for LLM workloads.
  4. Collaborating with multiple teams from hardware to software to provide performance analysis insights.
  5. Defining performance test plans, setting performance expectations for new technologies and solutions, and working to achieve performance targets.

Skills

Required

  • B.Sc in Computer Science or Software Engineering or equivalent experience
  • 15+ years of experience with high-performance networking (RDMA, MPI, NCCL, SHARP)
  • Demonstrated ability in performance evaluation techniques and approaches
  • Experience with NVIDIA GPUs and the CUDA library
  • Knowledge of deep learning frameworks like TensorFlow or PyTorch
  • Expertise in networking collective communication libraries such as NCCL and protocols like RoCE and RDMA
  • Fast and self-learning capabilities with strong analytical and problem-solving skills
  • Proficiency in programming languages: Python, Bash, and C++
  • Experience with a container-based development environment
  • Great teammate who communicates clearly and works well with others

Nice to have

  • Extensive understanding and hands-on experience with AI workloads and benchmarking for distributed LLM training
  • Knowledge in PyTorch, CUDA, and NCCL libraries
  • Comprehensive system knowledge and understanding (Intel / AMD / ARM CPUs, NVIDIA GPUs, HCA, Memory, PCI)
  • Strong capabilities in performance evaluation and methods using contemporary tools

What the JD emphasized

  • 15+ years of experience with high-performance networking (RDMA, MPI, NCCL, SHARP)
  • Expertise in networking collective communication libraries such as NCCL and protocols like RoCE and RDMA
  • Extensive understanding and hands-on experience with AI workloads and benchmarking for distributed LLM training

Other signals

  • LLM training and inference
  • distributed systems
  • high-performance networking
  • GPU and CPU clusters
  • performance analysis tools