Senior Software Engineer, AI Networking

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1

Senior Software Engineer role focused on building and productizing ML tools for optimizing AI workloads (LLM training/inference) across GPU/CPU clusters, with a focus on networking and system resource utilization. Involves distributed deep learning, ML-based optimization techniques, and performance analysis.

What you'd actually do

  1. Design and implement resource allocation and combinatorial optimization techniques (e.g., reinforcement learning, LLM agents for DSE, Bayesian optimization and other multi-objective optimization techniques) to optimize LLM models at datacenter scale.
  2. Research, develop, and deploy AI/ML techniques to optimize large-scale Deep Learning (LLM) training and inference on NVIDIA supercomputers and distributed systems. This includes a focus on high-performance networking and NVIDIA communication libraries.
  3. Build and productionize ML-based tools for performance prediction and optimization, with a strong emphasis on networking aspects.
  4. Develop and deploy a scalable, reliable data curation pipeline capable of handling complex data types, such as time series and PyTorch model graphs, to effectively support the training of high-performance Machine Learning models.
  5. Collaborate across hardware and software teams to deliver valuable performance analysis insights.

Skills

Required

  • PhD or Master's degree in Computer Science, Software Engineering, or equivalent experience
  • 4+ years of experience applying machine learning techniques to computer architecture and system optimization problems
  • Hands-on experience developing and deploying various learning algorithms (e.g., reinforcement learning, offline RL, supervised learning)
  • Proficiency in building and using ML models with leading frameworks such as PyTorch or TensorFlow, or JAX
  • Proven ability to apply GNNs/transformers-based optimization to PyTorch model graph and Kineto execution traces
  • Expertise combining knowledge of NVIDIA GPUs, the CUDA library, and deep learning frameworks (TensorFlow/PyTorch) with networking concepts, including collective communication libraries (like NCCL) and protocols (such as RoCE and RDMA)
  • Strong programming capabilities in Python, Bash, and C++

Nice to have

  • In-depth knowledge and experience with machine learning/reinforcement learning and frameworks
  • Comprehensive understanding of computer architecture, system architecture and networking
  • Extensive experience in applying machine learning techniques such as GNNs or related graph-based models
  • Knowledge in PyTorch, CUDA, and NCCL libraries
  • Proven software engineering/development skills
  • strong passion for collective communication and networking is desirable

What the JD emphasized

  • 4+ years of experience applying machine learning techniques to computer architecture and system optimization problems
  • bringing to bear ML at the intersection of at least two of the following areas: HPC, networking, and AI applications
  • Hands-on experience developing and deploying various learning algorithms (e.g., reinforcement learning, offline RL, supervised learning) to tackle optimization challenges within computer architecture, system design, or networking domains
  • Proven ability to apply GNNs/transformers-based optimization to PyTorch model graph and Kineto execution traces
  • Expertise combining knowledge of NVIDIA GPUs, the CUDA library, and deep learning frameworks (TensorFlow/PyTorch) with networking concepts, including collective communication libraries (like NCCL) and protocols (such as RoCE and RDMA)

Other signals

  • optimizing AI workloads
  • LLM training and inference stacks
  • ML-based tools for performance prediction and optimization
  • large-scale AI systems