Senior Software Engineer, Dgx Cloud AI Infrastructure

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4 · Remote

Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at scale. This role involves setting technical direction for communication libraries, model frameworks, and inference/training stacks, leading performance and reliability investigations, defining benchmarking and qualification processes, and building resilience capabilities for large clusters.

What you'd actually do

  1. Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates.
  2. Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.
  3. Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks.
  4. Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance.
  5. Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments.

Skills

Required

  • software infrastructure for large-scale AI or HPC systems
  • debugging and triaging AI applications
  • NCCL, CUDA-aware distributed execution
  • multi-GPU and multi-node workloads
  • architecting, debugging, and scaling large-scale distributed systems
  • Python
  • C/C++
  • operating workloads in scheduled, containerized cluster environments
  • analytical, debugging, and communication skills

Nice to have

  • Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric)
  • Strong knowledge of GPU cluster fabrics and topology, including NVLink, NVSwitch, PCIe, RoCE, and InfiniBand
  • Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms
  • Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure

What the JD emphasized

  • track record of technical leadership
  • Expertise debugging and triaging AI applications across the full stack
  • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale
  • Proven track record of architecting, debugging, and scaling large-scale distributed systems
  • Expert-level Python and C/C++ programming skills
  • Demonstrated experience debugging and optimizing AI workloads at large scale
  • Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms
  • Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure

Other signals

  • large-scale AI infrastructure
  • distributed training and inference workloads
  • GPU platforms
  • LLM workloads
  • performance and reliability investigations
  • multi-GPU and multi-node deployments
  • benchmarking and qualification
  • resilience and failure-attribution