Senior Software Engineer, AI Resiliency

NVIDIA · Semiconductors · Redmond, WA +1

Senior Software Engineer to lead the development of AI software resiliency for large-scale AI supercomputers (100,000+ GPUs), focusing on features like fast checkpoint-recovery, error detection/isolation, and straggler/hang detection to minimize cluster downtime. The role involves hands-on C++ and Python coding, debugging, fault tolerance, and collaboration with AI researchers and hardware/software teams, integrating resiliency into AI frameworks like PyTorch and JAX/XLA. Experience with distributed systems, fault tolerance, AI frameworks, and debugging tools is required, with a preference for experience in training models, CUDA/NCCL/MPI, checkpointing strategies, and large-scale AI clusters/HPC.

What you'd actually do

  1. Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
  2. Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.
  3. Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.
  4. Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
  5. Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.

Skills

Required

  • Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience
  • Proficiency in C++ and Python
  • 6+ years of relevant experience
  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments
  • Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar
  • Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight)
  • Excellent problem-solving skills

Nice to have

  • Hands-on experience in training models or working with model training teams
  • Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale
  • Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training
  • Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads
  • Strong systems programming skills and experience with low-level performance tuning

What the JD emphasized

  • AI Software Resiliency
  • AI supercomputers
  • 100,000+ GPUs
  • cluster downtime towards zero
  • fast checkpoint-recovery
  • error detection
  • error isolation
  • straggler/hang detection
  • large-scale distributed systems
  • fault tolerance
  • silent data corruption (SDC)
  • AI frameworks like PyTorch and JAX/XLA
  • large-scale AI clusters
  • HPC environments

Other signals

  • AI supercomputers
  • 100,000+ GPUs
  • cluster downtime towards zero
  • fast checkpoint-recovery
  • error detection
  • error isolation
  • straggler/hang detection
  • large-scale distributed systems
  • AI frameworks like PyTorch and JAX/XLA
  • fault tolerance
  • silent data corruption (SDC)
  • monitoring tools
  • CI/CD pipelines
  • AI training and inference workloads
  • CUDA, NCCL, or MPI
  • checkpointing strategies
  • fault-tolerant computing in AI training
  • large-scale AI clusters
  • HPC environments