Senior System Architect, Infrastructure Reliability

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +4

This role focuses on architecting and building a failure attribution framework for large-scale heterogeneous EDA (Electronic Design Automation) workloads. The goal is to automate the identification of root causes for job failures by analyzing telemetry from CPU and GPU clusters, distinguishing between hardware, infrastructure, and software issues, and enabling proactive measures to prevent failures. While machine learning is mentioned for failure classification, the core of the role is in distributed systems, reliability engineering, and infrastructure, not in developing AI models as a primary deliverable.

What you'd actually do

  1. Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.
  2. Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.
  3. Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters.
  4. Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams.
  5. Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs.

Skills

Required

  • Distributed Systems Mastery
  • Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments.
  • CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts.
  • Strong C++ and Python skills
  • Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes)

Nice to have

  • Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald).
  • GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML)
  • Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns.
  • Experience with checkpoint/restore technologies (like CRIU)

What the JD emphasized

  • automated RCA (Root Cause Analysis) pipelines
  • machine learning