Distinguished Resiliency and Safety Architect, GPU Diagnostics

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA

The role focuses on designing, developing, and maintaining diagnostic software for NVIDIA GPUs and SoCs, specifically for resiliency in datacenters and functional safety in autonomous vehicles and robots. It involves identifying hardware defects, addressing coverage gaps, studying failure mechanisms, and ensuring compliance with safety standards like ISO 26262. The position requires deep understanding of hardware/software boundaries, high-performance computing systems, and proficiency in C/C++, CUDA, and Python.

What you'd actually do

  1. Design, develop, and maintain diagnostics software suite to efficiently stress test NVIDIA GPUs and SOCs to identify hardware defects, including defects that cause silent data corruption.
  2. Address coverage gaps in NVIDIA diagnostic suite flagged by silicon failures on customer workloads or test suites.
  3. Tests for GPUs in automotive functional safety contexts should include low-level routines to exercise instruction sets, memory subsystems and interrupt mechanisms, in compliance with ISO 26262 and related safety standards.
  4. Study silent data corruption, intermittent faults, and hard-to-reproduce failures in the field, including customer returns (RMAs), to establish root causes, and improve detection by diagnostics
  5. Support deployment of diagnostics in pre-production qualification environments as well as large-scale production usages.

Skills

Required

  • Master’s or PhD degree in Computer Science, Computer Engineering, Electrical Engineering or closely related degree or equivalent experience.
  • At least 15+ years of relevant experience.
  • Ability to reason across hardware/software boundaries to debug complex system-level issues
  • In-depth understanding of the architecture and micro-architecture of high-performance computing systems.
  • Strong knowledge of hardware failure mechanisms that can result in incorrect computation.
  • Proficiency in C/C++, CUDA programming.
  • Scripting and automation with Python or similar.
  • Understanding of the software development life cycle, from requirements to testing closure and maintenance, including creating customer releases and documentation.
  • Excellent interpersonal skills and ability to collaborate with on-site and remote teams.
  • Strong debugging and analytical skills.
  • Be self-driven and results oriented.

Nice to have

  • Familiarity with GPU and SOC Architectures, Machine Learning/Deep Learning concepts
  • Understanding factors causing silent data corruption in hardware
  • Ability to use high performance libraries and write hand-crafted kernels where necessary to create stress conditions to induce hardware failures.
  • Experience in embedded software development.

What the JD emphasized

  • ISO 26262
  • silent data corruption
  • functional safety