Senior System Failure Analysis Engineer

NVIDIA NVIDIA · Semiconductors · Yokneam, Israel

This role focuses on system failure analysis, investigating complex product failures across hardware, software, and firmware. The engineer will lead investigations, synthesize data, and utilize AI-driven methodologies to identify root causes and systemic risks, supporting customer quality and guiding future product robustness.

What you'd actually do

  1. Hands-on Lab Investigation: You are active in the lab environment. You perform advanced debugging, characterize system behavior, run reproductions of failures in the lab, and utilize sophisticated lab equipment to validate hypotheses, bridging the gap between high-level data and physical hardware reality.
  2. Multidisciplinary Failure Analysis: Lead deep-dive investigations into system-level failures, understand and analyse customer usage for the product, diagnose how software execution, firmware logic, and hardware components interact to cause specific failure modes.
  3. Root Cause Ownership: Drive the investigation lifecycle from initial symptom to final physics-of-failure or logic-error identification.
  4. Task Force Leadership: Orchestrate and lead cross-organizational technical task forces at the company level. You align experts from HW, SW, Mechanical, and NPI teams to solve high-priority technical problems.
  5. Advanced Data & AI Integration: Define and utilize sophisticated data analysis tools and AI-driven methodologies. You correlate customer failure patterns with production telemetry and RMA history to identify hidden trends and systemic risks.

Skills

Required

  • Expert-level experience with lab equipment
  • B.Sc/B.Tech in Electrical Engineering, or a related technical field
  • 5+ years of experience in Product Development, System-Level Debugging, or Architecture
  • Proven ability to troubleshoot issues where the hardware, software, and firmware interface
  • Experience using data analysis tools
  • strong interest in applying AI/Machine Learning to automate and scale failure analysis processes
  • The ability to lead technical teams through high-pressure investigations and clearly communicate findings to both engineering and quality stakeholders

Nice to have

  • Experience in Board Design combined with SW or Firmware development
  • A track record of solving technical problems during the transition from prototype to high-volume manufacturing
  • Experience building custom Python scripts or SQL dashboards to visualize and analyze global product failure distributions
  • Ability to provide technical feedback to R&D teams based on FA findings to improve the robustness of future products

What the JD emphasized

  • Expert-level experience with lab equipment
  • 5+ years of experience in Product Development, System-Level Debugging, or Architecture
  • Proven ability to troubleshoot issues where the hardware, software, and firmware interface
  • strong interest in applying AI/Machine Learning to automate and scale failure analysis processes