Senior Ras and Power Management Firmware Architect

NVIDIA NVIDIA · Semiconductors · Yokneam, Israel +2

NVIDIA is seeking a Senior RAS and Power Management Firmware Architect to define, implement, and guide firmware architecture for reliability, availability, serviceability, and power management across next-generation NVIDIA Networking products and platforms. The role involves working with cross-functional teams to build robust, diagnosable, and power-efficient systems for large-scale deployments.

What you'd actually do

  1. Define platform-level firmware architecture for RAS and power management across SoCs, accelerators, DPUs, servers, embedded systems, and data center platforms.
  2. Own error detection, classification, containment, recovery, escalation, and reporting architecture.
  3. Define firmware architecture for power sequencing, power states, reset flows, thermal and power fault handling, idle management, and recovery from power-related failures.
  4. Create firmware specifications for hardware error handling, health monitoring, crash capture, telemetry, diagnostics, debug data, and field serviceability.
  5. Define interfaces and contracts between firmware, hardware, operating systems, BMCs, management controllers, platform software, and cloud/service infrastructure.

Skills

Required

  • BSc, MS, or PhD in Electrical Engineering, Computer Science, Computer Engineering, or equivalent experience.
  • 7+ years of relevant experience in firmware, platform architecture, embedded systems, or low-level systems software.
  • Deep understanding of RAS principles, fault modeling, error containment, recovery policies, diagnosability, and serviceability requirements.
  • Experience architecting firmware for complex hardware platforms such as SoCs, accelerators, DPUs, servers, networking devices, or embedded systems.
  • Strong knowledge of power management concepts, including power sequencing, reset architecture, thermal and power fault handling, power state transitions, and platform recovery flows.
  • Familiarity with boot firmware, UEFI/BIOS, BMC, embedded controllers, RTOS, embedded Linux, or platform management stacks.
  • Strong understanding of hardware/software interfaces, registers, interrupts, telemetry paths, debug infrastructure, and firmware-to-hardware contracts.
  • Programming and debugging fundamentals across languages such as C/C++, Python/Perl scripting, Verilog, assembly, or RISC-V assembly.
  • Ability to lead cross-functional architecture discussions and drive alignment across hardware, firmware, software, validation, product, and customer-facing teams.
  • Excellent communication skills, strong technical leadership, and a real passion for working collaboratively.

Nice to have

  • Experience with PCIe AER, CXL RAS, memory RAS, ECC/parity, accelerator RAS, networking RAS, high-availability systems, or large-scale data center platforms.
  • Knowledge of ACPI, SMBIOS, UEFI, PLDM, MCTP, Redfish, IPMI, or cloud telemetry systems.
  • Experience with power/thermal fault handling, dynamic power management, platform power sequencing, low-power states, or autonomous recovery mechanisms.
  • Background in silicon bring-up, platform validation, production diagnostics, or customer failure analysis.
  • Prior technical leadership experience as a firmware architect, principal engineer, platform lead, or domain owner.

What the JD emphasized

  • 7+ years of relevant experience in firmware, platform architecture, embedded systems, or low-level systems software.
  • Deep understanding of RAS principles, fault modeling, error containment, recovery policies, diagnosability, and serviceability requirements.
  • Strong knowledge of power management concepts, including power sequencing, reset architecture, thermal and power fault handling, power state transitions, and platform recovery flows.
  • Familiarity with boot firmware, UEFI/BIOS, BMC, embedded controllers, RTOS, embedded Linux, or platform management stacks.
  • Strong understanding of hardware/software interfaces, registers, interrupts, telemetry paths, debug infrastructure, and firmware-to-hardware contracts.