Principal Architect, System Software - Orbital Data Center

NVIDIA NVIDIA · Semiconductors · Santa Clara, CA +1 · Remote

NVIDIA is seeking a Principal Architect to lead the system software architecture for their Orbital Data Center (ODC) modules, specifically Space-1. This role involves designing and implementing a resilient, production-ready inference platform for the harsh environment of low-Earth orbit, covering the full stack from firmware to AI workloads. The architect will collaborate with hardware teams, drive customer use cases, and ensure the platform operates reliably for 5-year missions, enabling AI adoption in space.

What you'd actually do

  1. Own system architecture for inference stack and other applications running on this class of products and make it resilient to any fault happening in space.
  2. Co-architect with the orbital hardware system architecture team to define interfaces, partitioning, and trade-offs across silicon, board, firmware, OS, and AI workload layers for 5-year LEO missions.
  3. Own end-to-end system software architecture for Space-1 and successor Orbital Data Center modules — covering data center stack, BMC firmware, BIOS, host OS, GPU/CPU drivers, CUDA, DCGM, and manageability telemetry as a single integrated stack.
  4. Define the manageability architecture for an unreachable, autonomous data center: remote bring-up, in-orbit firmware update, dual-module redundancy, fault containment, recovery from SEU/SEFI events, and telemetry for fleets ranging from tens to millions of nodes.
  5. Architect rad-tolerant system software behaviors — ECC handling, memory scrubbing, latch-up mitigation, deterministic recovery, and graceful degradation through 5 years and up to ~8,000 thermal cycles in dawn–dusk sun-synchronous orbit.

Skills

Required

  • 15+ years of relevant experience in server/platform system software — spanning compute libraries, BMC firmware, BIOS, host OS, drivers, and manageability
  • BS, MS, or PhD in EE/CS or related field of education (or equivalent experience).
  • Working experience in building AI infrastructure and systems in space.
  • Proven record of architecting and delivering platform software for large-scale data centers or mission-critical embedded systems.
  • Strong knowledge of server architecture, data center manageability, and full-stack integration of firmware with OS and accelerator software.
  • Hands-on experience with data center health management workflows, telemetry, and fault management at scale.
  • Solid understanding of hardware management interfaces (USB, SMBus/I2C, PCIe) and proficiency with modern management protocols including Redfish, MCTP, and PLDM.
  • Strong and demonstrable skill in C/C++ and Python.
  • Experience programming and debugging server platforms, including pre-silicon and platform bring-up environments.
  • Experience in SCM (e.g. Git, Perforce) and project management tools like Jira.
  • Excellent written and oral communication skills, good work ethics, high sense of team-work, love to produce quality work, and commitment to finish your tasks every single day.
  • You are a self-starter who loves to find creative solutions to complicated problems and hands on with coding.

Nice to have

  • Experience architecting platform software for space, aerospace, defense, or other radiation, thermal, and vibration-constrained environments — including SEU/SEFI mitigation, ECC strategy, TID/SEE qualification, and rad-hard design partitioning.
  • Being a part of a start up or initiative directly related to space data centers.
  • Hands-on experience with autonomous, remote, or unreachable data center operations — in-orbit or in-field firmware update, dual-module redundancy, and recovery without physical access.
  • Hands-on with x86 or ARM (Grace/Vera) system architecture and the NVIDIA AI software stack (CUDA, DCGM, DOCA/OFED, GPU drivers, DGX OS).
  • Familiarity with NSA PHIPs security, post-quantum networking, and aerospace standards (VPX, MIL-STD shock/vibe, NASA EEE-INST-002).
  • Proven technical leadership driving large complex programs with 50+ engineers across firmware, OS, driver, and AI stack teams.
  • Skilled in reviewing hardware schematics and PCB layouts for debugging, design verification, and collaboration with hardware engineers.

What the JD emphasized

  • architecting and delivering platform software for large-scale data centers or mission-critical embedded systems
  • Working experience in building AI infrastructure and systems in space
  • architect platform software for space, aerospace, defense, or other radiation, thermal, and vibration-constrained environments
  • Hands-on experience with autonomous, remote, or unreachable data center operations

Other signals

  • architecting inference platform for space
  • system software architecture for orbital data center
  • AI adoption in low-earth orbit