System Software Engineer, First-party Hardware

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on designing, building, and validating low-level system software for OpenAI's first-party AI hardware. It involves working with BMC, Linux, firmware, and various hardware interfaces to ensure the manageability and health of AI hardware systems. The role also includes owning the acceptance path for partner-delivered software, building automation for testing, and debugging complex issues across hardware and software boundaries. While the hardware is for AI workloads, the role itself is system software engineering, not direct AI model development.

What you'd actually do

  1. Design, develop, and maintain low-level firmware and system software for first-party AI hardware manageability, including BMC software, Redfish services, gNMI telemetry, firmware update and recovery flows, BIOS/UEFI interactions, platform drivers, and hardware diagnostics.
  2. Own integration and acceptance of partner and vendor software releases, including requirements, code and artifact review, reproducible builds, CI, regression monitoring, version tracking, acceptance criteria, and launch-readiness evidence.
  3. Build and maintain automation and CI infra for testing and managing systems in our lab
  4. Define and debug hardware management protocols across accelerators, host systems, management controllers, firmware, and platform services, including interfaces such as I2C, SMBus, PMBus, PCIe, Ethernet, GPIO, UART, and JTAG.
  5. Build system health monitoring, telemetry, remote diagnostics, and recovery paths that make hardware failures diagnosable in the lab, at manufacturing partners, and in production data centers.

Skills

Required

  • C, C++, Rust or similar systems languages
  • Linux-based hardware platforms
  • embedded Linux
  • OpenBMC
  • Redfish
  • IPMI boundaries
  • BIOS/UEFI
  • bootloaders
  • firmware update systems
  • kernel drivers
  • RTOS
  • fleet management software
  • I2C, SMBus, PMBus, SPI, PCIe, Ethernet, USB, UART, GPIO, JTAG
  • power controllers
  • board-level debug tools
  • protocol analyzers
  • logs
  • packet captures
  • firmware traces
  • bus captures
  • BMC journals
  • Linux tooling
  • hardware bring-up
  • manufacturing or qualification testing
  • system diagnostics
  • release validation
  • deployment of high-performance compute, accelerator, server, networking, storage, or embedded platforms
  • working with external vendors, manufacturing partners, or partner engineering teams

Nice to have

  • platform security topics such as secure boot, firmware signing, device provisioning, attestation, certificate handling, trusted update flows, or access-control design

What the JD emphasized

  • low-level system software
  • firmware
  • BMC software
  • hardware diagnostics
  • partner-delivered system software
  • debug issues across hardware and software boundaries
  • build infra and automation to test and manage devices in lab
  • guide partner deliverables
  • build validation evidence
  • carry platforms from bring-up through production deployment
  • low-level firmware and system software
  • BMC software
  • Redfish services
  • gNMI telemetry
  • firmware update and recovery flows
  • BIOS/UEFI interactions
  • platform drivers
  • hardware diagnostics
  • integration and acceptance of partner and vendor software releases
  • requirements
  • code and artifact review
  • reproducible builds
  • CI
  • regression monitoring
  • version tracking
  • acceptance criteria
  • launch-readiness evidence
  • automation and CI infra
  • testing and managing systems in our lab
  • hardware management protocols
  • accelerators
  • host systems
  • management controllers
  • firmware
  • platform services
  • I2C
  • SMBus
  • PMBus
  • PCIe
  • Ethernet
  • GPIO
  • UART
  • JTAG
  • system health monitoring
  • telemetry
  • remote diagnostics
  • recovery paths
  • hardware failures diagnosable
  • manufacturing partners
  • production data centers
  • validation and test automation
  • board bring-up
  • rack bring-up
  • qualification
  • manufacturing readiness
  • deployment readiness
  • long-term reliability
  • engineering releases into manufacturing-ready software recipes
  • images
  • versions
  • logs
  • limits
  • remediation mapping
  • provisioning hooks
  • secure artifact handling
  • traceable data export
  • complex production issues
  • hardware signals
  • BMC firmware
  • BIOS/UEFI
  • kernel drivers
  • platform services
  • network topology
  • PCIe behavior
  • power
  • thermals
  • boot
  • provisioning
  • manufacturing test
  • partner engineering teams
  • define software contracts
  • unblock bring-up
  • drive issues to closure
  • durable architecture notes
  • runbooks
  • validation records
  • decision documents
  • reproduce
  • operate
  • improve the platform
  • 7+ years of hands-on experience
  • exceptional accomplishments demonstrating equivalent expertise
  • low-level system software
  • embedded software
  • firmware
  • BMC software
  • platform software
  • device drivers
  • hardware diagnostics
  • C, C++, Rust
  • reliable software for real hardware
  • Linux-based hardware platforms
  • embedded Linux
  • OpenBMC
  • Redfish
  • BMCWeb
  • IPMI boundaries
  • BIOS/UEFI
  • bootloaders
  • firmware update systems
  • kernel drivers
  • RTOS
  • fleet management software
  • hardware/software interfaces
  • I2C
  • SMBus
  • PMBus
  • SPI
  • PCIe
  • Ethernet
  • USB
  • UART
  • GPIO
  • JTAG
  • power controllers
  • board-level debug tools
  • protocol analyzers
  • debug live hardware
  • logs
  • packet captures
  • firmware traces
  • bus captures
  • lab hosts
  • BMC journals
  • Linux tooling
  • carefully controlled experiments
  • hardware bring-up
  • manufacturing or qualification testing
  • system diagnostics
  • release validation
  • deployment of high-performance compute
  • accelerator
  • server
  • networking
  • storage
  • embedded platforms
  • reason across software, firmware, hardware, manufacturing, and operations boundaries
  • turn ambiguous problems into clear requirements, designs, tests, and decisions
  • working with external vendors
  • manufacturing partners
  • partner engineering teams
  • define deliverables
  • review technical work
  • drive issues to closure
  • platform security topics
  • secure boot
  • firmware signing
  • device provisioning
  • attestation
  • certificate handling
  • trusted update flows
  • access-control design