Software Engineer, Fleet Hardware Health

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer responsible for the reliability and uptime of OpenAI's compute fleet, focusing on minimizing hardware failures in large-scale supercomputing infrastructure through automation and system-level investigations.

What you'd actually do

  1. Build and maintain automation systems for provisioning and managing server fleets.
  2. Develop tools to monitor server health, performance, and lifecycle events.
  3. Collaborate with clusters, networking, and infrastructure teams.
  4. Partner with external operators to ensure a high level of quality.
  5. Identify and fix performance bottlenecks and inefficiencies.

Skills

Required

  • Python
  • Go
  • Linux
  • networking
  • server hardware knowledge
  • SQL
  • PromQL
  • Pandas

Nice to have

  • low level details of hardware components
  • protocols
  • associated Linux tooling
  • PCIe
  • Infiniband
  • power management
  • kernel perf tuning
  • IPMI
  • Redfish
  • High-performance computing (HPC)
  • distributed systems
  • hardware management protocols
  • Prometheus
  • Grafana

What the JD emphasized

  • reliability
  • uptime
  • hardware failure
  • supercomputing infrastructure
  • automation
  • system-level comprehensive investigations
  • automated solutions
  • go deep on problems
  • investigate as thoroughly as possible
  • build automation for detection and remediation at scale