Software Engineer, GPU Infrastructure - Hpc

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer for OpenAI's Fleet team, focusing on the reliability and uptime of their large-scale compute fleet (data centers, GPUs, networking). The role involves building and maintaining automation systems for provisioning, managing, and monitoring server health, performance, and lifecycle events. It requires deep system-level investigations, developing automated solutions for detection and remediation, and identifying performance bottlenecks. Prior hardware expertise is not required, but bonus skills include low-level hardware details, management protocols, HPC, and monitoring tools.

What you'd actually do

  1. Build and maintain automation systems for provisioning and managing server fleets.
  2. Develop tools to monitor server health, performance, and lifecycle events.
  3. Collaborate with clusters, networking, and infrastructure teams.
  4. Partner with external operators to ensure a high level of quality.
  5. Identify and fix performance bottlenecks and inefficiencies.

Skills

Required

  • Experience managing large-scale server environments
  • Proficiency in Python, Go, or similar languages
  • Strong Linux, networking, and server hardware knowledge
  • Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool

Nice to have

  • Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)
  • Knowledge of hardware management protocols (e.g., IPMI, Redfish)
  • High-performance computing (HPC) or distributed systems experience
  • Prior experience developing, managing, or designing hardware
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana)

What the JD emphasized

  • reliability and uptime
  • Minimizing hardware failure
  • troubleshooting these state-of-the-art systems at scale
  • keen focus on system-level comprehensive investigations
  • build automation for detection and remediation at scale