Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on building and operating the infrastructure for large-scale AI model training. It involves managing and scaling Kubernetes clusters, automating bare-metal bring-up, and ensuring the reliability and efficiency of compute clusters that power frontier AI research. The role blends distributed systems engineering with hands-on infrastructure work, focusing on the software layer that abstracts the complexity of massive node counts across multiple data centers.

What you'd actually do

  1. Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
  2. Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
  3. Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
  4. Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
  5. Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure

Skills

Required

  • operating or scaling Kubernetes clusters or similar container orchestration systems
  • programming or scripting skills (Python, Go, or similar)
  • Infrastructure-as-Code tools such as Terraform or CloudFormation
  • bare-metal Linux environments
  • GPU hardware
  • large-scale networking
  • infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
  • Kubernetes internals, cluster scaling patterns, and containerized workloads
  • cloud infrastructure concepts (compute, networking, storage, security)
  • automating cluster or data center operations

Nice to have

  • GPU workloads
  • firmware management
  • high-performance computing

What the JD emphasized

  • largest supercomputers
  • largest datacenters
  • massive scale
  • magnitude of nodes
  • on fire
  • high-growth or hyperscale environments
  • bare-metal Linux environments
  • large-scale networking
  • fast-moving, high-impact operational problems
  • mission-critical systems