Software Engineer, Frontier Clusters Infrastructure

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on building and operating the infrastructure for large-scale AI model training. It involves managing and scaling Kubernetes clusters, automating bare-metal bring-up, and developing software abstractions for compute clusters. The goal is to ensure the reliability and efficiency of hyperscale supercomputers used for frontier model training.

What you'd actually do

  1. Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
  2. Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
  3. Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
  4. Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
  5. Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure

Skills

Required

  • operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments
  • strong programming or scripting skills (Python, Go, or similar)
  • familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation
  • bare-metal Linux environments
  • GPU hardware
  • large-scale networking
  • fast-moving, high-impact operational problems
  • building automation to eliminate manual work
  • careful engineering with the urgency of keeping mission-critical systems running
  • infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
  • Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
  • Proficiency in cloud infrastructure concepts (compute, networking, storage, security)
  • automating cluster or data center operations

Nice to have

  • GPU workloads
  • firmware management
  • high-performance computing

What the JD emphasized

  • largest supercomputers
  • cutting edge model training
  • hyperscale supercomputers
  • frontier models
  • next generation of compute clusters
  • frontier research
  • massive scale
  • magnitude of nodes
  • extreme load
  • on fire