Software Engineer, System Enablement

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

Software Engineer, System Enablement role at OpenAI focuses on building and maintaining the infrastructure backbone for deploying and operating AI models. This involves end-to-end system bring-up, provisioning automation, fleet management, and integrating new hardware into schedulable capacity. The role requires expertise in Kubernetes, IaC, provisioning, networking, and automation scripting, with a focus on ensuring system stability, observability, and readiness for internal customers.

What you'd actually do

  1. Own the end-to-end bring-up and bootstrap path for new systems and compute nodes from _bare metal/early access in lab or production/cloud environments _to _schedulable fleet capacity_: image build, user-data/config, cluster join, and readiness gates.
  2. Build and maintain “first-class” golden image + provisioning workflows across lab, and production environments, including working with partner-provided base images and reconciling OS/version requirements.
  3. Work with partner teams to integrate nodes into our fleet infrastructure and IaC pipelines (Terraform, Chef, etc.), ensuring cloud resources map cleanly onto our internal lifecycle expectations (e.g., VMSS/instance pools, image references).
  4. Partner with scheduling and platform owners to ensure new hardware is reachable and scheduled (pool definitions, network/WAN connectivity/routing, admission controls, platform-specific quirks), including cases where new SKUs require changes for scheduling integration.
  5. Drive registration and inventory correctness (e.g., systems that track nodes and their metadata), including hands-on support to get nodes registered and visible end-to-end.

Skills

Required

  • 5+ years of experience in systems SW development and building/operating Linux-based infrastructure in production or pre-production environments.
  • Kubernetes cluster operations (node lifecycle, bootstrap/join, debugging control-plane connectivity)
  • Infrastructure-as-Code / config management (Terraform, Chef/Ansible, etc.)
  • Provisioning and imaging (PXE/iPXE, golden images, cloud-init/user-data)
  • Networking fundamentals (L2/L3, routing, DNS, fire-walling; comfort debugging reachability
  • Proven ability to write automation in Python/Go/Bash and ship operational tooling/run-books.

Nice to have

  • Experience bringing up new hardware platforms (early silicon/servers/NICs) in a lab setting and turning them into stable fleet capacity.
  • Multi-cloud operational experience (Azure/GCP/AWS/OCI), especially with compute pools (e.g., VMSS / instance pools).
  • Experience building telemetry/health pipelines (agent-based metrics/logging, health rollups, readiness criteria).
  • Familiarity with WAN, peering, and multi-site network concepts for cluster deployments.

What the JD emphasized

  • early access
  • bare metal
  • new hardware platforms
  • early silicon/servers/NICs
  • new SKUs