Systems Engineer (network / Storage / Systems)

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

The Stargate team at OpenAI builds and operates the physical and logical infrastructure for large-scale AI systems, focusing on compute environments for frontier model training and inference. This role involves architecting, validating, and operationalizing core infrastructure systems across networking, storage, system bring-up, and hardware debugging. The engineer will partner with various teams and vendors to ensure efficient deployment and reliable operation of new compute platforms.

What you'd actually do

  1. Own system engineering workstreams across one or more critical domains including networking, storage, system validation, or bring-up.
  2. Design and improve top-of-network architectures spanning frontend, WAN, OOB, firewall, and adjacent infrastructure layers.
  3. Define storage architectures across in-rack, in-pod, cluster, and cloud tiers with focus on performance, lifecycle, and cost efficiency.
  4. Lead system bring-up for new hardware platforms including imaging, provisioning, validation, and readiness for production deployment.
  5. Build tools and automation that improve lab operations, SKU onboarding, fleet readiness, and deployment velocity.

Skills

Required

  • systems engineering
  • infrastructure engineering
  • hardware platforms
  • large-scale compute environments
  • networking
  • storage systems
  • server platforms
  • firmware
  • Linux systems
  • distributed infrastructure
  • bringing up new hardware systems or clusters
  • debugging low-level hardware/software issues
  • cross-functional RCA efforts
  • hyperscale infrastructure
  • AI clusters
  • HPC environments
  • data center systems
  • OEM, ODM, JDM, or hardware vendors
  • Python
  • Go
  • Bash
  • scripting or software skills
  • fast-moving environments
  • high ownership
  • evolving technical requirements

Nice to have

  • GPU clusters or accelerator-based infrastructure
  • cluster management
  • provisioning
  • fleet lifecycle tooling
  • network automation
  • storage optimization
  • systems observability
  • hardware and software engineering organizations
  • scaling greenfield infrastructure deployments
  • rapid expansion programs

What the JD emphasized

  • 7+ years of experience
  • Strong technical depth in one or more areas: networking, storage systems, server platforms, firmware, Linux systems, or distributed infrastructure.
  • Experience bringing up new hardware systems or clusters in lab or production environments.
  • Experience debugging low-level hardware/software issues and driving cross-functional RCA efforts.