Hardware Development Infrastructure Engineer

OpenAI OpenAI · AI Frontier · San Francisco, CA · Scaling

This role focuses on building and running the infrastructure for hardware development at OpenAI, which is involved in creating AI-native silicon. The engineer will work on regression systems, CI/CD pipelines, cloud and cluster platforms, and data foundations to support the hardware development lifecycle. The role requires strong infrastructure fundamentals, cloud platform experience, CI/CD experience, and programming skills, with familiarity in chip development workflows being a plus.

What you'd actually do

  1. Partner with hardware teams on workflows and tooling: Embed with teams across DV, PD, emulation, formal, and software to understand development flows, identify failure modes, and deliver tooling (CLIs, services, APIs) that reduces manual work and accelerates iteration.
  2. Build and operate regression systems at scale: Own regressions end-to-end—from definition and scheduling to execution, results ingestion, triage, and reporting—while improving throughput, reproducibility, and flake reduction.
  3. Own CI/CD for infrastructure and tooling: Design and operate pipelines for infrastructure-as-code, services, images, and cluster configuration changes, including testing, gated deploys, staged rollouts, and safe rollback.
  4. Run cloud and HPC platforms: Design, provision, and operate cloud infrastructure (Azure preferred) and HPC/HTC clusters (e.g., Slurm), tuning scheduling policies, autoscaling, node lifecycles, and cost-performance tradeoffs.
  5. Build data foundations and visibility: Develop ETL pipelines to ingest metrics, logs, and results; operate databases for workflow metadata and outcomes; and build dashboards that surface efficiency, utilization, and reliability trends.

Skills

Required

  • Strong infrastructure fundamentals
  • Cloud platforms
  • Networking
  • Security
  • Performance
  • Automation
  • Experience operating cloud environments (Azure preferred; AWS, GCP, or OCI acceptable)
  • Strong infrastructure-as-code practices (e.g., Terraform, Bicep; configuration management tools a plus)
  • Strong programming skills (Python preferred)
  • Solid software engineering and scripting practices
  • Experience building and operating CI/CD systems (e.g., Jenkins, Buildkite, GitHub Actions), including testing and release workflows
  • Database experience (e.g., Postgres or MySQL), including schema design, migrations, indexing, and operational safety
  • Clear communicator with strong judgment—able to explain tradeoffs, propose pragmatic solutions, and articulate a realistic vision for scalable infrastructure

Nice to have

  • Familiarity with chip development workflows
  • At least one deep EDA domain (e.g., DV, PD, emulation, or formal verification)
  • Experience operating Slurm or other large-scale cluster schedulers
  • Experience with enterprise authentication and directory services (e.g., Entra ID, LDAP, FreeIPA, SSSD)
  • Experience building or operating backend and middleware systems such as message queues, caches, artifact stores, or internal service platforms
  • Familiarity with high-performance storage architectures and data movement optimization
  • Experience running and monitoring license servers for expensive or capacity-constrained toolchains

What the JD emphasized

  • AI workloads
  • AI-native silicon
  • AI models
  • AI
  • AI
  • AI
  • AI
  • AI