Security Reliability Engineer

OpenAI OpenAI · AI Frontier · San Francisco, CA · Security

OpenAI is seeking a Security Reliability Engineer to join their Infrastructure Engineering team. This role focuses on building, operating, and scaling reliable, secure, and scalable infrastructure for identity, access, endpoint, and shared platform services. The engineer will own infrastructure systems end-to-end, applying SRE principles, implementing infrastructure-as-code, and driving automation for leverage and safety. The role requires significant experience in operating and architecting mission-critical infrastructure and influencing cross-functional partners.

What you'd actually do

  1. Define and evolve infrastructure patterns for on prem and hybrid environments, including self hosted platforms, vendor supported systems, and lab environments.
  2. Own the full lifecycle of infrastructure systems, including deployment, upgrades, patching, recovery, and ongoing operations.
  3. Identify high leverage automation opportunities that eliminate manual toil and reduce operational risk across infrastructure and access related systems.
  4. Work closely with Security, Identity, Network, Client Platform, and Platform Engineering teams to operate secure, policy enforced infrastructure.
  5. Have 10 or more years of experience operating and architecting mission critical infrastructure in high reliability environments.

Skills

Required

  • Infrastructure as Code (Terraform, Chef, Ansible)
  • Site Reliability Engineering (SRE) principles
  • Security principles
  • Identity and Access Management (IAM) systems
  • Monitoring and Alerting
  • Incident Response
  • Containerization (Kubernetes, Docker)
  • Git workflows
  • Automation

Nice to have

  • Experience operating infrastructure for R&D or specialized labs
  • Fleet, endpoint, or virtual desktop platforms (FleetDM, Chef, Azure Virtual Desktop)
  • Partnering with identity or security engineering teams on hardened, policy enforced infrastructure

What the JD emphasized

  • 10 or more years of experience operating and architecting mission critical infrastructure in high reliability environments
  • led the design and maturation of complex on prem, hybrid, or cloud integrated systems
  • Apply Site Reliability Engineering principles at scale
  • Operate comfortably in ambiguity
  • Influence cross functional partners across security, identity, network, and platform teams to land reliability improvements without direct authority