Datacenter Incident Program Manager

OpenAI OpenAI · AI Frontier · United States · Remote · Scaling

This role is responsible for designing, operating, and improving the incident management lifecycle for mission-critical data center environments supporting AI infrastructure. It involves establishing standards, leading incident response, driving post-incident reviews, and implementing tooling to ensure peak performance and reliability.

What you'd actually do

  1. Define and maintain incident severity levels (SEV definitions), classification criteria, and escalation thresholds.
  2. Establish end-to-end incident response standards: protocols, lifecycle stages (declare → stabilize → mitigate → recover → close), and operating cadence.
  3. Build and maintain governance artifacts: runbooks, war room formats, reporting templates, and decision/communication standards.
  4. Create and operationalize notification trees, stakeholder comms templates (initial, periodic updates, recovery/closure), and executive escalation criteria.
  5. Define clear RACI across Facilities, Hardware Ops, Network, Security, and vendor/partner teams, including handoffs and accountability paths.

Skills

Required

  • 7+ years in mission-critical infrastructure, data center operations, or reliability engineering
  • Direct experience leading major incidents (P1/P0 equivalent)
  • Strong familiarity with facilities systems, hardware operations, or network infrastructure
  • Demonstrated experience running war rooms and executive updates
  • Experience conducting root cause analysis and corrective action tracking
  • Ability to remain calm and decisive under high-pressure conditions

Nice to have

  • Experience in hyperscale or high-density AI compute environments
  • Background in facilities commissioning, facility operations, hardware operations, or network reliability
  • Familiarity with ISO-based quality systems or structured operational documentation frameworks
  • Experience implementing incident tooling (PagerDuty, ServiceNow, Jira, etc.)

What the JD emphasized

  • mission-critical data center environments
  • major incidents (P1/P0 equivalent)
  • high-density AI compute environments