What you'd actually do

Owns 100% uptime operations for a portfolio of very large/complex data center sites, ensuring consistent execution of shift coverage, operational handoffs, and standardized runbooks.

Defines the enterprise strategy for real-time monitoring and operational health across the portfolio (BMS/EPMS/SCADA/telemetry), aligning KPIs to uptime, reliability, safety, and customer outcomes.

Governs standards for event triage, incident command, escalation, stakeholder communications, and customer-impacting notifications.

Oversees evaluation of power, cooling, physical space, network/support infrastructure, and security capacity, ensuring readiness for load growth and peak conditions.

Drives adoption of automation for alarm correlation, workflow orchestration, remote operations, and predictive analytics to reduce human error and improve response times.

Skills

Required

Data center operations leadership
Performance monitoring and governance
Critical infrastructure management (power, cooling, controls, life safety, security)
Capacity planning and readiness assessment
Automation and telemetry adoption
Predictive maintenance implementation
Crisis management and incident response
Continuous improvement methodologies
Asset lifecycle management
Vendor performance management
Financial management and investment governance
Risk assessment and mitigation
Compliance and audit readiness
Team building and development
Cross-functional collaboration

Nice to have

Experience with BMS/EPMS/SCADA systems
Knowledge of LOTO and energized work policies
Familiarity with specific data center technologies (e.g., vSphere, Kubernetes, networking protocols)

What the JD emphasized

100% uptime operations

Mission Critical Operations (MCO)

high-severity incidents

real-time monitoring

operational health

uptime

reliability

safety

customer outcomes

MTTR/MTBF

repeat events

risk posture

preventive and predictive maintenance

MOP/SOP/EOP quality

change control

operational compliance

event triage

incident command

escalation

stakeholder communications

customer-impacting notifications

post-incident reviews

root cause analysis (RCA)

corrective/preventive actions (CAPA)

executive escalation point

complex incidents

cross-regional reliability risks

power, cooling, physical space, network/support infrastructure, and security capacity

load growth

peak conditions

resiliency standards

redundancy

maintenance windows

failover testing

generator/UPS readiness

fuel strategy

operational risk assessments

audit-ready

applicable standards

internal controls

automation

alarm correlation

workflow orchestration

remote operations

predictive analytics

human error

response times

data quality

instrumentation

high-confidence operational decision-making

expansions/new builds/site launches

Day-0/Day-1 readiness

staffing

training

spares

procedures

turnover acceptance criteria

operability

maintainability

safety into design and commissioning

lifecycle strategy

critical infrastructure

supporting hardware assets

installation

maintenance

spares

logistics

inventory

decommissioning

vendor performance

SLAs

service quality

compliance

performance gaps

multi-million dollar investments

upgrades

capacity expansion

reliability improvements

risk remediation

strategic oversight

mission-critical operational initiatives

reliability risk

customer impact

compliance needs

engineering, construction, security, network/IT, program management, and business stakeholders

reliable 24/7 delivery

complex operational/technical issues

disciplined, data-driven resolution

prevention of recurrence

operational excellence

training programs

certifications

drills

sustained improvement roadmap

availability

risk reduction

high-performing 24/7 operations organization

shift leaders

incident commanders

regional operations management

24/7/365 environment

incident and team management across all shifts

life safety

safe work practices

LOTO

energized work policies

Leads enterprise-wide performance monitoring and real-time operational governance, ensuring standardized processes for shift operations, event management, escalation, incident command, and communications. Oversees capacity and readiness for critical infrastructure (power, cooling, controls, life safety, and physical security), ensuring sites are resilient, compliant, and audit-ready.

Partners with executive leadership on multi-year operational, reliability, and financial targets; drives adoption of automation, telemetry, and predictive maintenance to reduce risk and improve mean time to restore (MTTR). Establishes crisis management standards, continuous improvement mechanisms, and a culture of operational excellence, knowledge sharing, and accountability.

Leads major expansion and transformation initiatives impacting operational readiness, serves as senior liaison across regions, and oversees the full lifecycle of critical infrastructure and hardware assets—including install, maintenance strategy, spares, vendor performance, and investment governance—to optimize reliability, security, and scalability.

Key Responsibilities

24/7 Mission Critical Operations Leadership

Owns 100% uptime operations for a portfolio of very large/complex data center sites, ensuring consistent execution of shift coverage, operational handoffs, and standardized runbooks.
Establishes and governs the Mission Critical Operations (MCO) operating model: command structure, on-call rotations, escalation paths, and service-impacting event response.
Ensures operational readiness for high-severity incidents through drills/tabletops, incident commander training, and continuous improvement of response playbooks.

Performance Monitoring, Controls, and Reliability

Defines the enterprise strategy for real-time monitoring and operational health across the portfolio (BMS/EPMS/SCADA/telemetry), aligning KPIs to uptime, reliability, safety, and customer outcomes.
Drives operating rhythms for reviewing: availability, MTTR/MTBF, alarm quality, repeat events, maintenance effectiveness, and risk posture.
Establishes standards for preventive and predictive maintenance, MOP/SOP/EOP quality, change control, and operational compliance.

Incident, Problem, and Crisis Management

Governs standards for event triage, incident command, escalation, stakeholder communications, and customer-impacting notifications.
Leads post-incident reviews for P1/P0 events, ensuring root cause analysis (RCA) quality, corrective/preventive actions (CAPA), and verified closure.
Operates as executive escalation point for highly complex incidents and cross-regional reliability risks.

Capacity, Resiliency, and Site Readiness

Oversees evaluation of power, cooling, physical space, network/support infrastructure, and security capacity, ensuring readiness for load growth and peak conditions.
Ensures resiliency standards are met (redundancy, maintenance windows, failover testing, generator/UPS readiness, fuel strategy as applicable).
Directs operational risk assessments and ensures sites remain audit-ready and compliant with applicable standards and internal controls.

Automation and Operational Tooling

Drives adoption of automation for alarm correlation, workflow orchestration, remote operations, and predictive analytics to reduce human error and improve response times.
Standardizes data quality and instrumentation required for high-confidence operational decision-making.
Expansion, Launch, and Transformation (Operational Readiness Focus)
Leads operational support for expansions/new builds/site launches, ensuring Day-0/Day-1 readiness, staffing, training, spares, procedures, and turnover acceptance criteria.
Partners with engineering and construction to embed operability, maintainability, and safety into design and commissioning.

Asset Lifecycle, Vendors, and Investment Governance

Oversees lifecycle strategy for critical infrastructure and supporting hardware assets: installation, maintenance, spares, logistics, inventory, and decommissioning.
Establishes enterprise standards for vendor performance, SLAs, service quality, and compliance; drives corrective actions where performance gaps exist.
Approves and manages multi-million dollar investments in upgrades, capacity expansion, reliability improvements, and risk remediation.

Core Leadership Responsibilities (unchanged but aligned to 24/7 ops)

Planning & Execution

Provides strategic oversight for mission-critical operational initiatives, ensuring priorities reflect reliability risk, customer impact, and compliance needs.

Collaboration & Partnership

Sets direction and builds strong partnerships with engineering, construction, security, network/IT, program management, and business stakeholders to ensure reliable 24/7 delivery.

Problem Solving

Serves as escalation for complex operational/technical issues; drives disciplined, data-driven resolution and prevention of recurrence.

Continuous Learning / Improvement

Champions operational excellence through training programs, certifications, drills, and a sustained improvement roadmap aligned to availability and risk reduction.
Performance and Development
Builds and develops a high-performing 24/7 operations organization, including shift leaders, incident commanders, and regional operations management.
This role supports a 24/7/365 environment and will require participation and managing incident and team management across all shifts.
Safety emphasis: explicit accountability for life safety and safe work practices (LOTO, energized work policies as applicable).

Disclaimer:

Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range in USD from: $139,400 to $291,800 per annum. May be eligible for bonus, equity, and compensation deferral.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business. Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - M4