Data Center Reliability Engineer

Oracle Oracle · Enterprise · Abilene, TX +1

Reliability Engineer focused on improving availability and reducing risk in mission-critical data center facility systems through data-driven analysis, identifying failure patterns, driving corrective actions, and building tooling and metrics. Responsibilities include monitoring operational telemetry, defining reliability KPIs, supporting RCAs, partnering with teams for preventive strategies, and developing analytics tools using Python and SQL.

What you'd actually do

  1. Monitor and analyze operational telemetry, alarms, and performance trends to identify emerging risks and reliability degradation.
  2. Define and track reliability KPIs; deliver concise analysis and recommendations that drive operational and engineering decisions.
  3. Support and/or lead RCAs and corrective action tracking for recurring or high-impact issues, ensuring follow-through and verification.
  4. Partner with operations and engineering teams to improve preventive strategies, automation opportunities, and compliance execution.
  5. Develop and maintain analytics and reporting tools using Python, SQL, and/or DCIM/BMS/SCADA data sources.

Skills

Required

  • data center infrastructure systems
  • Power distribution (UPS, generators, switchgear, PDUs)
  • Cooling systems (CRAC/CRAH, chillers, cooling towers)
  • Building Management Systems (BMS) / DCIM tools
  • reliability or systems analysis in data centers or other uptime-critical environments
  • Python
  • SQL

Nice to have

  • engineering degree or equivalent applied experience
  • comfort with data and tooling
  • analytical and visualization skills
  • disciplined technical documentation
  • influence outcomes through evidence, clarity, and structured thinking