Director, Data Center Reliability Engineering

Oracle Oracle · Enterprise · Nashville, TN +1

Director of Data Center Reliability Engineering at Oracle, responsible for leading reliability engineering and analytics teams, standardizing methodologies like FMEA and RCA, overseeing monitoring and automation tools, defining and tracking KPIs, and developing engineers. The role requires senior experience in reliability or maintenance engineering in uptime-critical environments, with a strong background in analytics and reliability frameworks. It involves global impact at scale within hyperscale cloud infrastructure, focusing on operational excellence, process rigor, and continuous improvement.

What you'd actually do

  1. Lead reliability engineering and analytics teams across multiple sites.
  2. Standardize and enforce FMEA, RCA, and continuous improvement methodologies.
  3. Oversee deployment of monitoring, analytics, and automation tools supporting reliability programs.
  4. Define, track, and report reliability KPIs to executive and global operations leadership.
  5. Ensure corrective actions are implemented, verified, and sustained.

Skills

Required

  • technical leadership
  • stakeholder influence
  • reliability engineering
  • maintenance engineering
  • uptime-critical environments
  • analytics
  • RCA rigor
  • reliability frameworks

Nice to have

  • translating analysis into executive-level decisions

What the JD emphasized

  • reliability engineering
  • FMEA
  • RCA
  • continuous improvement
  • monitoring
  • analytics
  • automation tools
  • KPIs
  • data-driven problem solving
  • uptime-critical environments
  • hyperscale cloud infrastructure
  • operational excellence
  • process rigor
  • performance monitoring
  • capacity analysis
  • issue management
  • incident management
  • crisis management
  • root cause analysis
  • data center expansion
  • installations
  • maintenance
  • component replacements
  • upgrades
  • proactive maintenance
  • lifecycle management
  • efficiency
  • stability
  • planning
  • execution
  • collaboration
  • partnership
  • problem solving
  • continuous learning