Manager, Reliability Operations

Expedia Expedia · Hospitality · Prague, Czech Republic

Manager for an AI-focused reliability operations team responsible for monitoring, triage, and remediation of IT operations, integrating AI/ML solutions to improve incident response and system stability.

What you'd actually do

  1. Lead a 24/7 global reliability operations function that monitors, supports, and improves production systems, ensuring high availability, resilience, and rapid incident response across multiple services and domains.
  2. Own and mature incident management practices, including detection, triage, escalation, communication, and post‑incident review processes, driving reduction in mean time to detect (MTTD) and mean time to resolve (MTTR).
  3. Partner closely with engineering, SRE, and product teams to define and evolve operational standards, runbooks, and readiness criteria, including system design (LLD), API integration considerations, and data modeling that support reliable operations.
  4. Develop and manage observability strategies (monitoring, alerting, logging, and dashboards) to proactively identify reliability risks and drive data‑driven improvements to system stability and performance.
  5. Build, coach, and mentor a high‑performing reliability operations team, fostering a culture of continuous improvement, operational excellence, and accountability across multiple technical domains and platforms.
  6. Safely integrate and operate AI/ML‑enabled solutions that improve incident detection, noise reduction, capacity forecasting, and operational workflows, including familiarity with AI‑driven systems, tools, or workflows and applying AI/ML concepts to real world products.

Skills

Required

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience in operating large‑scale, customer‑facing systems.
  • Substantial experience in reliability operations, SRE, production support, or related fields, including leading 24/7 operational teams and owning reliability for multiple services or a broad technical domain.
  • Proven track record implementing and operating incident management, on‑call, and observability practices (monitoring, alerting, logging, dashboards) for distributed systems, including collaboration with engineering on system design (LLD), API integration, and data modeling.
  • Demonstrated ability to use operational and performance data to drive decisions, prioritize reliability improvements, and manage trade‑offs between stability, velocity, and cost at scale.
  • Hands‑on familiarity with AI‑driven or automation‑focused operational tools (for example, intelligent alerting, anomaly detection, or automated remediation) and ability to ensure they are integrated and operated safely in production.
  • Experience with automation tools and at least one programming or scripting language (Python preferred).
  • Experience with monitoring and observability tools such as Datadog, Splunk, Catchpoint, PagerDuty, or similar platforms.
  • Strong incident response mindset, including the ability to analyze outages, identify root causes, and proactively recommend and implement automation-driven solutions to prevent recurrence.

Nice to have

  • Experience leading reliability operations for complex, high‑traffic, globally distributed systems, including coordination across multiple engineering and product teams and ownership of multi‑service or multi‑domain reliability outcomes.
  • Demonstrated success defining and evolving operational architectures and runbooks in partnership with engineering, including low‑level system design, API design for operability, and data models that support effective monitoring, alerting, and incident analysis.
  • Strong track record driving operational excellence: improving incident response processes, leading blameless post‑incident reviews, reducing recurring incidents, and implementing long‑term reliability improvements grounded in data.
  • Experience scaling AI‑ or ML‑enabled capabilities within reliability operations, such as intelligent incident triage, predictive capacity and reliability modeling, or AI‑assisted runbooks, with clear governance and safety controls.
  • Depth in using AI‑driven observability or AIOps platforms to correlate signals across logs, metrics, and traces, and to continuously refine alerting and automation strategies that improve reliability outcomes.

What the JD emphasized

  • AI-focused automation analysts
  • agentic AI responders
  • AI/ML-enabled solutions
  • AI-driven systems, tools, or workflows
  • AI/ML concepts
  • AI-driven or automation-focused operational tools
  • AI/ML-enabled capabilities
  • AI-assisted runbooks
  • AI-driven observability or AIOps platforms

Other signals

  • AI-focused automation analysts
  • AI Resiliency Centre (ARC)
  • orchestrating human and agentic AI responders
  • integrate and operate AI/ML-enabled solutions
  • AI-driven systems, tools, or workflows
  • applying AI/ML concepts to real world products
  • AI-driven or automation-focused operational tools
  • AI/ML-enabled capabilities within reliability operations
  • AI-assisted runbooks
  • AI-driven observability or AIOps platforms