Manager, Reliability Operations

Expedia · Hospitality · Prague, Czech Republic

Manager for an AI-focused reliability operations team responsible for monitoring, triage, and remediation of IT operations, integrating AI/ML solutions to improve incident response and system stability.

What you'd actually do

Lead a 24/7 global reliability operations function that monitors, supports, and improves production systems, ensuring high availability, resilience, and rapid incident response across multiple services and domains.
Own and mature incident management practices, including detection, triage, escalation, communication, and post‑incident review processes, driving reduction in mean time to detect (MTTD) and mean time to resolve (MTTR).
Partner closely with engineering, SRE, and product teams to define and evolve operational standards, runbooks, and readiness criteria, including system design (LLD), API integration considerations, and data modeling that support reliable operations.
Develop and manage observability strategies (monitoring, alerting, logging, and dashboards) to proactively identify reliability risks and drive data‑driven improvements to system stability and performance.
Build, coach, and mentor a high‑performing reliability operations team, fostering a culture of continuous improvement, operational excellence, and accountability across multiple technical domains and platforms.
Safely integrate and operate AI/ML‑enabled solutions that improve incident detection, noise reduction, capacity forecasting, and operational workflows, including familiarity with AI‑driven systems, tools, or workflows and applying AI/ML concepts to real world products.

Skills

Required

Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience in operating large‑scale, customer‑facing systems.
Substantial experience in reliability operations, SRE, production support, or related fields, including leading 24/7 operational teams and owning reliability for multiple services or a broad technical domain.
Proven track record implementing and operating incident management, on‑call, and observability practices (monitoring, alerting, logging, dashboards) for distributed systems, including collaboration with engineering on system design (LLD), API integration, and data modeling.
Demonstrated ability to use operational and performance data to drive decisions, prioritize reliability improvements, and manage trade‑offs between stability, velocity, and cost at scale.
Hands‑on familiarity with AI‑driven or automation‑focused operational tools (for example, intelligent alerting, anomaly detection, or automated remediation) and ability to ensure they are integrated and operated safely in production.
Experience with automation tools and at least one programming or scripting language (Python preferred).
Experience with monitoring and observability tools such as Datadog, Splunk, Catchpoint, PagerDuty, or similar platforms.
Strong incident response mindset, including the ability to analyze outages, identify root causes, and proactively recommend and implement automation-driven solutions to prevent recurrence.

Nice to have

Experience leading reliability operations for complex, high‑traffic, globally distributed systems, including coordination across multiple engineering and product teams and ownership of multi‑service or multi‑domain reliability outcomes.
Demonstrated success defining and evolving operational architectures and runbooks in partnership with engineering, including low‑level system design, API design for operability, and data models that support effective monitoring, alerting, and incident analysis.
Strong track record driving operational excellence: improving incident response processes, leading blameless post‑incident reviews, reducing recurring incidents, and implementing long‑term reliability improvements grounded in data.
Experience scaling AI‑ or ML‑enabled capabilities within reliability operations, such as intelligent incident triage, predictive capacity and reliability modeling, or AI‑assisted runbooks, with clear governance and safety controls.
Depth in using AI‑driven observability or AIOps platforms to correlate signals across logs, metrics, and traces, and to continuously refine alerting and automation strategies that improve reliability outcomes.

What the JD emphasized

AI-focused automation analysts
agentic AI responders
AI/ML-enabled solutions
AI-driven systems, tools, or workflows
AI/ML concepts
AI-driven or automation-focused operational tools
AI/ML-enabled capabilities
AI-assisted runbooks
AI-driven observability or AIOps platforms

Other signals

AI-focused automation analysts
AI Resiliency Centre (ARC)
orchestrating human and agentic AI responders
integrate and operate AI/ML-enabled solutions
AI-driven systems, tools, or workflows
applying AI/ML concepts to real world products
AI-driven or automation-focused operational tools
AI/ML-enabled capabilities within reliability operations
AI-assisted runbooks
AI-driven observability or AIOps platforms

Read full job description

Expedia Group brands power global travel for everyone, everywhere. We design cutting-edge tech to make travel smoother and more memorable, and we create groundbreaking solutions for our partners. Our diverse, vibrant, and welcoming community is essential in driving our success.

Why Join Us?

To shape the future of travel, people must come first. Guided by our Values and Leadership Agreements, we foster an open culture where everyone belongs, differences are celebrated and know that when one of us wins, we all win.

We provide a full benefits package, including exciting travel perks, generous time-off, parental leave, a flexible work model (with some pretty cool offices), and career development resources, all to fuel our employees' passion for travel and ensure a rewarding career journey. We’re building a more open world. Join us.

**Role Summary **

A people leader who oversees a team of AI-focused automation analysts within Expedia Group’s AI Resiliency Centre (ARC), the central hub for global IT operations, providing always-on monitoring, triage, and remediation across eCommerce and corporate services. This manager builds a piece of the follow the sun capability across hubs, orchestrating human and agentic AI responders to reduce noise, cut Mean Time to Detect/Restore (MTTK/MTTR), and prevent customer-impacting incidents before they occur.

In this role, you will:

Lead a 24/7 global reliability operations function that monitors, supports, and improves production systems, ensuring high availability, resilience, and rapid incident response across multiple services and domains.
Own and mature incident management practices, including detection, triage, escalation, communication, and post‑incident review processes, driving reduction in mean time to detect (MTTD) and mean time to resolve (MTTR).
Partner closely with engineering, SRE, and product teams to define and evolve operational standards, runbooks, and readiness criteria, including system design (LLD), API integration considerations, and data modeling that support reliable operations.
Develop and manage observability strategies (monitoring, alerting, logging, and dashboards) to proactively identify reliability risks and drive data‑driven improvements to system stability and performance.
Build, coach, and mentor a high‑performing reliability operations team, fostering a culture of continuous improvement, operational excellence, and accountability across multiple technical domains and platforms.
Safely integrate and operate AI/ML‑enabled solutions that improve incident detection, noise reduction, capacity forecasting, and operational workflows, including familiarity with AI‑driven systems, tools, or workflows and applying AI/ML concepts to real world products.

Minimum Qualifications:

Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience in operating large‑scale, customer‑facing systems.
Substantial experience in reliability operations, SRE, production support, or related fields, including leading 24/7 operational teams and owning reliability for multiple services or a broad technical domain.
Proven track record implementing and operating incident management, on‑call, and observability practices (monitoring, alerting, logging, dashboards) for distributed systems, including collaboration with engineering on system design (LLD), API integration, and data modeling.
Demonstrated ability to use operational and performance data to drive decisions, prioritize reliability improvements, and manage trade‑offs between stability, velocity, and cost at scale.
Hands‑on familiarity with AI‑driven or automation‑focused operational tools (for example, intelligent alerting, anomaly detection, or automated remediation) and ability to ensure they are integrated and operated safely in production.
Experience with automation tools and at least one programming or scripting language (Python preferred).
Experience with monitoring and observability tools such as Datadog, Splunk, Catchpoint, PagerDuty, or similar platforms.
Strong incident response mindset, including the ability to analyze outages, identify root causes, and proactively recommend and implement automation-driven solutions to prevent recurrence.

Preferred Qualifications:

Experience leading reliability operations for complex, high‑traffic, globally distributed systems, including coordination across multiple engineering and product teams and ownership of multi‑service or multi‑domain reliability outcomes.
Demonstrated success defining and evolving operational architectures and runbooks in partnership with engineering, including low‑level system design, API design for operability, and data models that support effective monitoring, alerting, and incident analysis.
Strong track record driving operational excellence: improving incident response processes, leading blameless post‑incident reviews, reducing recurring incidents, and implementing long‑term reliability improvements grounded in data.
Experience scaling AI‑ or ML‑enabled capabilities within reliability operations, such as intelligent incident triage, predictive capacity and reliability modeling, or AI‑assisted runbooks, with clear governance and safety controls.
Depth in using AI‑driven observability or AIOps platforms to correlate signals across logs, metrics, and traces, and to continuously refine alerting and automation strategies that improve reliability outcomes.

Accommodation requests

If you need assistance with any part of the application or recruiting process due to a disability, or other physical or mental health conditions, please reach out to our Recruiting Accommodations Team through the Accommodation Request.

We are proud to be named as a Best Place to Work on Glassdoor in 2024 and be recognized for award-winning culture by organizations like Forbes, TIME, Disability:IN, and others.

Expedia Group's family of brands includes: Brand Expedia®, Hotels.com®, Expedia® Partner Solutions, Vrbo®, trivago®, Orbitz®, Travelocity®, Hotwire®, Wotif®, ebookers®, CheapTickets®, Expedia Group™ Media Solutions, Expedia Local Expert®, CarRentals.com™, and Expedia Cruises™. © 2024 Expedia, Inc. All rights reserved. Trademarks and logos are the property of their respective owners. CST: 2029030-50

Employment opportunities and job offers at Expedia Group will always come from Expedia Group’s Talent Acquisition and hiring teams. Never provide sensitive, personal information to someone unless you’re confident who the recipient is. Expedia Group does not extend job offers via email or any other messaging tools to individuals with whom we have not made prior contact. Our email domain is @expediagroup.com. The official website to find and apply for job openings at Expedia Group is careers.expediagroup.com/jobs.

Expedia is committed to creating an inclusive work environment with a diverse workforce. All qualified applicants will receive consideration for employment without regard to race, religion, gender, sexual orientation, national origin, disability or age.