Principal Manager, Incident Management … at Microsoft

What you'd actually do

Lead end-to-end incident management and crisis response at scale, orchestrating complex, multi-team mitigation efforts, driving rapid restoration, and ensuring clear, timely communication with stakeholders and leadership.

Drive service reliability and operational excellence, holding teams accountable to SLOs, improving Time to Detect (TTD) and Time to Mitigate (TTM), and embedding best-in-class incident, problem management, and post-incident review practices.

Define and execute reliability engineering strategy, advancing telemetry, alerting, automation, and predictive monitoring capabilities to proactively identify issues, reduce noise, and improve system resilience.

Build and scale cross-organizational partnerships and capabilities, developing deep technical expertise, standardizing processes, and enabling consistent, high-quality incident response across services and regions.

Lead and develop high-performing teams, fostering a culture of accountability, continuous improvement, and inclusion while coaching engineers and leaders to deliver measurable reliability and customer impact.

Skills

Required

Incident management
Crisis response
Service reliability
Operational excellence
Reliability engineering
Telemetry
Alerting
Automation
Predictive monitoring
Cross-organizational partnerships
Team leadership
People management
Critical environment experience
Network engineering
Service engineering
Systems engineering
Industrial controls

Nice to have

Data Center Operations
Mission critical facilities
Semi-conductor environment experience
BAS, BMS and EPMS systems
Large-scale cloud or distributed systems
Single line diagrams
Fault tree analysis
Trade certification in electrical/mechanical/controls

Overview

Microsoft Cloud Infrastructure and Operations (CO+I) is the engine that powers Microsoft's cloud services. The group is responsible for designing, building, and operating Microsoft’s global datacenters; managing the programmatic delivery of our critical infrastructure design, equipment procurement, construction delivery, infrastructure innovation, demand planning and capacity utilization of our unified infrastructure; and responsible for all operations needed to run the physical infrastructure. We focus on smart growth with an emphasis on automation, data-driven engineering, cost‐effectiveness, and environmental sustainability. We deliver the core infrastructure and foundational technologies for Microsoft's 200+ online businesses including Azure, Office 365, Bing, Xbox Live, Skype, and OneDrive. Our portfolio is built and managed by a team of subject matter experts working 24x7x365 to support services for more than 1 billion customers and 20 million businesses in over 90 countries worldwide.

Within CO+I, the Data Center Incident Management Team (DCIM) is responsible for 24 x 7 x 365 incident management for Microsoft data centers worldwide. Within the DCIM Team, we are seeking a highly motivated and experienced Principal Manager, Incident Management - AMER to join our team. If you are a strategic thinker with a passion for driving business success, we encourage you to apply for this exciting opportunity. This role will require participation in an on-call rotation, including availability during evenings, weekends, and/or holidays to support business needs

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

Lead end-to-end incident management and crisis response at scale, orchestrating complex, multi-team mitigation efforts, driving rapid restoration, and ensuring clear, timely communication with stakeholders and leadership.
Drive service reliability and operational excellence, holding teams accountable to SLOs, improving Time to Detect (TTD) and Time to Mitigate (TTM), and embedding best-in-class incident, problem management, and post-incident review practices.
Define and execute reliability engineering strategy, advancing telemetry, alerting, automation, and predictive monitoring capabilities to proactively identify issues, reduce noise, and improve system resilience.
Build and scale cross-organizational partnerships and capabilities, developing deep technical expertise, standardizing processes, and enabling consistent, high-quality incident response across services and regions.
Lead and develop high-performing teams, fostering a culture of accountability, continuous improvement, and inclusion while coaching engineers and leaders to deliver measurable reliability and customer impact.
This role will require participation in an on-call rotation, including availability during evenings, weekends, and/or holidays to support business needs
Embody our culture and values.

Qualifications

Required qualifications

Bachelor's Degree in Mechanical Engineering, Electrical Engineering, Information Technology, Facilities Management, Aerospace Engineering, or related field AND 6+ years technical experience in critical environment, network engineering, service engineering, systems engineering, or industrial controls OR equivalent experience.

Other Requirements:** **

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:

Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

Master's Degree in Mechanical Engineering, Electrical Engineering, Information Technology, Facilities Management, Aerospace Engineering, or related field AND 8+ years technical experience in critical environment, network engineering, service engineering, systems engineering, or industrial controls OR Bachelor's Degree in Mechanical Engineering, Electrical Engineering, Information Technology, Facilities Management, Aerospace Engineering or related field AND 12+ years technical experience in critical environment, network engineering, service engineering, systems engineering, or industrial controls OR equivalent experience.
Data Center Operations, mission critical facilities, semi-conductor environment experience.
Working expertise knowledge of BAS, BMS and EPMS systems
5+ years technical experience working with large-scale cloud or distributed systems.
Ability to read single line diagrams and diagnose fault tree from drawings
Trade certification in electrical/mechanical/controls.
5+ years people management experience.

#COICareers | #CDSCareers |

Service Engineering M5 - The typical base pay range for this role across the U.S. is USD $142,800.00 - $274,800.00 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000.00 - $304,200.00 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**