Principal Manager, Incident Management - Amer

Microsoft Microsoft · Big Tech · Phoenix, AZ +4 · Service Engineering

This role is for a Principal Manager of Incident Management within Microsoft's Cloud Infrastructure and Operations (CO+I) division. The primary focus is on leading end-to-end incident management and crisis response for Microsoft's global data centers, driving service reliability, and improving operational metrics like Time to Detect (TTD) and Time to Mitigate (TTM). The role also involves defining and executing reliability engineering strategy, building cross-organizational partnerships, and leading high-performing teams. It requires participation in an on-call rotation and has specific security screening requirements.

What you'd actually do

  1. Lead end-to-end incident management and crisis response at scale, orchestrating complex, multi-team mitigation efforts, driving rapid restoration, and ensuring clear, timely communication with stakeholders and leadership.
  2. Drive service reliability and operational excellence, holding teams accountable to SLOs, improving Time to Detect (TTD) and Time to Mitigate (TTM), and embedding best-in-class incident, problem management, and post-incident review practices.
  3. Define and execute reliability engineering strategy, advancing telemetry, alerting, automation, and predictive monitoring capabilities to proactively identify issues, reduce noise, and improve system resilience.
  4. Build and scale cross-organizational partnerships and capabilities, developing deep technical expertise, standardizing processes, and enabling consistent, high-quality incident response across services and regions.
  5. Lead and develop high-performing teams, fostering a culture of accountability, continuous improvement, and inclusion while coaching engineers and leaders to deliver measurable reliability and customer impact.

Skills

Required

  • Incident management
  • Crisis response
  • Service reliability
  • Operational excellence
  • Reliability engineering
  • Telemetry
  • Alerting
  • Automation
  • Predictive monitoring
  • Cross-organizational partnerships
  • Team leadership
  • People management
  • Critical environment experience
  • Network engineering
  • Service engineering
  • Systems engineering
  • Industrial controls

Nice to have

  • Data Center Operations
  • Mission critical facilities
  • Semi-conductor environment experience
  • BAS, BMS and EPMS systems
  • Large-scale cloud or distributed systems
  • Single line diagrams
  • Fault tree analysis
  • Trade certification in electrical/mechanical/controls

What the JD emphasized

  • 6+ years technical experience in critical environment, network engineering, service engineering, systems engineering, or industrial controls
  • 5+ years people management experience