Senior Site Reliability Engineer

Microsoft Microsoft · Big Tech · Perth, WA +3 · Site Reliability Engineering

Senior SRE role focused on leading incident response for Microsoft M365 Substrate Core services, ensuring consistent and predictable handling of high severity incidents to minimize customer impact and accelerate learning. The role involves incident command, operational governance, and driving improvements in incident handling practices.

What you'd actually do

  1. Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high‑impact events.
  2. Act as the senior incident leader or sponsor for long‑running, high‑stakes, or cross‑service incidents, ensuring alignment on impact, risk, and recovery priorities.
  3. Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  4. Ensure high‑quality post‑incident reviews and drive accountability for repair items that reduce recurrence and systemic risk. Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths.
  5. Communicate clearly and credibly with senior leadership during customer‑impacting events.

Skills

Required

  • Proven experience leading teams through high‑severity production incidents in large, distributed systems.
  • Solid understanding of incident management, reliability engineering, and live‑site operations at scale.
  • Ability to drive clarity, accountability, and results in ambiguous, time‑critical situations.

Nice to have

  • Experience building or scaling incident response programs at organizational or enterprise scope.
  • Background in SRE, production engineering, or platform reliability roles.
  • Track record of reducing customer impact through improved incident handling, tooling, or prevention.
  • Experience operating in follow‑the‑sun or globally distributed incident response models.

What the JD emphasized

  • high severity incidents
  • customer impacting
  • incident command
  • operational governance
  • live-site risk
  • incident response maturity