Principal Site Reliability Engineer

Microsoft Microsoft · Big Tech · Perth, WA +3 · Site Reliability Engineering

This Principal SRE role focuses on leading initiatives for durable, high-quality handling of high severity incidents across Microsoft M365 Substrate Core services. The role ensures consistent and predictable incident management, minimizes customer impact, and accelerates recovery and learning. It involves leadership in reliability, incident command, and operational governance, partnering with Incident Managers and Service Owners to set standards and drive evolution of incident handling practices.

What you'd actually do

  1. Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high‑impact events.
  2. Act as the senior incident leader or sponsor for long‑running, high‑stakes, or cross‑service incidents, ensuring alignment on impact, risk, and recovery priorities.
  3. Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  4. Ensure high‑quality post‑incident reviews and drive accountability for repair items that reduce recurrence and systemic risk. Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths.
  5. Coach and help develop a team of Site Reliability Engineers serving as incident responders.

Skills

Required

  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
  • Proven experience leading teams through high‑severity production incidents in large, distributed systems.
  • Strong understanding of incident management, reliability engineering, and live‑site operations at scale.
  • Ability to drive clarity, accountability, and results in ambiguous, time‑critical situations.

Nice to have

  • Doctorate Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 12+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
  • 7+ years technical experience working with large-scale cloud or distributed systems.
  • Experience building or scaling incident response programs at organizational or enterprise scope.
  • Background in SRE, production engineering, or platform reliability roles.
  • Track record of reducing customer impact through improved incident handling, tooling, or prevention.
  • Experience operating in follow‑the‑sun or globally distributed incident response models.

What the JD emphasized

  • high severity incidents
  • incident command
  • operational governance
  • live-site risk
  • incident response maturity