Principal Site Reliability Engineering Manager- Ctj- Secret (cleared Environments)

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Site Reliability Engineering

Principal SRE Manager for foundational cloud platform powering M365 Copilot and Exchange Online in highly regulated environments. Focus on operational excellence, reliability, security, and compliance through strong software engineering fundamentals and team leadership.

What you'd actually do

  1. Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels. Foster a culture of accountability, learning, and inclusion.
  2. Own the operational health and reliability posture of Substrate services running in regulated environments, ensuring strong availability, incident readiness, and recovery practices.
  3. Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics, using data and learning loops to continuously improve reliability, diagnosability, and customer experience.
  4. Lead effective incident management and post-incident reviews, emphasizing systemic fixes, automation, and long-term resilience rather than short-term remediation.
  5. Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation, leading incident response for Substrate services in regulated environments, providing hands-on leadership during incidents, and driving high-quality post-incident reviews that result in durable engineering improvements.

Skills

Required

  • Site Reliability Engineering
  • Software Engineering
  • Cloud Platform Operations
  • Incident Management
  • SLOs/SLIs
  • Automation
  • Team Leadership
  • Security Clearance

Nice to have

  • AI assisted techniques

What the JD emphasized

  • highly regulated environments
  • strong software engineering fundamentals
  • strong software engineering practices
  • durable, compliant, and auditable reliability outcomes