Sr. Service Engineer- Ctj- Poly

Microsoft Microsoft · Big Tech · Redmond, WA +3 · Service Engineering

This role focuses on operating and improving Microsoft 365 services within sovereign cloud environments, emphasizing availability, compliance, and security. Responsibilities include incident response, monitoring telemetry, implementing scalable solutions, and collaborating across teams to ensure service health and customer satisfaction. The role supports the delivery of M365 services, including CoPilot, in highly regulated environments.

What you'd actually do

  1. Responds to incidents during regular on-call rotations, including complex incidents with major customer or business impact, by identifying the level of impact, troubleshooting, contributing to difficult decisions based on business impact, deploying appropriate fixes to resolve root cause(s), and implementing automations for prevention of recurring incidents through coordinating resources required for incident resolution, which may include product teams, owners, leadership, other engineering teams, and/or subject matter experts. Escalates resolution of highly complex, ambiguous, and impactful incidents as needed. Contributes to postmortems and shares details related to incidents and their resolution through post-mortem reports and regular review meetings. Provides expert incident response assistance to other Service Engineers as needed, and develops incident response and resolution guidance.
  2. Creates, monitors, and takes action on telemetry data and influences telemetry analytics to better identify patterns that reveal errors and unexpected problems that are affecting the system's availability, reliability, performance, and/or efficiency. Develops scripts and/or automation and leverages an understanding of solutions to define, develop, measure, track, change, and improve the quality of telemetry pipelines that support automated monitoring and incident response. Identifies and develops telemetry collaborations that result in better-together services.
  3. Independently implements reliable, scalable, and high-performance solutions across teams. Contributes to design documents. Owns implementation and rollback plans. Maintains quality checklist and related documentation. Quantifies and ensures the health and compliance of a service according to Engineering and industry standards.
  4. Leverages advanced technical expertise, judgment, and decision making to coordinate multiple work streams and resources in crisis situations to drive mitigation plan and resolve, reduce, or mitigate the impact of a crisis by engaging necessary teams and escalating to appropriate stakeholders. Independently conducts root cause analyses and participates in post-incident reviews based on incidences/crises for the purposes of leading continuous improvement. Applies diagnostic expertise. Provides guidance to other engineers working to mitigate and resolve issues. Communicates customer impact and other relevant information with key stakeholders, leadership, and customers. Develops and drives projects and programs to improve crisis response by creating standard practices (e.g., processes, standard operating procedures) for consistent response across engineering teams. Fosters increased service stability. Reduces future noise by participating in optimization of telemetry and alarming.
  5. Collaborates within and across teams (e.g., within Service Engineering, across a service) by proactively and syst

Skills

Required

  • Distributed systems
  • Scalable services
  • Automation
  • Incident response
  • Telemetry monitoring
  • Troubleshooting
  • Root cause analysis
  • Service operations
  • Cloud environments
  • Security
  • Compliance

Nice to have

  • M365 stack
  • CoPilot
  • Exchange Online Protection
  • Purview Platform
  • Microsoft Defender for Office
  • Azure Silver
  • Microsoft Sovereign Clouds

What the JD emphasized

  • availability and compliance are essential
  • highest standards
  • strictest regulatory and compliance requirements
  • highest standards of security, compliance, and operational excellence