Senior Site Reliability Engineering at Microsoft

What you'd actually do

Incident triage and first-line response: Provide on-call coverage for incoming incidents across CDI services. Perform initial investigation, severity assessment, and routing to owning engineering teams.

Agentic triage system development: Build and extend AI-driven agents that ingest ICM alerts, correlate with recent deployments and feature flag rollouts, check known-issue databases, and produce initial assessments with suggested severity and owning team.

TSG and known-issue matching: Develop automation that matches incoming incidents to relevant Troubleshooting Guides (TSGs) and known issues across Fabric and Power Platform — reducing investigation time and enabling faster resolution.

Auto-routing and classification: Configure and extend ICM routing rules and build intelligent classification systems based on service tree, alert signatures, and historical patterns.

Incident lifecycle automation: Build agents for incident summarization, customer communications drafting, postmortem generation, and reporting, replacing manual authoring with AI-assisted workflows requiring human judgment only for high-severity incidents.

Skills

Required

Master's Degree or Bachelors in Computer Science, Information Technology, or related field AND 7+ year(s) technical experience in software engineering, network engineering, or systems administration.
4+ years of software engineering experience in site reliability, Live site operations, or incident management for cloud services.
Experience with incident management systems and workflows (ICM, PagerDuty, ServiceNow, or similar).
Experience with monitoring, alerting, and observability systems (Kusto, Geneva, Grafana, or similar).

Nice to have

Strong programming skills in one or more of: C#, PowerShell, Python, KQL/Kusto.
Ability to work in an on-call rotation across time zones in a geographically distributed team.
Strong communication skills to interface with engineers, leadership, support, and customers.

Overview

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.

Microsoft’s Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture.

Within Azure Data, the data integration team builds data gravity on the Microsoft Cloud. Massive volumes of data are generated – not just from transactional systems of record, but also from the world around us. Our data integration products – Azure Data Factory and Power Query make it easy for customers to bring in, clean, shape, and join data, to extract intelligence.

Power Query powers data connectivity and transformation across Microsoft's data platform, including Fabric, Dataflows, Power BI, Excel, and more. The CDI organization owns Live site operations, incident management, and quality engineering across this stack. You will work alongside engineers building cloud-scale data infrastructure used by millions of customers worldwide.

You are joining a team that is replacing a primarily manual operation with an automated engineered solution. The incident management function is critical to maintaining the stability of Live site services.

As your automation matures and the operational load decreases, you will have the opportunity to broaden your scope into CDI's broader engineering products and services by contributing directly to the systems you've been keeping reliable.

We do not just value differences or different perspectives. We seek them out and invite them in so we can tap into the collective power of everyone in the company. As a result, our customers are better served.

Responsibilities

Incident triage and first-line response: Provide on-call coverage for incoming incidents across CDI services. Perform initial investigation, severity assessment, and routing to owning engineering teams.
Agentic triage system development: Build and extend AI-driven agents that ingest ICM alerts, correlate with recent deployments and feature flag rollouts, check known-issue databases, and produce initial assessments with suggested severity and owning team.
TSG and known-issue matching: Develop automation that matches incoming incidents to relevant Troubleshooting Guides (TSGs) and known issues across Fabric and Power Platform — reducing investigation time and enabling faster resolution.
Auto-routing and classification: Configure and extend ICM routing rules and build intelligent classification systems based on service tree, alert signatures, and historical patterns.
Incident lifecycle automation: Build agents for incident summarization, customer communications drafting, postmortem generation, and reporting, replacing manual authoring with AI-assisted workflows requiring human judgment only for high-severity incidents.
Embody our culture and values

Qualifications

Required/Minimum Qualifications

Master's Degree or Bachelors in Computer Science, Information Technology, or related field AND 7+ year(s) technical experience in software engineering, network engineering, or systems administration.
4+ years of software engineering experience in site reliability, Live site operations, or incident management for cloud services.
Experience with incident management systems and workflows (ICM, PagerDuty, ServiceNow, or similar).
Experience with monitoring, alerting, and observability systems (Kusto, Geneva, Grafana, or similar).

Preferred/Additional Qualifications

Strong programming skills in one or more of: C#, PowerShell, Python, KQL/Kusto.
Ability to work in an on-call rotation across time zones in a geographically distributed team.
Strong communication skills to interface with engineers, leadership, support, and customers.

**Job Requirements: Other & Additional **

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:

This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Equal Opportunity Employer (EOP)

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form (Accessibility | Microsoft Careers).

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

#azdat

#azuredata

#dataintegration, #livesite, #powerquery

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**