Senior Software Engineer - Live Site Re… at Microsoft

What you'd actually do

Provide on-call support for customer-facing services, including monitoring, investigation, severity assessment, and coordination with engineering teams for incident resolution

Design and develop automation systems to support incident triage, including correlation of alerts with deployments, feature changes, and known issues

Build tools for incident lifecycle management, including summarization, reporting, and documentation to improve operational efficiency

Develop and maintain classification and routing systems for incoming incidents using telemetry, service metadata, and historical patterns

Analyze operational metrics such as time-to-triage and incident resolution effectiveness; identify trends and drive improvements through automation and process enhancements

Skills

Required

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Hands-on experience managing live site operations, including log analysis, incident response, and telemetry-based diagnostics

Nice to have

Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Experience working with Live Site operations, including incident triage, monitoring, alerting systems, and production support for large-scale services
Experience building automation systems using AI/ML techniques or large language models (LLMs)
Experience with incident management, telemetry analysis, and operational monitoring systems
Experience with Microsoft Azure, Power BI, Microsoft Fabric, or related cloud services
Experience authoring troubleshooting guides (TSGs) or analyzing incident patterns
Understanding of service-level agreements (SLAs), escalation processes, and customer communication practices
Experience collaborating with globally distributed engineering teams
Experience participating in an on-call rotation supporting production services

Overview

Microsoft is a place where passionate innovators collaborate to solve complex challenges and create technology that empowers people and organizations around the world.

The Azure Data engineering team is responsible for building and operating Microsoft’s data platform, including services such as Microsoft Fabric, Azure SQL Database, Azure Cosmos DB, Azure Data Factory, Azure Synapse Analytics, Azure Event Grid, and Power BI. These services enable customers to ingest, process, and analyze data at scale.

The Data Integration team builds capabilities that enable customers to move, transform, and prepare data efficiently across systems. This role focuses on live site reliability, operational excellence, and automation within the Customer Data Integration (CDI) organization.

We are hiring a Senior Software Engineer to design and build systems that improve incident response, automate operational workflows, and enhance the reliability of data integration services used by millions of customers.

Responsibilities

Provide on-call support for customer-facing services, including monitoring, investigation, severity assessment, and coordination with engineering teams for incident resolution
Design and develop automation systems to support incident triage, including correlation of alerts with deployments, feature changes, and known issues
Build tools for incident lifecycle management, including summarization, reporting, and documentation to improve operational efficiency
Develop and maintain classification and routing systems for incoming incidents using telemetry, service metadata, and historical patterns
Analyze operational metrics such as time-to-triage and incident resolution effectiveness; identify trends and drive improvements through automation and process enhancements
Partner with cross-functional engineering teams to improve reliability, reduce operational overhead, and enhance service quality
Contribute to the design, development, and improvement of distributed systems and cloud services as part of the broader CDI engineering scope
Demonstrate Microsoft’s culture and values in day-to-day work and collaboration

Qualifications

Required/minimum qualifications

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Hands-on experience managing live site operations, including log analysis, incident response, and telemetry-based diagnostics

Additional or preferred qualifications

Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Experience working with Live Site operations, including incident triage, monitoring, alerting systems, and production support for large-scale services
Experience building automation systems using AI/ML techniques or large language models (LLMs)
Experience with incident management, telemetry analysis, and operational monitoring systems
Experience with Microsoft Azure, Power BI, Microsoft Fabric, or related cloud services
Experience authoring troubleshooting guides (TSGs) or analyzing incident patterns
Understanding of service-level agreements (SLAs), escalation processes, and customer communication practices
Experience collaborating with globally distributed engineering teams
Experience participating in an on-call rotation supporting production services

Software Engineering IC4 - The typical base pay range for this role across Canada is CAD $114,400.00 - CAD $203,900.00 per year.

Find additional pay information here: https://careers.microsoft.com/v2/global/en/canada-pay-information.html

Software Engineering IC4 - L'échelle salariale de base typique pour ce rôle dans l'ensemble du Canada est de 114,400.00 $ CAD à 203,900.00 $ CAD par année.

Pour plus d'information au sujet de la rémunération, veuillez cliquer ici: https://careers.microsoft.com/v2/global/en/canada-pay-information.html

Ce poste sera ouvert pendant au moins cinq jours et les candidatures seront acceptées de façon continue jusqu’à ce que le poste soit pourvu.

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft est un employeur offrant l’égalité d’accès à l’emploi. Tous les candidats qualifiés seront pris en considération pour l’emploi, sans égard à l’âge, à l’ascendance, à la citoyenneté, à la couleur, aux congés médicaux ou familiaux, à l’identité ou à l’expression de genre, aux renseignements génétiques, à l’état d’immigration, à l’état matrimonial, à l’état de santé, à l’origine nationale, à un éventuel handicap physique ou mental, à l’affiliation politique, au statut de vétéran protégé ou au statut militaire, à la race, à l’ethnie, à la religion, au sexe (y compris la grossesse), à l’orientation sexuelle ou à toute autre caractéristique protégée par les lois, ordonnances et règlements locaux applicables. Si vous avez besoin d’aide avec des accommodements religieux et/ou d’un accommodement raisonnable en raison d’un handicap pendant le processus de candidature, apprenez-en plus sur la **demande d’accommodement.**

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**