Senior Software Engineer - Live Site Reliability

Microsoft Microsoft · Big Tech · Vancouver, BC +2 · Software Engineering

Senior Software Engineer role focused on live site reliability, operational excellence, and automation within Microsoft's Azure Data engineering team. Responsibilities include on-call support, incident response, designing automation systems for triage and incident lifecycle management, and improving the reliability of data integration services.

What you'd actually do

  1. Provide on-call support for customer-facing services, including monitoring, investigation, severity assessment, and coordination with engineering teams for incident resolution
  2. Design and develop automation systems to support incident triage, including correlation of alerts with deployments, feature changes, and known issues
  3. Build tools for incident lifecycle management, including summarization, reporting, and documentation to improve operational efficiency
  4. Develop and maintain classification and routing systems for incoming incidents using telemetry, service metadata, and historical patterns
  5. Analyze operational metrics such as time-to-triage and incident resolution effectiveness; identify trends and drive improvements through automation and process enhancements

Skills

Required

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Hands-on experience managing live site operations, including log analysis, incident response, and telemetry-based diagnostics

Nice to have

  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Experience working with Live Site operations, including incident triage, monitoring, alerting systems, and production support for large-scale services
  • Experience building automation systems using AI/ML techniques or large language models (LLMs)
  • Experience with incident management, telemetry analysis, and operational monitoring systems
  • Experience with Microsoft Azure, Power BI, Microsoft Fabric, or related cloud services
  • Experience authoring troubleshooting guides (TSGs) or analyzing incident patterns
  • Understanding of service-level agreements (SLAs), escalation processes, and customer communication practices
  • Experience collaborating with globally distributed engineering teams
  • Experience participating in an on-call rotation supporting production services

What the JD emphasized

  • Hands-on experience managing live site operations, including log analysis, incident response, and telemetry-based diagnostics