Senior Site Reliability Engineering

Microsoft Microsoft · Big Tech · Bengaluru, KA, IN +1 · Site Reliability Engineering

This role focuses on building and extending AI-driven agents for incident triage, classification, and automation within Microsoft's Azure Data engineering team. The goal is to replace manual operations with engineered solutions, improving the stability and efficiency of cloud-scale data infrastructure.

What you'd actually do

  1. Incident triage and first-line response: Provide on-call coverage for incoming incidents across CDI services. Perform initial investigation, severity assessment, and routing to owning engineering teams.
  2. Agentic triage system development: Build and extend AI-driven agents that ingest ICM alerts, correlate with recent deployments and feature flag rollouts, check known-issue databases, and produce initial assessments with suggested severity and owning team.
  3. TSG and known-issue matching: Develop automation that matches incoming incidents to relevant Troubleshooting Guides (TSGs) and known issues across Fabric and Power Platform — reducing investigation time and enabling faster resolution.
  4. Auto-routing and classification: Configure and extend ICM routing rules and build intelligent classification systems based on service tree, alert signatures, and historical patterns.
  5. Incident lifecycle automation: Build agents for incident summarization, customer communications drafting, postmortem generation, and reporting, replacing manual authoring with AI-assisted workflows requiring human judgment only for high-severity incidents.

Skills

Required

  • Master's Degree or Bachelors in Computer Science, Information Technology, or related field AND 7+ year(s) technical experience in software engineering, network engineering, or systems administration.
  • 4+ years of software engineering experience in site reliability, Live site operations, or incident management for cloud services.
  • Experience with incident management systems and workflows (ICM, PagerDuty, ServiceNow, or similar).
  • Experience with monitoring, alerting, and observability systems (Kusto, Geneva, Grafana, or similar).

Nice to have

  • Strong programming skills in one or more of: C#, PowerShell, Python, KQL/Kusto.
  • Ability to work in an on-call rotation across time zones in a geographically distributed team.
  • Strong communication skills to interface with engineers, leadership, support, and customers.

What the JD emphasized

  • AI-driven agents
  • incident management
  • automation
  • AI-assisted workflows

Other signals

  • AI-driven agents
  • incident management automation
  • AI-assisted workflows