Site Reliability Engineer II

Microsoft Microsoft · Big Tech · Bengaluru, KA, IN · Site Reliability Engineering

Site Reliability Engineer II for Microsoft's Azure Data engineering team, focusing on the messaging and real-time analytics services like Azure Service Bus, Event Hub, Event Grid, and Fabric RTI Eventstreams. The role involves scaling and operating Fabric Event Stream as a globally distributed, highly reliable platform, with a primary focus on region build-out, deployment, and SRE practices. Responsibilities include onboarding new regions, driving deployment automation, ensuring service reliability, improving availability, monitoring, incident response, and building observability and telemetry.

What you'd actually do

  1. own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout.
  2. play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions.
  3. building strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets.
  4. Onboard new regions, drive deployment automation, and ensure consistent service configuration
  5. Improve availability, resiliency, and incident response; own service health across regions

Skills

Required

  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check

Nice to have

  • Solid understanding of concurrency, scalability, and fault tolerance
  • Hands-on experience with cloud platforms (Azure preferred), including service deployment, region onboarding, or infrastructure automation
  • Experience with streaming or messaging systems (e.g., Azure Event Hubs, Kafka, Service Bus, or similar), including understanding of throughput, latency, and reliability trade-offs
  • Experience in automation and deployment pipelines, including CI/CD, safe rollout practices, and multi-region configuration management
  • Proven ability to debug complex production issues and drive fixes across distributed systems

What the JD emphasized

  • highly reliable platform
  • milliseconds latency
  • massive throughput
  • 99.99% service availability
  • petabytes of data per day
  • high availability and SLA targets