Site Reliability Engineer II at Microsoft

What you'd actually do

own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout.

play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions.

building strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets.

Onboard new regions, drive deployment automation, and ensure consistent service configuration

Improve availability, resiliency, and incident response; own service health across regions

Skills

Required

Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check

Nice to have

Solid understanding of concurrency, scalability, and fault tolerance
Hands-on experience with cloud platforms (Azure preferred), including service deployment, region onboarding, or infrastructure automation
Experience with streaming or messaging systems (e.g., Azure Event Hubs, Kafka, Service Bus, or similar), including understanding of throughput, latency, and reliability trade-offs
Experience in automation and deployment pipelines, including CI/CD, safe rollout practices, and multi-region configuration management
Proven ability to debug complex production issues and drive fixes across distributed systems

Overview

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.

Microsoft’s Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications, and driving a data culture.

Within Azure Data, the messaging and real-time analytics team provides comprehensive solutions and a robust platform that enables users to ingest high granularity signals (real-time & observability) and complex data, converting those into a competitive advantage in real-time for both end users and modern applications.

We’re Azure Messaging – a rapidly growing group of around 40 engineers – and we’re experts at moving hundreds of millions of small packets of information into and out of the cloud, per second. We work on the cutting edge of distributed messaging systems, where milliseconds latency, massive throughput and 99.99% service availability aren’t tradeoffs – they’re all necessary. Our infrastructure needs to be resilient enough for financial transactions, rapid enough for streaming and gaming applications, and still nimble enough to move many petabytes of data per day.

We build the Azure Service Bus (http://aka.ms/servicebus)), Azure Event Hub (http://aka.ms/eventhub)), Azure Event Grid (http://aka.ms/azureeventgrid)) and Fabric RTI Eventstreams (http://aka.ms/eventstream)) services, which help power Microsoft SaaS applications like Office 365, Xbox Live, Halo, Application Insights (and many, many more), as well as thousands of external Microsoft customers. Our group fosters a diverse, inclusive, and collaborative work culture that prioritizes people at all times.

We are looking for a Site Reliability Engineer II to help scale and operate Fabric Event Stream as a globally distributed, highly reliable platform, with a primary focus on region build-out, deployment, and site reliability engineering (SRE).

We do not just value differences or different perspectives. We seek them out and invite them in so we can tap into the collective power of everyone in the company. As a result, our customers are better served.

Responsibilities

In this role, you will own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout. You will work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale.

You will also play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions. This includes building strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets. This is a high-impact role at the intersection of distributed systems, cloud infrastructure, and operational excellence, where you will directly influence how Event Stream scales to support enterprise customers worldwide.

**Key Focus Areas **

Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
Reliability & SRE: Improve availability, resiliency, and incident response; own service health across regions
Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability

Embody our culture and values

Qualifications

**Required/Minimum Qualifications **

Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR

Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.

**Job Requirements: Other & Additional **

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:

This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

**Preferred/Additional Qualifications **

Solid understanding of concurrency, scalability, and fault tolerance
Hands-on experience with cloud platforms (Azure preferred), including service deployment, region onboarding, or infrastructure automation
Experience with streaming or messaging systems (e.g., Azure Event Hubs, Kafka, Service Bus, or similar), including understanding of throughput, latency, and reliability trade-offs
Experience in automation and deployment pipelines, including CI/CD, safe rollout practices, and multi-region configuration management
Proven ability to debug complex production issues and drive fixes across distributed components
Demonstrated ability to work across teams (platform, infra, partner services) to deliver end-to-end solutions and unblock dependencies
Proficient problem-solving skills with the ability to navigate ambiguity, design clear solutions, and deliver incrementally at scale

#azdat, #azuredata

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about **requesting accommodations.**