Senior Site Reliability Engineer

Microsoft Microsoft · Big Tech · Redmond, WA +1 · Site Reliability Engineering

Senior Site Reliability Engineer for Microsoft Teams Core Services Infrastructure and Security team, focusing on building and operating mission-critical distributed systems for real-time communication. Responsibilities include network security, gateway infrastructure, routing, load balancing, DNS, CDN, monitoring, and observability. The role involves hands-on implementation, design discussions, incident response, and continuous improvement using automation and potentially AI-enabled tooling.

What you'd actually do

  1. Contribute to the design, implementation, and operation of secure, reliable network and infrastructure services supporting Microsoft Teams’ microservices environment.
  2. Improve reliability by developing and refining monitoring, alerting, dashboards, and automated recovery mechanisms across critical control‑plane and data‑plane systems.
  3. Serve as a Designated Responsible Individual (DRI) on a rotational basis triaging incidents, driving mitigations, documenting learnings, and helping improve live‑site processes.
  4. Work closely with partner engineering teams (security, networking, microservices, compliance, governance) to deliver integrated improvements across shared infrastructure layers.
  5. Identify opportunities for automation using scripts, pipelines, policy‑driven guardrails, or AI‑enabled tooling to reduce manual toil and increase engineering productivity.

Skills

Required

  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.

Nice to have

  • Experience with core networking concepts including TCP/IP fundamentals, routing, load balancing, CDN, ACL/firewalling, TLS, certificate lifecycle management.
  • Ability to diagnose and remediate performance or availability issues using logs, metrics, traces, and standard network troubleshooting tools.
  • Hands‑on experience operating services in a cloud environment (Azure or equivalent).
  • 3+ years technical experience working with large-scale cloud or distributed systems.
  • Experience with network security, cloud security controls, identity‑driven security policies, and certificate management at scale.
  • Familiarity with large‑scale cloud infrastructure (IaaS), microservices patterns, API gateways, and global routing architectures.
  • Experience leveraging automation frameworks, scripting, GitOps, Infrastructure‑as‑Code, or AI‑driven tooling to improve reliability and reduce operational load.
  • Demonstrated ability to partner across teams, influence without authority, and contribute to multi‑team initiatives or modernization efforts.

What the JD emphasized

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.