Site Reliability Engineer - Ctj - Poly

Microsoft Microsoft · Big Tech · Reston, VA +3 · Site Reliability Engineering

Site Reliability Engineer (SRE) for Azure Data Transfer (ADT) team, focusing on reliability, availability, performance, and efficiency of distributed systems for highly regulated industries. Responsibilities include owning reliability architecture, defining SLIs/SLOs, implementing automation, improving observability, and leading incident response. Requires experience with cloud reliability practices, secure-by-design principles, and compliance requirements.

What you'd actually do

  1. Owns reliability architecture and end-to-end service understanding (dependencies, failure modes, and customer journeys) for distributed systems at scale. Defines and improves service health via SLIs/SLOs, error budgets, and well-defined operational readiness criteria. Drives cross-team reliability reviews and recommends design changes, runbooks, and safe rollout/rollback strategies that improve availability, latency, performance, and efficiency while managing cost.
  2. Maintains deep, current expertise in cloud reliability practices and the evolving technology landscape. Drives adoption of new platform capabilities and operational patterns (e.g., progressive delivery, resilience testing, chaos engineering where appropriate). Mentors engineers through design reviews, incident walkthroughs, and knowledge sharing to raise the reliability bar across related services.
  3. Implements reliable, scalable, and high-performance changes using SRE practices (progressive delivery, feature flags where applicable, safe rollouts/rollbacks). Owns implementation and rollback plans, validates operational readiness, and reduces toil through automation, self-healing, and standardized playbooks.
  4. Leverages telemetry and production signals to identify reliability risks and recurring failure patterns, then ships configuration changes, code fixes, or automation to address root causes. Expands infrastructure-as-code and operational tooling so teams can manage platforms and services safely and repeatably through code and policy.
  5. Builds and improves observability (metrics, logs, traces, dashboards, alerts) and uses it to detect, diagnose, and prevent incidents. Defines actionable alerting, reduces noise, and ensures instrumentation supports SLO reporting and rapid troubleshooting. Develops automation to validate telemetry pipelines and to enable automated mitigation and safer incident response.

Skills

Required

  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
  • Experience with SRE principles (availability, latency, performance, efficiency, change management, incident response)
  • Experience with distributed systems at scale
  • Experience defining and improving service health via SLIs/SLOs and error budgets
  • Experience with observability (logs, metrics, traces)
  • Experience with automation and infrastructure-as-code
  • Experience with secure-by-design and compliance requirements
  • Ability to obtain and maintain a U.S. Government Top Secret Clearance with SCI and Polygraph

Nice to have

  • Experience with progressive delivery, resilience testing, chaos engineering
  • Experience with cloud reliability practices
  • Experience with data transmission across clouds

What the JD emphasized

  • highly regulated industries
  • strict security policy and assurance requirements
  • highly secured and regulated environments
  • Security Clearance Requirements
  • active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph