Site Reliability Engineer - Ctj - Poly

Microsoft Microsoft · Big Tech · Reston, VA +3 · Site Reliability Engineering

Site Reliability Engineer for M365 sovereign cloud operations, focusing on building tools and systems for faster, smarter, and more reliable service delivery and maintenance. The role involves writing code for operational challenges, internal platforms, automation, and agentic workflows, while ensuring security, compliance, and operational excellence for government and sovereign cloud customers.

What you'd actually do

  1. Creates and implements code for a product, service, or feature, reusing code as applicable with minimal supervision. Writes and learns to create code that is extensible and maintainable. Considers diagnosability, reliability, and maintainability with few defects, and understands when the code is ready to be shared and delivered. Applies coding patterns and best practices to write code (e.g., leveraging state-of-the-art generative artificial intelligence [GenAI], approaches to source code organization, naming conventions).
  2. Acts as a designated responsible individual (DRI), working on-call to monitor a system/product feature/service for degradation, downtime, or interruptions. Alerts stakeholders as to the status and gains approval to restore system/product/service for simple problems. Responds within service level agreement (SLA) timeframe. Escalates issues to appropriate owners
  3. Maintains operations of live site service, following security best practices when responding quickly to mitigate issues while using the minimum required permissions to do so that arise on a rotational, on-call basis. Identifies solutions and mitigations to simple issues and complex issues when applicable impacting performance or functionality of live site services and escalates appropriately. With minimal supervision, improves troubleshooting guides (TSGs), wikis, tests, and telemetry to make on-call better, and recommends user-facing support documentation and additional test coverage to reduce likelihood of future user-initiated incidents
  4. Contributes to identifying dependencies, and incorporates them into the development of design documents for a product area with little oversight. Helps to actively identify other teams and technologies to leverage, how they interact, and where their own system or team can support others. Understands downstream interactions between systems.
  5. Contributes to the identification of requirements for, and development of automation within production and deployment of a complex product feature, targeting zero-touch deployment when possible. Runs code in simulated, or other non-production environments to confirm functionality and error-free runtime for products with little to no oversight.

Skills

Required

  • Distributed systems
  • Scalable services
  • Coding
  • Automation
  • On-call support
  • Troubleshooting
  • System design
  • Security best practices

Nice to have

  • Generative artificial intelligence (GenAI)