Site Reliability Engineer

Microsoft Microsoft · Big Tech · United States · Site Reliability Engineering

Site Reliability Engineer (SRE) role focused on Azure Specialized workloads, involving service architecture, datacenter networking, monitoring, security, and backend infrastructure. The role requires managing and monitoring services, developing automation, ensuring security and compliance, and staying current with reliability best practices. It involves on-call duties and incident response.

What you'd actually do

  1. Acts as a Designated Responsible Individual (DRI) working on call to monitor service for degradation, downtime, or interruptions. Alerts stakeholders as to the status and gains approval to restore system/product/service for simple problems. Responds within Service Level Agreement (SLA) timeframe. Escalate issues to appropriate owners.
  2. Contributes to efforts to collect, classify, and analyze data with little oversight on a range of metrics (e.g., health of the system, where bugs might be occurring). Contributes to the refinement of product features by escalating findings from analyses to inform decisions regarding the engineering of products.
  3. Contributes to the development of automation within production and deployment of a complex product feature. Runs code in simulated, or other non-production environments to confirm functionality and error-free runtime for products with little to no oversight.
  4. Contributes to efforts to ensure the correct processes are followed to achieve a high degree of security, privacy, safety, and accessibility. Checks for visible evidence to demonstrate compliance for product areas. Develops and holds an understanding of the implications of onboarding new technologies following expectations of compliance at Microsoft.
  5. Remains current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.

Skills

Required

  • Computer Science
  • Information Technology
  • software engineering
  • network engineering
  • systems administration
  • physical infrastructure management
  • security screening

Nice to have

  • service architecture
  • datacenter networking
  • monitoring
  • security
  • backend infrastructure development
  • automation
  • compliance
  • observability

What the JD emphasized

  • AI infrastructure
  • Cloud services
  • Security
  • control and data plane enablement
  • monitoring
  • reliability
  • performance
  • security screening requirements