Senior Site Reliability Engineer

Microsoft Microsoft · Big Tech · United States · Site Reliability Engineering

Senior Site Reliability Engineer role in Azure Specialized, focusing on AI infrastructure, Cloud services, and Security. Responsibilities include monitoring service health, contributing to automation, ensuring security and compliance, and maintaining live services. Requires experience with physical infrastructure and GPU/Infiniband support.

What you'd actually do

  1. Acts as a Designated Responsible Individual (DRI) working on call to monitor service for degradation, downtime, or interruptions. Alerts stakeholders as to the status and gains approval to restore system/product/service for simple problems. Responds within Service Level Agreement (SLA) timeframe. Escalate issues to appropriate owners.
  2. Contributes to efforts to collect, classify, and analyze data with little oversight on a range of metrics (e.g., health of the system, where bugs might be occurring). Contributes to the refinement of product features by escalating findings from analyses to inform decisions regarding the engineering of products.
  3. Contributes to the development of automation within production and deployment of a complex product feature. Runs code in simulated, or other non-production environments to confirm functionality and error-free runtime for products with little to no oversight.
  4. Contributes to efforts to ensure the correct processes are followed to achieve a high degree of security, privacy, safety, and accessibility. Checks for visible evidence to demonstrate compliance for product areas. Develops and holds an understanding of the implications of onboarding new technologies following expectations of compliance at Microsoft.
  5. Remains current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.

Skills

Required

  • support of physical infrastructure
  • GPU and/or Infiniband support
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.

Nice to have

  • 7+ years technical experience in software engineering, network engineering, OR systems administration OR Bachelor's Degree in Computer Science, Information Technology, OR related field AND 4+ years technical experience in software engineering, network engineering, OR systems administration OR Master's Degree in Computer Science, Information Technology, OR related field AND 3+ years technical experience in software engineering, network engineering