Site Reliability Engineer - Ctj - Poly

Microsoft Microsoft · Big Tech · Redmond, WA +2 · Site Reliability Engineering

Site Reliability Engineer for Azure Local services in sovereign clouds, focusing on dependability, customer support, and operational excellence. Responsibilities include maintaining service reliability, contributing to design and development by analyzing telemetry and suggesting improvements, and driving operational excellence through automation and AI/ML insights for performance and resource optimization. The role involves incident response, developing alerts, and maintaining telemetry pipelines.

What you'd actually do

  1. Support customer deployments and use of Azure Local and Azure Local disconnected operations.
  2. Maintain Azure Service reliability including deployment, availability, security, performance and customer satisfaction for sovereign environments.
  3. Leverages technical expertise in cloud technologies and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of product components or features supported by their team.
  4. Leverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation.
  5. Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale.

Skills

Required

  • site reliability engineering
  • cloud technologies
  • production telemetry analysis
  • automation
  • incident response
  • monitoring and alerting
  • performance optimization
  • resource management
  • scripting
  • code reviews
  • design reviews

Nice to have

  • AI/ML algorithms
  • sovereign cloud solutions