Principal Site Reliability Engineer

Microsoft · Big Tech · United States · Site Reliability Engineering

Principal Site Reliability Engineer at Microsoft, focusing on ensuring the reliability, performance, scalability, and availability of systems.

What you'd actually do

Drive the adoption of reliability best practices across engineering teams.
Design, implement, and manage scalable and highly available systems.
Develop and maintain automation for deployment, monitoring, and incident response.
Troubleshoot and resolve complex production issues, ensuring minimal downtime.
Mentor junior engineers and contribute to the team's technical growth.

Skills

Required

Deep understanding of SRE principles and practices
Experience with cloud platforms (Azure, AWS, GCP)
Proficiency in at least one programming language (e.g., Python, Go, C#)
Strong knowledge of containerization and orchestration technologies (Docker, Kubernetes)
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog)

Nice to have

Experience with CI/CD pipelines
Familiarity with infrastructure as code (Terraform, Ansible)
Knowledge of database systems (SQL, NoSQL)
Experience in a regulated environment