Principal Site Reliability Engineer
Principal Site Reliability Engineer at Microsoft, focusing on ensuring the reliability, performance, scalability, and availability of systems.
What you'd actually do
- Drive the adoption of reliability best practices across engineering teams.
- Design, implement, and manage scalable and highly available systems.
- Develop and maintain automation for deployment, monitoring, and incident response.
- Troubleshoot and resolve complex production issues, ensuring minimal downtime.
- Mentor junior engineers and contribute to the team's technical growth.
Skills
Required
- Deep understanding of SRE principles and practices
- Experience with cloud platforms (Azure, AWS, GCP)
- Proficiency in at least one programming language (e.g., Python, Go, C#)
- Strong knowledge of containerization and orchestration technologies (Docker, Kubernetes)
- Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog)
Nice to have
- Experience with CI/CD pipelines
- Familiarity with infrastructure as code (Terraform, Ansible)
- Knowledge of database systems (SQL, NoSQL)
- Experience in a regulated environment