Principal Site Reliability Engineer

Microsoft Microsoft · Big Tech · United States · Site Reliability Engineering

Principal Site Reliability Engineer at Microsoft, focusing on ensuring the reliability, performance, scalability, and availability of systems.

What you'd actually do

  1. Drive the adoption of reliability best practices across engineering teams.
  2. Design, implement, and manage scalable and highly available systems.
  3. Develop and maintain automation for deployment, monitoring, and incident response.
  4. Troubleshoot and resolve complex production issues, ensuring minimal downtime.
  5. Mentor junior engineers and contribute to the team's technical growth.

Skills

Required

  • Deep understanding of SRE principles and practices
  • Experience with cloud platforms (Azure, AWS, GCP)
  • Proficiency in at least one programming language (e.g., Python, Go, C#)
  • Strong knowledge of containerization and orchestration technologies (Docker, Kubernetes)
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog)

Nice to have

  • Experience with CI/CD pipelines
  • Familiarity with infrastructure as code (Terraform, Ansible)
  • Knowledge of database systems (SQL, NoSQL)
  • Experience in a regulated environment