Senior Site Reliability Engineer

Glean Glean · Enterprise · Engineering

Senior Site Reliability Engineer for an AI-powered knowledge management platform, focusing on ensuring the reliability, availability, and performance of cloud-based services and applications. The role involves managing complex challenges of scale and fast growth in a hybrid cloud environment, eliminating work through automation, and maintaining robust, scalable, and highly available cloud infrastructure.

What you'd actually do

  1. Play a key role in driving technical excellence and fostering a culture of reliability across engineering teams.
  2. Implement and maintain resilient cloud architectures, monitor system performance, and proactively identify and resolve potential bottlenecks or points of failure.
  3. Participate in primary oncall rotation; cultivate technical curiosity and growth mindset, and a blameless postmortem culture within the team.
  4. Develop and maintain automation scripts, tools, and processes to streamline system deployment, monitoring, and management tasks.
  5. Optimize cloud infrastructure and applications for performance, scalability, and cost-effectiveness.

Skills

Required

  • Site Reliability Engineering
  • cloud-based services and infrastructure management
  • software development
  • Google Cloud Platform, AWS, or Azure
  • Docker
  • Kubernetes
  • Terraform
  • networking
  • security principles
  • SRE practices
  • monitoring and alerting tools

Nice to have

  • Technical Leadership and Mentorship
  • incident management
  • performance optimization
  • automation
  • security and compliance collaboration
  • system design reviews
  • launch reviews

What the JD emphasized

  • 8+ years of experience in a senior-level role within Site Reliability Engineering or similar role, particularly in managing cloud-based services and infrastructure.
  • 5+ years of experience with software development in one or more programming languages.
  • 2+ years of experience managing people or teams, leading projects, and designing, analyzing, and troubleshooting distributed systems running in Cloud.