Senior Site Reliability Engineer

Glean Glean · Enterprise · Engineering

This role is for a Senior Site Reliability Engineer (SRE) at Glean, an enterprise AI company focused on knowledge work. The SRE will ensure the reliability, availability, and performance of cloud-based services and applications, working closely with engineering teams to build and maintain robust, scalable, and highly available cloud infrastructure. The role involves technical leadership, incident management, automation, performance optimization, security, monitoring, and consultation on software development lifecycle. While the company is AI-focused, the SRE role itself is primarily infrastructure and reliability engineering, not direct AI/ML model development.

What you'd actually do

  1. Technical Leadership and Mentorship
  2. Ensure High Availability
  3. Incident Management
  4. Automation and Tooling
  5. Performance Optimization

Skills

Required

  • Site Reliability Engineering
  • Cloud-based services and infrastructure management
  • Software development (multiple languages)
  • People/team management
  • Distributed systems design, analysis, and troubleshooting
  • Google Cloud Platform, AWS, or Azure
  • Docker
  • Kubernetes
  • Terraform
  • Networking principles
  • Security principles
  • SRE best practices
  • Security best practices
  • Monitoring and alerting tools

Nice to have

  • Technical Leadership
  • Mentorship
  • Incident Management
  • Automation and Tooling development
  • Performance Optimization
  • Security and Compliance collaboration

What the JD emphasized

  • ensure the reliability, availability, and performance of our cloud-based services and applications
  • building infrastructure to scale our operations
  • eliminating work through automation
  • manage the complex challenges of scale and fast growth
  • keep Glean applications up and running
  • ensure our customers have the best and most reliable experience possible
  • Play a key role in driving technical excellence and fostering a culture of reliability
  • setting best practices for incident management, performance optimization, and automation
  • Influence best practices, drive cross-team collaborations
  • shaping architectural decisions and ensuring the delivery of high-quality, reliable systems
  • Implement and maintain resilient cloud architectures
  • monitor system performance
  • proactively identify and resolve potential bottlenecks or points of failure
  • Participate in primary oncall rotation
  • Continuously optimize the on-call process for sustainability and efficiency
  • Develop and maintain automation scripts, tools, and processes to streamline system deployment, monitoring, and management tasks
  • efficiently scaling cloud operations
  • Optimize cloud infrastructure and applications for performance, scalability, and cost-effectiveness
  • Collaborate with security engineers to implement best practices and ensure compliance with security standards and policies
  • Design and configure advanced monitoring systems to gain insights into system behavior
  • set up alerts, and respond proactively to potential issues
  • Create and maintain comprehensive dashboards and playbooks for production on-call
  • Engage actively in the entire software development lifecycle
  • Participate in system design reviews and provide valuable SRE insights during launch reviews
  • influencing and enhancing system architecture
  • 8+ years of experience in a senior-level role within Site Reliability Engineering or similar role, particularly in managing cloud-based services and infrastructure
  • 5+ years of experience with software development in one or more programming languages
  • 2+ years of experience managing people or teams, leading projects, and designing, analyzing, and troubleshooting distributed systems running in Cloud
  • Strong knowledge of cloud platforms such as Google Cloud Platform, AWS, or Azure
  • Practical experience with containerization technologies, including Docker and Kubernetes
  • Familiarity with infrastructure as code tools like Terraform is essential
  • Solid understanding of networking, security principles, and best SRE and security practices
  • Proficiency in using monitoring and alerting tools to detect and respond to potential issues effectively