Senior Site Reliability Engineer

ClickHouse ClickHouse · Data AI · Product & Engineering

Senior Site Reliability Engineer responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of ClickHouse's cloud infrastructure. This includes collaborating with engineering teams, establishing SLOs/SLAs, ensuring monitoring and alerting, enhancing incident response, planning chaos initiatives, and managing on-call processes. The role leverages software engineering expertise to develop platforms and tools for operational and engineering efficiencies.

What you'd actually do

  1. Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
  2. Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  3. Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
  4. Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
  5. Continuously improve the reliability and performance of our ClickHouse services.

Skills

Required

  • Site Reliability Engineering
  • Go
  • Python
  • AWS
  • Azure
  • Google Cloud Platform
  • Kubernetes
  • Docker Swarm
  • Ansible
  • Terraform
  • Puppet
  • distributed databases
  • SQL
  • production debugging

Nice to have

  • ClickHouse
  • ClickHouse Cloud

What the JD emphasized

  • building and leading processes
  • ensure the reliability, availability, scalability, and performance
  • design and implement scalable, secure, highly available and fault-tolerant distributed systems
  • incident management and response
  • post-mortem analysis
  • continuous improvement
  • operational and engineering efficiencies
  • elastic, limitless scale, high-performance, serverless ClickHouse Cloud
  • design and implement scalable, secure, and highly available systems
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs)
  • Ensure all the infrastructure components ... have monitoring and alerting
  • Enhance and refine incident response processes and post-mortem analysis
  • Continuously improve the reliability and performance
  • Plan, enable, and drive Chaos initiatives
  • Manage on-call processes
  • At least 8 years of experience in Site Reliability Engineering or a related field
  • Strong knowledge of cloud computing platforms
  • Hands-on experience with container orchestration tools
  • Strong experience with automation and configuration management tools
  • strong problem-solver
  • solid production debugging skills
  • high level of responsibility, ownership, and accountability