Senior Site Reliability Engineer- Remote

ClickHouse ClickHouse · Data AI · Canada +1 · Engineering

Senior Site Reliability Engineer responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of ClickHouse's cloud infrastructure. This involves collaborating with engineering teams, establishing SLOs/SLAs, ensuring monitoring and alerting, enhancing incident response, and driving chaos initiatives. The role leverages software engineering expertise to develop platforms and tools for operational efficiency.

What you'd actually do

  1. Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
  2. Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  3. Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
  4. Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
  5. Continuously improve the reliability and performance of our ClickHouse services.

Skills

Required

  • Go
  • Python
  • AWS
  • Azure
  • Google Cloud Platform
  • Kubernetes
  • Docker Swarm
  • Ansible
  • Terraform
  • Puppet
  • production debugging skills

Nice to have

  • ClickHouse
  • distributed databases
  • SQL

What the JD emphasized

  • At least 8 years of experience in Site Reliability Engineering or a related field.