Senior Site Reliability Engineer- Remote

ClickHouse ClickHouse · Data AI · Canada +1 · Engineering

ClickHouse is seeking a Senior Site Reliability Engineer to join their central SRE team. The role focuses on building and leading processes to ensure the reliability, availability, scalability, and performance of their cloud infrastructure, which supports AI workloads. Responsibilities include collaborating with engineering teams, establishing SLOs/SLAs, ensuring monitoring and alerting, enhancing incident response, and driving chaos initiatives. The ideal candidate has a strong background in SRE, cloud platforms, container orchestration, and automation tools, with experience in distributed databases being a plus.

What you'd actually do

  1. Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
  2. Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  3. Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
  4. Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
  5. Continuously improve the reliability and performance of our ClickHouse services.

Skills

Required

  • Go and/or Python
  • AWS, Azure, or Google Cloud Platform
  • Kubernetes or Docker Swarm
  • Ansible, Terraform, or Puppet
  • production debugging skills

Nice to have

  • ClickHouse
  • distributed databases
  • SQL

What the JD emphasized

  • At least 8 years of experience in Site Reliability Engineering or a related field.