Staff Site Reliability Engineer - Incident Management & Reliability (remote - Canada)

Confluent Confluent · Data AI · ON +1 · Remote · Engineering

Staff Site Reliability Engineer focused on incident management and reliability for Confluent's cloud platform. The role involves analyzing failure patterns, building automation, improving tooling, defining SLOs, and evolving incident response practices. It requires deep expertise in distributed systems, cloud environments (AWS, GCP, Azure), and incident management tooling.

What you'd actually do

  1. Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
  2. Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  3. Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
  4. Own standards, practices, and continuous improvement of incident response across engineering
  5. Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity

Skills

Required

  • SRE
  • incident management
  • reliability engineering
  • AWS
  • GCP
  • Azure
  • distributed systems
  • incident management tooling
  • observability
  • Kubernetes
  • container orchestration
  • CI/CD pipelines
  • release processes
  • written communication

Nice to have

  • Kafka
  • event streaming

What the JD emphasized

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Experience navigating reliability/incident programs at 500+ engineer organizations