Site Reliability Engineer

Supabase Supabase · Data AI · Remote · Engineering

Supabase is seeking a Site Reliability Engineer to join their Service Operations team. The role focuses on improving the reliability of engineering teams by establishing practices, frameworks, and feedback loops. Responsibilities include defining SLIs/SLOs, owning the Operational Readiness Review process, strengthening the incident-to-improvement pipeline, acting as a reliability expert, identifying and automating toil, and helping teams design sustainable on-call practices. The ideal candidate has 7+ years of SRE experience, a software engineering mindset, experience with SLOs/SLIs, incident response, and cloud infrastructure.

What you'd actually do

  1. Partner with service teams to define meaningful SLIs and SLOs grounded in customer experience, and build the error budget policies that turn them into engineering decisions
  2. Own and evolve the Operational Readiness Review (ORR) process — conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation
  3. Strengthen the incident-to-improvement pipeline: connecting postmortem findings to operational readiness gaps, identifying repeat failure patterns, and driving systemic fixes
  4. Act as the reliability expert teams pull in for architecture reviews, failure mode analysis, dependency mapping, and resilience design
  5. Identify and quantify operational toil across the org, and build or advocate for automation that eliminates it

Skills

Required

  • SRE
  • production engineering
  • reliability-focused roles
  • shaping SRE practices and driving adoption across engineering teams
  • software engineering mindset
  • writing code
  • building tools
  • defining and operationalizing SLOs/SLIs at scale
  • error budget policies
  • incident response
  • postmortem facilitation
  • turning incident learnings into systemic improvements
  • cloud infrastructure
  • clear and persuasive communication
  • influencing without authority
  • async or globally distributed teams

Nice to have

  • managed database platforms
  • Postgres
  • Pulumi
  • Terraform
  • CDK
  • Kubernetes-based platform operations
  • OpenTelemetry
  • VictoriaMetrics
  • Grafana
  • developer-facing reliability tooling

What the JD emphasized

  • shaping SRE practices and driving adoption across engineering teams