Staff Site Reliability Engineer

Replit Replit · Enterprise · Remote · Engineering

Staff Site Reliability Engineer role at Replit, focusing on ensuring the reliability, scalability, and performance of their platform infrastructure. Responsibilities include architecting observability solutions, defining reliability standards, leading incident response, driving automation, optimizing Kubernetes performance on GCP, debugging distributed systems, and mentoring engineers. Requires 8-10 years of SRE experience, strong programming in Python/Go, deep understanding of distributed systems, Kubernetes, and observability tools.

What you'd actually do

  1. Design, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions.
  2. Work with product and engineering teams to define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  3. Act as a senior leader during high-impact incidents, guiding the team to rapid resolution.
  4. Architect, build, and improve automation to eliminate toil and operational work.
  5. Collaborate with core infrastructure and product teams to performance-tune and optimize our large-scale cloud deployments, with a deep focus on Kubernetes, Docker, and GCP.

Skills

Required

  • 8-10 years of experience in Site Reliability Engineering or similar roles
  • Strong programming skills in languages like Python or Go
  • Deep understanding of distributed systems
  • Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies
  • Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions
  • Strong incident management skills
  • Experience with infrastructure as code (e.g., Terraform, Pulumi)
  • Excellent written and verbal communication skills
  • Strong interpersonal skills
  • A willingness to dive into understanding, debugging, and improving any layer of the stack

Nice to have

  • Deep experience with Google Cloud Platform (GCP) services and tools
  • Expert-level knowledge of modern observability platforms (e.g., Prometheus, Grafana, Datadog, OpenTelemetry)
  • Experience designing and building reliable systems capable of handling high throughput and low latency
  • Significant experience with Go and Terraform
  • Familiarity with working in rapid-growth, startup environments
  • Experience writing company-facing blog posts and training materials

What the JD emphasized

  • reliability
  • scalability
  • performance
  • automation
  • observability
  • incident response
  • Kubernetes
  • GCP
  • Python
  • Go
  • distributed systems