Site Reliability Engineer

Replit Replit · Enterprise · Remote · Engineering

Site Reliability Engineer to ensure the reliability, scalability, and performance of Replit's infrastructure. Responsibilities include designing and implementing observability solutions, driving automation and infrastructure as code, establishing SLOs/SLIs, incident management, and performance optimization.

What you'd actually do

  1. Design and Implement Observability Solutions
  2. Drive Automation and Infrastructure as Code
  3. Establish SLOs and SLIs
  4. Incident Management and Response
  5. Performance Optimization

Skills

Required

  • Python
  • Go
  • distributed systems
  • Kubernetes
  • Terraform
  • Ansible
  • Pulumi
  • CI/CD
  • monitoring
  • alerting
  • logging
  • incident response
  • SLOs
  • SLIs

Nice to have

  • GCP
  • Prometheus
  • Grafana
  • Datadog

What the JD emphasized

  • 4-8 years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering)
  • Strong programming skills in languages commonly used for automation (Python, Go, or similar)
  • Deep understanding of distributed systems
  • Experience with container orchestration platforms (Kubernetes) and cloud-native technologies
  • Proven track record of implementing and maintaining monitoring/observability solutions
  • Strong incident management skills with experience leading incident response
  • Experience with infrastructure as code and configuration management tools