Staff Site Reliability Engineer - Site Experience

Reddit Reddit · Consumer · Germany · Remote · Site Reliability

Staff Site Reliability Engineer focused on user-facing systems at Reddit, ensuring availability, latency, scalability, and operational excellence. The role involves leading reliability initiatives, architecting for scale, reducing operational risk, driving automation, and incident management for critical user experiences.

What you'd actually do

  1. Lead Reliability Engineering for User Experience
  2. Architect for Scale
  3. Reduce Operational Risk
  4. Drive Automation
  5. Incident Management

Skills

Required

  • Site Reliability Engineering
  • Infrastructure Engineering
  • distributed systems
  • networking
  • Linux systems
  • cloud native architectures
  • highly available systems
  • Go
  • Python
  • observability systems
  • metrics
  • logging
  • tracing
  • alerting
  • SLOs
  • automation
  • incident management
  • performance optimization

Nice to have

  • Kubernetes
  • containers
  • cloud infrastructure
  • modern deployment platforms
  • Prometheus
  • Grafana
  • OpenTelemetry
  • Envoy
  • Kafka
  • ClickHouse
  • Cassandra
  • Redis
  • CDN optimization
  • edge reliability
  • traffic engineering
  • global infrastructure
  • open source software
  • technical communities
  • large scale incident response
  • operational transformation initiatives

What the JD emphasized

  • operating large scale distributed systems
  • high traffic, user facing production environments
  • highly available systems
  • troubleshoot complex issues across applications, infrastructure, networking, and services
  • internet scale traffic volumes