Staff Site Reliability Engineer - Site Experience

Reddit Reddit · Consumer · Germany · Remote · Site Reliability

Staff Site Reliability Engineer focused on user-facing systems at internet scale, ensuring availability, latency, scalability, and operational excellence. Involves partnering with product and infrastructure teams, leading reliability initiatives, and influencing engineering standards.

What you'd actually do

  1. Lead Reliability Engineering for User Experience
  2. Architect for Scale
  3. Reduce Operational Risk
  4. Drive Automation
  5. Incident Management

Skills

Required

  • Site Reliability Engineering
  • Infrastructure Engineering
  • distributed systems
  • networking
  • Linux systems
  • cloud native architectures
  • highly available systems
  • Go
  • Python
  • observability systems
  • metrics
  • logging
  • tracing
  • alerting
  • SLOs
  • automation
  • incident management
  • performance optimization
  • troubleshoot complex issues

Nice to have

  • Kubernetes
  • containers
  • cloud infrastructure
  • modern deployment platforms
  • Prometheus
  • Grafana
  • OpenTelemetry
  • Envoy
  • Kafka
  • ClickHouse
  • Cassandra
  • Redis
  • CDN optimization
  • edge reliability
  • traffic engineering
  • global infrastructure
  • open source software
  • incident response
  • operational transformation

What the JD emphasized

  • operating large scale distributed systems
  • high traffic, user facing production environments
  • highly available systems
  • complex issues across applications, infrastructure, networking, and services