Site Reliability Engineer

PostHog · Enterprise · Remote · Engineering

PostHog is seeking a Site Reliability Engineer to manage and automate their AWS infrastructure, focusing on EKS clusters, Karpenter, Cilium, and ArgoCD. The role involves ensuring the reliability and scalability of a large-scale, stateful system handling petabytes of data, with a focus on reducing operational stress and building automation for traffic-heavy workloads. While PostHog has an AI product, this SRE role is not directly involved in building AI models but supports the infrastructure that runs them.

What you'd actually do

  1. Operating EKS clusters across several environments with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployme

Skills

Required

  • AWS
  • Kubernetes
  • EKS
  • Karpenter
  • Cilium
  • ArgoCD
  • GitOps
  • automation
  • stateful infrastructure
  • reliability engineering
  • system design
  • networking

Nice to have

  • VMs

What the JD emphasized

  • deep ownership of production systems
  • working with stateful infrastructure
  • AWS, VMs, automation
  • making messy systems reliable
  • fully own projects
  • scaling
  • shipping complex products
  • handling a stream of support requests
  • trying to ship something that touches multiple teams
  • turning a fast-growing, stateful system into a predictable, well-automated platform
  • reducing operational stress
  • designing safe automation for traffic-heavy workloads
  • building the tooling and patterns that let the system scale without scaling human effort
  • petabytes of data
  • thousands of cores
  • constant ingestion
  • multi-region, multi-account AWS platform
  • many services on Kubernetes