Staff Site Reliability Engineer - Observability

Okta Okta · Enterprise · San Francisco, CA · Tech Ops-610

Okta is seeking a Staff Site Reliability Engineer specializing in Observability on Google Cloud to expand their Observability ecosystem. The role involves designing, building, and maintaining scalable observability infrastructure using Terraform and coding in Go, Python, or Ruby. Responsibilities include optimizing data collection, processing, and storage for Splunk and Grafana, participating in incident response, and automating the deployment of agents and collectors.

What you'd actually do

  1. Design, build, and maintain scalable observability infrastructure using tools like Terraform.
  2. Optimize the collection, processing, and storage of Observabilty data to ensure high reliability and low latency of our Splunk and Grafana services
  3. Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
  4. Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.

Skills

Required

  • GKE
  • Splunk
  • Grafana
  • SRE
  • DevOps
  • Systems Engineering
  • Python
  • Go
  • Linux
  • TCP/IP
  • DNS
  • Load Balancing
  • Kubernetes
  • GKE

Nice to have

  • OpenTelemetry
  • Vector
  • Grafana Loki
  • AWS

What the JD emphasized

  • Minimum 5+ Experience scaling and managing observability in a Google Cloud platform.
  • Expertise in creating intuitive, actionable Splunk or Grafana dashboards that correlate data across multiple sources.
  • Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.
  • Strong coding skills in Python, Go for building internal tools and automating workflows.