Staff Site Reliability Engineer - Observability

Okta Okta · Enterprise · Bellevue, WA · Tech Ops-610

Staff Site Reliability Engineer with a specialty in Splunk to own and evolve Okta's Splunk ecosystem, delivering a world-class, comprehensive, scalable Observability Platform. This role involves treating infrastructure as code using Terraform and coding in Go, Python, or Ruby to automate agent and collector deployment across distributed systems. Key responsibilities include designing and maintaining observability infrastructure, optimizing Splunk for reliability and low latency, participating in incident response, and automating tasks to eliminate toil.

What you'd actually do

  1. Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
  2. Splunk Engineering: Optimize the collection, processing, and storage of log data to ensure high reliability and low latency of our Splunk services
  3. Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
  4. Automation: Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.

Skills

Required

  • Splunk Cloud at scale (1000+ SVCs)
  • Workload Management (WLM)
  • HEC optimization
  • Splunk dashboards
  • SRE
  • DevOps
  • Systems Engineering
  • high-availability systems
  • SPL
  • Go
  • Linux internals
  • TCP/IP
  • DNS
  • Load Balancing
  • Kubernetes/EKS
  • debugging complex, cross-service performance bottlenecks

Nice to have

  • OpenTelemetry (OTel)
  • Vector
  • instrumenting applications
  • Splunk charge-back app
  • AWS
  • GCP

What the JD emphasized

  • Splunk Cloud at scale (1000+ SVCs)
  • HEC optimization
  • correlate data across multiple sources
  • high-availability systems
  • cross-service performance bottlenecks