Staff Site Reliability Engineer

Okta Okta · Enterprise · Bangalore, India · Tech Ops-610

Okta is seeking a Staff Site Reliability Engineer to build and operate highly scalable, reliable, and secure infrastructure powering their production systems, with a focus on supporting AI initiatives. The role involves leading reliability and modernization efforts, managing Kubernetes and cloud infrastructure (AWS/GCP), and implementing infrastructure as code. The engineer will partner with development teams, drive observability improvements, champion SRE best practices, and mentor other engineers. Experience with Kubernetes, cloud-native distributed systems, IaC, CI/CD, and observability tools is required.

What you'd actually do

  1. Design, build, and operate highly scalable, reliable, and secure infrastructure powering our production systems across AWS and GCP.
  2. Lead major reliability and modernization initiatives, including container platform migrations (e.g., ECS to EKS/GKE) and microservice enablement across multi-cloud environments.
  3. Serve as a technical authority in Kubernetes (EKS and GKE), cloud infrastructure (AWS and GCP), and modern CI/CD practices (GitOps, automation pipelines).
  4. Partner with development teams to architect and enable microservice-based applications, ensuring production readiness, scalability, and observability.
  5. Implement and manage infrastructure as code (Terraform, Ansible) to automate provisioning, scaling, and configuration management across multiple cloud providers.

Skills

Required

  • Kubernetes (EKS and GKE)
  • AWS
  • GCP
  • Terraform
  • Ansible
  • Python
  • Go
  • Shell scripting
  • CI/CD
  • Linux systems
  • Networking fundamentals
  • Redis
  • Observability tools (Prometheus, Grafana, ELK, Loki, OpenTelemetry, Google Cloud Operations)
  • Container security
  • Secrets management (HashiCorp Vault, AWS Secrets Manager, Google Secret Manager)
  • SRE best practices (SLOs/SLIs, incident response)
  • Infrastructure as Code

Nice to have

  • ECS to EKS/GKE migrations
  • microservice enablement
  • SaaS experience
  • high-scale, cloud-native environments

What the JD emphasized

  • Kubernetes (EKS and GKE)
  • AWS and GCP
  • microservice enablement
  • Terraform
  • Python, Go, or Shell
  • Redis (must have)
  • ECS to EKS/GKE migrations