Staff Site Reliability Engineer

Okta Okta · Enterprise · Bangalore, India · Tech Ops-610

Okta is seeking a Staff Site Reliability Engineer to join their Emerging Products Group (EPG). The role focuses on building and operating highly reliable, scalable, and secure cloud services, with an emphasis on automation and operational excellence. The engineer will lead reliability initiatives, mentor other engineers, and explore AI-assisted engineering techniques to improve operational efficiency and productivity. The tech stack includes Kubernetes, Terraform, Go, Python, and Datadog.

What you'd actually do

  1. Design, build, and operate large-scale cloud infrastructure and production services.
  2. Participate in an on-call rotation supporting highly available customer-facing systems.
  3. Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
  4. Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
  5. Partner with engineering teams to improve service availability, scalability, performance, and resilience.

Skills

Required

  • operating large-scale production services in AWS and/or GCP
  • Kubernetes in production environments
  • Infrastructure as Code technologies such as Terraform and Helm
  • software engineering skills in Golang and/or Python
  • building automation and internal engineering platforms
  • operating and troubleshooting distributed data platforms
  • cloud networking fundamentals
  • observability platforms, monitoring strategies, and production telemetry
  • leading incident response
  • reliability engineering concepts including SLIs, SLOs, error budgets, and capacity planning
  • CI/CD pipelines, deployment strategies, and automation-first operational practices

Nice to have

  • Experience with or strong interest in AI-assisted engineering and operational automation.

What the JD emphasized

  • AI-assisted engineering techniques
  • operational efficiency
  • incident response
  • troubleshooting
  • automation