Staff Site Reliability Engineer at Okta

What you'd actually do

Design, build, and operate highly scalable, reliable, and secure infrastructure powering our production systems across AWS and GCP.

Lead major reliability and modernization initiatives, including container platform migrations (e.g., ECS to EKS/GKE) and microservice enablement across multi-cloud environments.

Serve as a technical authority in Kubernetes (EKS and GKE), cloud infrastructure (AWS and GCP), and modern CI/CD practices (GitOps, automation pipelines).

Partner with development teams to architect and enable microservice-based applications, ensuring production readiness, scalability, and observability.

Implement and manage infrastructure as code (Terraform, Ansible) to automate provisioning, scaling, and configuration management across multiple cloud providers.

Skills

Required

Kubernetes (EKS and GKE)
AWS
GCP
Terraform
Ansible
Python
Go
Shell scripting
CI/CD
Linux systems
Networking fundamentals
Redis
Observability tools (Prometheus, Grafana, ELK, Loki, OpenTelemetry, Google Cloud Operations)
Container security
Secrets management (HashiCorp Vault, AWS Secrets Manager, Google Secret Manager)
SRE best practices (SLOs/SLIs, incident response)
Infrastructure as Code

Nice to have

ECS to EKS/GKE migrations
microservice enablement
SaaS experience
high-scale, cloud-native environments

**Secure Every Identity, from AI to Human

**Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.

This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

What You’ll Be Doing

Design, build, and operate highly scalable, reliable, and secure infrastructure powering our production systems across AWS and GCP.
Lead major reliability and modernization initiatives, including container platform migrations (e.g., ECS to EKS/GKE) and microservice enablement across multi-cloud environments.
Serve as a technical authority in Kubernetes (EKS and GKE), cloud infrastructure (AWS and GCP), and modern CI/CD practices (GitOps, automation pipelines).
Partner with development teams to architect and enable microservice-based applications, ensuring production readiness, scalability, and observability.
Implement and manage infrastructure as code (Terraform, Ansible) to automate provisioning, scaling, and configuration management across multiple cloud providers.
Drive improvements in observability, performance, and cost efficiency through robust monitoring, logging, and alerting systems that span AWS and GCP.
Champion SRE best practices — defining SLOs/SLIs, conducting blameless postmortems, and continuously improving incident response.
Lead complex technical projects from conception to completion, managing timelines, and technical dependencies across teams.
Mentor engineers across teams, fostering a culture of reliability, automation, and continuous learning.
Collaborate with security and compliance partners to ensure infrastructure adheres to best practices and standards (e.g., IAM Federation, Workload Identity).
Participate in the on-call rotation, using incidents as learning opportunities to enhance systems and processes.

What You’ll Bring to the Role:

Strong hands-on experience architecting and operating cloud-native distributed systems (AWS and GCP).
Deep expertise with Kubernetes (EKS and GKE) — design, provisioning, scaling, and advanced troubleshooting in production.
Proven experience leading ECS to EKS/GKE migrations and driving microservice enablement initiatives at scale.
Proficiency with Infrastructure as Code tools such as Terraform (multi-provider), Ansible, or CloudFormation.
Solid coding and scripting ability in Python, Go, or Shell, with a focus on automation, tooling, and operational excellence.
Advanced understanding of CI/CD pipelines (ArgoCD, GitLab CI, Spinnaker), Linux systems, and networking fundamentals (Direct Connect/Interconnect, DNS, routing, load balancing) and Redis (must have).
Experience managing databases and caching systems (e.g., RDS/Cloud SQL, Redis/Memorystore, PostgreSQL, MySQL) in cloud environments.
Hands-on experience with observability tools (Prometheus, Grafana, ELK, Loki, OpenTelemetry, Google Cloud Operations) for performance and reliability insights.
Working knowledge of container security, secrets management (HashiCorp Vault, AWS Secrets Manager, Google Secret Manager), and compliance in production environments.
Strong communication and problem-solving skills, with demonstrated success leading cross-team projects and mentoring peers.

Experience:

8+ years in SRE, DevOps, or Infrastructure Engineering roles.
3–5 years of experience with Kubernetes (EKS/GKE) and related ecosystem tools (Helm, Karpenter, etc.) in production.
3–5 years of experience with AWS and GCP.
3–5 years using Terraform to manage multi-cloud infrastructure.
5+ years of coding experience in Python, Go, or similar languages.
Proven track record leading high-impact projects, specifically migration projects (ECS → EKS/GKE) and enabling microservice architectures.
Experience implementing SLOs/SLIs, performing root cause analyses, and improving operational resilience.
Prior work in SaaS or high-scale, cloud-native environments is a strong plus.
Strong Linux and security fundamentals.
Bachelor’s degree in Computer Science or equivalent hands-on experience.

P25021_3418720

#LI-Hybrid

** The Okta Experience**

Supporting Your Well-Being
Driving Social Impact
Developing Talent and Fostering Connection + Community

We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.

Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.

If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.

Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please click here to view our full NYC AEDT Notice.

Okta is committed to complying with applicable data privacy and security laws and regulations. For more information, please see our Personnel and Job Candidate Privacy Notice at https://www.okta.com/legal/personnel-policy/.

**Secure Every Identity, from AI to Human

This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

What You’ll Be Doing

Design, build, and operate highly scalable, reliable, and secure infrastructure powering our production systems across AWS and GCP.
Lead major reliability and modernization initiatives, including container platform migrations (e.g., ECS to EKS/GKE) and microservice enablement across multi-cloud environments.
Serve as a technical authority in Kubernetes (EKS and GKE), cloud infrastructure (AWS and GCP), and modern CI/CD practices (GitOps, automation pipelines).
Partner with development teams to architect and enable microservice-based applications, ensuring production readiness, scalability, and observability.
Implement and manage infrastructure as code (Terraform, Ansible) to automate provisioning, scaling, and configuration management across multiple cloud providers.
Drive improvements in observability, performance, and cost efficiency through robust monitoring, logging, and alerting systems that span AWS and GCP.
Champion SRE best practices — defining SLOs/SLIs, conducting blameless postmortems, and continuously improving incident response.
Lead complex technical projects from conception to completion, managing timelines, and technical dependencies across teams.
Mentor engineers across teams, fostering a culture of reliability, automation, and continuous learning.
Collaborate with security and compliance partners to ensure infrastructure adheres to best practices and standards (e.g., IAM Federation, Workload Identity).
Participate in the on-call rotation, using incidents as learning opportunities to enhance systems and processes.

What You’ll Bring to the Role:

Strong hands-on experience architecting and operating cloud-native distributed systems (AWS and GCP).
Deep expertise with Kubernetes (EKS and GKE) — design, provisioning, scaling, and advanced troubleshooting in production.
Proven experience leading ECS to EKS/GKE migrations and driving microservice enablement initiatives at scale.
Proficiency with Infrastructure as Code tools such as Terraform (multi-provider), Ansible, or CloudFormation.
Solid coding and scripting ability in Python, Go, or Shell, with a focus on automation, tooling, and operational excellence.
Advanced understanding of CI/CD pipelines (ArgoCD, GitLab CI, Spinnaker), Linux systems, and networking fundamentals (Direct Connect/Interconnect, DNS, routing, load balancing) and Redis (must have).
Experience managing databases and caching systems (e.g., RDS/Cloud SQL, Redis/Memorystore, PostgreSQL, MySQL) in cloud environments.
Hands-on experience with observability tools (Prometheus, Grafana, ELK, Loki, OpenTelemetry, Google Cloud Operations) for performance and reliability insights.
Working knowledge of container security, secrets management (HashiCorp Vault, AWS Secrets Manager, Google Secret Manager), and compliance in production environments.
Strong communication and problem-solving skills, with demonstrated success leading cross-team projects and mentoring peers.

Experience:

8+ years in SRE, DevOps, or Infrastructure Engineering roles.
3–5 years of experience with Kubernetes (EKS/GKE) and related ecosystem tools (Helm, Karpenter, etc.) in production.
3–5 years of experience with AWS and GCP.
3–5 years using Terraform to manage multi-cloud infrastructure.
5+ years of coding experience in Python, Go, or similar languages.
Proven track record leading high-impact projects, specifically migration projects (ECS → EKS/GKE) and enabling microservice architectures.
Experience implementing SLOs/SLIs, performing root cause analyses, and improving operational resilience.
Prior work in SaaS or high-scale, cloud-native environments is a strong plus.
Strong Linux and security fundamentals.
Bachelor’s degree in Computer Science or equivalent hands-on experience.

P25021_3418720

#LI-Hybrid

** The Okta Experience**

Supporting Your Well-Being
Driving Social Impact
Developing Talent and Fostering Connection + Community

If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

What You’ll Be Doing

What You’ll Bring to the Role:

What You’ll Be Doing

What You’ll Bring to the Role: