Site Reliability Engineer (auth0)

Okta Okta · Enterprise · Spain · Remote · Tech Ops-610

Okta is seeking a mid-level Site Reliability Engineer to ensure the reliability, resilience, and scalability of their production systems, with a focus on securing AI infrastructure. The role involves designing and building custom software in Go, partnering with engineering teams, and contributing to SRE tooling and processes.

What you'd actually do

  1. Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy.
  2. Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services.
  3. Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions.
  4. Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues.
  5. Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency.

Skills

Required

  • Go programming language
  • Infrastructure as code (Terraform)
  • Container orchestration (Kubernetes, Docker)
  • GitOps (ArgoCD)
  • Major cloud provider (Azure, AWS, or GCP)
  • Microservices architecture
  • Databases (SQL, NoSQL)
  • Networking fundamentals
  • SRE principles (SLIs, SLOs, error budgets)
  • On-call rotation experience
  • Communication and collaboration skills

Nice to have

  • custom software development
  • observability principles
  • automation
  • operational efficiency

What the JD emphasized

  • career-defining work
  • relentless drive to solve complex challenges
  • speed and urgency
  • execute with excellence
  • mission
  • exponential growth
  • directly contributing to the platform's core resiliency and robustness
  • hands-on builder
  • high degree of ownership
  • high degree of autonomy
  • custom applications, not just scripts
  • major cloud provider
  • microservices architecture
  • networking fundamentals
  • custom code can solve platform-level issues
  • core SRE principles
  • on-call rotation for a 24/7 cloud-based environment
  • Exceptional communication and collaboration skills
  • remote, distributed team
  • self-driven
  • massive scale
  • curious and motivated engineer
  • building reliability directly into the platform