Manager- Site Reliability Engineering

Okta Okta · Enterprise · Bangalore, India · SW Eng - Infrastructure-672

Manager of Site Reliability Engineering at Okta, focusing on building and maintaining reliable and performant infrastructure, with a specific emphasis on securing AI systems. The role involves mentoring SRE teams, ensuring security best practices, responding to production incidents, and collaborating with stakeholders to balance reliability, security, and delivery velocity.

What you'd actually do

  1. Mentoring, managing, and leading a team of SRE’s with a broad range of expertise and experience.
  2. Being an evangelist and advocate for security best practices, leading initiatives and projects to strengthen our security posture for our most critical infrastructure.
  3. Responding to production incidents, driving us to remediation as quickly as possible and determining how we can prevent them in the future.
  4. Triaging and troubleshooting complex production issues to ensure reliability and performance.
  5. Working closely with our stakeholders across the organization to ensure our new capabilities are aligned to our competing constraints of reliability, security, and delivery velocity.

Skills

Required

  • Experience managing SRE or SWE teams
  • Leadership skills
  • Communication skills
  • Project management skills
  • Strong security background
  • Experience running large-scale production Java/Tomcat and containerized services in AWS or other cloud providers
  • CI/CD principles
  • Linux fundamentals
  • OS hardening
  • Networking concepts
  • IP protocols

Nice to have

  • Experience in a cloud native environment
  • Experience with ECS, KMS, Kinesis, RDS

What the JD emphasized

  • Secure Every Identity, from AI to Human
  • builders and owners who operate with speed and urgency and execute with excellence
  • Always On
  • If you have to do something more than once, automate it
  • rapidly self-educate on new concepts and tools
  • security best practices
  • strengthen our security posture
  • production incidents
  • complex production issues
  • reliability and performance
  • reliability, security, and delivery velocity
  • vulnerability scanning and security posture
  • 24x7 online environment
  • large-scale production Java/Tomcat and containerized services in AWS
  • CI/CD principles
  • Linux fundamentals
  • OS hardening
  • networking concepts
  • IP protocols
  • 4+ years of experience managing SRE or SWE teams
  • 13+ years Strong leadership, communication, and project management skills.
  • Strong security background and knowledge.