Staff Site Reliability Engineer, Edge - Ts/sci

Okta Okta · Enterprise · Washington, DC · Tech Ops-610

Staff Site Reliability Engineer (SRE) to lead the evolution of Okta's large-scale production systems, ensuring uncompromising reliability and performance while supporting critical national security missions in secure, restricted environments. This role focuses on infrastructure leadership, incident engineering, strategic automation, and system resilience, with a strong emphasis on FedRAMP and DoD IL6 compliance.

What you'd actually do

  1. Design, build, and oversee Okta’s production infrastructure, ensuring architectural integrity and peak performance.
  2. Act as a senior escalation point for production incidents, conducting deep-dive root cause analysis and implementing permanent, automated preventive solutions.
  3. Eliminate manual toil by developing sophisticated automation frameworks, evolving monitoring tools, and establishing rigorous technical documentation.
  4. Optimize a highly available, large-scale environment, ensuring "Always On" service delivery across complex network topologies.
  5. Provide technical guidance to the engineering organization, championing SRE best practices and a culture of self-education.

Skills

Required

  • Active TS/SCI with Polygraph clearance
  • Deep professional experience with FedRAMP and DoD IL6 frameworks
  • Mastery of AWS networking and security, including Transit Gateways, VPCs, Route Tables, ELBs, and NACLS
  • Advanced experience automating enterprise-scale infrastructure via Terraform or CloudFormation
  • Expert-level Linux systems administration
  • Proficiency in Go, Python, Bash, or Ruby
  • Proven success managing Docker containers and Java-based stacks (Apache/Tomcat) in high-security production environments
  • Solid understanding of networking concepts, IP protocols, and multi-cloud infrastructure

Nice to have

  • B.S. in Computer Science or equivalent technical experience

What the JD emphasized

  • TS/SCI with Polygraph
  • FedRAMP
  • DoD IL6