Senior Site Reliability Engineer

Oracle Oracle · Enterprise · United States

Senior Site Reliability Engineer role at Oracle focused on supporting mission-critical services, with a strong emphasis on OCI tooling, distributed systems, production operations, automation, observability, and incident response. The role involves integrating AI-driven tools into observability platforms and incident management workflows, and partnering with AI Ops initiatives.

What you'd actually do

  1. Own and improve service reliability, availability, and performance (SLO/SLA)
  2. Lead and participate in incident response, root cause analysis, and postmortems
  3. Develop and implement automation to reduce operational toil and improve efficiency
  4. Build and enhance monitoring, alerting, and observability frameworks
  5. Partner with OCI engineering teams to improve system design, scalability, and resilience

Skills

Required

  • 5–8+ years of experience in SRE, DevOps, or production engineering
  • Strong hands-on experience with Oracle Cloud Infrastructure (OCI)
  • Experience operating large-scale distributed systems in production
  • Proficiency in one or more programming/scripting languages (Python, Go, Bash, etc.)
  • Experience with monitoring, observability, and incident management practices
  • Strong understanding of Linux systems and networking fundamentals
  • Experience with CI/CD pipelines and infrastructure as code
  • Knowledge/Experience with troubleshooting and managing databases (Oracle preferred)
  • United States Citizen and currently reside in the United States

Nice to have

  • Experience supporting high-availability, customer-facing cloud services
  • Background in regulated or government cloud environments
  • Familiarity with FedRAMP, ILx, or similar compliance standards
  • Experience with FedRAMP and 3PAO audit procedures and requirements
  • Experience driving automation and reliability engineering best practices
  • Experience with Shepherd or similar
  • Experience with Kubernetes
  • Experience with M&O stack: Graphana, Prometheus or similar
  • Familiarity with construction & engineering industry desired
  • Familiarity with SRE principles, including incident response, SLIs/SLOs, and resilience engineering.
  • Proven track record in building automation solutions for cloud operations or DevOps processes.

What the JD emphasized

  • regulated, high-compliance environment
  • FedRAMP, ILx, or similar compliance standards
  • FedRAMP and 3PAO audit procedures and requirements

Other signals

  • integrates AI-driven tools into observability platforms
  • partner with the SRE team to align AI Ops initiatives with reliability goals
  • build dashboards and visualizations to present AI-driven insights