Senior Site Reliability Engineer

Oracle Oracle · Enterprise · United States

Senior Site Reliability Engineer role focused on incident response, availability, and automation for Oracle Cloud Infrastructure (OCI). Responsibilities include detecting, triaging, and mitigating service-impacting events, coordinating SMEs, and improving incident management processes through automation and documentation. Requires experience in public cloud operations, major incident management, and software engineering best practices.

What you'd actually do

  1. Solve complex problems related to infrastructure cloud services and automate common tasks to ensure continuous availability with minimal human intervention.
  2. Command and coordinate SMEs and service leaders to restore services as quickly as possible during major incidents, while keeping accurate and timely data on the progress of such incidents.
  3. Utilize a deep understanding of cloud computing design patterns and their dependencies to mitigate complex major incidents.
  4. Embed a methodical approach to troubleshoot large, complex, interconnected systems used in incident detection and orchestration.
  5. Document pertinent information related to incidents that aids process improvement, identifies deviations, and enables the creation of an incident knowledge base.

Skills

Required

  • Site Reliability Engineering
  • DevOps
  • System Engineering
  • Public cloud operations experience (e.g., AWS, Azure, GCP, OCI)
  • Major Incident Management in a cloud-based environment
  • Automation and orchestration principles
  • Modern object-oriented programming language
  • Professional software engineering standard methodologies (Agile, coding standards, code reviews, source control, build processes, testing, operations)
  • Infrastructure automation tools (Chef, Ansible, Jenkins, Terraform)
  • Infrastructure-as-a-Service
  • CI/CD systems
  • Docker
  • RESTful APIs
  • Log analysis tools
  • Debugging tools

What the JD emphasized

  • Must have public cloud operations experience
  • Extensive experience with Major Incident Management in a cloud-based environment