Senior Site Reliability Engineer

Oracle Oracle · Enterprise · Nashville, TN +1

Senior Site Reliability Engineer role focused on incident response and maintaining high availability of Oracle's cloud services. Responsibilities include incident management, automation of tasks, troubleshooting complex systems, and defining technical architecture for distributed systems. Requires public cloud operations experience and expertise in SRE/DevOps principles.

What you'd actually do

  1. Solve complex problems related to infrastructure cloud services and automate common tasks to ensure continuous availability with minimal human intervention.
  2. Command and coordinate SMEs and service leaders to restore services as quickly as possible during major incidents, while keeping accurate and timely data on the progress of such incidents.
  3. Utilize a deep understanding of cloud computing design patterns and their dependencies to mitigate complex major incidents.
  4. Embed a methodical approach to troubleshoot large, complex, interconnected systems used in incident detection and orchestration.
  5. Document pertinent information related to incidents that aids process improvement, identifies deviations, and enables the creation of an incident knowledge base.

Skills

Required

  • Site Reliability Engineering
  • DevOps
  • System Engineering
  • public cloud operations experience (e.g., AWS, Azure, GCP, OCI)
  • Major Incident Management in a cloud-based environment
  • automation and orchestration principles
  • modern object-oriented programming language
  • Agile project management
  • coding standards
  • code reviews
  • source control management
  • build processes
  • testing
  • operations
  • infrastructure automation tools such as Chef, Ansible, Jenkins, Terraform
  • Infrastructure-as-a-Service
  • CI/CD systems
  • Docker
  • RESTful APIs
  • log analysis tools
  • debugging tools

What the JD emphasized

  • public cloud operations experience
  • Major Incident Management