Principal Site Reliability Engineer

Oracle Oracle · Enterprise · Seattle, WA +1

Principal Site Reliability Engineer responsible for designing, architecting, and maintaining reliable and scalable infrastructure and services. Focuses on capacity planning, incident response, automation, and performance reporting, collaborating with development teams to ensure robust systems. The role involves advanced experimentation with new tools and staying updated on SRE trends.

What you'd actually do

  1. Designs and architects infrastructure and/or service according to terms for reliability and functionality.
  2. Forecasts demands for infrastructure and responds to capacity needs, ensuring systems have sufficient resources to handle current and future workloads and identifying resource gaps.
  3. Collaborates with the software development team to develop infrastructures, ensuring features are reliable and scalable according to deployment requirements.
  4. Exercises judgment when performing data collection, triage, technical analysis, and redirection to maintain and optimize operations and infrastructure reliability.
  5. Leverages advanced knowledge to perform incident response, root cause analyses, and/or maintenance on assigned services (e.g., software installs, version upgrades, security updates, backup and recovery).

Skills

Required

  • infrastructure design
  • service architecture
  • capacity planning
  • incident response
  • root cause analysis
  • automation development
  • performance monitoring
  • technical communication
  • troubleshooting
  • site reliability engineering trends

Nice to have

  • prototyping
  • on-call support
  • scripting
  • SLO management