Site Reliability Developer 3

Oracle Oracle · Enterprise · BENGALURU, KARNATAKA, India

This role focuses on Site Reliability Engineering (SRE) for Oracle Cloud Infrastructure (OCI) Compute services. The primary responsibilities include ensuring the reliability, scalability, performance, and operational efficiency of large-scale production environments. The role involves handling customer incidents, supporting deployments, troubleshooting complex infrastructure issues, conducting root cause analysis, and driving service reliability improvements. A key aspect is leveraging AIOps and intelligent automation tools for monitoring, anomaly detection, event correlation, predictive alerting, and remediation workflows to reduce operational toil and enhance incident response. The role also requires strong programming/scripting skills for automation and tooling, deep expertise in distributed systems, and experience with Enterprise Linux in production environments.

What you'd actually do

  1. Install, monitor, maintain, support, and optimize all production server hardware and software.
  2. Provide escalated technical support for complex technical issues which may include leading problem management cases and providing management status.
  3. Coordinate escalated support cases and lead appropriate internal technical resources and/or third-party vendors to resolution and coordinate a storage infrastructure of Oracle systems and database appliances.
  4. Responsible for Oracle production environments; assist with server operating system and application upgrades, bug fixes, patching, and deployment activities; and work on standardization projects for both hardware and software under the Oracle technology stack while providing consistent system uptime as expected in a Cloud environment.
  5. Leverage AIOps, observability platforms, telemetry analytics, intelligent automation, event correlation, predictive alerting, and automated remediation workflows to improve operational efficiency, incident response, service reliability, and reduce operational toil.

Skills

Required

  • Enterprise Linux operating systems in large-scale production environments
  • Incident Management
  • Support and troubleshooting of Staging/Production environments
  • On-Call rotations
  • high availability
  • scalability
  • reliability
  • operational excellence
  • Root Cause Analysis (RCA)
  • automated operational processes
  • deployment tools
  • CI/CD pipelines
  • operational procedures
  • automation frameworks
  • zero-downtime deployments
  • infrastructure upgrades
  • patching
  • change management
  • production rollout activities
  • AIOps
  • observability platforms
  • telemetry analytics
  • intelligent automation
  • event correlation
  • anomaly detection
  • predictive alerting
  • automated remediation workflows
  • operational efficiency
  • incident response
  • scalable operational solutions
  • infrastructure
  • cloud migration
  • distributed systems operations
  • code-level analysis
  • service dependencies
  • production security posture
  • operational compliance
  • infrastructure reliability
  • capacity planning
  • performance optimization
  • system tuning
  • scalability initiatives

Nice to have

  • Python
  • Java
  • Go

What the JD emphasized

  • critical customer incidents
  • complex production issues
  • complex technical issues
  • complex infrastructure issues
  • complex issues requiring code-level analysis