Site Reliability Developer 5

Oracle Oracle · Enterprise · Japan

This role focuses on the reliability, availability, and operational strategy of Oracle's Japan Sovereign Cloud platform. It involves leading large-scale reliability initiatives, influencing architecture, developing automation, and driving operational excellence. The position requires expertise in distributed systems, cloud infrastructure, and software engineering, with participation in a 24x7 support model and acting as a key escalation point for incidents.

What you'd actually do

  1. Provide technical leadership for the reliability, availability, and operational strategy of OCI's Japan Sovereign Cloud platform.
  2. Lead large-scale reliability initiatives, influence architecture decisions, develop advanced automation frameworks, and drive operational excellence across multiple cloud services.
  3. Define SRD reliability strategy, operational standards, and improvement roadmaps that connect business requirements with technical execution across Alert, Incident Response, Availability, and Reliability.
  4. Align operational practices with JP Sovereign Cloud, EU Sovereign Cloud, and global OCI reliability teams, and drive standardization where it improves service resiliency.
  5. Participate in a 24x7 operational support model while serving as a key escalation point for high-severity incidents and strategic reliability improvements.

Skills

Required

  • Site Reliability Engineering
  • Cloud Infrastructure Engineering
  • Software Development
  • Large-scale distributed systems operations
  • Designing, operating, and improving highly available cloud platforms
  • Mission-critical services operations
  • Software development and automation (Java, Go, Python, or similar)
  • Distributed systems architecture
  • Networking
  • Storage
  • Observability
  • Service resiliency principles
  • Major incident response
  • Reliability programs leadership
  • Cross-organizational technical initiatives leadership
  • Architecture influence
  • Operational standards definition
  • Engineering best practices influence
  • 24x7 shift rotation participation
  • Senior escalation resource for critical production events
  • Reliability strategy definition
  • Operational standards definition across multiple teams/services
  • Translating business and operational requirements into prioritized reliability roadmaps
  • Cross-sovereign collaboration
  • Shared operational practices
  • Tooling alignment

Nice to have

  • Native-level Japanese language proficiency
  • Business-level English communication skills

What the JD emphasized

  • Native-level Japanese language proficiency
  • 8+ years of experience in Site Reliability Engineering, Cloud Infrastructure Engineering, Software Development, or large-scale distributed systems operations
  • Extensive experience designing, operating, and improving highly available cloud platforms and mission-critical services
  • Expert-level proficiency in software development and automation using languages such as Java, Go, Python, or similar
  • Deep understanding of distributed systems architecture, networking, storage, observability, and service resiliency principles
  • Proven track record leading major incident response efforts, reliability programs, and cross-organizational technical initiatives
  • Ability to influence architecture, operational standards, and engineering best practices across multiple teams
  • Willingness to participate in a 24x7 shift rotation and act as a senior escalation resource for critical production events
  • Proven ability to define reliability strategy and operational standards across multiple teams or services
  • Experience translating business and operational requirements into prioritized reliability roadmaps with measurable outcomes
  • Ability to drive cross-sovereign collaboration, including shared operational practices and tooling alignment across JP Sovereign Cloud and EU Sovereign Cloud