Site Reliability Developer 5

This role focuses on the reliability, availability, and operational strategy of Oracle's Japan Sovereign Cloud platform. It involves leading large-scale reliability initiatives, influencing architecture, developing automation, and driving operational excellence. The position requires expertise in distributed systems, cloud infrastructure, and software engineering, with participation in a 24x7 support model and acting as a key escalation point for incidents.

What you'd actually do

Provide technical leadership for the reliability, availability, and operational strategy of OCI's Japan Sovereign Cloud platform.
Lead large-scale reliability initiatives, influence architecture decisions, develop advanced automation frameworks, and drive operational excellence across multiple cloud services.
Define SRD reliability strategy, operational standards, and improvement roadmaps that connect business requirements with technical execution across Alert, Incident Response, Availability, and Reliability.
Align operational practices with JP Sovereign Cloud, EU Sovereign Cloud, and global OCI reliability teams, and drive standardization where it improves service resiliency.
Participate in a 24x7 operational support model while serving as a key escalation point for high-severity incidents and strategic reliability improvements.

Skills

Required

Site Reliability Engineering
Cloud Infrastructure Engineering
Software Development
Large-scale distributed systems operations
Designing, operating, and improving highly available cloud platforms
Mission-critical services operations
Software development and automation (Java, Go, Python, or similar)
Distributed systems architecture
Networking
Storage
Observability
Service resiliency principles
Major incident response
Reliability programs leadership
Cross-organizational technical initiatives leadership
Architecture influence
Operational standards definition
Engineering best practices influence
24x7 shift rotation participation
Senior escalation resource for critical production events
Reliability strategy definition
Operational standards definition across multiple teams/services
Translating business and operational requirements into prioritized reliability roadmaps
Cross-sovereign collaboration
Shared operational practices
Tooling alignment

Nice to have

Native-level Japanese language proficiency
Business-level English communication skills

What the JD emphasized

Native-level Japanese language proficiency
8+ years of experience in Site Reliability Engineering, Cloud Infrastructure Engineering, Software Development, or large-scale distributed systems operations
Extensive experience designing, operating, and improving highly available cloud platforms and mission-critical services
Expert-level proficiency in software development and automation using languages such as Java, Go, Python, or similar
Deep understanding of distributed systems architecture, networking, storage, observability, and service resiliency principles
Proven track record leading major incident response efforts, reliability programs, and cross-organizational technical initiatives
Ability to influence architecture, operational standards, and engineering best practices across multiple teams
Willingness to participate in a 24x7 shift rotation and act as a senior escalation resource for critical production events
Proven ability to define reliability strategy and operational standards across multiple teams or services
Experience translating business and operational requirements into prioritized reliability roadmaps with measurable outcomes
Ability to drive cross-sovereign collaboration, including shared operational practices and tooling alignment across JP Sovereign Cloud and EU Sovereign Cloud

Read full job description

As a Principal Site Reliability Developer (IC5), you will provide technical leadership for the reliability, availability, and operational strategy of OCI's Japan Sovereign Cloud platform. You will lead large-scale reliability initiatives, influence architecture decisions, develop advanced automation frameworks, and drive operational excellence across multiple cloud services. You will define SRD reliability strategy, operational standards, and improvement roadmaps that connect business requirements with technical execution across Alert, Incident Response, Availability, and Reliability.

This position requires deep expertise in distributed systems, cloud infrastructure, and software engineering, combined with the ability to collaborate effectively with senior engineering leaders across global OCI organizations. You will align operational practices with JP Sovereign Cloud, EU Sovereign Cloud, and global OCI reliability teams, and drive standardization where it improves service resiliency. The role includes participation in a 24x7 operational support model while serving as a key escalation point for high-severity incidents and strategic reliability improvements. You will sponsor improvements raised from shift operations, ensure recurring issues are addressed through durable fixes, and mentor senior engineers on Plan + Execution ownership.

Qualifications

Native-level Japanese language proficiency and business-level English communication skills
8+ years of experience in Site Reliability Engineering, Cloud Infrastructure Engineering, Software Development, or large-scale distributed systems operations
Extensive experience designing, operating, and improving highly available cloud platforms and mission-critical services
Expert-level proficiency in software development and automation using languages such as Java, Go, Python, or similar
Deep understanding of distributed systems architecture, networking, storage, observability, and service resiliency principles
Proven track record leading major incident response efforts, reliability programs, and cross-organizational technical initiatives
Ability to influence architecture, operational standards, and engineering best practices across multiple teams
Willingness to participate in a 24x7 shift rotation and act as a senior escalation resource for critical production events
Proven ability to define reliability strategy and operational standards across multiple teams or services
Experience translating business and operational requirements into prioritized reliability roadmaps with measurable outcomes
Ability to drive cross-sovereign collaboration, including shared operational practices and tooling alignment across JP Sovereign Cloud and EU Sovereign Cloud

Career Level - IC5

Qualifications

Native-level Japanese language proficiency and business-level English communication skills
8+ years of experience in Site Reliability Engineering, Cloud Infrastructure Engineering, Software Development, or large-scale distributed systems operations
Extensive experience designing, operating, and improving highly available cloud platforms and mission-critical services
Expert-level proficiency in software development and automation using languages such as Java, Go, Python, or similar
Deep understanding of distributed systems architecture, networking, storage, observability, and service resiliency principles
Proven track record leading major incident response efforts, reliability programs, and cross-organizational technical initiatives
Ability to influence architecture, operational standards, and engineering best practices across multiple teams
Willingness to participate in a 24x7 shift rotation and act as a senior escalation resource for critical production events
Proven ability to define reliability strategy and operational standards across multiple teams or services
Experience translating business and operational requirements into prioritized reliability roadmaps with measurable outcomes
Ability to drive cross-sovereign collaboration, including shared operational practices and tooling alignment across JP Sovereign Cloud and EU Sovereign Cloud

Career Level - IC5