Remote - Site Reliability Developer 3 (… at Oracle

What you'd actually do

Own the end-to-end reliability, scalability, and operability of shared data platforms

Define platform standards, architectural direction, and operational guardrails

Establish capacity models, scaling strategies, and operational best practices

Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical

Design and evolve an Ansible- and Terraform-driven automation framework

Skills

Required

Operating large-scale, customer-facing distributed platforms
HDFS, YARN, HBase, Kafka, Storm, or similar systems
Linux
Networking
Distributed system troubleshooting
Ansible
Terraform
Python
Ruby
Bash
Kerberized environments
Technical architecture documentation
Platform ownership
Observability
Capacity modeling
Computer Science fundamentals

Nice to have

Cloud momentum
Entrepreneurial spirit
Energetic and creative environment
World class engineering center
Focus on excellence
Product development
Product strategy
Modernized, automated healthcare
Net new line of business
Impact and disrupt the healthcare industry
Transforming how healthcare and technology intersect
Reach billions of people with our products & services
Create technology in which truly impacts the world
Ability to have immediate impact on developing technology
Unlimited growth potential with inspiring work
Work with the best minds in the industry
Enjoy working in an open, diverse, and productive environment
HealtheIntent

What the JD emphasized

U.S. Citizenship required and eligibility for a Federal Security Clearance

4+ years operating large-scale, customer-facing distributed platforms

Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems

Strong background in Linux, networking, and distributed system troubleshooting

Infrastructure-as-Code using Ansible and Terraform

Scripting and automation using Python, Ruby, and Bash

Hands-on experience operating Kerberized environments

Proven ability to define and document technical architecture for complex systems

Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers

Experience designing observability and capacity models for distributed platforms

BS or MS in Computer Science, or equivalent

U.S. Citizenship required and eligibility for a Federal Security Clearance

Our Team

Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Data, Analytics Platform. This team will focus on product development and product strategy for Oracle Health, while building out a complete platform supporting modernized, automated healthcare. This is a net new line of business, constructed with an entrepreneurial spirit that promotes an energetic and creative environment. We are unencumbered and will need your contribution to make it a world class engineering center with the focus on excellence.

Oracle Health Data, Analytics Platform has a rare opportunity to play a critical role in how Oracle Health products impact and disrupt the healthcare industry by transforming how healthcare and technology intersect.

You will have the opportunity to:

Reach billions of people with our products & services
Create technology in which truly impacts the world
Ability to have immediate impact on developing technology
Unlimited growth potential with inspiring work
Work with the best minds in the industry
Enjoy working in an open, diverse, and productive environment

About The Job

This role provide support to core data platforms behind Oracle Health’s Data & Analytics Platform. As a Senior Site Reliability Engineer (SRE), you will own shared, mission-critical systems used by multiple products and teams.

You will work on the design and operation of large-scale, stateful distributed platforms, including Hadoop ecosystem components (HDFS, YARN, HBase) deployed on Oracle Big Data Service (BDS), Kafka, and Storm. These multi-tenant platforms are deployed and operated through Ansible- and Terraform-based automation and require strong architectural ownership to manage scale, change, and broad blast radius.

What You'll Do

Platform Ownership & Technical Leadership

Own the end-to-end reliability, scalability, and operability of shared data platforms
Define platform standards, architectural direction, and operational guardrails
Influence cross-team technical decisions and long-term platform strategy
Drive long-term platform evolution and influence reliability strategy across the data ecosystem

Architecture & Design

Clearly articulate system behavior, dependencies, and failure modes
Make principled trade-offs between reliability, performance, cost, and complexity
Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively

Operations Engineering

Establish capacity models, scaling strategies, and operational best practices
Design platforms that behave predictably under load, failure, and change
Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery

Distributed Systems Expertise

Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades

Security

Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
Treat security as a first-class architectural concern

Automation

Design and evolve an Ansible- and Terraform-driven automation framework
Treat automation as production software: versioned, reviewed, tested, and improved
Eliminate operational toil by encoding reliability and safety into the platform

Incident Leadership & Prevention

Serve as the ultimate escalation point for complex or ambiguous incidents
Focus on eliminating entire classes of failure, not just resolving individual issues

Representation

Represent SRE and platform engineering in high-visibility and sensitive forums
Communicate clearly with engineering leadership and partner teams

Responsibilities

The team operates within the Oracle Health Data & Analytics Platform, supporting one of Oracle Health’s core products, HealtheIntent. We operate the big data and streaming infrastructure that enables downstream teams to deliver reliable customer-facing solutions at scale, while continuously improving operability and efficiency.

Required Experience

4+ years operating large-scale, customer-facing distributed platforms
Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems
Strong background in Linux, networking, and distributed system troubleshooting
Infrastructure-as-Code using Ansible and Terraform
Scripting and automation using Python, Ruby, and Bash
Hands-on experience operating Kerberized environments
Proven ability to define and document technical architecture for complex systems
Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers
Experience designing observability and capacity models for distributed platforms

Required Qualifications:

U.S. Citizenship and eligibility for a Federal Security Clearance
5+ years of technical experience relevant to this position
Ability to communicate effectively and build rapport with team members
BS or MS in Computer Science, or equivalent

Disclaimer:

Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range in USD from: $79,100 to $158,200 per annum. May be eligible for bonus and equity.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business. Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following:

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - IC3