Principal Big Data Site Reliability Developer (us Citizenship Required) US Remote

Oracle Oracle · Enterprise · United States

Principal Site Reliability Engineer for Oracle Health's Data & Analytics Platform, focusing on the reliability, scalability, and operability of large-scale, stateful distributed platforms (Hadoop ecosystem, Kafka, Storm) using automation (Ansible, Terraform). Requires strong Linux, networking, and distributed systems troubleshooting skills, with experience in Kerberized environments and defining technical architecture for complex systems.

What you'd actually do

  1. Own the end-to-end reliability, scalability, and operability of shared data platforms
  2. Define platform standards, architectural direction, and operational guardrails
  3. Lead platform architecture and design reviews
  4. Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
  5. Serve as the ultimate escalation point for complex or ambiguous incidents

Skills

Required

  • 8+ years operating large-scale, customer-facing distributed platforms
  • Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems
  • Strong background in Linux, networking, and distributed system troubleshooting
  • Infrastructure-as-Code using Ansible and Terraform
  • Scripting and automation using Python, Ruby, and Bash
  • Hands-on experience operating Kerberized environments
  • Proven ability to define and document technical architecture for complex systems
  • Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers
  • Experience designing observability and capacity models for distributed platforms
  • U.S. Citizenship and eligibility for a Federal Security Clearance
  • 10+ years of technical experience relevant to this position

Nice to have

  • BS or MS in Computer Science, or equivalent

What the JD emphasized

  • U.S. Citizenship and eligibility for a Federal Security Clearance
  • own shared, mission-critical systems
  • broad blast radius
  • multiple downstream consumers