[remote] Principal Site Reliability Developer- Usc Required

Oracle Oracle · Enterprise · United States

This role focuses on Site Reliability Engineering for Oracle Health's AI-powered healthcare infrastructure, specifically the Clinical AI Assistant platform. The principal engineer will lead reliability efforts for large-scale cloud-native systems, design and operate distributed systems supporting AI services, build automation and self-healing capabilities, and develop AIOps features. The role requires strong experience in SRE, distributed systems, Kubernetes, and automation, with helpful experience in AI/ML infrastructure and regulated environments.

What you'd actually do

  1. Lead reliability engineering efforts for large-scale cloud-native healthcare platforms
  2. Design and operate highly available distributed systems supporting AI-driven services
  3. Build automation, self-healing systems, and intelligent operational tooling
  4. Drive improvements across scalability, observability, deployment safety, and incident response
  5. Lead complex production investigations and engineer durable long-term fixes

Skills

Required

  • 7–10+ years of experience in Site Reliability Engineering, DevOps, Production Engineering, or related infrastructure roles
  • Strong experience operating large-scale production systems with high availability requirements
  • Deep understanding of distributed systems, reliability engineering, and cloud infrastructure
  • Hands-on experience with Kubernetes and containerized workloads
  • Strong automation and software engineering skills
  • Experience improving operational excellence through tooling and engineering rigor
  • Strong troubleshooting and performance optimization skills in Linux-based environments
  • Experience with observability systems, monitoring, tracing, and alerting
  • Ability to lead technical initiatives and drive cross-team reliability improvements
  • U.S. citizenship required
  • Ability to obtain and maintain a federal security clearance

Nice to have

  • AI/ML or LLM infrastructure in production
  • AIOps or intelligent operational automation
  • Experience in healthcare or other regulated environments
  • High-throughput, low-latency distributed systems
  • Experience with Java, Python, C++, or similar languages

What the JD emphasized

  • AI-powered healthcare infrastructure
  • large-scale AI systems
  • AI-driven services
  • AIOps capabilities
  • AI/ML or LLM infrastructure in production
  • AIOps or intelligent operational automation
  • regulated environments

Other signals

  • AI-powered healthcare infrastructure
  • Clinical AI Assistant platform
  • large-scale AI systems