Senior Site Reliability Engineer

Oracle Oracle · Enterprise · United States

Senior Site Reliability Engineer for an AI-first Electronic Health Record platform, focusing on building and operating reliable, scalable infrastructure and data pipelines. The role involves advancing automation, observability, and AI-assisted reliability practices, including exploring Generative AI and intelligent automation for incident response and system resilience. Responsibilities include designing and implementing AI-driven solutions for operational efficiency, infrastructure lifecycle management, and autonomous operations, while also supporting data technologies and cloud environments.

What you'd actually do

  1. Design, build, and operate highly reliable, scalable infrastructure and data pipelines that power mission-critical analytics globally.
  2. advancing automation, observability, and AI-assisted reliability practices.
  3. exploring the use of Generative AI and intelligent automation to improve incident response, system resilience, and operational efficiency.
  4. Design and implement GenAI-powered or agent-based solutions for: Observability and anomaly detection, Incident triage and remediation, Infrastructure provisioning and lifecycle management
  5. Build tools and frameworks that enable self-service and autonomous operations

Skills

Required

  • Experience building and operating high-availability, fault-tolerant systems
  • Strong understanding of distributed systems, performance monitoring, and resiliency patterns
  • Experience with incident response, root-cause analysis, and production troubleshooting
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
  • Strong experience with multi-cloud environments (OCI, AWS/Azure)
  • Deep understanding of cloud infrastructure design, deployment, and resource optimization
  • Experience managing hybrid or cross-cloud architectures
  • Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
  • Infrastructure as Code (Terraform)
  • Observability tools (Prometheus, Grafana)
  • Strong focus on automation-first operations
  • Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
  • Experience with ETL frameworks and large-scale data processing
  • Understanding of columnar storage systems
  • Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
  • Strong proficiency in Python, Java, or Go
  • Experience with Docker, Kubernetes, and shell scripting
  • Strong troubleshooting skills with ability to perform root-cause analysis
  • Experience resolving complex production issues in distributed systems
  • Implement and optimize infrastructure for Oracle HDI Analytics Platform
  • Ensure system uptime, reliability, and scalability
  • Design and implement GenAI-powered or agent-based solutions for: Observability and anomaly detection, Incident triage and remediation, Infrastructure provisioning and lifecycle management
  • Build tools and frameworks that enable self-service and autonomous operations
  • Build and optimize scalable data pipelines using Vertica and ETL frameworks
  • Apply DevOps/SRE practices to automate deployments and operations
  • Enhance observability using Prometheus/Grafana and AI-driven insights
  • Support multi-cloud initiatives across OCI, AWS, and Azure
  • Optimize cost, performance, and compliance across environments
  • Participate in on-call rotations
  • Implement preventative and automated remediation solutions
  • Work closely with engineers to execute technical roadmaps
  • Contribute to code reviews and infrastructure improvements
  • 4+ years of software engineering, cloud infrastructure, SRE, or DevOps experience
  • Proven ownership of production system reliability in cloud environments
  • Cloud infrastructure design and automation
  • Distributed systems and performance optimization
  • Data warehousing and ETL frameworks
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows
  • Familiarity with tools like LangChain, AutoGPT, or custom AI agents
  • Terraform, Docker, Kubernetes
  • Observability stacks (Prometheus, Grafana)
  • Python, Java, or Go
  • Strong problem-solving mindset with a focus on automation and scalability
  • Experience improving system reliability through intelligent automation

Nice to have

  • Experience in healthcare or regulated environments (HIPAA, compliance frameworks)
  • Experience working in environments requiring security clearance
  • Experience building self-healing or autonomous infrastructure systems

What the JD emphasized

  • U.S. citizenship is required for this position
  • Hands-on experience applying Generative AI or Agentic AI
  • Ability to design or integrate AI-driven workflows
  • Familiarity with building or integrating autonomous agents
  • AI-Native Engineering (NEW)
  • AI-Driven Automation (NEW)
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows
  • Familiarity with tools like LangChain, AutoGPT, or custom AI agents

Other signals

  • AI-first Electronic Health Record platform
  • Generative AI and intelligent automation to improve incident response, system resilience, and operational efficiency
  • AI-driven automation for observability, anomaly detection, incident triage, and remediation
  • applying Generative AI or Agentic AI to Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation