Senior Infrastructure & Reliability Engineer

Oracle Oracle · Enterprise · United States

Senior Infrastructure & Reliability Engineer for Oracle's Health Data Intelligence team, focusing on SRE for large-scale healthcare analytics platforms. The role involves designing, building, and operating reliable infrastructure and data pipelines, with a significant emphasis on advancing automation, observability, and AI-assisted reliability practices. This includes exploring and applying Generative AI and intelligent automation for incident response, system resilience, and operational efficiency, as well as designing and implementing GenAI-powered or agent-based solutions for observability, incident triage, and infrastructure management.

What you'd actually do

  1. Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloads
  2. Improve system reliability through automation, monitoring, and performance optimization
  3. Contribute to the adoption of AI-assisted approaches for operations, including: - Enhancing observability and alerting - Supporting automated incident detection and remediation - Exploring intelligent automation for infrastructure lifecycle management
  4. Partner with development teams to enhance service architecture, scalability, and operability
  5. Participate in on-call rotations and act as an escalation point for complex production issues

Skills

Required

  • Experience building and operating high-availability, fault-tolerant systems
  • Strong understanding of distributed systems, performance monitoring, and resiliency patterns
  • Experience with incident response, root-cause analysis, and production troubleshooting
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
  • Strong experience with multi-cloud environments (OCI, AWS/Azure)
  • Deep understanding of cloud infrastructure design, deployment, and resource optimization
  • Experience managing hybrid or cross-cloud architectures
  • Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
  • Infrastructure as Code (Terraform)
  • Observability tools (Prometheus, Grafana)
  • Strong focus on automation-first operations
  • Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
  • Experience with ETL frameworks and large-scale data processing
  • Understanding of columnar storage systems
  • Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
  • Strong proficiency in Python, Java, or Go
  • Experience with Docker, Kubernetes, and shell scripting
  • Strong troubleshooting skills with ability to perform root-cause analysis
  • Experience resolving complex production issues in distributed systems
  • 8+ years of software engineering experience, with 5+ years in cloud infrastructure, SRE, or DevOps
  • Proven ownership of production system reliability in cloud environments
  • Cloud infrastructure design and automation
  • Distributed systems and performance optimization
  • Data warehousing and ETL frameworks
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows
  • Familiarity with tools like LangChain, AutoGPT, or custom AI agents
  • Terraform, Docker, Kubernetes
  • Observability stacks (Prometheus, Grafana)
  • Python, Java, or Go

Nice to have

  • AI-assisted reliability practices

What the JD emphasized

  • U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to:
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows

Other signals

  • AI-Native Engineering
  • AI-Driven Automation
  • Generative AI
  • Agentic AI