Principal Infrastructure & Reliability Engineer

Oracle Oracle · Enterprise · United States

This role focuses on Site Reliability Engineering for large-scale healthcare analytics platforms, with a significant emphasis on applying Generative AI and agentic AI to improve infrastructure management, observability, incident response, and operational efficiency. The candidate will design, build, and operate reliable infrastructure and data pipelines, and explore AI-assisted reliability practices.

What you'd actually do

  1. Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloads
  2. Improve system reliability through automation, monitoring, and performance optimization
  3. Contribute to the adoption of AI-assisted approaches for operations, including: Enhancing observability and alerting, Supporting automated incident detection and remediation, Exploring intelligent automation for infrastructure lifecycle management
  4. Partner with development teams to enhance service architecture, scalability, and operability
  5. Participate in on-call rotations and act as an escalation point for complex production issues

Skills

Required

  • Experience building and operating high-availability, fault-tolerant systems
  • Strong understanding of distributed systems, performance monitoring, and resiliency patterns
  • Experience with incident response, root-cause analysis, and production troubleshooting
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
  • Strong experience with multi-cloud environments (OCI, AWS/Azure)
  • Deep understanding of cloud infrastructure design, deployment, and resource optimization
  • Experience managing hybrid or cross-cloud architectures
  • Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
  • Infrastructure as Code (Terraform)
  • Observability tools (Prometheus, Grafana)
  • Strong focus on automation-first operations
  • Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
  • Experience with ETL frameworks and large-scale data processing
  • Understanding of columnar storage systems
  • Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
  • Strong proficiency in Python, Java, or Go
  • Experience with Docker, Kubernetes, and shell scripting
  • Strong troubleshooting skills with ability to perform root-cause analysis
  • Experience resolving complex production issues in distributed systems
  • 10+ years of software engineering experience, with 8+ years in cloud infrastructure, SRE, or DevOps
  • Proven ownership of production system reliability in cloud environments
  • Cloud infrastructure design and automation
  • Distributed systems and performance optimization
  • Data warehousing and ETL frameworks
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows
  • Familiarity with tools like LangChain, AutoGPT, or custom AI agents
  • Terraform, Docker, Kubernetes
  • Observability stacks (Prometheus, Grafana)
  • Python, Java, or Go

Nice to have

  • AI-assisted reliability practices

What the JD emphasized

  • U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases

Other signals

  • applying Generative AI or Agentic AI to infrastructure lifecycle management
  • design or integrate AI-driven workflows for operational efficiency
  • building or integrating autonomous agents for DevOps/SRE use cases