Principal AI Site Reliability Engineer (us Remote)

Oracle Oracle · Enterprise · United States

This role focuses on Site Reliability Engineering for large-scale healthcare analytics platforms, with a significant emphasis on applying Generative AI and agentic AI to improve infrastructure lifecycle management, observability, incident response, and operational efficiency. The role involves designing, building, and operating reliable infrastructure, advancing automation, and exploring AI-driven solutions for DevOps/SRE use cases within multi-cloud environments.

What you'd actually do

  1. Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloads
  2. Improve system reliability through automation, monitoring, and performance optimization
  3. Contribute to the adoption of AI-assisted approaches for operations, including: Enhancing observability and alerting, Supporting automated incident detection and remediation, Exploring intelligent automation for infrastructure lifecycle management
  4. Partner with development teams to enhance service architecture, scalability, and operability
  5. Perform root cause analysis and implement long-term fixes to prevent recurrence

Skills

Required

  • Experience building and operating high-availability, fault-tolerant systems
  • Strong understanding of distributed systems, performance monitoring, and resiliency patterns
  • Experience with incident response, root-cause analysis, and production troubleshooting
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
  • Strong experience with multi-cloud environments (OCI, AWS/Azure)
  • Deep understanding of cloud infrastructure design, deployment, and resource optimization
  • Experience managing hybrid or cross-cloud architectures
  • Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
  • Infrastructure as Code (Terraform)
  • Observability tools (Prometheus, Grafana)
  • Strong focus on automation-first operations
  • Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
  • Experience with ETL frameworks and large-scale data processing
  • Understanding of columnar storage systems
  • Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
  • Strong proficiency in Python, Java, or Go
  • Experience with Docker, Kubernetes, and shell scripting
  • Strong troubleshooting skills with ability to perform root-cause analysis
  • Experience resolving complex production issues in distributed systems
  • 10+ years of software engineering experience, with 8+ years in cloud infrastructure, SRE, or DevOps
  • Proven ownership of production system reliability in cloud environments
  • Cloud infrastructure design and automation
  • Distributed systems and performance optimization
  • Data warehousing and ETL frameworks
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows
  • Familiarity with tools like LangChain, AutoGPT, or custom AI agents
  • Terraform, Docker, Kubernetes
  • Observability stacks (Prometheus, Grafana)
  • Python, Java, or Go

What the JD emphasized

  • U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows

Other signals

  • AI-assisted reliability practices
  • Generative AI and intelligent automation
  • AI-driven automation