Senior AI Site Reliability Developer 3

Oracle Oracle · Enterprise · United States

This role focuses on building and operating the infrastructure for an AI-first Electronic Health Record platform. The Senior AI Site Reliability Developer will design, build, and operate reliable, scalable infrastructure and data pipelines, with a strong emphasis on advancing automation, observability, and AI-assisted reliability practices. This includes exploring and applying Generative AI and agentic AI for incident response, system resilience, and operational efficiency, as well as managing cloud environments and data technologies.

What you'd actually do

  1. Design, build, and operate highly reliable, scalable infrastructure and data pipelines that power mission-critical analytics globally.
  2. Explore the use of Generative AI and intelligent automation to improve incident response, system resilience, and operational efficiency.
  3. Design and implement GenAI-powered or agent-based solutions for: Observability and anomaly detection, Incident triage and remediation, Infrastructure provisioning and lifecycle management
  4. Build and optimize scalable data pipelines using Vertica and ETL frameworks.
  5. Apply DevOps/SRE practices to automate deployments and operations.

Skills

Required

  • Experience building and operating high-availability, fault-tolerant systems
  • Strong understanding of distributed systems, performance monitoring, and resiliency patterns
  • Experience with incident response, root-cause analysis, and production troubleshooting
  • Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to: Infrastructure lifecycle management, Observability and anomaly detection, Incident response and remediation automation
  • Ability to design or integrate AI-driven workflows for operational efficiency and reliability
  • Familiarity with building or integrating autonomous agents for DevOps/SRE use cases
  • Strong experience with multi-cloud environments (OCI, AWS/Azure)
  • Deep understanding of cloud infrastructure design, deployment, and resource optimization
  • Experience managing hybrid or cross-cloud architectures
  • Advanced competency in CI/CD pipelines (Jenkins, Kubernetes)
  • Infrastructure as Code (Terraform)
  • Observability tools (Prometheus, Grafana)
  • Strong focus on automation-first operations
  • Proficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)
  • Experience with ETL frameworks and large-scale data processing
  • Understanding of columnar storage systems
  • Experience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)
  • Strong proficiency in Python, Java, or Go
  • Experience with Docker, Kubernetes, and shell scripting
  • Strong troubleshooting skills with ability to perform root-cause analysis
  • Experience resolving complex production issues in distributed systems
  • Implement and optimize infrastructure for Oracle HDI Analytics Platform
  • Ensure system uptime, reliability, and scalability
  • Build tools and frameworks that enable self-service and autonomous operations
  • Build and optimize scalable data pipelines using Vertica and ETL frameworks.
  • Apply DevOps/SRE practices to automate deployments and operations
  • Enhance observability using Prometheus/Grafana and AI-driven insights
  • Support multi-cloud initiatives across OCI, AWS, and Azure
  • Optimize cost, performance, and compliance across environments
  • Participate in on-call rotations
  • Implement preventative and automated remediation solutions
  • Work closely with engineers to execute technical roadmaps
  • Contribute to code reviews and infrastructure improvements
  • 4+ years of software engineering, cloud infrastructure, SRE, or DevOps experience
  • Proven ownership of production system reliability in cloud environments
  • Cloud infrastructure design and automation
  • Distributed systems and performance optimization
  • Data warehousing and ETL frameworks
  • Demonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operations
  • Experience building or integrating AI-powered automation for DevOps/SRE workflows
  • Familiarity with tools like LangChain, AutoGPT, or custom AI agents
  • Terraform, Docker, Kubernetes
  • Observability stacks (Prometheus, Grafana)
  • Python, Java, or Go

Nice to have

  • Experience in healthcare or regulated environments (HIPAA, compliance frameworks)
  • Experience working in environments requiring security clearance
  • Experience building self-healing or autonomous infrastructure systems

What the JD emphasized

  • U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.

Other signals

  • AI-assisted reliability practices
  • Generative AI and intelligent automation
  • AI-driven automation