Principal Site Reliability Engineer

Disney · Media · Bay Lake, FL +1

Principal Site Reliability Engineer responsible for leading and impacting SRE culture, advocating for service level management, and designing/building/supporting products and platforms. This role will lead the adoption of AI/LLM-assisted reliability engineering, architecting and operating AI-enabled capabilities, and optimizing system reliability through advanced analytics. The role requires expertise in evaluating GPT model families, engineering RAG systems, and automating infrastructure and operations.

What you'd actually do

  1. Lead and impact the SRE culture, mentoring others on the effective use of that culture to drive observability and reduce toil
  2. Advocate for accelerated adoption of service level management, including advancing the adoption and tracking of SLIs, SLOs, and SLAs for all systems and applications in the assigned portfolio
  3. Elevate and lead the design, build, and support of products and platforms by serving as a thought leader who considers which green field products should be used and evaluating build vs. buy decisions
  4. Lead the adoption of AI/LLM-assisted reliability engineering by building and governing secure, production-grade workflows while also architecting and operating AI-enabled reliability capabilities
  5. Propel and drive development pipelines, automate infrastructure and operations, create telemetry for monitoring, engineer high reliability and reinforce best practices to secure company data

Skills

Required

  • Site Reliability Engineering
  • observability
  • DevOps toolset
  • cloud-agnostic solutions
  • AWS
  • Azure
  • GCP
  • GPT model families
  • RAG systems
  • configuration management
  • orchestration tools
  • Terraform
  • Cloud Formation
  • Ansible
  • Chef
  • AI for system reliability
  • Linux
  • Windows

Nice to have

  • high demand releases
  • PCI audit and standards experience

What the JD emphasized

  • Minimum 10 years of related work experience
  • Demonstrated experience in advancing the maturity of Site Reliability Engineering in an enterprise scale environment.
  • Expertise in defining and implementing industry-leading observability strategies across diverse and highly complex distributed systems, ensuring optimal performance and reliability
  • Comprehensive knowledge and hands-on experience with a comprehensive DevOps toolset for source control management, continuous integration/continuous deployment, orchestration, containerization, application performance management, observability, and reliability testing
  • Demonstrated experience in engineering cloud-agnostic solutions using multiple Cloud service providers including AWS, Azure, and GCP
  • Demonstrated experience evaluating and applying multiple GPT model families across cost/latency/quality tradeoffs, and engineering scalable MCP toolchains plus high-signal RAG systems to improve operational outcomes
  • Mastery in architecting and managing highly available, scalable, and automated infrastructure using configuration management and orchestration tools such as Terraform, Cloud Formation, Ansible, and Chef, contributing to the organization's strategic objectives and competitive advantage
  • Expert in using AI to optimize system reliability through advanced analytics and prescriptive recommendations

Other signals

  • AI/LLM-assisted reliability engineering
  • architecting and operating AI-enabled reliability capabilities
  • evaluating and applying multiple GPT model families
  • engineering scalable MCP toolchains plus high-signal RAG systems
  • using AI to optimize system reliability