Sr Principal Site Reliability Engineer

Disney Disney · Media · San Francisco, CA +1

Sr Principal Site Reliability Engineer for Disney's Media Engineering team, focusing on ensuring 99.999% incident-free uptime for content distribution platforms. Responsibilities include driving platform stability, developing instrumentation and alerting, managing redundancy and resiliency, leading incident response, partnering with various teams, and driving automation for releases and efficiency. Requires extensive engineering leadership experience, knowledge of large-scale distributed platforms, and experience with media streaming technologies.

What you'd actually do

  1. Accountable for platform stability and uptime, from processing platform and content supply chain through CDN delivery and playback.
  2. Develop solid understanding of all critical data flows and ensure proper instrumentation and alerting practices.
  3. Drive redundancy and resiliency strategy across thousands of servers, network links, in both datacenter and cloud environments.
  4. Responsible for Media Engineering’s Incident Response process and ensuring follow-ups and proactive actions to avoid service incidents.
  5. Partner with Infrastructure, Operations, Product, and Development teams to ensure best-practices, conducting audits and reviews across each domain.

Skills

Required

  • Minimum of 12 years of engineering leadership experience, including managing and influencing teams directly and indirectly
  • Bachelors or higher degree in Engineering or a related field, or equiv experience.
  • Experience working across complex globally connected teams with a variety of stakeholders
  • Experience with large-scale globally distributed platforms including content preparation, distribution, playback, operations, and infrastructure
  • Possess a vision for exceptional escalation management and engineering excellence.
  • Ability to develop and implement and socialize strategies and tactics to drive improvement in stability, system performance, team capability, and operational efficiency.
  • Knowledge of how to use data to understand and improve business performance.
  • Track record of developing strong cross-functional and cross-regional relationships.

Nice to have

  • Direct experience with major Content Delivery Network integrations
  • Experience in media streaming technologies, especially media processing workflows and tooling, media players and devices, and content delivery strategies.
  • Experience with both high-scale back-end services (cloud and datacenter) along with client development on a variety of devices (mobile, web, living room devices).
  • Experience with Media Operations and/or Infrastructure management

What the JD emphasized

  • 99.999% incident-free uptime
  • thousands of servers
  • datacenter and cloud environments
  • Incident Response process
  • automation strategy
  • rapid safe releases
  • tight content SLAs
  • operational efficiency