Site Reliability Engineer II

Disney Disney · Media · New York, NY +1

Site Reliability Engineer II at Disney, focusing on improving the performance, resiliency, and operational excellence of backend services. Responsibilities include building automation, enhancing observability, collaborating with engineering teams on SRE principles, and participating in incident response for critical systems supporting Disney's media products.

What you'd actually do

  1. Contribute to the design, implementation, and improvement of systems to enhance reliability, scalability, and performance.
  2. Build and maintain automation for deployment, monitoring, alerting, and operational workflows.
  3. Collaborate with software engineering teams to implement SRE best practices, including SLIs, SLOs, error budgets, and automated remediation.
  4. Develop tools, dashboards, and instrumentation to improve observability across metrics, logs, and distributed tracing.
  5. Participate in incident response, root cause analysis (RCA), and corrective actions to prevent recurrence.

Skills

Required

  • 3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related discipline.
  • Hands-on experience with cloud platforms – AWS (preferred), GCP, Azure.
  • Proficiency in Python, Go, JavaScript, Bash, or equivalent scripting languages.
  • Working knowledge of Linux or Unix-based systems.
  • Experience with CI/CD systems (e.g., GitHub Actions, GitLab CI, Jenkins).
  • Familiarity with Infrastructure-as-Code (Terraform, CloudFormation, etc.).
  • Experience with containerization technologies such as Docker and Kubernetes.
  • Understand networking fundamentals, distributed systems, and system design basics.
  • Strong analytical and troubleshooting skills, including the ability to diagnose complex system issues.
  • An ability to work both independently and collaboratively
  • Strong communication skills and the ability to collaborate effectively with cross-functional teams.

Nice to have

  • Hands-on experience with observability stacks (Prometheus, Grafana, ELK/EFK, Datadog, Splunk, New Relic).
  • Exposure to GitOps tooling (Argo CD, Flux).
  • Experience contributing to SLO/SLI frameworks and implementing error budgets.
  • Knowledge of service mesh architectures (Istio, Linkerd).
  • Familiarity with performance testing and load testing tools.
  • Experience with message queues, event-driven systems, or distributed data platforms.
  • Cloud or DevOps-related certifications (AWS Associate/Specialty, GCP Profe

What the JD emphasized

  • Fostering innovation is a critical component to success here at Disney Entertainment and ESPN Product & Technology. Therefore, the ideal candidate will also need to be highly adaptable to changes and be able to pivot when required.