Site Reliability Engineer, Icloud

Apple Apple · Big Tech · London, United Kingdom +1 · Software and Services

Site Reliability Engineer for Apple Services (iCloud, Photos, Mail, Drive, Backup) focusing on building and supporting highly available, scalable, customer-facing services. Responsibilities include infrastructure design, operation, monitoring, automation, and incident response for large-scale distributed systems.

What you'd actually do

  1. Egage with our product teams to understand requirements, design and implement resilient and scalable infrastructure solutions.
  2. Operate, monitor, and triage all aspects of our production and non-production environments.
  3. Collaborate on code, infrastructure, design reviews, and process enhancements
  4. Evaluate and integrate new technologies to improve system reliability, security, and performance.
  5. Develop and implement automation to provision, configure, deploy, and monitor Apple services.

Skills

Required

  • managing and scaling distributed systems
  • deploying, supporting and supervising services, platforms, and application stacks
  • observability platforms (Splunk, Grafana, Prometheus)
  • Java, Python, or Go
  • Kubernetes, Nginx, Envoy, Prometheus, Docker

Nice to have

  • networking protocols (HTTP, DNS, TCP/IP, etc.)
  • Linux Operating System internals
  • iOS app development (Xcode, Swift)
  • OpenTelemetry Standards / distributed tracing (Jaeger)

What the JD emphasized

  • 5 + years experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment
  • Strong experience with deploying, supporting and supervising new and existing services, platforms, and application stacks
  • Experience with observability platforms with Splunk, Grafana, Prometheus.
  • Demonstrable fluency in at least one of the following languages: Java, Python, or Go.
  • Experience with Kubernetes, Nginx, Envoy, Prometheus, and/or Docker.