Site Reliability Engineer, Dublin

Apple Apple · Big Tech · Dublin, Ireland · Software and Services

Site Reliability Engineer at Apple Services Engineering (ASE) focusing on building and enhancing massive clusters for Virtual Machines, Containers, and associated infrastructure. The role involves designing and developing tooling, frameworks, and automation for reliability, scalability, and operational efficiency, defining and implementing SLOs/SLIs, leading incident response, and contributing to platform architecture. Requires expertise in cloud operations, infrastructure-as-a-service, and strong software development skills in Go and Java.

What you'd actually do

  1. Design and develop tooling, frameworks, and automation in Go and Java to improve reliability, scalability, and operational efficiency of compute infrastructure (VMs, containers, orchestration).
  2. Define and implement SLOs/SLIs for compute services and build the observability pipelines (metrics, logging, tracing) to measure and enforce them.
  3. Lead incident response for compute infrastructure, driving triage, root cause analysis, and postmortem corrective actions.
  4. Develop and maintain infrastructure-as-code and CI/CD pipelines, ensuring reproducibility, automated testing, and staged rollouts across the fleet.
  5. Contribute to compute platform architecture through design reviews, technical design documents, production readiness reviews, capacity planning, and disaster recovery exercises.

Skills

Required

  • cloud operations
  • infrastructure-as-a-service (compute, storage, and network virtualization)
  • Go
  • Java
  • building production services, tools or automation frameworks
  • software development lifecycle practices (version control, code review, CI/CD, automated testing)
  • operating and engineering large-scale multi-tenant Infrastructure as a Managed service
  • articulate complex technical concepts to both technical and non-technical stakeholders

Nice to have

  • Infrastructure as a Service orchestration tools (OpenStack, CloudStack, etc)
  • Linux system virtualization (Libvirt, KVM, QEMU, etc), along with the APIs
  • implement and coordinate telemetry using monitoring and observability tools (Splunk, Grafana, Prometheus)
  • building internal platforms or developer tooling
  • distributed systems concepts

What the JD emphasized

  • Must be an expert and have in-depth professional experience with cloud operations, with a focus on “infrastructure-as-a-service” (compute, storage, and network virtualization).