Site Reliability Engineer, Icloud at Apple

What you'd actually do

Egage with our product teams to understand requirements, design and implement resilient and scalable infrastructure solutions.

Operate, monitor, and triage all aspects of our production and non-production environments.

Collaborate on code, infrastructure, design reviews, and process enhancements

Evaluate and integrate new technologies to improve system reliability, security, and performance.

Develop and implement automation to provision, configure, deploy, and monitor Apple services.

Skills

Required

managing and scaling distributed systems
deploying, supporting and supervising services, platforms, and application stacks
observability platforms (Splunk, Grafana, Prometheus)
Java, Python, or Go
Kubernetes, Nginx, Envoy, Prometheus, Docker

Nice to have

networking protocols (HTTP, DNS, TCP/IP, etc.)
Linux Operating System internals
iOS app development (Xcode, Swift)
OpenTelemetry Standards / distributed tracing (Jaeger)

What the JD emphasized

5 + years experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment

Strong experience with deploying, supporting and supervising new and existing services, platforms, and application stacks

Experience with observability platforms with Splunk, Grafana, Prometheus.

Demonstrable fluency in at least one of the following languages: Java, Python, or Go.

Experience with Kubernetes, Nginx, Envoy, Prometheus, and/or Docker.

People at Apple don’t just build products — they craft experiences our customers love and depend on. Apple Services Engineering (ASE) builds and supports the systems that make many of these daily experiences possible. If you’ve used Apple products, you’ve likely interacted with us. Apple Services Site Reliability Engineering (SRE) teams are responsible for the systems and services that directly support those customers and their experiences. We are looking for an SRE with experience in building and supporting highly available customer-facing services.

Description

Apple Services’ scale is BIG. Operating at our scale, across multiple geographies and servicing hundreds of millions of users presents unique challenges. As a Software Developer in SRE at Apple, you'll need to solve these problems using data, teamwork, and your own expertise. ASE Products Site Reliability teams are responsible for the reliability and performance of the server software stack that powers products like iCloud Photos, Mail, Drive, Backup and many more. We do that by focusing on reliability best practices from service inception to production, collaborating deeply with product development teams to deliver a superlative product and shared vision while leveraging data and automation as first principles. We run a mix of open source, vendor licensed, and internally developed tools to manage the end to end SDLC of our products. You'll learn these tools and have opportunities to improve them.

Responsibilities

Egage with our product teams to understand requirements, design and implement resilient and scalable infrastructure solutions. Operate, monitor, and triage all aspects of our production and non-production environments. Collaborate on code, infrastructure, design reviews, and process enhancements Evaluate and integrate new technologies to improve system reliability, security, and performance. Develop and implement automation to provision, configure, deploy, and monitor Apple services. Participate in an oncall rotation providing hands-on technical expertise during service impacting events. Contribute to capacity planning, scale testing, and disaster recovery exercises Approach operational problems with a software engineering mindset.

Minimum Qualifications

Strong sense of ownership, customer service, and integrity proven through clear communication. BS in Computer Science or related field, or equivalent employment 5 + years experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment Strong experience with deploying, supporting and supervising new and existing services, platforms, and application stacks Experience with scale testing, disaster recovery, and capacity planning Experience with observability platforms with Splunk, Grafana, Prometheus. Demonstrable fluency in at least one of the following languages: Java, Python, or Go. Experience with Kubernetes, Nginx, Envoy, Prometheus, and/or Docker.

Preferred Qualifications

Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies. Understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals. Experience in developing iOS apps using Xcode and Swift. Experience in OpenTelemetry Standards / distributed tracing like jaeger

At Apple, we're not all the same. And that's our greatest strength. We draw on the differences in who we are, what we've experienced and how we think. Because to create products that serve everyone, we believe in including everyone. Therefore, we are committed to treating all applicants fairly and equally. As a registered Disability Confident employer, we will work with applicants to make any reasonable accommodations. Apple will consider for employment all qualified applicants with criminal backgrounds in a manner consistent with applicable law. Learn more

At Apple, we believe accessibility is a fundamental human right. You’ll find that idea reflected in everything here — in our culture, our benefits and our digital tools. By welcoming as many perspectives as possible, we help you build a career where you feel like you belong.