What you'd actually do

Provide technical leadership and guidance to SRE team by applying hands-on skills and continuous learning. Build and mentor a world-class engineering team that partners closely with platform teams to design scalable, reliable systems, while contributing actively to both platform and application code.

Manage Infrastructure as Code (IaC) and develop tooling to enhance engineering productivity. Lead initiatives for cost optimization and operational efficiency at scale.

Actively participate in on-call rotations and resolve critical production issues. Lead response efforts during major incidents and serve as the primary escalation point for complex problems.

Perform root cause investigations and ensure follow-up with actionable postmortems and infrastructure hardening initiatives. Implement fixes—in code, infrastructure, or processes—to prevent recurrence.

Partner closely with engineering teams to troubleshoot issues, deploy fixes, and enhance system reliability. Champion operational excellence through direct technical contributions.

Skills

Required

Site Reliability Engineering (SRE)
people-management
cloud or hybrid environments
cloud-native services
ETL frameworks (e.g., Apache Spark, Flink)
messaging systems (e.g., Kafka)
cloud infrastructure & services (e.g., AWS, GCP, Kubernetes)
Observability tools (e.g., Prometheus, Grafana, CloudWatch)
Python
Java
Scala
incident response
root cause analysis
system reliability improvements
Bachelor’s degree or equivalent

Nice to have

enterprise data systems on distributed architectures
data visualization tools such as Tableau, Business Objects, or ThoughtSpot
modern & distributed databases such as Snowflake, Cassandra, SingleStore, or SAP HANA
Generative AI or automation tools for issue detection, alerting, or remediation
system design
data structures
incident management best practices

Apple is where individual imaginations gather together, committing to the values that lead to great work. Every new product we build, service we create, is the result of us making each other’s ideas stronger. That happens because every one of us shares a belief that we can make something wonderful and share it with the world, changing lives for the better. It’s the diversity of our people and their thinking that inspires the innovation that runs through everything we do. When we bring everybody in, we can do the best work of our lives. Here, you’ll do more than join something — you’ll add something.

Do you want to help build some of the largest and most consequential enterprise and customer technology systems in the world? Join Apple’s Information Systems and Technology (IS&T) organization. IS&T is the engine behind everything Apple does for customers and for the people who build for them. It’s Apple’s central nervous system. Supporting 2.5 billion active Apple devices, processing billions of secure transactions, and keeping the technology that defines modern life running flawlessly, IS&T makes the impossible feel effortless.

Do you love building solutions to handle global complexity and immense scale? Imagine what you could do here.

AI & Data Platforms (AiDP) is IS&T's engine for AI-powered innovation. The team brings together data, application development, and machine learning — including generative AI — along with data services and customer success functions, to help IS&T build solutions more efficiently and streamline the adoption and embedding of generative AI across Apple.

Description

We are seeking an experienced Site Reliability Engineering (SRE) Manager to support scalable and resilient distributed systems that power Apple's data pipelines and analytics platforms. Our Enterprise Data Warehouse landscape caters to a wide variety of real-time, near real-time and batch analytical solutions. These solutions are an integral part of business functions like Sales, Operations, Finance, AppleCare, Marketing and Internet Services, enabling business drivers to make critical decisions. We utilizes proprietary and open source technologies such as Kafka, Spark, Iceberg, Airflow, and others to build these solutions. If you are passionate about addressing infrastructure challenges at scale, both on-premises and in the cloud, and focused on optimizing scalable solutions by prioritizing ease of use and maintenance, you will discover exciting opportunities in AiDP.

As a hands-on SRE Manager, you’ll lead by example—actively driving operational excellence, contributing to code, and ensuring system reliability. You will be deeply involved in incident response across complex, distributed data platforms designed to support data exploration, analytics, and reporting solutions. These platforms operate at the unique intersection of high data volume and hybrid infrastructure, spanning both cloud and on-premise environments.

Responsibilities

Lead by Example: Provide technical leadership and guidance to SRE team by applying hands-on skills and continuous learning. Build and mentor a world-class engineering team that partners closely with platform teams to design scalable, reliable systems, while contributing actively to both platform and application code. Drive Automation for Data Platforms and Infrastructure: Manage Infrastructure as Code (IaC) and develop tooling to enhance engineering productivity. Lead initiatives for cost optimization and operational efficiency at scale. Incident Response and On-Call Engagement: Actively participate in on-call rotations and resolve critical production issues. Lead response efforts during major incidents and serve as the primary escalation point for complex problems. Drive Post-Incident Analysis: Perform root cause investigations and ensure follow-up with actionable postmortems and infrastructure hardening initiatives. Implement fixes—in code, infrastructure, or processes—to prevent recurrence. Active Collaboration with Cross-Functional Teams: Partner closely with engineering teams to troubleshoot issues, deploy fixes, and enhance system reliability. Champion operational excellence through direct technical contributions. Establish Production Readiness Standards: Take ownership of Application Security, Disaster Recovery & Application Documentation to reflect latest system architecture and configurations.

Minimum Qualifications

10+ years of experience in Site Reliability Engineering (SRE) or a related domain. 2+ years of direct people-management experience, including leading, hiring, developing, and building engineering teams. Hands-on experience supporting and maintaining applications in cloud or hybrid environments. Expertise in cloud-native services, including ETL frameworks (e.g., Apache Spark, Flink) and messaging systems (e.g., Kafka). Strong knowledge of cloud infrastructure & services (e.g., AWS, GCP, Kubernetes). Experience with Observability tools (e.g., Prometheus, Grafana, CloudWatch). Programming experience in Python, Java, or Scala. Proven ability to lead incident response, perform root cause analysis, and drive system reliability improvements. Bachelor’s degree or equivalent.

Preferred Qualifications

Hands-on experience supporting enterprise data systems on distributed architectures. Exposure to data visualization tools such as Tableau, Business Objects, or ThoughtSpot, with experience supporting and troubleshooting related issues. Experience with modern & distributed databases such as Snowflake, Cassandra, SingleStore, or SAP HANA. Experience using Generative AI or automation tools for issue detection, alerting, or remediation. Solid understanding of system design, data structures, and incident management best practices.

At Apple, we believe accessibility is a fundamental human right. You’ll find that idea reflected in everything here — in our culture, our benefits and our digital tools. By welcoming as many perspectives as possible, we help you build a career where you feel like you belong.

Learn about accessibility in Apple’s workplace

Description

Responsibilities

Minimum Qualifications

Preferred Qualifications

Learn about accessibility in Apple’s workplace

Site Reliability Engineering Manager

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Description

Responsibilities

Minimum Qualifications

Preferred Qualifications

Description

Responsibilities

Minimum Qualifications

Preferred Qualifications