Site Reliability Engineering Manager

Apple Apple · Big Tech · Bengaluru, Karnataka, India +1 · Software and Services

This role is for a Site Reliability Engineering Manager at Apple, supporting scalable and resilient distributed systems for data pipelines and analytics platforms. The role involves leading an SRE team, driving operational excellence, incident response, and automation for data platforms. While the team is within the AI & Data Platforms (AiDP) organization and mentions generative AI, the core responsibilities focus on SRE for data infrastructure, not direct AI/ML model development or research.

What you'd actually do

  1. Provide technical leadership and guidance to SRE team by applying hands-on skills and continuous learning. Build and mentor a world-class engineering team that partners closely with platform teams to design scalable, reliable systems, while contributing actively to both platform and application code.
  2. Manage Infrastructure as Code (IaC) and develop tooling to enhance engineering productivity. Lead initiatives for cost optimization and operational efficiency at scale.
  3. Actively participate in on-call rotations and resolve critical production issues. Lead response efforts during major incidents and serve as the primary escalation point for complex problems.
  4. Perform root cause investigations and ensure follow-up with actionable postmortems and infrastructure hardening initiatives. Implement fixes—in code, infrastructure, or processes—to prevent recurrence.
  5. Partner closely with engineering teams to troubleshoot issues, deploy fixes, and enhance system reliability. Champion operational excellence through direct technical contributions.

Skills

Required

  • Site Reliability Engineering (SRE)
  • people-management
  • cloud or hybrid environments
  • cloud-native services
  • ETL frameworks (e.g., Apache Spark, Flink)
  • messaging systems (e.g., Kafka)
  • cloud infrastructure & services (e.g., AWS, GCP, Kubernetes)
  • Observability tools (e.g., Prometheus, Grafana, CloudWatch)
  • Python
  • Java
  • Scala
  • incident response
  • root cause analysis
  • system reliability improvements
  • Bachelor’s degree or equivalent

Nice to have

  • enterprise data systems on distributed architectures
  • data visualization tools such as Tableau, Business Objects, or ThoughtSpot
  • modern & distributed databases such as Snowflake, Cassandra, SingleStore, or SAP HANA
  • Generative AI or automation tools for issue detection, alerting, or remediation
  • system design
  • data structures
  • incident management best practices

What the JD emphasized

  • 10+ years of experience in Site Reliability Engineering (SRE) or a related domain.
  • 2+ years of direct people-management experience, including leading, hiring, developing, and building engineering teams.