Site Reliability Engineer, Enterprise Technology Services

Apple Apple · Big Tech · Sunnyvale, CA · Software and Services

Site Reliability Engineer for Apple's Identity Management Services team, focusing on designing, building, and supporting critical platform services at scale. Responsibilities include ensuring high availability, reliability, and performance of authentication, authorization, and provisioning services, managing infrastructure, capacity planning, disaster recovery, and driving automation. The role also involves monitoring, incident management, and incorporating ML for anomaly detection and GenAI for alert engineering.

What you'd actually do

  1. Drive Platform Reliability & SRE Standards: Lead the optimization of a large-scale Identity Management Platform, ensuring ultra-high availability, reliability, and performance for critical authentication, authorization, and provisioning services. Define and implement robust Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) to guide engineering teams toward ambitious reliability and observability goals.
  2. Architect & Engineer Resilient Systems: Design, build, and manage robust, distributed systems across cloud and on-premise infrastructure. Develop advanced capacity planning, disaster recovery, auto-failover, and data consistency mechanisms. Innovate by creating reusable tooling, automation frameworks, and advanced reliability platforms covering observability, alerting, chaos testing, auto-scaling, and failover strategies.
  3. Lead Operational Excellence & Incident Management: Drive comprehensive operational excellence through advanced observability (tracing, logging, metrics, alerting) and next-generation telemetry, leveraging Machine Learning for anomaly detection and exploring GenAI for alert engineering. Lead technical response during major incidents, conducting deep post-mortems, driving systemic improvements, and embedding preventive architectural controls.
  4. Champion Automation & Resilience Engineering: Develop and implement large-scale automation solutions, internal tooling, and frameworks to enhance reliability, cost-efficiency, and operational visibility. Advance resilience engineering by integrating automation pipelines, CI/CD, canary releases, and chaos engineering principles into core development and deployment workflows. Drive initiatives to eliminate toil and contribute to multi-cloud strategy.
  5. Ensure Security & Compliance: Maintain the highest security posture, implementing fraud prevention at the perimeter, and ensuring strict adherence to industry compliance standards (e.g., ISO-27001, PCI). Uphold all architectural and operational practices to rigorously meet security standards, compliance requirements, and audit readiness protocols.

Skills

Required

  • Java
  • Python
  • Go
  • Bash
  • Ansible
  • Prometheus
  • Grafana
  • Datadog
  • OpenTelemetry
  • ELK
  • Splunk
  • ISO-27001
  • PCI
  • BS degree in computer science or equivalent field with 7+ years of experience
  • MS degree in computer science or equivalent field with 5+ years of experience
  • 5+ years of experience in Site Reliability Engineering
  • building, scaling, and operating large-scale distributed platform services
  • designing, analyzing, and troubleshooting distributed systems
  • designing observability stacks

Nice to have

  • Machine Learning tools
  • Generative AI enhancements
  • Open Source technologies designed for large-scale data processing
  • error budgeting
  • service reliability metrics (SLA, SLO, SLI)
  • CI/CD

What the JD emphasized

  • strong software development skills
  • deep systems expertise
  • solid understanding of SRE principles
  • high availability
  • reliability
  • security
  • data consistency
  • disaster recovery
  • auto-failover mechanisms
  • monitoring infrastructure
  • application services
  • incident management
  • system bottlenecks
  • architectural challenges
  • automation solutions
  • large-scale platform service needs
  • alert engineering
  • anomaly detection
  • Machine Learning tools
  • Generative AI enhancements
  • device-related issues
  • debugging relevant logs
  • full system lifecycle
  • configuration and code deployment
  • user acceptance test
  • production environments
  • ultra-high availability
  • performance
  • Service Level Indicators (SLIs)
  • Objectives (SLOs)
  • Agreements (SLAs)
  • distributed systems
  • capacity planning
  • reusable tooling
  • automation frameworks
  • observability
  • alerting
  • chaos testing
  • auto-scaling
  • failover strategies
  • operational excellence
  • next-generation telemetry
  • technical response
  • major incidents
  • deep post-mortems
  • systemic improvements
  • preventive architectural controls
  • cost-efficiency
  • operational visibility
  • automation pipelines
  • CI/CD
  • canary releases
  • chaos engineering principles
  • development and deployment workflows
  • eliminate toil
  • multi-cloud strategy
  • highest security posture
  • fraud prevention
  • perimeter
  • industry compliance standards
  • (e.g., ISO-27001, PCI)
  • architectural and operational practices
  • security standards
  • compliance requirements
  • audit readiness protocols
  • Cross-Functional Collaboration
  • engineering
  • production support
  • QA teams
  • seamless service delivery
  • DevOps culture
  • technical insights
  • log analysis
  • system debugging
  • 5+ years of experience
  • Site Reliability Engineering
  • building, scaling, and operating large-scale distributed platform services
  • Java
  • BS degree in computer science or equivalent field with 7+ years of experience
  • MS degree in computer science or equivalent field with 5+ years of experience
  • Open Source technologies
  • large-scale data processing
  • analyzing, and troubleshooting distributed systems
  • Python, Java, Go, Bash, Ansible
  • observability stacks
  • (Prometheus, Grafana, Datadog, OpenTelemetry, ELK, etc.)
  • troubleshooting
  • problem-solving skills
  • monitoring and logging tools
  • (e.g., Prometheus, Splunk, Grafana, OpenTelemetry)
  • error budgeting
  • service reliability metrics
  • (SLA, SLO, SLI)
  • CI/CD
  • Rele