Site Reliability Engineer - Kafka

Apple Apple · Big Tech · Seattle, WA · Software and Services

Site Reliability Engineer (SRE) with experience in distributed systems, Kafka infrastructure, and production environments. Focus on building and managing scalable, reliable, and fast data infrastructure services, including monitoring, alerting, and automation. Requires proficiency in Java, Go, or Python, and experience with Kubernetes, AWS, GCP, and IaC.

What you'd actually do

  1. Understanding of core SRE concepts - Monitoring, Alerting, Incident management
  2. Deep and wide performance engineering (design concepts, profile-guided optimization)
  3. Service lifecycle mangement across bare metal, and virtualized (EC2), kubernetes platforms
  4. Prepare alert handling procedures, run-books, and collaborate with other SRE team members.
  5. Excellent communication and a high degree of customer focus when engaging with internal platform customers

Skills

Required

  • 5 or more years of experience in support of internet-facing production services and distributed systems via deployments, On Call and Incident Management.
  • 5 or more years of experience running large scale infrastructure with a heavy reliance on automation tooling
  • 5 or more years of experience troubleshooting and performance deep dive analysis
  • Real operational experience managing services at scale on Kubernetes
  • Proficient in one or more of the following programming languages: Java, Go (golang), Python
  • Operational experience deploying in and running on Datacenter and Cloud architectures (networking topologies, host placement strategies, and failure modes); design of multi-datacenter systems; failure domains; and wide-area networking.
  • Self motivated, inquisitive with an aptitude to learn new technologies quickly and effectively.
  • Demonstrated expertise developing and troubleshooting distributed systems and database storage engines.
  • Experience developing critical internet services and/or platform infrastructure.
  • Experience with AWS, GCP and IaC such as Terraform

Nice to have

  • Experience managing messaging services such as Kafka or other Data services
  • Proficient in Java, Go (golang) & Python

What the JD emphasized

  • Prior experience with development or maintenance of Kafka infrastructure or similar data service is highly recommended