Site Reliability Manager, Site Reliability Engineering

Google Google · Big Tech · Mountain View, CA +1

This role is for a Site Reliability Manager at Google, focusing on ensuring the reliability, uptime, and performance of large-scale, distributed systems, particularly for the Search front-end. The role involves leading a team of engineers, automating responses to service conditions, designing and delivering software to improve service availability and efficiency, and managing on-call rotations. While the role mentions "Experience Search as we move forward into AI" and preferred qualifications include "Experience with Artificial Intelligence", the core responsibilities and qualifications are centered on traditional SRE principles for infrastructure and systems engineering, not on building or directly managing AI/ML models or systems as the primary function.

What you'd actually do

  1. Lead a team of Software/Systems Engineers on projects for users and be responsible for uptime.
  2. Own end-to-end availability and performance of key services and build automation to prevent problem recurrence. Automate response to all non-exceptional service conditions.
  3. Lead by example, mentor the team and establish credibility through quality technical execution.
  4. Manage on-call rotations across continents, using a follow-the-sun model.
  5. Design, write and deliver software to improve the availability, scalability, latency and efficiency of Google's services.

Skills

Required

  • programming in one or more languages
  • people management
  • leading projects
  • administration (e.g., filesystems, inodes, system calls)
  • networking (e.g., TCP/IP, routing, network topologies and hardware, SDN)
  • developing infrastructure
  • distributed systems
  • system design

Nice to have

  • building scalable, reliable and highly performant web applications
  • programming languages Java or C++
  • Artificial Intelligence
  • experimental design (e.g., A/B, multivariate) and incremental analysis

What the JD emphasized

  • large-scale systems
  • massively distributed
  • fault-tolerant systems
  • reliability
  • uptime
  • capacity and performance
  • automation
  • system design
  • availability
  • scalability
  • latency
  • efficiency