Staff Site Reliability Engineer

Google Google · Big Tech · San Jose, CA +1

Staff Site Reliability Engineer for Google Cloud's AI infrastructure, focusing on the reliability and supportability of data intelligence systems (Woodshed and Napa) that underpin Google's AI initiatives. The role involves leading the SRE team, improving system design, automation, and capacity management for large-scale, distributed systems, with a specific emphasis on production ML systems.

What you'd actually do

  1. Lead the team in our top 2026 challenge, reducing the support cost of the products via correct provisioning intelligent alerting, and system design and deployment improvements.
  2. Grow the Site Reliability Engineering (SRE) team from trained on-callers and incident responders to system partners.
  3. Build trust with and influence over key stakeholders to drive successful scaling of the supportability of complex systems.
  4. Identify problems and painpoints of the team, dev partner teams, and customers; and drive solutions balancing short term and long term needs.
  5. Work with critical customers to give them the reliability they need for their key user journeys.

Skills

Required

  • Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience.
  • 8 years of experience building and developing infrastructure or distributed systems.
  • 5 years of experience in troubleshooting and debugging.
  • 5 years of experience building and architecting production quality Machine Learning (ML) systems.
  • 5 years of experience programming in C++, Go, or Python.

Nice to have

  • Master's degree in Computer Science, or a related technical field.
  • Experience in Site Reliability Engineering.
  • Experience in troubleshooting and supporting applications like web services, data storage, databases, data pipelines, commerce engines, with Linux/Unix or other operating systems.

What the JD emphasized

  • production quality Machine Learning (ML) systems

Other signals

  • Google Cloud's services
  • data intelligence systems underlying Google's AI push
  • Technical Infrastructure team
  • production quality Machine Learning (ML) systems