Staff Site Reliability Engineer, Cloud Reliability Intelligence

Google Google · Big Tech · Sunnyvale, CA +1

Staff Site Reliability Engineer for Google Cloud's Reliability Intelligence team, focusing on building an evergreen intelligence platform. The role involves integrating Generative AI and LLM-driven workflows to move from reactive tracking to a predictive system for failure detection and automated risk mitigation. Responsibilities include owning the technical roadmap, designing scalable backend pipelines and UIs, and prototyping/productionizing LLM features for incident data parsing and reliability fixes.

What you'd actually do

  1. Own the technical roadmap and long-term architecture for the Evergreen platform, including a unified data model for promise delivery across GCP.
  2. Design and scale high-performance backend pipelines (Go, Java) and data-rich user interfaces (TypeScript, Angular) used by over 10,000+ Google engineers.
  3. Prototype and productionize LLM-based features to parse unstructured incident data, automatically file risk tickets, and suggest reliability fixes.
  4. Partner closely with Product Management, Data Science, and leadership to align multiple organizations on a unified approach to policy measurement and enforcement.

Skills

Required

  • data structures
  • algorithms
  • distributed systems
  • technical leadership
  • full-stack architectures
  • backend data automation
  • engineering frontend

Nice to have

  • applying LLMs or Generative AI to automate workflows
  • large-scale reliability analysis
  • policy conformance frameworks

What the JD emphasized

  • Prototype and productionize LLM-based features

Other signals

  • integrating Generative AI
  • LLM-driven workflows
  • predictive system
  • automates risk mitigation
  • Prototype and productionize LLM-based features