What you'd actually do

Own the technical roadmap and long-term architecture for the Evergreen platform, including a unified data model for promise delivery across GCP.

Design and scale high-performance backend pipelines (Go, Java) and data-rich user interfaces (TypeScript, Angular) used by over 10,000+ Google engineers.

Prototype and productionize LLM-based features to parse unstructured incident data, automatically file risk tickets, and suggest reliability fixes.

Partner closely with Product Management, Data Science, and leadership to align multiple organizations on a unified approach to policy measurement and enforcement.

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

The Reliability Outcome Enablement team develops the products, core infrastructure, and datasets that drive and sustain Google Cloud platform's (GCP's) reliability promises. We build the evergreen intelligence platform the core system that automates resilience across the GCP ecosystem. Every product team at Google (from BigQuery to Spanner) relies on our infrastructure and integrated data lake to keep their services bulletproof.

We are currently expanding our platform to integrate Generative AI and LLM-driven workflows, moving from reactive tracking to a predictive system that catches failures and automates risk mitigation.Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google's product portfolio possible. We're proud to be our engineers' engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible.Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $207000 - $301000 (USD) + 20% bonus target + equity + benefits

Learn more about benefits at Google.

Responsibilities

Own the technical roadmap and long-term architecture for the Evergreen platform, including a unified data model for promise delivery across GCP.
Design and scale high-performance backend pipelines (Go, Java) and data-rich user interfaces (TypeScript, Angular) used by over 10,000+ Google engineers.
Prototype and productionize LLM-based features to parse unstructured incident data, automatically file risk tickets, and suggest reliability fixes.
Partner closely with Product Management, Data Science, and leadership to align multiple organizations on a unified approach to policy measurement and enforcement.

Qualifications

Minimum qualifications:

Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
8 years of experience with data structures and algorithms.
3 years of experience leading projects and designing, analyzing, and troubleshooting distributed systems.
3 years of experience in a technical leadership role; overseeing projects.
Experience overseeing full-stack architectures, ensuring cohesion between backend data automation layers and engineering frontend.

Preferred qualifications:

Experience in applying LLMs or Generative AI to automate workflows.
Familiarity with large-scale reliability analysis, or policy conformance frameworks.

US: $207000 - $301000 (USD) + 20% bonus target + equity + benefits

Learn more about benefits at Google.

Responsibilities

Own the technical roadmap and long-term architecture for the Evergreen platform, including a unified data model for promise delivery across GCP.
Design and scale high-performance backend pipelines (Go, Java) and data-rich user interfaces (TypeScript, Angular) used by over 10,000+ Google engineers.
Prototype and productionize LLM-based features to parse unstructured incident data, automatically file risk tickets, and suggest reliability fixes.
Partner closely with Product Management, Data Science, and leadership to align multiple organizations on a unified approach to policy measurement and enforcement.

Qualifications

Minimum qualifications:

Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
8 years of experience with data structures and algorithms.
3 years of experience leading projects and designing, analyzing, and troubleshooting distributed systems.
3 years of experience in a technical leadership role; overseeing projects.
Experience overseeing full-stack architectures, ensuring cohesion between backend data automation layers and engineering frontend.

Preferred qualifications:

Experience in applying LLMs or Generative AI to automate workflows.
Familiarity with large-scale reliability analysis, or policy conformance frameworks.

Staff Site Reliability Engineer, Cloud Reliability Intelligence

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

Other signals

Responsibilities

Qualifications

Minimum qualifications:

Preferred qualifications:

Responsibilities

Qualifications

Minimum qualifications:

Preferred qualifications: