Senior Site Reliability Engineer

Spotify Spotify · Consumer · New York, NY · Platform

Senior Site Reliability Engineer for Spotify's Backstage team, focusing on building and operating AI-native infrastructure for a developer platform and internal coding agents. The role involves ensuring reliability, scalability, and security for cloud infrastructure (GCP/AWS) supporting LLM-driven agent workflows and non-deterministic AI workloads, with an emphasis on incident management, operational excellence, and mentoring.

What you'd actually do

  1. Own fleet reliability. Lead the reliability, security, and scalability strategy for Portal’s SaaS infrastructure, including the runtime environments that power our platform and LLM-driven agent workflows. Define SLOs, drive capacity planning, and ensure our systems meet the demands of a rapidly growing product.
  2. Architect for the agentic era. Design and evolve infrastructure on GCP and AWS using Terraform and infrastructure-from-code patterns. Shape how we structure environments for non-deterministic AI workloads — including sandboxing, resource isolation, cost governance, and security boundaries.
  3. Drive operational excellence. Evolve our incident management, on-call, and postmortem practices. Leverage AI assistants to accelerate root cause analysis and build increasingly self-healing capabilities into our production systems.
  4. Lead fullstack reliability. Operate across a modern web stack (TypeScript, React, Python). While not frontend-heavy, you’ll diagnose and resolve issues across the stack and drive reliability improvements end-to-end.
  5. Mentor and multiply. Raise the reliability IQ of the broader engineering team. Establish SRE best practices, conduct production-readiness reviews, and mentor engineers on operational thinking.

Skills

Required

  • 5+ years of hands-on experience operating cloud infrastructure (GCP and/or AWS)
  • using Terraform and Kubernetes to run production systems at scale
  • practical experience — or a strong demonstrated interest — in operating LLM-based systems, RAG pipelines, or agentic workloads
  • understand the reliability challenges of non-deterministic systems
  • proficient in at least one modern language (TypeScript, Java, Go, or Python)
  • comfortable navigating large, heterogeneous codebases
  • build automation and improve systems

Nice to have

  • AI-assisted coding tools

What the JD emphasized

  • operating LLM-based systems
  • agentic workloads
  • non-deterministic AI workloads

Other signals

  • AI-native workflows
  • agentic production systems
  • AI-native engineering
  • agentic developer tooling
  • LLM-driven agent workflows
  • non-deterministic AI workloads
  • AI assistants
  • generative AI features
  • operating LLM-based systems
  • agentic workloads