Site Reliability Engineer - Backstage

Spotify Spotify · Consumer · New York, NY · Platform

Site Reliability Engineer for Spotify's Backstage platform, focusing on building and operating AI-native infrastructure for developer tools, including background coding agents and SaaS infrastructure supporting LLM-powered workflows. The role involves modern Infra-as-Code on GCP/AWS with Terraform, operating in a fullstack environment, and ensuring reliability for agentic production systems.

What you'd actually do

  1. Maintain and improve Portal’s SaaS infrastructure for reliability, security, and scalability. This covers the runtime environments supporting the platform and workflows powered by large language models.
  2. Collaborate with senior engineers to build infrastructure on GCP and AWS using Terraform and emerging infrastructure-from-code patterns where agents assist in defining the stack.
  3. Operate in a modern web stack environment (TypeScript, React, Python). While this isn’t a frontend-heavy role, comfort with debugging fullstack systems and web infrastructure is key.
  4. Participate in on-call rotations to ensure systems meet reliability and availability goals, employing AI assistants to accelerate root cause analysis and incident resolution.
  5. Participate in the planning and delivery of technical projects, defining how infrastructure evolves to support the next wave of generative AI features.

Skills

Required

  • Cloud infrastructure (GCP or AWS)
  • IaC tools like Terraform
  • Distributed systems principles
  • Operating distributed systems reliably at scale
  • Modern programming language (e.g., TypeScript, Java, Go, Python)
  • Debugging fullstack systems
  • Web infrastructure

Nice to have

  • LLMs
  • RAG
  • agents in an operational context
  • non-deterministic AI workloads
  • open-source projects
  • building "coding assistant" bots

What the JD emphasized

  • AI-native workflows
  • background coding agents
  • agentic production environment
  • AI assistants to accelerate root cause analysis
  • AI-generated PRs are the norm

Other signals

  • AI-native workflows
  • background coding agents
  • agentic production environment
  • AI assistants to accelerate root cause analysis
  • infrastructure evolves to support the next wave of generative AI features
  • AI-generated PRs are the norm