Senior Site Reliability Engineer – Observability

Adobe Adobe · Enterprise · Bucharest, Romania

Senior Site Reliability Engineer focused on Observability at Adobe, responsible for building and maintaining critical observability services. The role involves architecting and implementing large-scale observability platforms, optimizing ingestion costs, and integrating AI agent development for analyzing log data and automating user interactions.

What you'd actually do

  1. Experience with internally hosted logging systems like Splunk, ClickHouse, Loki, Elastic, assisting clients and improving environment performance and stability
  2. Demonstrated ability to drive ingestion cost optimization through data-driven analysis, pipeline guardrails, and direct engagement with customer engineering teams to reduce unnecessary log volume
  3. Experience with OpenTelemetry — including collector configuration, pipelines, and instrumentation — as a core requirement given Adobe's OTel-native observability strategy
  4. AI agent development and experience integrating AI workflows into large-scale deployments; ability to build AI-assisted workflows to surface actionable insights from large log datasets and automate routine user interactions
  5. Experience architecting distributed environments with thousands of users

Skills

Required

  • production level experience with distributed applications at scale in public and/or private cloud
  • architecting and implementing large-scale Observability platforms
  • internally hosted logging systems like Splunk, ClickHouse, Loki, Elastic
  • ingestion cost optimization
  • OpenTelemetry
  • AI agent development
  • integrating AI workflows into large-scale deployments
  • AI-assisted workflows
  • large log datasets
  • automate routine user interactions
  • architecting distributed environments
  • Programming experience with languages like Go, Python
  • building integrations and applications to large-scale Observability environments
  • designing and implementing systems for fault tolerance, scalability and stability
  • developing, deploying and running distributed applications on cloud platforms
  • container and orchestration technologies (Docker, Kubernetes)
  • owning on-call coverage
  • triage and resolve issues across platforms
  • highest level of up-time and Quality of Service (QoS)
  • defining service level objectives (SLOs) and service level indicators (SLIs)
  • cloud deployments
  • Collaborate with SRE and Engineering/Product teams
  • designing and maintaining production monitoring systems
  • solving performance and stability issues
  • Excellent communicator
  • driving projects to completion
  • contribution to technical direction and strategic decisions

Nice to have

  • evaluating and prototyping alternative storage/processing backends (e.g., ClickHouse, Loki)
  • Experience with other Observability tooling like Grafana, Cortex, and Tempo
  • Promote the DevOps/SRE approach

What the JD emphasized

  • AI agent development
  • integrating AI workflows
  • AI-assisted workflows
  • large log datasets
  • automate routine user interactions
  • OpenTelemetry

Other signals

  • AI agent development
  • integrating AI workflows
  • AI-assisted workflows
  • large log datasets
  • automate routine user interactions