Senior Observability Infrastructure Engineer

Adobe Adobe · Enterprise · Bucharest, Romania

Senior Observability Infrastructure Engineer at Adobe responsible for building and finding best-of-breed tools for critical Observability services. The role involves crafting new tools, maintaining large-scale logging deployments, and driving ingestion cost optimization. A key aspect is integrating AI agent development and AI workflows into large-scale deployments to surface insights from log data and automate interactions, leveraging OpenTelemetry.

What you'd actually do

  1. Experience with internally hosted logging systems like Splunk, ClickHouse, Loki, Elastic, assisting clients and improving environment performance and stability
  2. Demonstrated ability to drive ingestion cost optimization through data-driven analysis, pipeline guardrails, and direct engagement with customer engineering teams to reduce unnecessary log volume
  3. Experience with OpenTelemetry — including collector configuration, pipelines, and instrumentation — as a core requirement given Adobe's OTel-native observability strategy
  4. AI agent development and experience integrating AI workflows into large-scale deployments; ability to build AI-assisted workflows to surface actionable insights from large log datasets and automate routine user interactions
  5. Experience architecting distributed environments with thousands of users

Skills

Required

  • 7 to 10+ years production level experience with distributed applications at scale in public and/or private cloud
  • Experience architecting and implementing large-scale Observability platforms
  • Experience with internally hosted logging systems like Splunk, ClickHouse, Loki, Elastic, assisting clients and improving environment performance and stability
  • Demonstrated ability to drive ingestion cost optimization through data-driven analysis, pipeline guardrails, and direct engagement with customer engineering teams to reduce unnecessary log volume
  • Experience with OpenTelemetry — including collector configuration, pipelines, and instrumentation
  • AI agent development and experience integrating AI workflows into large-scale deployments; ability to build AI-assisted workflows to surface actionable insights from large log datasets and automate routine user interactions
  • Experience architecting distributed environments with thousands of users
  • Programming experience with languages like Go, Python; experience building integrations and applications to large-scale Observability environments
  • Experience designing and implementing systems for fault tolerance, scalability and stability
  • Experience developing, deploying and running distributed applications on cloud platforms; experience with container and orchestration technologies (Docker, Kubernetes)
  • Comfortable owning on-call coverage across a multi-tool observability stack, with the ability to triage and resolve issues across platforms beyond primary area of expertise
  • Ensure the highest level of up-time and Quality of Service (QoS) to Adobe's customers through operational excellence
  • Knowledge in defining service level objectives (SLOs) and service level indicators (SLIs) to represent and measure service quality
  • Knowledge of (public and/or private) cloud deployments
  • Collaborate with SRE and Engineering/Product teams in driving critical initiatives
  • Experience in designing and maintaining production monitoring systems
  • Experience in solving performance and stability issues using a wide variety of tools
  • Excellent communicator in and across teams, driving projects to completion
  • Impacts the organization through contribution to technical direction and strategic decisions

Nice to have

  • Experience evaluating and prototyping alternative storage/processing backends (e.g., ClickHouse, Loki)
  • Experience with other Observability tooling like Grafana, Cortex, and Tempo
  • Promote the DevOps/SRE approach

What the JD emphasized

  • AI agent development
  • integrating AI workflows
  • AI-assisted workflows
  • large log datasets
  • OpenTelemetry

Other signals

  • AI agent development
  • integrating AI workflows
  • AI-assisted workflows
  • large log datasets
  • OpenTelemetry