Senior Site Reliability Engineer, Observability

Webflow Webflow · Enterprise · CA · Remote · Engineering

Webflow is seeking a Senior Site Reliability Engineer for their Observability team to enhance the reliability and stability of their customer-facing production infrastructure. The role involves owning and evolving the observability stack (OpenTelemetry, Datadog), debugging the main application, driving adoption of observability practices like SLOs and distributed tracing, and building AI-powered agents for faster insights and reduced alert fatigue. The engineer will also guide other teams on instrumentation, improve on-call processes, automate workflows, and partner with engineering teams on observability practices.

What you'd actually do

  1. Own and evolve Webflow's observability stack, including OpenTelemetry, and Datadog, to provide reliable, actionable metrics, traces, and logs across our services.
  2. Build and maintain AI-powered agents and automation that help engineers surface insights faster, reduce alert fatigue, and accelerate incident resolution.
  3. Continuously raise the bar on observability practices by driving adoption of SLOs, distributed tracing, and structured logging throughout engineering.
  4. Participate in and continuously improve on-call and incident response processes, with a focus on making observability data the foundation of faster, more effective responses.
  5. Reduce toil by automating common observability workflows to keep the rest of engineering working smoothly with fewer interruptions.

Skills

Required

  • BS/BA degree or relevant experience
  • Business-level fluency in English
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment
  • Hands-on experience with observability platforms and tooling (Datadog, Grafana, Prometheus, ElasticSearch or similar)
  • Experience with OpenTelemetry or similar instrumentation frameworks
  • Experience defining and operationalizing SLOs/SLIs at scale
  • Experience with multi-tier cloud environments (AWS or GCP)
  • Experience with container-centric architectures (Docker, Kubernetes, ECS)
  • Experience with infrastructure-as-code tools (Terraform, Pulumi)

Nice to have

  • Experience building or operating AI agents that interact with observability data
  • Experience with OpenTelemetry, Kubernetes and Pulumi specifically
  • Experience improving on-call and incident response processes for Engineering

What the JD emphasized

  • AI-powered agents and automation