Enterprise Observability Architect

Merck Merck · Pharma · Central Bohemian, Czech Republic

Seeking an Enterprise Observability Architect to define and drive a company-wide observability strategy, architecting a unified, full-stack platform to improve service reliability and performance. This role involves strategic leadership, standardization, and evangelizing best practices across engineering teams, with a focus on SRE principles and AIOps.

What you'd actually do

  1. Create, document, and champion a unified, long-term observability strategy for the entire company, covering all telemetry types like metrics, traces, logs, profiles and more
  2. Design a cohesive, full-stack observability solution using best-in-class tools and practices. Ensure our architecture promotes a product-oriented approach with a strong focus on self-service capabilities
  3. Actively identify and help decommission redundant or overlapping tools, driving the organization towards a standardized, cost-effective, and easy-to-manage observability service
  4. Partner with Site Reliability Engineering (SRE) team, promote their principles, such as defining, measuring, and managing Service Level Objectives (SLOs) and error budgets and how critical Observability for them
  5. Clearly and persuasively communicate complex observability topics, benefits, and strategic decisions to senior leadership and diverse stakeholders across IT and business domains

Skills

Required

  • Observability
  • software engineering
  • SRE
  • platform engineering
  • architecting and implementing large-scale technical solutions
  • Grafana
  • Prometheus
  • Dynatrace
  • BigPanda
  • xMatters
  • technical strategy
  • Site Reliability Engineering principles
  • SLOs
  • error budgets
  • OpenTelemetry
  • Multi-Cloud Experience (AWS, Azure, GCP)
  • Communication
  • Influence

Nice to have

  • Tool Rationalization
  • FinOps Knowledge
  • eBPF Knowledge
  • Product Mindset

What the JD emphasized

  • 10+ years in Observability, software engineering, SRE, or platform engineering, with a proven track record of architecting and implementing large-scale technical solutions
  • Subject matter expert on Observability practices and collection of all types of telemetry. SME for the tools that support them (e.g., Grafana, Prometheus, Dynatrace, BigPanda, xMatters or others)
  • Demonstrable experience creating and executing a technical strategy across multiple teams or an entire organization
  • Deep, practical knowledge of Site Reliability Engineering principles, with hands-on experience defining, implementing, and managing SLOs and error budgets
  • Strong understanding and practical experience with the OpenTelemetry standard for instrumentation and telemetry collection
  • Exceptional Communication & Influence: World-class ability to explain highly complex technical concepts to senior executives and non-technical audiences. Proven ability to lead by influence and drive consensus in a large organization