Principal Site Reliability Engineering Expert Director

BCG BCG · Consulting · London, United Kingdom · Technology and Engineering

This role is for a Principal Site Reliability Engineering Expert Director at Boston Consulting Group, focusing on shaping reliability, automation, and operational excellence across various domains including traditional infrastructure, cloud, network, identity, security, and AI-driven operations. The primary goal is to design scalable systems, reusable engineering patterns, and standardized controls to reduce toil, improve resilience, and embed reliability and compliance into delivery pipelines and operational platforms. A key aspect involves driving organizational change towards automation-first practices, building reusable CI/CD and Terraform modules, engineering guardrails, observability patterns, and automation frameworks. The role also includes an enablement function for less technical areas, translating complex engineering practices into practical standards and improving governance through engineering controls. The ideal candidate is a systems thinker who can build automation-first, reliability-focused operating models.

What you'd actually do

  1. Design and evolve reliability patterns across cloud, network, identity, and security domains.
  2. Lead the design of automation frameworks that eliminate manual operational tasks across multiple domains.
  3. Contribute to AI-driven operational use cases, including event correlation, anomaly detection, noise reduction, operational insights, and automated remediation.
  4. Define standards for telemetry, monitoring, alerting, and operational visibility across all critical systems.
  5. Provide technical leadership across teams, influencing standards, architecture, and engineering practices.

Skills

Required

  • Site Reliability Engineering (SRE)
  • Cloud Engineering (AWS, Azure, GCP)
  • Network Operations
  • Identity and Access Management (IAM)
  • Security Engineering
  • AI-driven operations / AIOps
  • Automation Frameworks
  • CI/CD pipelines
  • Terraform
  • Observability (monitoring, alerting, logging)
  • Systems Thinking
  • Incident Management
  • Technical Leadership

Nice to have

  • Experience in consulting or professional services
  • Experience enabling less technical teams
  • Knowledge of Zero Trust principles

What the JD emphasized

  • AI-driven operations
  • AIOps capabilities
  • automation-first
  • reliability engineering
  • operational excellence
  • guardrails
  • observability patterns
  • automation frameworks

Other signals

  • AI-driven operations
  • AIOps capabilities
  • event correlation
  • anomaly detection
  • noise reduction
  • operational insights
  • automated remediation