Senior Site Reliability Engineer

BCG BCG · Consulting · Gurgaon, Haryana, India · Technology and Engineering

Senior Site Reliability Engineer responsible for running and improving reliability engineering systems, including automation, pipelines, and observability. The role involves designing and implementing solutions to reduce operational toil, embedding reliability into workflows, shaping engineering standards, and leading incident response. It also includes mentoring junior engineers and collaborating with various teams to ensure reliability and governance. The ideal candidate has extensive experience in SRE, platform engineering, cloud platforms (AWS/Azure), Infrastructure-as-Code, CI/CD, and observability tools.

What you'd actually do

  1. Run and continuously improve the reliability engineering systems within scope, including automation, pipelines, observability, and operational tooling.
  2. Design and implement engineering solutions that eliminate operational toil at scale and embed reliability into delivery workflows.
  3. Help shape engineering standards, patterns, and reusable frameworks across the SRE practice.
  4. Lead the engineering response to complex incidents within scope, drive systemic remediation, and contribute to post-incident learning.
  5. Mentor and coach less senior engineers across reliability engineering, automation, observability, and SRE principles.

Skills

Required

  • Site Reliability Engineering
  • Platform Engineering
  • Cloud (AWS or Azure)
  • Automation
  • Observability
  • CI/CD
  • Infrastructure-as-Code (Terraform)
  • Scripting (Python)
  • Incident Response
  • Stakeholder Engagement
  • Technical Communication
  • Enterprise Observability Platforms (Splunk, Datadog)
  • Telemetry Pipelines
  • SLIs/SLOs
  • Cloud Infrastructure Operations (AWS, Azure, GCP, Alibaba Cloud)
  • IaC Patterns
  • Cloud Networking
  • Identity Primitives
  • Policy Enforcement
  • Identity Platforms (Entra ID)
  • Secrets Management (HashiCorp Vault)
  • OIDC
  • Workload Identity
  • Dynamic Credential Patterns
  • Zero Trust
  • Least-Privilege Adoption
  • Security Tooling in CI/CD
  • Policy-as-Code
  • Secure-by-Default Patterns
  • Hybrid and Cloud Network Architectures
  • Automated Network Controls
  • Zero Trust Segmentation
  • Network Observability

Nice to have

  • Federated, multi-cloud, or large enterprise environment experience
  • Containerisation (Docker)
  • Orchestration (Kubernetes)
  • Secrets management tooling (HashiCorp Vault)
  • Cloud certification (professional level)
  • Policy-as-code tooling (OPA, Sentinel)
  • Contributing to engineering communities of practice
  • AIOps
  • Noise reduction
  • Event correlation
  • Event-driven ops automation platforms (ServiceNow, PagerDuty)

What the JD emphasized

  • Core responsibilities
  • What You'll Bring
  • 5–8 years of experience in Site Reliability Engineering, Platform Engineering, or related operational engineering disciplines.
  • Strong hands-on experience across multiple SRE domains, including cloud, automation, observability, and CI/CD.
  • Demonstrated experience designing and implementing automation and reliability solutions at scale.
  • Deep knowledge of at least one cloud platform (AWS or Azure), including networking, identity, and observability primitives.
  • Experience with Infrastructure-as-Code (e.g. Terraform) and CI/CD pipelines.
  • Strong scripting experience (e.g. Python).
  • Experience leading incident response and driving systemic improvement.
  • Strong stakeholder engagement and technical communication skills.
  • Deep hands-on experience with one or more enterprise observability platforms (e.g. Splunk, Datadog).
  • Proven experience designing and operating telemetry pipelines, ingestion controls, and observability cost management.
  • Proven experience designing signals (SLIs, SLOs, synthetic checks, alerts) and ops automation triggered from those signals.
  • Experience driving SLO/SLI practices across multiple teams.
  • Deep hands-on experience operating cloud infrastructure across at least two of AWS, Azure, GCP, or Alibaba Cloud.
  • Proven experience designing reusable IaC patterns and landing zone components across cloud providers.
  • Strong working knowledge of cloud networking, account management, identity primitives, and policy enforcement across providers.
  • Experience driving cloud platform engineering standards and governance across multiple teams.
  • Deep hands-on experience with identity platforms (e.g. Entra ID) and secrets management (e.g. HashiCorp Vault).
  • Proven experience designing OIDC, workload identity, and dynamic credential patterns.
  • Experience driving Zero Trust and least-privilege adoption across multiple teams.
  • Deep hands-on experience with security tooling embedded in CI/CD pipelines.
  • Proven experience designing policy-as-code controls and secure-by-default patterns.
  • Experience driving secure engineering adoption across multiple teams.
  • Deep hands-on experience with hybrid and cloud network architectures.
  • Proven experience designing automated network controls through IaC.
  • Experience driving Zero Trust segmentation and network observability adoption.