Intermediate Site Reliability Engineer, Environment Automation

GitLab GitLab · Enterprise · Canada +1 · Platforms Engineering

GitLab is seeking an Intermediate Site Reliability Engineer for their Environment Automation team. This role focuses on ensuring the reliability, scalability, security, and consistency of hundreds of isolated GitLab environments for customers. The engineer will work with infrastructure as code, deployment packages, and Kubernetes, contributing to automation across the entire lifecycle from provisioning to operations. Responsibilities include defining, deploying, and maintaining environments, debugging production issues, building automation for upgrades and configuration changes, and supporting an observability stack. The role emphasizes treating everything as code and managing many tenant environments in parallel.

What you'd actually do

  1. Contribute to automating operational tasks across many GitLab environments, from initial provisioning and configuration updates to upgrades and routine maintenance, helping reduce manual work and improve reliability at scale under the guidance of senior team members.
  2. Help build and refine the observability stack for multi-tenant GitLab environments so we monitor the right signals across Kubernetes, cloud services, and GitLab applications, supporting early issue detection and basic capacity tracking.
  3. Assist in responding to platform alerts and incidents, collaborating with Environment Automation SREs and engineering teams to troubleshoot production issues across multiple tenants and document findings.
  4. Support planning and implementation of infrastructure changes, capacity expansions, and new service rollouts for Dedicated and other managed GitLab environments, contributing to efforts that improve resource efficiency and environment isolation.
  5. Develop and maintain scripts, automation tools, and infrastructure-as-code workflows that manage parts of the GitLab environment lifecycle, enabling more repeatable, self-service operations over time.

Skills

Required

  • Experience working as an SRE or in a similar role operating production infrastructure
  • Hands-on experience with backend programming languages such as Golang, with the ability to read, understand, and modify infrastructure tools
  • Hands-on experience running Kubernetes-based workloads in production, including basic understanding of deployments, rollouts, and debugging common issues like crash loops, failed health checks, and scheduling problems
  • Familiarity with infrastructure as code tools like Terraform or Ansible
  • Experience with cloud platforms (AWS, GCP, Azure)
  • Understanding of CI/CD principles and tools

Nice to have

  • Experience with Helm Charts
  • Familiarity with GitLab CI/CD
  • Experience with observability tools (Prometheus, Grafana, ELK stack)
  • Understanding of multi-tenant architectures

What the JD emphasized

  • operating production infrastructure
  • automating the lifecycle of many environments or tenants in parallel
  • running Kubernetes-based workloads in production