Senior Site Reliability Engineer, Tenant Services: Geo

GitLab GitLab · Enterprise · India · Platforms Engineering

GitLab is seeking a Senior Site Reliability Engineer for their Tenant Services, Geo team. This role focuses on keeping user-facing services and production systems running smoothly, with a specific emphasis on GitLab Geo, a feature for data replication for disaster recovery and migrations. The engineer will be responsible for executing end-to-end migrations, operating and improving the Geo operational surface, designing and building automation and tooling, and contributing to infrastructure improvements. The role involves on-call rotations, incident response, and direct customer interaction during migrations. While the company embraces AI for productivity, this specific role is centered on SRE principles and operational excellence for their Geo product.

What you'd actually do

  1. Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup.
  2. Join the team’s shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours, and participate in the SaaS Site Reliability Engineering (SRE) on-call rotation to respond to incidents that impact GitLab.com availability.
  3. Operate and improve the Geo operational surface for Dedicated, including:
  4. Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations as “boring” and repeatable as possible.
  5. Run our infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes; contribute improvements back to GitLab’s product and infrastructure where appropriate.

Skills

Required

  • Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs.
  • Hands-on experience with at least one major cloud provider (e.g., Google Cloud Platform or Amazon Web Services), including networking, storage, and managed services.
  • Experience with Kubernetes and its ecosystem (e.g., Helm), including deploying and troubleshooting workloads.
  • Experience with infrastructure as code and configuration management tools such as Terraform, Ansible, or Chef.
  • Strong programming skills in at least one general-purpose language (preferably Go or Ruby) and proficiency with scripting (e.g., Shell, Python).
  • Experience with observability systems (e.g., Prometheus, Grafana, logging stacks) and using metrics and logs to troubleshoot performance and reliability issues.
  • Practical exposure to data replication, backup/restore, or migration scenarios (e.g., database replication, storage replication, or Geo-like technologies) where data integrity and downtime risk must be carefully managed.
  • Comfort participating in an on-call rotation, investigating incidents across the stack, and driving follow-through on corrective actions.
  • Ability to engage directly with enterprise customers during migrations and incidents, including on live calls and through clear written updates.
  • Ability to clearly define problems, propose options, and think beyond immediate fixes to improve systems and processes over time.
  • Ability to be a “manager of one”: self-directed, organized, and able to drive work to completion in a remote, asynchronous environment.
  • Strong written and verbal communication skills

Nice to have

  • prior experience with Geo or Gitaly
  • familiarity with disaster recovery technologies or GitLab itself

What the JD emphasized

  • customer-facing SLAs
  • customer migrations
  • customer
  • enterprise customers