Senior Site Reliability Engineer, Tenan… at GitLab

GitLab is the intelligent orchestration platform for DevSecOps. GitLab enables organizations to increase developer productivity, improve operational efficiency, reduce security and compliance risk, and accelerate digital transformation. More than 50 million registered users and more than 50% of the Fortune 100* trust GitLab to ship better, more secure software faster.

The same principles built into our products are reflected in how our team works: we embrace AI as a core productivity multiplier, with all team members expected to incorporate AI into their daily workflows to drive efficiency, innovation, and impact. GitLab is where careers accelerate, innovation flourishes, and every voice is valued. Our high-performance culture is driven by our values and continuous knowledge exchange, enabling our team members to reach their full potential while collaborating with industry leaders to solve complex problems. Co-create the future with us as we build technology that transforms how the world develops software.

*Fortune 500® is a registered trademark of Fortune Media IP Limited, used under license. Claim based on GitLab data. Fortune 100 refers to the top 20% ranked companies in the 2025 Fortune 500 list, published in June 2025. Fortune and Fortune Media IP Limited are not affiliated with, and do not endorse products or services of GitLab.

An overview of this role

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople who apply sound engineering principles, operational discipline, and mature automation to our operating environments and the GitLab codebase.

In this role, you will join the Tenant Services, Geo team. Geo is a feature that replicates data from a GitLab instance to a warm-standby, and is used both for data migrations and disaster recovery. The Tenant Services, Geo team is responsible for supporting GitLab Dedicated customer migrations and Geo-related escalations across GitLab Dedicated (excluding FedRAMP environments).

We don’t expect you to have prior experience with Geo or Gitaly; familiarity with disaster recovery technologies or GitLab itself is sufficient, and we’ll support you in learning our specific stack.

The team’s mandate spans the full Geo operational surface, pre- and post-cutover data hygiene, migration execution, and non-migration Geo escalations, working closely with the core Geo team, Dedicated migrations, and Support. You will help evolve a reliable, low-risk cutover model for Dedicated migrations while improving tooling, automation, and observability so migrations become faster, safer, and more predictable over time.

What you’ll do

Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup.
Join the team’s shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours, and participate in the SaaS Site Reliability Engineering (SRE) on-call rotation to respond to incidents that impact GitLab.com availability.
Operate and improve the Geo operational surface for Dedicated, including:
Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations as “boring” and repeatable as possible.
Run our infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes; contribute improvements back to GitLab’s product and infrastructure where appropriate.
Build and maintain monitoring, alerting, and dashboards that:
Collaborate closely with:
Contribute to readiness reviews, incident reviews, and root cause analyses, turning learnings into changes in automation, process, or product.
Document every action, including runbooks, architecture decisions, and post-incident reviews, so your findings turn into repeatable practices and automation.
Proactively identify and reduce toil by automating repetitive operational work and simplifying migration workflows.

What you’ll bring

Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs.
Hands-on experience with at least one major cloud provider (e.g., Google Cloud Platform or Amazon Web Services), including networking, storage, and managed services.
Experience with Kubernetes and its ecosystem (e.g., Helm), including deploying and troubleshooting workloads.
Experience with infrastructure as code and configuration management tools such as Terraform, Ansible, or Chef.
Strong programming skills in at least one general-purpose language (preferably Go or Ruby) and proficiency with scripting (e.g., Shell, Python).
Experience with observability systems (e.g., Prometheus, Grafana, logging stacks) and using metrics and logs to troubleshoot performance and reliability issues.
Practical exposure to data replication, backup/restore, or migration scenarios (e.g., database replication, storage replication, or Geo-like technologies) where data integrity and downtime risk must be carefully managed.
Comfort participating in an on-call rotation, investigating incidents across the stack, and driving follow-through on corrective actions.
Ability to engage directly with enterprise customers during migrations and incidents, including on live calls and through clear written updates.
Ability to clearly define problems, propose options, and think beyond immediate fixes to improve systems and processes over time.
Ability to be a “manager of one”: self-directed, organized, and able to drive work to completion in a remote, asynchronous environment.
Strong written and verbal communication skills, with a bias toward clear, asynchronous documentation and collaboration.
Alignment with our company values and a commitment to working in accordance with those values.

It’s a plus if you have

Experience working with disaster recovery technologies.
Experience with managed/hosted environments similar to GitLab Dedicated, including regulated or compliance-sensitive customers (e.g., SOC2, ISO).
Prior work on large-scale data migrations or cutovers where customer data integrity, performance, and downtime risk had to be carefully balanced.
Hands-on experience designing and operating database replication, backup/restore, and cutover workflows (for example, PostgreSQL or cloud-managed equivalents such as AWS RDS), including planning and executing low-risk migrations for large datasets.
Experience with multi-tenant architectures, sharding, or routing strategies in high-traffic SaaS platforms.
Familiarity with GitLab (self-managed or SaaS), and/or contributions to open source projects.

How GitLab Supports Full-Time Employees

Benefits to support your health, finances, and well-being
Flexible Paid Time Off
Team Member Resource Groups
Equity Compensation & Employee Stock Purchase Plan
Growth and Development Fund
Parental Leave
Home Office Support

Please note that we welcome interest from candidates with varying levels of experience; many successful candidates do not meet every single requirement. Additionally, studies have shown that people from underrepresented groups are less likely to apply to a job unless they meet every single qualification. If you're excited about this role, please apply and allow our recruiters to assess your application.

**Country Hiring Guidelines: **GitLab hires new team members in countries around the world. All of our roles are remote, however some roles may carry specific location-based eligibility requirements. Our Talent Acquisition team can help answer any questions about location after starting the recruiting process.

**Privacy Policy: **Please review our Recruitment Privacy Policy. Your privacy is important to us.

GitLab is proud to be an equal opportunity workplace and is an affirmative action employer. GitLab’s policies and practices relating to recruitment, employment, career development and advancement, promotion, and retirement are based solely on merit, regardless of race, color, religion, ancestry, sex (including pregnancy, lactation, sexual orientation, gender identity, or gender expression), national origin, age, citizenship, marital status, mental or physical disability, genetic information (including family medical history), discharge status from the military, protected veteran status (which includes disabled veterans, recently separated veterans, active duty wartime or campaign badge veterans, and Armed Forces service medal veterans), or any other basis protected by law. GitLab will not tolerate discrimination or harassment based on any of these characteristics. See also GitLab’s EEO Policy and EEO is the Law. If you have a disability or special need that requires accommodation, please let us know during the recruiting process.

An overview of this role

What you’ll do

Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup.
Join the team’s shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours, and participate in the SaaS Site Reliability Engineering (SRE) on-call rotation to respond to incidents that impact GitLab.com availability.
Operate and improve the Geo operational surface for Dedicated, including:
Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations as “boring” and repeatable as possible.
Run our infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes; contribute improvements back to GitLab’s product and infrastructure where appropriate.
Build and maintain monitoring, alerting, and dashboards that:
Collaborate closely with:
Contribute to readiness reviews, incident reviews, and root cause analyses, turning learnings into changes in automation, process, or product.
Document every action, including runbooks, architecture decisions, and post-incident reviews, so your findings turn into repeatable practices and automation.
Proactively identify and reduce toil by automating repetitive operational work and simplifying migration workflows.

What you’ll bring

Experience operating highly-available distributed systems at scale, ideally in a SaaS environment with customer-facing SLAs.
Hands-on experience with at least one major cloud provider (e.g., Google Cloud Platform or Amazon Web Services), including networking, storage, and managed services.
Experience with Kubernetes and its ecosystem (e.g., Helm), including deploying and troubleshooting workloads.
Experience with infrastructure as code and configuration management tools such as Terraform, Ansible, or Chef.
Strong programming skills in at least one general-purpose language (preferably Go or Ruby) and proficiency with scripting (e.g., Shell, Python).
Experience with observability systems (e.g., Prometheus, Grafana, logging stacks) and using metrics and logs to troubleshoot performance and reliability issues.
Practical exposure to data replication, backup/restore, or migration scenarios (e.g., database replication, storage replication, or Geo-like technologies) where data integrity and downtime risk must be carefully managed.
Comfort participating in an on-call rotation, investigating incidents across the stack, and driving follow-through on corrective actions.
Ability to engage directly with enterprise customers during migrations and incidents, including on live calls and through clear written updates.
Ability to clearly define problems, propose options, and think beyond immediate fixes to improve systems and processes over time.
Ability to be a “manager of one”: self-directed, organized, and able to drive work to completion in a remote, asynchronous environment.
Strong written and verbal communication skills, with a bias toward clear, asynchronous documentation and collaboration.
Alignment with our company values and a commitment to working in accordance with those values.

It’s a plus if you have

Experience working with disaster recovery technologies.
Experience with managed/hosted environments similar to GitLab Dedicated, including regulated or compliance-sensitive customers (e.g., SOC2, ISO).
Prior work on large-scale data migrations or cutovers where customer data integrity, performance, and downtime risk had to be carefully balanced.
Hands-on experience designing and operating database replication, backup/restore, and cutover workflows (for example, PostgreSQL or cloud-managed equivalents such as AWS RDS), including planning and executing low-risk migrations for large datasets.
Experience with multi-tenant architectures, sharding, or routing strategies in high-traffic SaaS platforms.
Familiarity with GitLab (self-managed or SaaS), and/or contributions to open source projects.

How GitLab Supports Full-Time Employees

Benefits to support your health, finances, and well-being
Flexible Paid Time Off
Team Member Resource Groups
Equity Compensation & Employee Stock Purchase Plan
Growth and Development Fund
Parental Leave
Home Office Support

**Privacy Policy: **Please review our Recruitment Privacy Policy. Your privacy is important to us.

Senior Site Reliability Engineer, Tenant Services: Geo

What you'd actually do

Skills

Required

Nice to have

What the JD emphasized

An overview of this role

What you’ll do

What you’ll bring

It’s a plus if you have

How GitLab Supports Full-Time Employees

An overview of this role

What you’ll do

What you’ll bring

It’s a plus if you have

How GitLab Supports Full-Time Employees