Senior Backend Engineer (ruby or Golang), Tenant Scale; Cells Infrastructure

GitLab GitLab · Enterprise · APAC +3 · Remote · Platforms Engineering

This role is for a Senior Backend Engineer on the Cells Infrastructure team at GitLab. The primary focus is on building and operating foundational services for GitLab's Cells architecture, specifically edge routing services and the Topology Service. The goal is to enable GitLab.com to scale horizontally by ensuring reliable and low-latency traffic direction across a fleet of independent Cell clusters. The role involves designing, implementing, and operating these systems, collaborating with other teams, and contributing to documentation and operational readiness. While the company mentions embracing AI as a productivity multiplier and incorporating it into daily workflows, this specific role's core responsibilities are in backend engineering for infrastructure scaling, not in building or directly managing AI/ML models or systems.

What you'd actually do

  1. Design and implement edge traffic routing that directs requests to the correct Cell in a way that's transparent to users.
  2. Build and evolve the Topology Service that serves as the authoritative source of cluster state for routing, resource assignment, and Cell lifecycle decisions.
  3. Collaborate across the GitLab Rails monolith and supporting services to make features and data models Cell-aware with feature teams across the product.
  4. Operate and improve the routing and topology systems you build by participating in tier-2 on-call, responding to escalated incidents, and strengthening observability and operational tooling.
  5. Author Architecture Decision Records (ADRs), operational runbooks, and documentation so other teams can understand, adopt, and extend the Cells platform.

Skills

Required

  • Experience building observable, resilient production services using Golang or Ruby on Rails
  • Background delivering and operating production systems in high-scale environments, including incident response and operational ownership
  • Ability to reason about distributed systems, including consistency models, partitioning strategies, failure modes, and operational tradeoffs
  • Experience building high-throughput networking services
  • Familiarity working in large, multi-team codebases and coordinating changes across teams and services, including making features and data models Cell-aware
  • Knowledge of observability practices such as metrics, tracing, and alerting, with an approach focused on building systems you'd be confident operating on-call
  • Strong written communication skills for an async-first, globally distributed team, including documenting decisions (for example, architecture decision records) and runbooks
  • Experience working with relational databases in production, including schema design, migrations, and query performance tuning

Nice to have

  • TypeScript experience
  • gRPC and protocol buffers knowledge
  • PostgreSQL experience