Site Reliability Engineer (sre)

xAI xAI · AI Frontier · London, United Kingdom · Infrastructure

Site Reliability Engineer responsible for backend services powering products like grok.com and the API, focusing on highly scalable and reliable services hosted on Kubernetes clusters.

What you'd actually do

  1. You will work on the team that is responsible for the backend services that power our products such as grok.com and the API.
  2. We focus on writing and maintaining highly scalable and reliable services that can efficiently process tens of thousands of queries per second.
  3. The services are hosted on a number of Kubernetes clusters (on-prem & cloud).

Skills

Required

  • Kubernetes
  • Buildkite
  • ArgoCD
  • Prometheus
  • Grafana
  • PagerDuty
  • Pulumi
  • Terraform
  • Rust
  • C++
  • Go
  • nginx
  • envoy

What the JD emphasized

  • Expert knowledge of Kubernetes
  • Expert knowledge of continuous deployment systems such as Buildkite and ArgoCD
  • Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty
  • Expert knowledge of infrastructure as code technologies such as Pulumi or Terraform