Infrastructure Engineer - Observability (apac)

Supabase Supabase · Data AI · APAC · Engineering

Infrastructure Engineer focused on observability, owning Kubernetes infrastructure for logging, tracing, and metrics systems. Responsibilities include collaborating on telemetry practices, ensuring high uptime, and scaling systems like VictoriaMetrics and OpenTelemetry Collector.

What you'd actually do

  1. Collaborate deeply with our infrastructure and product teams to enforce org-wide practices for emitting and collecting telemetry across a wide range of services, both internal and external-facing. This includes contributing to org-wide documentation, advocacy of best practices and helping to enforce standards org-wide.
  2. Own and operate the Kubernetes infrastructure of the observability team. You will help in defining the documentation, operational flows, and engineering standards to ensure high uptime across our logging, tracing, and metrics systems that are used by internal and external stakeholders
  3. Work within the Observability team to ensure industry-standard deployment and reliability practices are used, and to develop industry-leading reliability software to ensure that our observability systems never go down for our customers.
  4. Orchestrate and scale systems such as VictoriaMetrics, OpenTelemetry Collector, and Vector.

Skills

Required

  • 5+ years of experience in a Site Reliability Engineering role
  • Experience operating and supporting clustered applications in production environments
  • Hands-on experience deploying and managing applications in Kubernetes (k8s) environments
  • Working knowledge of PostgreSQL, including administration, performance tuning, and troubleshooting
  • Proficiency with at least one Infrastructure as Code (IaC) tool (e.g., Terraform, Pulumi, OpenTofu, or equivalent)
  • Experience with telemetry tooling such as OpenTelemetry, VictoriaMetrics, Grafana, Prometheus.

Nice to have

  • Experience with AWS services is a plus
  • Strong documentation and communication skills is a plus

What the JD emphasized

  • high throughput data services
  • high availability
  • systems reliability
  • high volume data pipeline environments
  • high uptime
  • never go down