Senior Site Reliability Engineer, AI Research

Algolia Algolia · Enterprise · Australia · R&D

Senior Site Reliability Engineer to join the AI Research team, focusing on the stability, scalability, and operability of the infrastructure that supports AI research and product engineering. The role is classic SRE on cloud-first, service-oriented architectures on GCP, with no AI/ML experience required. Responsibilities include platform reliability, embedded collaboration with researchers and engineers, and cloud infrastructure operations.

What you'd actually do

  1. Support and evolve the reliability of platforms used by the AI Research team.
  2. Ensure production services meet expectations for availability, latency, and operational readiness, particularly for systems that sit on customer-critical paths
  3. Design infrastructure and operational patterns that prioritize iteration speed while maintaining appropriate safeguards for production systems
  4. Work closely with researchers and engineers in a cross-functional setting, acting as an advisor on infrastructure, reliability, and operational concerns
  5. Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps (Terraform, ArgoCD)

Skills

Required

  • Strong experience operating cloud-first infrastructure
  • Hands-on experience running production services on Kubernetes
  • Proficiency with infrastructure-as-code (Terraform) and CI/CD systems
  • Experience supporting production services written in Go
  • Solid grounding in service reliability, incident response, and operational best practices
  • Comfort working in environments with ambiguity, where problems are not always well-defined upfront

Nice to have

  • Python experience
  • Experience supporting mission-critical internal platforms
  • Exposure to research or experimentation-heavy environments
  • Familiarity working alongside researchers or highly specialized domain experts

What the JD emphasized

  • AI or ML experience is not required
  • strong SRE fundamentals
  • experience operating production services
  • comfort working in an environment with ambiguity and high ownership
  • cloud-first infrastructure
  • production services on Kubernetes
  • infrastructure-as-code (Terraform)
  • CI/CD systems
  • Go
  • service reliability, incident response, and operational best practices
  • ambiguity, where problems are not always well-defined upfront
  • Explicitly Not Required
  • AI, ML, or deep learning experience
  • Model training, tuning, or ML framework expertise