Site Reliability Engineer

Databricks Databricks · Data AI · Costa Rica · Infrastructure

Site Reliability Engineer role focused on building and maintaining IT infrastructure, including cloud platforms, CI/CD pipelines, and observability. The role involves architecting and automating infrastructure using IaC, optimizing system performance, and ensuring high availability. A key responsibility is building internal AI plugins and automation scripts to improve developer workflows and operational efficiency, while also focusing on incident response and cross-functional collaboration.

What you'd actually do

  1. Architect and Automate: Design and deploy production-grade infrastructure on cloud platforms (AWS/Azure) using Infrastructure as Code (IaC) tools like Terraform or Pulumi.
  2. Reliability and Performance Engineering: Optimize system performance, architecture, and scaling to ensure maximum uptime and minimal latency for critical IT services.
  3. CI/CD Excellence: Architect robust deployment pipelines (e.g., GitHub Actions), managing both hosted and self-hosted runners for specialized build requirements.
  4. Observable by Default: Create underlying infrastructure to ensure new internal applications are secure and have logging, metrics and alerts enabled by default.
  5. Agentic ToolingI: Build internal AI plugins, and automation scripts to streamline developer workflows and enhance operational efficiency.
  6. Incident Response: Focus on subsequent data usage, incident management workflows, and creating necessary dashboards to maintain service health. Participate in a shared on-call rotation, leading rapid incident response and technical troubleshooting for production outages.Facilitate blameless post-mortems to identify root causes and implement permanent preventive engineering solutions.
  7. Partner Cross-Functionally: Collaborate with Security, Engineering, and Support teams to deliver real business outcomes.

Skills

Required

  • Python
  • Terraform
  • AWS
  • Azure
  • GCP
  • Kubernetes
  • Docker
  • Datadog
  • Prometheus
  • ELK
  • Kafka
  • GitHub Actions
  • Infrastructure as Code (IaC)
  • CI/CD
  • Distributed Systems
  • Observability

Nice to have

  • Pulumi

What the JD emphasized

  • Python (non-negotiable)
  • Terraform (modules, state management)
  • AWS, Azure, or GCP
  • Kubernetes, Docker
  • Datadog, Prometheus, or ELK
  • Kafka or messaging queues
  • GitHub Actions
  • 5+ years of production-level experience