Site Reliability Engineer

Airbyte Airbyte · Data AI · San Francisco, CA · Engineering

Airbyte is seeking a Software Engineer, Platform to own the infrastructure and reliability of their Data Replication platform, which powers data movement for AI applications. The role involves managing Kubernetes, CI/CD, observability, and tooling, with an emphasis on leveraging AI tools to automate tasks and improve efficiency.

What you'd actually do

  1. Own the infrastructure underpinning the Data Replication platform - Kubernetes clusters, CI/CD pipelines, secrets management, networking, and cloud resource configuration across AWS and GCP.
  2. Partner with product engineers to reliably integrate product features with infrastructure.
  3. Maintain and enhance observability, alerting, and anomaly detection with an eye towards LLM automation.
  4. Maintain and enhance AI-augmented release and internal tooling: canary deployments, progressive rollouts, automated release qualification, and rollback automation - with an eye towards LLM automation.
  5. Set the infrastructure bar for the team - build self-serve tooling, write runbooks, and coach engineers to own more of their stack.

Skills

Required

  • 7+ years in infrastructure, platform engineering, SRE, or DevOps.
  • Hands-on ownership of Kubernetes, Helm, and Terraform in production environments.
  • Deep experience with observability stacks (Prometheus, Grafana, Datadog) and on-call operations.
  • Experience with CI/CD pipeline ownership and developer tooling.
  • Ability & willingness to read backend code to understand how systems break and instrument them correctly.
  • Fluency with AI tools - LLMs and agentic frameworks to automate, debug faster, and reduce toil.

Nice to have

  • Data pipelines, replication systems, or ETL/ELT platforms.
  • Control plane / data plane architectures or internal developer platforms.
  • Experience with Airbyte, CDKs, or connector-based architectures.

What the JD emphasized

  • AI tools
  • LLM automation