Principal Engineer, Core Infrastructure

Klaviyo Klaviyo · Enterprise · Boston, MA · Engineering

Principal Engineer, Core Infrastructure at Klaviyo. This role focuses on architecting and operating the core cloud platform (Kubernetes, service mesh, networking, storage, CI/CD, observability) and embedding AI into the developer experience to improve shipping speed and safety. The role requires significant experience in cloud platform engineering and a hands-on approach to integrating AI tools for developer productivity, AIOps, and operational excellence.

What you'd actually do

  1. Architect and evolve the Kubernetes platform, service mesh, networking, storage, and CI/CD pipelines; ship golden paths and IaC modules.
  2. Define platform SLOs; use error budgets to guide reliability vs. velocity trade‑offs; drive incident learning and readiness reviews.
  3. Improve developer velocity (build/deploy times, flaky tests, local dev ergonomics) with measurable results.
  4. Lead capacity planning and commitments; build guardrails for cost, security, and compliance with Security/FinOps partners.
  5. Write high‑impact code, automation, and tooling; mentor across teams and raise the bar on operational excellence

Skills

Required

  • Kubernetes
  • service mesh
  • Terraform/IaC
  • CI/CD
  • production observability
  • cloud platforms
  • multi-region HA
  • SLO rigor
  • databases
  • storage systems
  • SQL
  • NoSQL
  • object storage
  • block storage
  • file storage
  • AI tools & automation
  • AIOps
  • incident triage
  • anomaly detection
  • runbook automation
  • security
  • cost boundaries
  • design reviews
  • incident excellence
  • SLO/error-budget tradeoffs

Nice to have

  • enterprise governance
  • compliance
  • audit requirements
  • GDPR
  • data privacy

What the JD emphasized

  • 10+ years building and operating cloud platforms
  • Deep in Kubernetes, service mesh, Terraform/IaC, CI/CD, and production observability
  • You’ve brought AI into platform engineering
  • You lead via design reviews, incident excellence, and SLO/error‑budget tradeoffs communicated in business terms.
  • You’re hands‑on with AI tools and help teams adopt them responsibly.

Other signals

  • Embed AI in the developer experience
  • AI tools & automation
  • AI fluency