Software Engineer Ii, Reliability

Klaviyo Klaviyo · Enterprise · Dublin, Ireland · Engineering

Software Engineer II, Reliability role focused on ensuring the reliability, scalability, and sustainability of Klaviyo's critical platforms. Responsibilities include building and operating production systems, automating operational tasks, improving observability, and participating in incident response. The role requires experience with cloud-native production systems, coding for automation, understanding distributed systems failure modes, and familiarity with SRE concepts. While not core to the role, candidates are encouraged to have experimented with AI tools for efficiency.

What you'd actually do

  1. Build, operate, and improve production systems with a focus on reliability, scalability, and performance
  2. Apply software engineering principles to automate operational tasks and reduce manual toil
  3. Contribute to the design and implementation of systems using established SRE best practices
  4. Help define and measure SLIs and SLOs for services you support
  5. Improve observability through metrics, dashboards, logging, and tracing

Skills

Required

  • Experience operating cloud-native production systems and services
  • Production-quality code (e.g. Python, Go, or similar) to automate operations and improve reliability
  • Understanding common failure modes in distributed systems
  • Experience working with containerized workloads and platforms (e.g. Kubernetes) in production environments
  • Participating in on-call rotations and diagnosing straightforward production issues
  • Experience using observability tools and responding to alerts
  • Familiarity with SRE concepts such as SLIs, SLOs, and error budgets
  • Hands-on experience with infrastructure as code or declarative configuration (e.g. Terraform, Kubernetes manifests)
  • Follow incident response processes and contribute meaningfully during outages

Nice to have

  • Supporting security-sensitive systems or internal platforms
  • Familiarity with AWS or other cloud providers
  • Exposure to messaging or asynchronous systems (e.g. Kafka, RabbitMQ, Celery)
  • Interest in performance testing, capacity planning, or resilience work
  • Practical experience with algorithms and data structures

What the JD emphasized

  • reliability
  • scalability
  • performance
  • automation
  • observability
  • incident response
  • SLIs
  • SLOs