Senior Software Engineer, Reliability

Klaviyo Klaviyo · Enterprise · Dublin, Ireland · Engineering

Senior Software Engineer, Reliability role focused on building and operating critical platforms, ensuring reliability, scalability, and sustainability. The role involves applying software engineering principles to automate infrastructure, reduce operational toil, and improve system reliability at scale, using SRE best practices. Key responsibilities include defining SLIs/SLOs, improving observability, participating in on-call rotations, and performing quantitative analysis. The ideal candidate is a cloud-native, platform-focused SRE with experience in distributed systems, Kubernetes, and observability tools. While not core to the role, experience with AI tools and workflows is a plus.

What you'd actually do

  1. Build and operate foundational, security-critical services with a strong emphasis on availability, scalability, latency, and fault tolerance
  2. Apply software engineering principles to automate infrastructure, reduce operational toil, and improve system reliability at scale
  3. Design, implement, and evolve systems using SRE best practices
  4. Define and refine SLIs, SLOs, and error budgets to guide engineering decisions
  5. Improve observability, alerting, and incident response to reduce mean time to detection and recovery

Skills

Required

  • Python
  • Go
  • distributed systems
  • Kubernetes
  • observability
  • SLIs
  • SLOs
  • error budgets
  • infrastructure as code
  • Terraform
  • capacity planning
  • load testing
  • performance analysis
  • post-incident reviews

Nice to have

  • security-critical platforms
  • internal security tooling
  • identity, access management
  • secrets management
  • policy enforcement systems
  • AWS
  • resilience testing
  • fault injection
  • chaos engineering
  • algorithms
  • data structures

What the JD emphasized

  • cloud-native
  • platform-focused
  • production-quality code
  • distributed, cloud-native systems
  • containerized workloads and platforms
  • observability systems
  • SLIs, SLOs, error budgets
  • infrastructure as code
  • capacity planning, load testing, and performance analysis
  • post-incident reviews
  • technical designs, platform APIs, operational runbooks, and system documentation