Lead Software Engineer, Reliability

Klaviyo Klaviyo · Enterprise · Dublin, Ireland · Engineering

Lead Software Engineer, Reliability at Klaviyo responsible for setting technical direction and leading reliability strategy for critical platforms. Focuses on ensuring systems are reliable, scalable, and sustainable, treating reliability as a core product feature. Responsibilities include designing, building, and operating foundational infrastructure, defining reliability standards, reducing operational toil through automation, and improving systems based on production learnings. The role involves hands-on work with systems underpinning reliability and operational excellence, setting technical vision, leading design and implementation of critical services, driving adoption of SRE best practices, identifying and addressing reliability risks, automating infrastructure, owning observability and incident response, guiding on-call strategy, performing quantitative analysis, partnering with leaders, leading incident response, and mentoring engineers. The role also mentions experimenting with AI tools and workflows to improve work efficiency.

What you'd actually do

  1. Set the technical vision and long-term strategy for reliability, availability, and operational excellence across critical platforms
  2. Lead the design, implementation, and evolution of foundational, security-critical services with strong guarantees around availability, scalability, latency, and fault tolerance
  3. Drive adoption of SRE best practices across engineering teams, including SLIs, SLOs, error budgets, and reliability-based decision making
  4. Identify systemic reliability risks and architectural bottlenecks, and lead cross-team initiatives to address them with durable, preventative solutions
  5. Apply software engineering principles to automate infrastructure, eliminate operational toil, and improve system reliability at scale

Skills

Required

  • Cloud-native, platform-focused SRE
  • Software engineering principles
  • Design and operation of distributed, cloud-native systems
  • Experience operating containerized workloads and platforms (e.g. Kubernetes)
  • Owning on-call strategy and participating in escalation for complex production incidents
  • Designing and evolving observability platforms and alerting strategies
  • Applying SRE concepts such as SLIs, SLOs, error budgets, and burn-rate–based alerting
  • Hands-on experience with infrastructure as code and declarative configuration (e.g. Terraform, Kubernetes manifests, policy-as-code)
  • Leading capacity planning, load testing, and performance analysis efforts for large-scale distributed systems
  • Driving high-quality post-incident reviews
  • Leading technical discussions, influencing architecture, and providing clear guidance across multiple teams
  • Python, Go, or similar for production-quality code

Nice to have

  • Leading or supporting critical platforms or internal tooling
  • Familiarity with identity, access management, secrets management, or policy enforcement systems
  • Operating systems at scale in cloud environments (AWS preferred)
  • Resilience testing, fault injection, or chaos engineering
  • Strong understanding of algorithms and data structures as they apply to large-scale systems

What the JD emphasized

  • reliability, availability, and operational excellence
  • availability, scalability, latency, and fault tolerance
  • SLIs, SLOs, error budgets, and reliability-based decision making
  • systemic reliability risks and architectural bottlenecks
  • automate infrastructure, eliminate operational toil, and improve system reliability at scale
  • observability, alerting, and incident response practices
  • on-call strategy and operational processes
  • capacity planning, scaling limits, and performance characteristics
  • system architecture
  • incident response
  • cloud-native, platform-focused SRE
  • highly reliable production systems at scale
  • production-quality code
  • distributed, cloud-native systems
  • failure modes
  • containerized workloads and platforms
  • observability platforms and alerting strategies
  • SLIs, SLOs, error budgets, and burn-rate–based alerting
  • infrastructure as code and declarative configuration
  • capacity planning, load testing, and performance analysis
  • post-incident reviews
  • technical discussions, influencing architecture