Senior Software Engineer, Asynchronous Processing

Klaviyo Klaviyo · Enterprise · Boston, MA · Engineering

Senior Platform Engineer on the Asynchronous Processing team responsible for architecting, building, and operating a high-scale, event-driven backbone using technologies like Golang, Python, Apache Pulsar, Kafka, SQS, AWS, and Kubernetes. The role focuses on creating self-service platforms for queueing and background processing, ensuring reliability, scalability, low latency, and observability for product teams.

What you'd actually do

  1. Build a deep understanding of engineering needs across the organization, guiding the design and development of appropriate platform primitives in queueing that align with the platform's vision and practically empower product teams.
  2. Design, develop, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of Klaviyo's asynchronous and queueing services.
  3. Design and develop systems and processes that enable highly available & scalable systems, with a focus on asynchronous processing.
  4. Leverage technology such as Python, Golang, AWS, and Kubernetes to advance Klaviyo's platform, with a deep focus on Apache Pulsar, SQS, and Kafka.
  5. Champion best practices by actively collaborating with other teams in a culture that values technical design review.

Skills

Required

  • Golang
  • Python
  • Apache Pulsar
  • Kafka
  • SQS
  • AWS
  • Kubernetes
  • distributed systems
  • queueing systems
  • background processing
  • observability
  • SLOs
  • incident management
  • root cause analysis

Nice to have

  • Terraform
  • AI tools and workflows

What the JD emphasized

  • highly available, full-stack SaaS products at scale
  • operational health of the systems you build, including performance, reliability, and observability
  • define and upholding SLOs
  • participating in on-call
  • driving follow-through on incidents and RCAs
  • Expertise and hands-on experience with asynchronous processing and queueing systems like SQS, Kafka, or Apache Pulsar
  • handle yourself and complex systems in outage situations
  • drive failures to root cause analysis and prevention of future issues