Senior Site Reliability Engineer (sre & Platform Reliability)

Affirm Affirm · Fintech · Poland, Spain · Remote · Infrastructure Platform Eng

Senior Site Reliability Engineer (SRE & Platform Reliability) at Affirm, focusing on operating applications, building tooling, and providing training. Responsibilities include providing data and visibility on application performance, guiding SLO development, driving incident and change management, engaging in service/architectural conversations, and recommending observability/alerting configurations. Requires experience in infrastructure, platform, distributed systems, capacity management, automation, observability, configuration management, and development/product experience. The role involves owning and delivering quarterly goals, leading engineers, supporting stakeholders, identifying technical solutions for incident readiness, supporting operations and availability, fostering a culture of quality, and developing talent.

What you'd actually do

  1. You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery.
  2. You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs.
  3. You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis.
  4. You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts.
  5. You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks.

Skills

Required

  • Designing, developing and launching backend systems at scale
  • Scripting and development languages like Bash, Python or Kotlin
  • Developing highly available distributed systems
  • AWS
  • MySQL
  • Kubernetes
  • Contributing in or driving parts of the Incident Lifecycle process
  • Site Reliability or Production Engineering experience
  • Defining a technical plan for the delivery of a significant feature or system component
  • Writing high quality code
  • Making impactful changes in a large code base
  • Developing a suite of tools and practices
  • Ownership of growth
  • Proactively seeking feedback
  • Strong verbal and written communication skills

Nice to have

  • Capacity management
  • Load and chaos testing
  • Automation
  • Observability
  • Configuration management
  • Development and product experience

What the JD emphasized

  • 4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin.
  • You have a track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes.
  • You have meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance.
  • You have 4+ years working in a Site Reliability or Production Engineering team