Senior Site Reliability Engineer (sre & Platform Reliability)

Affirm Affirm · Fintech · Poland, Spain · Remote · Infrastructure Platform Eng

Senior Site Reliability Engineer (SRE) at Affirm, focusing on operating applications, building tooling, and providing training for engineering partners. Responsibilities include ensuring application performance, guiding SLO development, driving incident management, steering change management and deployment practices, recommending observability and alerting configurations, and engaging in service and architectural conversations. The role requires experience in infrastructure, platform, distributed systems, capacity management, automation, observability, configuration management, and backend systems. The candidate will own and deliver quarterly goals, support stakeholders, identify technical solutions for incident readiness, support operations and availability through metrics and on-call efforts, and foster a culture of quality and ownership.

What you'd actually do

  1. You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery.
  2. You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs.
  3. You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis.
  4. You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts.
  5. You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks.

Skills

Required

  • Designing, developing and launching backend systems at scale
  • Scripting and development languages like Bash, Python or Kotlin
  • Developing highly available distributed systems
  • AWS
  • MySQL
  • Kubernetes
  • Contributing in or driving parts of the Incident Lifecycle process
  • Site Reliability or Production Engineering team experience
  • Defining a technical plan for the delivery of a significant feature or system component
  • Making impactful changes in a large code base
  • Developed a suite of tools and practices that enable you and your team to do so safely
  • Strong verbal and written communication skills

Nice to have

  • Experience in capacity management, load and chaos testing
  • Experience in automation, observability, and configuration management
  • Development and product experience

What the JD emphasized

  • 4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin.
  • 4+ years working in a Site Reliability or Production Engineering team
  • On-Call Rotation - There would be an on-call rotation for this role as a requirement.