Staff Site Reliability Engineer, Release Engineering

Plaid Plaid · Fintech · New York, NY · All Departments

Staff Site Reliability Engineer focused on Release Engineering at Plaid, a fintech company. The role involves defining and scaling reliability practices, architecting SLO and error-budget programs, driving progressive delivery, and ensuring production readiness. A key aspect is preparing for AI-driven development by scaling safety nets to handle increased code velocity and frequency.

What you'd actually do

  1. Lead the expansion of reliability standards across product engineering, converting foundational infrastructure into lasting operational habits and tooling.
  2. Architect and manage the SLO and error-budget framework, empowering teams to utilize reliability data for strategic product and release choices.
  3. Promote widespread use of progressive delivery and automated safety gates, ensuring high velocity without compromising production stability.
  4. Guide emerging product teams toward production readiness through expertise in observability, incident response, and scalable deployment health.
  5. Collaborate with SRE, Platform, and Infrastructure teams to transform complex production requirements into intuitive, self-service platform features.

Skills

Required

  • Over 8 years of professional experience in backend systems, SRE, or platform engineering roles.
  • Proven track record of designing reliability programs—such as service maturity models or SLI frameworks—that achieved cross-team adoption.
  • Direct experience building or operating canary rollout systems, metric-gated analysis, or automated rollback infrastructure.
  • Technical proficiency in software development, with a preference for Go or similar systems languages.
  • Ability to drive organizational change and influence engineering culture without formal authority.
  • Sound technical judgment in high-stakes production scenarios, balancing user impact with developer velocity.

Nice to have

  • Prior exposure to Kubernetes, service mesh technologies, Prometheus, or ArgoCD is considered a strong asset.

What the JD emphasized

  • define and scale Plaid's reliability practices
  • architect our SLO and error-budget programs
  • drive the adoption of progressive delivery
  • ensure new products are production-ready
  • scaling our safety nets to handle an increased volume and frequency of code changes