Manager, Software Engineering (resilience Engineering)

Affirm Affirm · Fintech · Canada, United States · Remote · Infrastructure Platform Eng

Manager for Resilience Engineering team focused on ensuring production system safety and reliability through proactive validation techniques like load testing and chaos engineering. This role involves leading a team to develop platforms and tooling for safe production experimentation, defining vision, and establishing best practices for testing system limits and failure scenarios.

What you'd actually do

  1. Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices.
  2. Lead and mentor a team of engineers building platforms and tooling for safe production experimentation.
  3. Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle.
  4. Establish best practices for safely testing system limits and failure scenarios in production.
  5. Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection.

Skills

Required

  • Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
  • Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
  • Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
  • Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
  • Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
  • Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling.
  • Strong programming background (e.g., Python, Kotlin, Java, or similar).
  • Excellent problem-solving skills
  • Strong communication and leadership skills

What the JD emphasized

  • production load testing
  • chaos engineering
  • production experimentation
  • safely test system behavior under stress and failure conditions in production
  • strong safety guarantees