Manager, Software Engineering (resilience Engineering)

Affirm Affirm · Fintech · Canada, United States · Remote · Infrastructure Platform Eng

Manager for a Resilience Engineering team focused on production load testing and chaos engineering to ensure the safety and reliability of Affirm's financial systems. The role involves leading a team, defining strategy, building platforms for safe production experimentation, and ensuring strong safeguards and observability for resilience experiments.

What you'd actually do

  1. Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices.
  2. Lead and mentor a team of engineers building platforms and tooling for safe production experimentation.
  3. Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle.
  4. Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection.
  5. Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments.

Skills

Required

  • Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
  • Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
  • Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
  • Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
  • Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
  • Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling.
  • Strong programming background (e.g., Python, Kotlin, Java, or similar).
  • Excellent problem-solving skills
  • Strong communication and leadership skills

Nice to have

  • guardrails

What the JD emphasized

  • production load testing
  • chaos engineering
  • production load testing
  • chaos engineering
  • production load testing
  • chaos engineering
  • production load testing
  • chaos experiments
  • production load tests
  • chaos experiments
  • production load testing
  • chaos engineering