Lead Software Engineer – Cloud Aws Resiliency

JPMorgan Chase JPMorgan Chase · Banking · Plano, TX +1 · Consumer & Community Banking

Lead Software Engineer focused on cloud AWS resiliency for the Deposits Platform at JPMorgan Chase. Responsibilities include leading DR tests, developing resiliency frameworks, defining RTO/RPO, automating DR provisioning with IaC, driving chaos engineering, building detection capabilities, establishing DR playbooks, mentoring engineers, and reporting on resilience maturity. The role also emphasizes driving team adoption of enterprise-authorized AI-assisted engineering practices for code quality, delivery speed, and operational outcomes, with a strong focus on responsible AI use, data sensitivity, and secure handling of inputs/outputs.

What you'd actually do

  1. Lead and coordinate end-to-end DR tests and real-time failover events across all applications within the product portfolio, ensuring smooth cross-team collaboration.
  2. Develop and maintain overarching resiliency frameworks that development and infrastructure teams can consume via automated product offerings and repeatable patterns.
  3. Define and monitor Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for all product components, tracking metrics to identify architectural or procedural gaps.
  4. Partner with application teams to automate disaster recovery provisioning, scaling, configuration, and monitoring using Infrastructure as Code (IaC) tools like Terraform.
  5. Drive resiliency testing scenarios and chaos engineering using native tools like AWS Fault Injection Service (FIS) and AWS Resilience Hub to identify vulnerabilities before they impact production

Skills

Required

  • Formal training or certification on software engineering concepts and 5+ years applied experience.
  • Hands-on practical experience delivering system design, application development, testing, and operational stability
  • Hands-on experience with AWS cloud architectures, specifically DR-enabling services like AWS Elastic Disaster Recovery, AWS Backup, and multi-AZ/multi-region deployments.
  • Proficiency in at least one modern language (Python, Go, or Bash) and familiarity with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
  • 5+ years in Site Reliability Engineering (SRE), Disaster Recovery Planning, or Distributed Systems Engineering.
  • Demonstrated experience leading effective use of approved AI-assisted software development tools (e.g., for coding, code review, test acceleration, troubleshooting) with the ability to set team expectations for validating AI outputs for correctness, performance, and security.
  • Strong understanding of responsible AI use in engineering workflows, including data sensitivity considerations, secure handling of inputs/outputs, and adherence to resiliency and security expectations; experience coaching engineers on safe, compliant adoption within delivery practices
  • Advanced understanding of agile methodologies such as CI/CD, Application Resiliency, and Security
  • Demonstrated proficiency in software applications and technical processes within a technical discipline (e.g., cloud, artificial intelligence, machine learning, mobile, etc.)

Nice to have

  • In-depth knowledge of the financial services industry and their IT systems
  • Practical cloud native experience

What the JD emphasized

  • Demonstrated experience leading effective use of approved AI-assisted software development tools (e.g., for coding, code review, test acceleration, troubleshooting) with the ability to set team expectations for validating AI outputs for correctness, performance, and security.
  • Strong understanding of responsible AI use in engineering workflows, including data sensitivity considerations, secure handling of inputs/outputs, and adherence to resiliency and security expectations; experience coaching engineers on safe, compliant adoption within delivery practices