Lead Site Reliability Engineer

JPMorgan Chase · Banking · New York, NY +1 · Consumer & Community Banking

Lead Site Reliability Engineer for the Enterprise technology, liquidity risk team at JPMorgan Chase. This role focuses on owning non-functional requirements, driving improvements in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation. A key aspect involves leveraging and leading the adoption of enterprise-authorized AI capabilities to enhance SRE workflows, including incident triage, troubleshooting, and SDLC practices, while ensuring security and data sensitivity.

What you'd actually do

Lead SRE practices that balance delivery speed, efficiency, and system stability
Partner with engineering peers and senior stakeholders to drive strong, shared outcomes
Scale SRE adoption across application and platform teams
Set reliability expectations and show progress through stability and reliability metrics
Run blameless, data-driven post-incident reviews and regular debriefs to turn lessons into improvements

Skills

Required

5+ years of applied experience in software engineering concepts
Advanced knowledge of SRE principles
Track record of implementing SRE across application and platform teams
Experience leading technologists to manage and resolve complex technology issues
Ability to influence team culture by championing innovation and driving change
Experience hiring, developing, and recognizing talent
Proficiency in at least one programming language (JavaScript, Go, Python)
Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab, Terraform)
Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS)
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows
Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.

Nice to have

Ability to code, troubleshoot, and demonstrate strong data fluency
Strong troubleshooting skills across common networking technologies and issues
Working knowledge of modern service and integration patterns
GraphQL fundamentals
event-driven architecture (Kafka or equivalent)
observability/telemetry with OpenTelemetry

What the JD emphasized

Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., CI/CD quality checks, test/validation automation, and operational readiness), ensuring traceability/auditability, resiliency, and security controls.
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.

Read full job description

As a Site Reliability Engineering at JPMorgan Chase within the Enterprise technology, liquidity risk team, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.

J****ob responsibilities

Lead SRE practices that balance delivery speed, efficiency, and system stability
Partner with engineering peers and senior stakeholders to drive strong, shared outcomes
Scale SRE adoption across application and platform teams
Set reliability expectations and show progress through stability and reliability metrics
Run blameless, data-driven post-incident reviews and regular debriefs to turn lessons into improvements
Build a continuous-improvement culture by gathering feedback and improving the customer experience
Coach entry- to mid-level engineers and promote knowledge sharing through internal forums and communities
Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., CI/CD quality checks, test/validation automation, and operational readiness), ensuring traceability/auditability, resiliency, and security controls.

Required qualifications, capabilities, and skills

Formal training or certification in software engineering concepts plus 5+ years of applied experience
Advanced knowledge of SRE principles and a track record of implementing SRE across application and platform teams while avoiding common pitfalls
Experience leading technologists to manage and resolve complex technology issues at a firmwide level
Ability to influence team culture by championing innovation and driving change
Experience hiring, developing, and recognizing talent
Proficiency in at least one programming language (preferred: JavaScript, Go, Python)
Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab, Terraform)
Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS)
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.

Preferred qualifications, capabilities, and skills

Ability to code, troubleshoot, and demonstrate strong data fluency
Strong troubleshooting skills across common networking technologies and issues
Working knowledge of modern service and integration patterns, including GraphQL fundamentals, event-driven architecture (Kafka or equivalent), and observability/telemetry with OpenTelemetry

J****ob responsibilities

Lead SRE practices that balance delivery speed, efficiency, and system stability
Partner with engineering peers and senior stakeholders to drive strong, shared outcomes
Scale SRE adoption across application and platform teams
Set reliability expectations and show progress through stability and reliability metrics
Run blameless, data-driven post-incident reviews and regular debriefs to turn lessons into improvements
Build a continuous-improvement culture by gathering feedback and improving the customer experience
Coach entry- to mid-level engineers and promote knowledge sharing through internal forums and communities
Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., CI/CD quality checks, test/validation automation, and operational readiness), ensuring traceability/auditability, resiliency, and security controls.

Required qualifications, capabilities, and skills

Formal training or certification in software engineering concepts plus 5+ years of applied experience
Advanced knowledge of SRE principles and a track record of implementing SRE across application and platform teams while avoiding common pitfalls
Experience leading technologists to manage and resolve complex technology issues at a firmwide level
Ability to influence team culture by championing innovation and driving change
Experience hiring, developing, and recognizing talent
Proficiency in at least one programming language (preferred: JavaScript, Go, Python)
Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab, Terraform)
Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS)
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.

Preferred qualifications, capabilities, and skills

Ability to code, troubleshoot, and demonstrate strong data fluency
Strong troubleshooting skills across common networking technologies and issues
Working knowledge of modern service and integration patterns, including GraphQL fundamentals, event-driven architecture (Kafka or equivalent), and observability/telemetry with OpenTelemetry