Lead Site Reliability Engineer

JPMorgan Chase JPMorgan Chase · Banking · New York, NY +1 · Consumer & Community Banking

Lead Site Reliability Engineer for the Enterprise technology, liquidity risk team at JPMorgan Chase. This role focuses on owning non-functional requirements, driving improvements in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation. A key aspect involves leveraging and leading the adoption of enterprise-authorized AI capabilities to enhance SRE workflows, including incident triage, troubleshooting, and SDLC practices, while ensuring security and data sensitivity.

What you'd actually do

  1. Lead SRE practices that balance delivery speed, efficiency, and system stability
  2. Partner with engineering peers and senior stakeholders to drive strong, shared outcomes
  3. Scale SRE adoption across application and platform teams
  4. Set reliability expectations and show progress through stability and reliability metrics
  5. Run blameless, data-driven post-incident reviews and regular debriefs to turn lessons into improvements

Skills

Required

  • 5+ years of applied experience in software engineering concepts
  • Advanced knowledge of SRE principles
  • Track record of implementing SRE across application and platform teams
  • Experience leading technologists to manage and resolve complex technology issues
  • Ability to influence team culture by championing innovation and driving change
  • Experience hiring, developing, and recognizing talent
  • Proficiency in at least one programming language (JavaScript, Go, Python)
  • Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab, Terraform)
  • Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS)
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.

Nice to have

  • Ability to code, troubleshoot, and demonstrate strong data fluency
  • Strong troubleshooting skills across common networking technologies and issues
  • Working knowledge of modern service and integration patterns
  • GraphQL fundamentals
  • event-driven architecture (Kafka or equivalent)
  • observability/telemetry with OpenTelemetry

What the JD emphasized

  • Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
  • Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., CI/CD quality checks, test/validation automation, and operational readiness), ensuring traceability/auditability, resiliency, and security controls.
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.