What you'd actually do

Creates and delivers high-quality designs, roadmaps, and program charters, while designing and developing robust software solutions, CI/CD pipelines, and infrastructure automation to optimize system reliability, scalability, and performance

Acts as a key resource and mentor for technologists, fostering a culture of site reliability, inclusion, and engineering excellence while guiding teams on best practices across cloud infrastructure, automation, and operational readiness

Collaborates with stakeholders to design and implement observability, alerting, and reliability solutions, including SLOs/SLIs, monitoring frameworks, and incident response processes that ensure stable, scalable, and high-performing systems

Uses enterprise-authorized AI capabilities within the work environment to accelerate reliability design and operational decisioning (e.g., incident/post-incident analysis and requirements traceability), validating outputs and handling operational data according to sensitivity and security requirements, while also leveraging modern tooling to optimize CI/CD and operational workflows.

Drives evolution, debugging, and performance optimization of critical systems by managing cloud-native infrastructure (AWS), container platforms (Docker/Kubernetes/EKS/ECS), and understanding application dependencies and system limitations

Skills

Required

Formal training or certification on site reliability engineering concepts and 5+ years applied experience
Advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform
Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection, along with hands-on experience with monitoring tools such as Grafana, Dynatrace, Prometheus, Datadog, or Splunk
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve reliability engineering workflows with strong validation habits and awareness of data sensitivity, along with experience leveraging automation and modern DevOps practices.
Ability to set team practices for safe AI usage in operations (e.g., review/approval expectations and escalation paths) while maintaining resiliency, security, and auditability outcomes, including governance of secure cloud and automation practices.
Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines, including AWS services (IAM, VPC, EC2, S3, RDS/Aurora, CloudWatch, EKS/ECS, Lambda, Route 53) and CI/CD tooling (GitHub Actions, Jenkins, GitLab CI, Azure DevOps)
Demonstrated ability to communicate data-based solutions with complex reporting and visualization methods
Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices

Nice to have

Familiarity with modern front-end technologies
Experience with large-scale distributed systems
Knowledge of networking and cloud security best practices
Strong collaboration, communication, and stakeholder management skills
Proactive, innovative mindset with a passion for continuous learning

What the JD emphasized

Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., testing/validation automation and production readiness), ensuring traceability/auditability, resiliency, and security controls, while enforcing governance, security best practices (IAM, secrets management), and reliability-focused automation.

Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the within Corporate Technology ,Compliance Technology team, you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products’ design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production.

Job Responsibilities

Creates and delivers high-quality designs, roadmaps, and program charters, while designing and developing robust software solutions, CI/CD pipelines, and infrastructure automation to optimize system reliability, scalability, and performance
Acts as a key resource and mentor for technologists, fostering a culture of site reliability, inclusion, and engineering excellence while guiding teams on best practices across cloud infrastructure, automation, and operational readiness
Collaborates with stakeholders to design and implement observability, alerting, and reliability solutions, including SLOs/SLIs, monitoring frameworks, and incident response processes that ensure stable, scalable, and high-performing systems
Uses enterprise-authorized AI capabilities within the work environment to accelerate reliability design and operational decisioning (e.g., incident/post-incident analysis and requirements traceability), validating outputs and handling operational data according to sensitivity and security requirements, while also leveraging modern tooling to optimize CI/CD and operational workflows.
Drives evolution, debugging, and performance optimization of critical systems by managing cloud-native infrastructure (AWS), container platforms (Docker/Kubernetes/EKS/ECS), and understanding application dependencies and system limitations
Provides ongoing guidance, tools, and automated solutions including infrastructure as code (Terraform/CloudFormation/CDK), environment standardization, configuration management, patching, backups, and cost optimization strategies
Makes significant contributions to JPMorganChase’s SRE community while supporting release management, change control, on-call rotations, and continuous improvement through post-incident reviews and operational excellence practices
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., testing/validation automation and production readiness), ensuring traceability/auditability, resiliency, and security controls, while enforcing governance, security best practices (IAM, secrets management), and reliability-focused automation.

Required qualifications, capabilities, and skills

Formal training or certification on site reliability engineering concepts and 5+ years applied experience
Brings an advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform
Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection, along with hands-on experience with monitoring tools such as Grafana, Dynatrace, Prometheus, Datadog, or Splunk
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve reliability engineering workflows with strong validation habits and awareness of data sensitivity, along with experience leveraging automation and modern DevOps practices.
Ability to set team practices for safe AI usage in operations (e.g., review/approval expectations and escalation paths) while maintaining resiliency, security, and auditability outcomes, including governance of secure cloud and automation practices.
Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines, including AWS services (IAM, VPC, EC2, S3, RDS/Aurora, CloudWatch, EKS/ECS, Lambda, Route 53) and CI/CD tooling (GitHub Actions, Jenkins, GitLab CI, Azure DevOps)
Demonstrated ability to communicate data-based solutions with complex reporting and visualization methods
Recognized as an active contributor of the engineering community
Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices

Preferred qualifications, skills, and capabilities

Familiarity with modern front-end technologies
Experience with large-scale distributed systems
Knowledge of networking and cloud security best practices
Strong collaboration, communication, and stakeholder management skills
Proactive, innovative mindset with a passion for continuous learning

Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.

Job Responsibilities

Creates and delivers high-quality designs, roadmaps, and program charters, while designing and developing robust software solutions, CI/CD pipelines, and infrastructure automation to optimize system reliability, scalability, and performance
Acts as a key resource and mentor for technologists, fostering a culture of site reliability, inclusion, and engineering excellence while guiding teams on best practices across cloud infrastructure, automation, and operational readiness
Collaborates with stakeholders to design and implement observability, alerting, and reliability solutions, including SLOs/SLIs, monitoring frameworks, and incident response processes that ensure stable, scalable, and high-performing systems
Uses enterprise-authorized AI capabilities within the work environment to accelerate reliability design and operational decisioning (e.g., incident/post-incident analysis and requirements traceability), validating outputs and handling operational data according to sensitivity and security requirements, while also leveraging modern tooling to optimize CI/CD and operational workflows.
Drives evolution, debugging, and performance optimization of critical systems by managing cloud-native infrastructure (AWS), container platforms (Docker/Kubernetes/EKS/ECS), and understanding application dependencies and system limitations
Provides ongoing guidance, tools, and automated solutions including infrastructure as code (Terraform/CloudFormation/CDK), environment standardization, configuration management, patching, backups, and cost optimization strategies
Makes significant contributions to JPMorganChase’s SRE community while supporting release management, change control, on-call rotations, and continuous improvement through post-incident reviews and operational excellence practices
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., testing/validation automation and production readiness), ensuring traceability/auditability, resiliency, and security controls, while enforcing governance, security best practices (IAM, secrets management), and reliability-focused automation.

Required qualifications, capabilities, and skills

Formal training or certification on site reliability engineering concepts and 5+ years applied experience
Brings an advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform
Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection, along with hands-on experience with monitoring tools such as Grafana, Dynatrace, Prometheus, Datadog, or Splunk
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve reliability engineering workflows with strong validation habits and awareness of data sensitivity, along with experience leveraging automation and modern DevOps practices.
Ability to set team practices for safe AI usage in operations (e.g., review/approval expectations and escalation paths) while maintaining resiliency, security, and auditability outcomes, including governance of secure cloud and automation practices.
Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines, including AWS services (IAM, VPC, EC2, S3, RDS/Aurora, CloudWatch, EKS/ECS, Lambda, Route 53) and CI/CD tooling (GitHub Actions, Jenkins, GitLab CI, Azure DevOps)
Demonstrated ability to communicate data-based solutions with complex reporting and visualization methods
Recognized as an active contributor of the engineering community
Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices

Preferred qualifications, skills, and capabilities

Familiarity with modern front-end technologies
Experience with large-scale distributed systems
Knowledge of networking and cloud security best practices
Strong collaboration, communication, and stakeholder management skills
Proactive, innovative mindset with a passion for continuous learning