What you'd actually do

Creates and delivers high quality designs, roadmaps, and program charters alongside the engineering team

Acts as a key resource and mentor for technologists in your area seeking advice on technical and business issues, and serves as a culture carrier and site reliability adoption champion for your team

Collaborates with others to create and implement observability and reliability designs for complex systems which are robust, stable, and do not incur additional toil or technical debt

Uses enterprise-authorized AI capabilities within the work environment to accelerate reliability design and operational decisioning (e.g., incident/post-incident analysis and requirements traceability), validating outputs and handling operational data according to sensitivity and security requirements.

Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., testing/validation automation and production readiness), ensuring traceability/auditability, resiliency, and security controls.

Skills

Required

Formal training or certification on site reliability engineering concepts and 5+ years applied experience
Advanced understanding of site reliability culture and principles
Deep knowledge in one or more areas of infrastructure engineering (hardware, networking, databases, storage, deployment, automation, scaling, resilience, performance)
Expertise in a specific infrastructure technology and scripting languages (e.g., Python)
Advanced knowledge and experience in observability (white and black box monitoring, service level objectives, alerting, telemetry collection)
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve reliability engineering workflows with strong validation habits and awareness of data sensitivity.
Ability to set team practices for safe AI usage in operations (e.g., review/approval expectations and escalation paths) while maintaining resiliency, security, and auditability outcomes.
Commitment to developing technical and cross-functional knowledge beyond your product area
Advanced knowledge of software applications and technical processes
Demonstrated ability to communicate data-based solutions with complex reporting and visualization methods
Recognized as an active contributor of the engineering community

Nice to have

Experience with Arista, Cisco, F5, and Fortinet devices
Familiarity with network automation tools and techniques, such as Ansible
Experience with Corvil and Wireshark

Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the enterprise technology, Electronic Trading Services team , you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products’ design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production.

Job Responsibilities

Creates and delivers high quality designs, roadmaps, and program charters alongside the engineering team
Acts as a key resource and mentor for technologists in your area seeking advice on technical and business issues, and serves as a culture carrier and site reliability adoption champion for your team
Collaborates with others to create and implement observability and reliability designs for complex systems which are robust, stable, and do not incur additional toil or technical debt
Uses enterprise-authorized AI capabilities within the work environment to accelerate reliability design and operational decisioning (e.g., incident/post-incident analysis and requirements traceability), validating outputs and handling operational data according to sensitivity and security requirements.
Apply technical expertise and problem-solving methodologies to projects of moderate scope and Drive workstreams or projects involving one or more infrastructure engineering technologies
Provides comprehensive and ongoing guidance, tools, and solutions to support the firms’ growth
Make significant contributions to JPMorganChase’s site reliability community via internal forums, communities of practice, guilds, and conferences
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., testing/validation automation and production readiness), ensuring traceability/auditability, resiliency, and security controls.
Execute creative solutions for design, development, and technical troubleshooting for problems of moderate complexity
Consider upstream and downstream data, systems, and technical implications, advising on mitigation actions
Collaborate with other platforms to architect and implement changes that resolve issues and modernize technology processes

Required qualifications, capabilities, and skills

Formal training or certification on site reliability engineering concepts and 5+ years applied experience ( NAMR/APAC – India/ LATAM/ Hong Kong)
Brings an advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform
Possess deep knowledge in one or more areas of infrastructure engineering, such as hardware, networking terminology, databases, storage engineering, deployment practices, integration, automation, scaling, resilience, or performance assessments
Demonstrate expertise in a specific infrastructure technology and scripting languages (e.g., Python)
Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve reliability engineering workflows with strong validation habits and awareness of data sensitivity.
Ability to set team practices for safe AI usage in operations (e.g., review/approval expectations and escalation paths) while maintaining resiliency, security, and auditability outcomes.
Show commitment to developing technical and cross-functional knowledge beyond your product area
Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines
Demonstrated ability to communicate data-based solutions with complex reporting and visualization methods
Recognized as an active contributor of the engineering community and

Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices

Preferred qualifications, capabilities, and skills

Experience with Arista, Cisco, F5, and Fortinet devices
Familiarity with network automation tools and techniques, such as Ansible
Experience with Corvil and Wireshark