Lead Site Reliability Engineer

JPMorgan Chase JPMorgan Chase · Banking · Singapore · Corporate Sector

Lead Site Reliability Engineer at JPMorgan Chase focusing on improving application and platform reliability using data-driven analytics and AI-assisted workflows. Responsibilities include leading SRE initiatives, driving collaboration on service levels, using enterprise AI for incident management, and mentoring engineers. Requires a Bachelor's degree, 5+ years of SRE experience, proficiency in programming languages like Python, and experience with observability, CI/CD, and container orchestration. Experience with AI capabilities for SRE workflows, including evaluating AI recommendations and defining guardrails, is essential.

What you'd actually do

  1. Consistently models and champions site reliability culture and practices, documents and shares knowledge within your organization via internal forums and communities of practice
  2. Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels, proactively identifying and solving technology-related bottlenecks in areas of expertise
  3. Drives collaboration with your team to identify comprehensive service level indicators and the stakeholder partners to establish reasonable service level objectives and error budgets with your customers
  4. Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
  5. Serves as the main point of contact during major incidents for your application and has the skills to identify and solve the issue quickly to avoid financial loss to the business

Skills

Required

  • Bachelor’s Degree in Computer Science, Cybersecurity, Data Science, or related disciplines
  • Formal training or certification on software engineering or site reliability engineering and 5+ years applied experience
  • Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
  • Fluent in at least one programming language such as: Python, Java/Spring Boot, .Net
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.
  • Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection
  • Proficient with continuous integration and continuous delivery practices and tooling
  • Proficient with container and container orchestration
  • Experience with troubleshooting common networking technologies and issues
  • Advanced knowledge of software applications and technical processes with emerging depth in one or more technical disciplines, and actively self-educates to evaluate and recommend suitable new technologies

Nice to have

  • Terraform
  • AWS
  • Python
  • Ansible

What the JD emphasized

  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.

Other signals

  • Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis
  • Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage