Lead Site Reliability Engineer

JPMorgan Chase JPMorgan Chase · Banking · Hyderabad, Telangana, India · Consumer & Community Banking

Lead Site Reliability Engineer at JPMorgan Chase focused on improving application and platform reliability using data-driven analytics and AI capabilities. The role involves leading initiatives, mentoring engineers, and ensuring the stability and performance of systems, with a strong emphasis on adopting AI-assisted reliability workflows.

What you'd actually do

  1. Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team
  2. Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels
  3. Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
  4. Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
  5. Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise

Skills

Required

  • site reliability best practices
  • scalability
  • performance
  • security
  • enterprise system architecture
  • toil reduction
  • Python
  • Java Spring Boot
  • observability
  • Grafana
  • Dynatrace
  • Prometheus
  • Splunk
  • continuous integration
  • continuous delivery
  • Jenkins
  • GitLab
  • Terraform
  • container orchestration
  • Kubernetes
  • Docker
  • networking technologies
  • complex data structures
  • algorithms

Nice to have

  • public cloud platforms (AWS or equivalent)
  • infrastructure automation tools
  • capacity planning
  • DevOps
  • SRE adoption
  • distributed systems
  • resilient systems

What the JD emphasized

  • enterprise-authorized AI capabilities
  • AI-assisted reliability workflows
  • evaluate AI-assisted operational recommendations
  • define appropriate guardrails

Other signals

  • Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis
  • Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage