Lead Sre- Azure & Gcp

JPMorgan Chase JPMorgan Chase · Banking · GLASGOW, LANARKSHIRE, United Kingdom · Corporate Sector

Lead Site Reliability Engineer (SRE) for Google Cloud environments at JPMorgan Chase, focusing on implementing SRE frameworks, ensuring high SLOs, and leveraging enterprise-authorized AI capabilities to enhance SRE workflows like incident triage and troubleshooting. Requires expertise in cloud platforms (Azure, GCP), container technologies, programming, monitoring tools, and AI-assisted operational recommendations.

What you'd actually do

  1. Lead and Implement SRE frameworks to support global google cloud environments and ensure the highest level of SLOs through operational excellence
  2. Mastery of application, data, infrastructure, and Agentic AI disciplines
  3. Keen understanding of financial control and budget management using expertise in working in partnership with colleagues throughout the firm, and in leading collaborative teams to achieve common goals
  4. Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
  5. Provide support to develop & improve the quality of technical engineering documentation
  6. Provide technical supervision, oversight and problem resolution for engineering activities
  7. Champion a DevOps model so that services are automated and elastic across all platforms

Skills

Required

  • Google & Azure cloud expertise in a mission critical production environment
  • Strong understanding about container technologies such as Docker, Kubernetes, GKE and HELM
  • Experience in programming in one of the following languages: Python, shell scripting or GO along with good understanding of REST APIs
  • Hands-on experience with cloud-based technologies and tools especially in deployment, monitoring and operations, such as Google Observability, Azure Monitor, Data Dog, Prometheus, Splunk, Elasticsearch and Grafana.
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.
  • Strong understanding about the Google Cloud governance and compliance and cost management
  • Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, Infrastructure as Code, Terraform and Jenkins.
  • Google Cloud certification or equivalent technical experience in the Public Cloud.
  • Good understanding of Agentic AI SDKs and GitHub Copilot Skills.

Nice to have

  • Good understanding of operating systems such as Windows, Linux (Redhat / Ubuntu)
  • Good understanding of LLM and other AI/ML frameworks which can be used in AIOPS

What the JD emphasized

  • Mastery of application, data, infrastructure, and Agentic AI disciplines
  • Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage