Lead Site Reliability Engineering - Network

JPMorgan Chase JPMorgan Chase · Banking · Palo Alto, CA +1 · Corporate Sector

Lead Site Reliability Engineer focused on network infrastructure, responsible for defining and implementing reliability principles, driving adoption of best practices and observability, and providing Tier-3 support. The role involves using and evaluating enterprise-authorized AI capabilities to enhance SRE workflows, including incident response and analysis, while ensuring security and data sensitivity.

What you'd actually do

  1. Applies network reliability principles (Permit to Operate, FMEA, operational readiness), balancing feature delivery, efficiency, and stability.
  2. Partners with network engineering domains (Datacenter, Firewall, Proxies, DMZ, Load Balancing, etc.) and Lines of Business to align goals and outcomes.
  3. Drives adoption of reliability best practices and observability, demonstrating impact through stability/reliability metrics.; Bridges Engineering, Operations, DevOps, and customers to build resilient, scalable, and secure network services.
  4. Provides Tier-3 network support, leading major incident response, rapid restoration, RCA, and follow-through on corrective actions.
  5. Leads reliability and stability initiatives using data-driven analysis to improve service levels and reduce recurring failure modes.

Skills

Required

  • Formal training or certification in network engineering concepts and 5+ years of applied experience.
  • 10+ years of experience leading technologists to manage and solve complex technical items within your domain of expertise.
  • Advanced proficiency in network reliability engineering, including Permit to Operate, FMEA, and operational readiness processes.
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.
  • Experience leading technologists to manage and solve complex network issues at a firmwide level.
  • Ability to influence team culture by championing innovation and change for success.
  • Proficiency in SD-WAN, cloud platforms (AWS, Azure, etc.), and major network technologies (Palo Alto, Juniper, F5, Broadcom, Arista, Cisco, etc.).
  • Proficiency in observability and monitoring tools such as Grafana, SevOne, Prometheus, Kibana, ThousandEyes, and Splunk.

Nice to have

  • CCIE, Load-balancing, SD-WAN, Observability tools, eBPF, Cloud certs
  • Demonstrated proficiency in troubleshooting and supporting complex networking environments, including Tier-3 operational support for major incidents.
  • Experience with continuous integration and delivery tools (e.g., Jenkins, GitLab, Terraform, etc.).
  • Experience in scalable networking design, including high availability, redundancy, failover, and load balancing.
  • Experience troubleshooting networking protocols such as TCP/IP, HTTPS, and BGP.
  • Experience in customer-facing migration, including service discovery, assessment, planning, execution, and operations.

What the JD emphasized

  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
  • Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.
  • Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
  • Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., CI/CD quality checks, test/validation automation, and operational readiness), ensuring traceability/auditability, resiliency, and security controls.

Other signals

  • Uses AI capabilities to improve SRE workflows
  • Evaluates AI-assisted operational recommendations
  • Defines guardrails for AI usage
  • Leverages AI for incident triage and analysis