Senior Lead Site Reliability Engineer

JPMorgan Chase JPMorgan Chase · Banking · Columbus, OH +1 · Corporate Sector

Senior Lead Site Reliability Engineer role focused on defining and implementing non-functional requirements, availability targets, and observability solutions for services. The role involves designing and developing robust software solutions, CI/CD pipelines, and infrastructure automation, with a strong emphasis on leveraging enterprise-authorized AI capabilities to enhance reliability design, operational decisioning, and AI-assisted reliability workflows. This includes mentoring teams, managing cloud-native infrastructure, and ensuring security and governance controls for AI usage in operations.

What you'd actually do

  1. Creates and delivers high-quality designs, roadmaps, and program charters, while designing and developing robust software solutions, CI/CD pipelines, and infrastructure automation to optimize system reliability, scalability, and performance
  2. Acts as a key resource and mentor for technologists, fostering a culture of site reliability, inclusion, and engineering excellence while guiding teams on best practices across cloud infrastructure, automation, and operational readiness
  3. Collaborates with stakeholders to design and implement observability, alerting, and reliability solutions, including SLOs/SLIs, monitoring frameworks, and incident response processes that ensure stable, scalable, and high-performing systems
  4. Uses enterprise-authorized AI capabilities within the work environment to accelerate reliability design and operational decisioning (e.g., incident/post-incident analysis and requirements traceability), validating outputs and handling operational data according to sensitivity and security requirements, while also leveraging modern tooling to optimize CI/CD and operational workflows.
  5. Drives evolution, debugging, and performance optimization of critical systems by managing cloud-native infrastructure (AWS), container platforms (Docker/Kubernetes/EKS/ECS), and understanding application dependencies and system limitations

Skills

Required

  • Formal training or certification on site reliability engineering concepts and 5+ years applied experience
  • Advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform
  • Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection, along with hands-on experience with monitoring tools such as Grafana, Dynatrace, Prometheus, Datadog, or Splunk
  • Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve reliability engineering workflows with strong validation habits and awareness of data sensitivity, along with experience leveraging automation and modern DevOps practices.
  • Ability to set team practices for safe AI usage in operations (e.g., review/approval expectations and escalation paths) while maintaining resiliency, security, and auditability outcomes, including governance of secure cloud and automation practices.
  • Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines, including AWS services (IAM, VPC, EC2, S3, RDS/Aurora, CloudWatch, EKS/ECS, Lambda, Route 53) and CI/CD tooling (GitHub Actions, Jenkins, GitLab CI, Azure DevOps)
  • Demonstrated ability to communicate data-based solutions with complex reporting and visualization methods
  • Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices

Nice to have

  • Familiarity with modern front-end technologies
  • Experience with large-scale distributed systems
  • Knowledge of networking and cloud security best practices
  • Strong collaboration, communication, and stakeholder management skills
  • Proactive, innovative mindset with a passion for continuous learning

What the JD emphasized

  • Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., testing/validation automation and production readiness), ensuring traceability/auditability, resiliency, and security controls, while enforcing governance, security best practices (IAM, secrets management), and reliability-focused automation.