Senior Lead Site Reliability Engineer

JPMorgan Chase JPMorgan Chase · Banking · Jersey City, NJ +1 · Consumer & Community Banking

Senior Lead Site Reliability Engineer role focused on building and optimizing AI/ML platforms and products, including infrastructure, serving, and agentic AI solutions for SRE functions. Requires deep SRE experience applied to AI/ML workloads and LLM inference.

What you'd actually do

  1. Creates high quality designs, roadmaps, and program charters that are delivered by you or the engineers under your guidance
  2. Provides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issues
  3. Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
  4. Collaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debt
  5. Identify application patterns and analytics in support of better service level objectives

Skills

Required

  • 16+ Years of software engineering experience
  • 5+ years of Site Reliability Engineering experience
  • Advanced knowledge in site reliability culture and principles
  • 2+ years of hands-on experience in architecting, scaling, and providing SRE support for AI/ML platforms and products
  • Databricks
  • GPU clusters
  • Model Serving frameworks
  • Feature Stores
  • Vector Databases
  • LLM inference pipelines
  • core SRE fundamentals — including reliability patterns, capacity planning, incident management, performance tuning, and toil reduction
  • SLOs/SLIs tailored to AI/ML workloads
  • Agentic AI-based solutions
  • AI Agents
  • Skills
  • Context Management
  • Retrieval-Augmented Generation (RAG)
  • tool-use patterns
  • Agentic AI frameworks to automate and augment core SRE functions
  • governance and controls of AI usage
  • Advanced knowledge and experience in observability
  • white and black box monitoring
  • service level objectives
  • alerting
  • telemetry collection
  • Grafana
  • Dynatrace
  • Prometheus
  • Datadog
  • Splunk

Nice to have

  • cloud-based data and analytics architecture
  • AWS storage
  • Snowflake
  • Kubernetes (EKS)
  • event-driven architectures
  • streaming services
  • batch jobs
  • ETL pipelines
  • Apache Kafka
  • Apache Spark
  • Strong communication skills
  • mentor and educate others on site reliability principles and practices
  • Recognized as an active contributor of the engineering community

What the JD emphasized

  • 5+ years of Site Reliability Engineering experience
  • At least 2+ years of hands-on experience in architecting, scaling, and providing SRE support for AI/ML platforms and products
  • Proven hands-on experience in designing and implementing Agentic AI-based solutions to deliver SRE capabilities at scale

Other signals

  • AI/ML platforms and products
  • AI Agents
  • LLM inference pipelines
  • Agentic AI-based solutions to deliver SRE capabilities