Site Reliability Engineer III

JPMorgan Chase JPMorgan Chase · Banking · Jersey City, NJ +1 · Consumer & Community Banking

Site Reliability Engineer III role focused on supporting AI/ML platforms and products within a financial institution. Responsibilities include ensuring reliability, capacity planning, and incident response for AI/ML infrastructure components like Databricks, Vector Databases, and Model Serving endpoints. The role also leverages Agentic AI concepts (AI Agents, RAG, automation frameworks) for SRE functions such as intelligent incident triage and self-healing workflows.

What you'd actually do

  1. Support SRE practices for AI/ML platforms and products by contributing to reliability, capacity planning, and incident response for AI/ML infrastructure components such as Databricks, Vector Databases, Model Serving endpoints, and ML pipelines.
  2. Leverage Agentic AI concepts and tools — including AI Agents, RAG, and automation frameworks — to assist in SRE functions such as intelligent incident triage, alert enrichment, runbook automation, and self-healing workflows.
  3. Collaborate with engineering teams to define, validate, and enforce Non-Functional Requirements (NFRs) including performance, scalability, availability, latency, and disaster recovery — for applications and data platforms, ensuring they meet production readiness standards before go-live.
  4. Define, implement, and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets for services in your remit.
  5. Drive observability maturity across services through white and black box monitoring, structured logging, distributed tracing, and intelligent alerting.

Skills

Required

  • software engineering experience
  • Site Reliability Engineering experience
  • Data Warehousing
  • Oracle
  • performance
  • scalability
  • availability
  • latency
  • disaster recovery
  • Production Readiness Reviews (PRRs)
  • Service Level Indicators (SLIs)
  • Service Level Objectives (SLOs)
  • Error Budgets
  • infrastructure-as-code
  • CI/CD pipelines
  • automated deployment strategies
  • monitoring
  • structured logging
  • distributed tracing
  • alerting
  • dashboards
  • runbooks
  • operational playbooks
  • incident response
  • Root Cause Analysis (RCA)
  • Chaos Engineering
  • Game Day exercises
  • AI/ML platforms
  • AI/ML products
  • Databricks
  • Vector Databases
  • Model Serving endpoints
  • ML pipelines
  • Agentic AI concepts
  • AI Agents
  • RAG
  • automation frameworks
  • operational toil reduction
  • automation of repetitive tasks
  • self-healing infrastructure patterns
  • performance tuning

Nice to have

  • SRE culture and best practices

What the JD emphasized

  • AI/ML platforms and products
  • Databricks
  • Vector Databases
  • Model Serving endpoints
  • ML pipelines
  • Agentic AI concepts
  • AI Agents
  • RAG
  • automation frameworks
  • Production Readiness Reviews (PRRs)
  • Service Level Indicators (SLIs)
  • Service Level Objectives (SLOs)
  • Error Budgets
  • Chaos Engineering
  • Game Day exercises
  • operational toil

Other signals

  • AI/ML platforms and products
  • Databricks
  • Vector Databases
  • Model Serving endpoints
  • ML pipelines
  • Agentic AI concepts
  • AI Agents
  • RAG
  • automation frameworks