Senior Lead Site Reliability Engineer

JPMorgan Chase JPMorgan Chase · Banking · Palo Alto, CA +1 · Corporate Sector

Senior Lead Site Reliability Engineer role focused on building AI-powered automation systems, intelligent monitoring, and next-generation reliability platforms. The role involves designing and building AI Agents for autonomous operations, integrating various data stores including vector databases, and developing automation scripts. Requires expertise in SRE principles, observability, and AI frameworks like LangChain.

What you'd actually do

  1. Creates high quality designs, roadmaps, and program charters for AI-powered automation systems, intelligent monitoring solutions, and next-generation reliability platforms that are delivered by you or the engineers under your guidance
  2. Provides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issues, particularly in the intersection of SRE and AI/ML technologies
  3. Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
  4. Collaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debt, including comprehensive logging pipelines and systems that export, analyze, and visualize observability metrics and traces across distributed systems
  5. Designs and builds AI Agents and MCP (Model Context Protocol) Servers for autonomous operations including incident detection, root cause analysis, and auto-remediation, while architecting solutions that integrate multiple data stores including graph databases, vector databases, transactional databases, analytical databases, and big data platforms

Skills

Required

  • Site Reliability Engineering
  • DevOps
  • Software Engineering
  • Java
  • Go (Golang)
  • Python
  • Terraform
  • distributed systems
  • microservices architecture
  • cloud-native technologies
  • AI Agents
  • autonomous systems
  • LangChain
  • LangGraph
  • AutoGen
  • CrewAI
  • GitHub Copilot
  • Claude
  • logging pipelines
  • Fluentd
  • Logstash
  • Vector
  • metrics collection
  • distributed tracing
  • RESTful APIs
  • message queue architectures
  • Kafka
  • RabbitMQ
  • SQS
  • graph databases
  • Neo4j
  • TigerGraph
  • vector databases
  • Pinecone
  • Weaviate
  • Chroma
  • containerization
  • Docker
  • Kubernetes
  • CI/CD pipelines
  • GitOps workflows

Nice to have

  • MCP (Model Context Protocol) Servers
  • agent frameworks
  • LLM integration
  • prompt engineering
  • RAG (Retrieval-Augmented Generation)
  • AI/ML model building
  • deployment
  • lifecycle management
  • TensorFlow
  • PyTorch
  • scikit-learn
  • big data technologies
  • Hadoop
  • Spark
  • Flink
  • analytical databases
  • NoSQL databases
  • MongoDB
  • Cassandra
  • DynamoDB
  • time-series databases
  • InfluxDB
  • TimescaleDB
  • security best practices
  • compliance requirements
  • chaos engineering tools
  • Chaos Monkey
  • Gremlin
  • LitmusChaos
  • GameDay exercises
  • open-source projects
  • cloud platforms
  • AWS
  • Azure
  • GCP

What the JD emphasized

  • AI-powered automation systems
  • AI Agents
  • autonomous operations
  • vector databases
  • AI frameworks (LangChain, LangGraph, AutoGen, CrewAI)

Other signals

  • AI-powered infrastructure automation
  • AI Agents and autonomous systems
  • intelligent monitoring solutions
  • next-generation reliability platforms