Senior Lead Software Engineer- Sre

JPMorgan Chase JPMorgan Chase · Banking · Jersey City, NJ +1 · Corporate Sector

This role is for a Senior Lead Site Reliability Engineer on the AI/ML Data Platforms team, focusing on building scalable and resilient data solutions. Responsibilities include incident management, root cause analysis, mentoring, and developing AI/ML solutions for troubleshooting. Requires proficiency in SRE principles, observability tools, Python/PySpark for AI/ML, and automation to reduce toil.

What you'd actually do

  1. Expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
  2. Coordinate incident management coverage to ensure effective resolution of application issues.
  3. Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
  4. Mentor and guide team members to foster innovation and strategic change.
  5. Develop and support AI/ML solutions for troubleshooting and incident resolution.

Skills

Required

  • site reliability culture and principles
  • running production incident calls and managing incident resolution
  • observability such as white and black box monitoring, service level objective alerting, and telemetry collection
  • SLI/SLO/SLA and Error Budgets
  • Python or PySpark for AI/ML modeling
  • reduce toil by building new tools to automate repeated tasks
  • system design, resiliency, testing, operational stability, and disaster recovery
  • network topologies, load balancing, and content delivery networks
  • risk controls and compliance with departmental and company-wide standards
  • work collaboratively in teams and build meaningful relationships

Nice to have

  • 10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies
  • AWS, Snowflake or Databricks certifications

What the JD emphasized

  • Must be able to reduce toil by building new tools to automate repeated tasks.