Lead Software Engineer -sre (grafana, Dynatrace, Slo/sli)

JPMorgan Chase JPMorgan Chase · Banking · Hyderabad, Telangana, India · Corporate Sector

Lead Software Engineer for AI/ML Data Platforms focusing on Site Reliability Engineering (SRE). Responsibilities include building scalable data solutions, incident management, root cause analysis, mentoring, and driving adoption of AI-assisted engineering practices for code quality, delivery speed, and operational outcomes. Requires expertise in SRE principles, observability tools, Python/PySpark for AI/ML, and system design. Must automate tasks and understand responsible AI use.

What you'd actually do

  1. Develop and support AI/ML solutions for troubleshooting and incident resolution.
  2. Drives team adoption of enterprise-authorized AI-assisted engineering practices within the work environment to improve code quality, delivery speed, and operational outcomes (e.g., AI-assisted code review/refactoring, test strategy acceleration, incident/root-cause analysis support), while establishing consistent validation standards (secure coding, peer review, automated testing) and promoting reuse of effective patterns across the team.
  3. Applies knowledge of tools within the Software Development Life Cycle toolchain, including enterprise-authorized AI-assisted development and automation capabilities, to improve the value realized by automation.
  4. Coordinate incident management coverage to ensure effective resolution of application issues.
  5. Mentor and guide team members to foster innovation and strategic change.

Skills

Required

  • Formal training or certification on software engineering concepts and 5+ years applied experience
  • Proficient in site reliability culture, principles and expertise in running production incident calls and managing incident resolution.
  • Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Strong understanding of SLI/SLO/SLA and Error Budgets
  • Proficiency in Python or PySpark for AI/ML modeling.
  • Must be able to reduce toil by building new tools to automate repeated tasks.
  • Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
  • Understanding of network topologies, load balancing, and content delivery networks.
  • Awareness of risk controls and compliance with departmental and company-wide standards.
  • Demonstrated experience leading effective use of approved AI-assisted software development tools (e.g., for coding, code review, test acceleration, troubleshooting) with the ability to set team expectations for validating AI outputs for correctness, performance, and security.
  • Strong understanding of responsible AI use in engineering workflows, including data sensitivity considerations, secure handling of inputs/outputs, and adherence to resiliency and security expectations; experience coaching engineers on safe, compliant adoption within delivery practices

Nice to have

  • Hands on experience an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
  • AWS, Snowflake or Databricks certifications.
  • Familiar on how to implement site reliability within an application or platform

What the JD emphasized

  • Must be able to reduce toil by building new tools to automate repeated tasks.
  • Demonstrated experience leading effective use of approved AI-assisted software development tools (e.g., for coding, code review, test acceleration, troubleshooting) with the ability to set team expectations for validating AI outputs for correctness, performance, and security.
  • Strong understanding of responsible AI use in engineering workflows, including data sensitivity considerations, secure handling of inputs/outputs, and adherence to resiliency and security expectations; experience coaching engineers on safe, compliant adoption within delivery practices

Other signals

  • AI-assisted engineering practices
  • AI/ML Data Platforms
  • AI-assisted code review/refactoring
  • AI-assisted development and automation capabilities
  • AI-assisted software development tools
  • responsible AI use