Software Engineer III - Sre

JPMorgan Chase JPMorgan Chase · Banking · Bengaluru, Karnataka, India · Asset & Wealth Management

Software Engineer III - AI Reliability Engineer at JPMorgan Chase within Asset and Wealth Management Technology team, focused on enhancing the reliability and resilience of AI systems, particularly large language model serving and training systems. Responsibilities include developing SLOs for AI systems, implementing monitoring, designing high-availability serving infrastructure, championing site reliability culture, developing automated failover and recovery systems, creating AI Incident Response playbooks, leading incident response for critical AI services, building cost optimization systems, engineering for scale and security, and collaborating with ML engineers. Requires formal training/certification in software engineering, proficiency in reliability best practices, observability tools, CI/CD, container orchestration, and understanding AI infrastructure challenges. Preferred qualifications include experience with AI-specific observability tools, AI incident response strategies, AI-centric SLOs/SLAs, and continuous evaluation processes.

What you'd actually do

  1. Develop and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
  2. Design, implement and continuously improve monitoring systems including availability, latency and other salient metrics
  3. Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
  4. Champion site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
  5. Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers

Skills

Required

  • Formal training or certification on software engineering concepts and 3+ years applied experience
  • Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
  • Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
  • Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
  • Experience with troubleshooting common networking technologies and issues
  • Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
  • Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
  • Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
  • Can effectively bridge the gap between ML engineers and infrastructure teams
  • Have excellent communication skills

Nice to have

  • Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
  • Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
  • Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
  • Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
  • Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.

What the JD emphasized

  • AI observability
  • incident response
  • high-availability language model serving infrastructure
  • AI Incident Response playbooks
  • Continuous Evaluation

Other signals

  • AI observability
  • incident response
  • high-availability language model serving infrastructure
  • AI Incident Response playbooks
  • Continuous Evaluation