Ai/ml Lead Data Engineer - Automation/image Processing

JPMorgan Chase JPMorgan Chase · Banking · Tampa, FL +1 · Commercial & Investment Bank

Lead Data Engineer responsible for designing, building, and maintaining scalable data pipelines and infrastructure for processing scanned document images, integrating OCR and computer vision models, and managing data storage on AWS. The role involves collaborating with data scientists and ML engineers, ensuring data quality and governance, and leading a team of data engineers.

What you'd actually do

  1. Design, build, and maintain scalable, high-performance data pipelines and infrastructure to support ingestion, processing, and storage of large volumes of scanned document images across enterprise-wide workflows
  2. Architect end-to-end data solutions on AWS cloud services to enable seamless flow of scanned images from source systems through OCR processing, model inference, and downstream data extraction and categorization pipelines
  3. Develop robust image preprocessing and OCR integration pipelines that handle TIF/PNG format conversion, normalization, resolution enhancement, noise reduction, and batching to prepare scanned documents for downstream computer vision and OCR models
  4. Build and optimize data pipelines that integrate OCR engine outputs, extracting structured text and metadata from scanned images and routing them into databases and analytics platforms for further processing
  5. Design and manage data storage architectures and containerized deployments, using Oracle databases and AWS-native stores (S3, EFS) to efficiently catalog, index, and retrieve extracted text, classification labels, and metadata from processed document images

Skills

Required

  • Data Engineering concepts
  • Java
  • Groovy
  • Python
  • image file handling
  • TIF/PNG format processing
  • multi-page document splitting
  • format conversion
  • integration with OCR and computer vision pipelines
  • AWS S3
  • AWS Lambda
  • AWS Step Functions
  • AWS CloudWatch
  • AWS EKS
  • Docker
  • Kubernetes
  • Oracle databases
  • PL/SQL
  • performance tuning
  • partitioning strategies
  • data modeling
  • OCR technologies
  • dataset preparation
  • annotation management
  • feature store integration
  • CI/CD pipelines (Jenkins)
  • infrastructure-as-code tools (Terraform, CloudFormation)
  • data governance
  • data quality frameworks
  • metadata management
  • data cataloging
  • leadership
  • communication
  • stakeholder management

Nice to have

  • Domain expertise in the healthcare industry

What the JD emphasized

  • 5+ years applied experience
  • Hands-on experience with image file handling, particularly TIF/PNG format processing, multi-page document splitting, format conversion, and integration with OCR and computer vision pipelines
  • Deep hands-on experience with AWS cloud services including S3 (for image storage), Lambda, Step Functions, and CloudWatch for building and monitoring scalable data workflows
  • Expertise in AWS EKS (Elastic Kubernetes Service) for deploying and managing containerized image processing, OCR, and data pipeline services using Docker and Kubernetes
  • Advanced knowledge of Oracle databases including PL/SQL, performance tuning, partitioning strategies, and data modeling for storing and querying large volumes of extracted document data and classification results
  • Understanding of data requirements for training deep learning models including dataset preparation, annotation management, and feature store integration
  • Strong understanding of data governance, data quality frameworks, metadata management, and data cataloging, particularly in the context of document-centric and image-heavy data ecosystems
  • Partner with security and compliance teams to ensure that scanned document data, extracted PII/PHI, and sensitive content are handled in accordance with regulatory requirements, encryption standards, and access controls

Other signals

  • design and build scalable data pipelines
  • architect end-to-end data solutions
  • collaborate with data scientists and ML engineers
  • evaluate and integrate emerging data technologies
  • establish and enforce data quality, lineage, governance, and security frameworks