Lead Data Engineer

Honeywell Honeywell · Industrial · Atlanta, GA +1

Lead Data Engineer on the Industrial AI & Data Platforms team, responsible for architecting and owning data foundations for physical AI at scale, including IoT sensor telemetry and Generative AI pipelines. This role involves technical leadership, building AI-ready data products like vector stores and RAG workflows, and mentoring engineers.

What you'd actually do

  1. Architect end-to-end data pipelines processing terabytes of IoT telemetry on Azure Databricks (PySpark DLT, Lakeflow) using medallion Lakehouse architecture.
  2. Design and optimize real-time ingestion pipelines from Azure Event Hub and Apache Kafka for high-volume industrial IoT telemetry.
  3. Build fault-tolerant, idempotent streaming architectures handling schema evolution, backpressure, and latency SLAs.
  4. Lead architecture reviews, set engineering standards, and drive decisions on data modeling, pipeline design, and platform evolution.
  5. Define technical direction for AI-ready data products including vector stores, embedding pipelines, and RAG-ready structured/unstructured data.

Skills

Required

  • 8+ years of data engineering experience
  • at least 2 years in a lead or senior role
  • building and operating medallion lakehouse architectures (Bronze / Silver / Gold)
  • Apache Spark / PySpark
  • Azure Databricks
  • streaming platforms - Apache Kafka and/or Azure Event Hub
  • Cloud data architecture skills (Azure preferred)
  • Data modeling and schema design expertise
  • building data pipelines for GenAI or ML applications: RAG systems, embedding pipelines, and document ingestion
  • MLOps familiarity including model versioning, feature stores, and monitoring/observability for data and ML systems
  • lead technical design reviews
  • mentor engineers
  • drive architectural decisions with stakeholder buy-in
  • CI/CD using GitHub Actions

Nice to have

  • LangChain, LangGraph, or other agentic AI orchestration frameworks
  • real-time data processing frameworks (Apache Spark Streaming, Structured Streaming)
  • MLOps practices
  • time-series databases and IoT data modeling patterns
  • containerization (Docker)
  • orchestration (Kubernetes)
  • data quality implementation for AI training data
  • distributed teams and cross-functional collaboration
  • data security and governance practices for AI systems
  • Agile and Scrum Methodologies

What the JD emphasized

  • production experience on Azure Databricks at scale
  • real-time IoT data
  • building data pipelines for GenAI or ML applications: RAG systems, embedding pipelines, and document ingestion
  • monitoring/observability for data and ML systems
  • mentor engineers

Other signals

  • architect and own the data foundations that enable physical AI at scale
  • production-grade Generative AI pipelines
  • AI-ready data products
  • intersection of modern data engineering and applied GenAI