Scientific Lead - Scientific Data Engineer

Eli Lilly Eli Lilly · Pharma · San Francisco, CA

This role focuses on building the data infrastructure and semantic layer to make scientific data accessible for AI systems, specifically for drug discovery research. It involves designing and building data architectures, ETL/ELT pipelines, and AI-ready data products, including vector embedding pipelines for RAG. The role bridges data infrastructure and generative AI engineers, aiming to convert early deployments into repeatable system standards and evaluation practices.

What you'd actually do

  1. Design and build the data architecture that transforms raw and processed omics data into harmonized, AI-consumable layers
  2. Build and optimize ETL/ELT pipelines that produce denormalized views, pre-computed aggregations, embedding-ready text representations, and feature stores optimized for AI system consumption
  3. Design and maintain a semantic layer over Lilly’s multi-omics databases that enables AI systems
  4. Build and manage vector embedding pipelines for scientific documents, study metadata, and structured data descriptions to power RAG-based retrieval
  5. Develop data products that serve multiple consumption patterns: direct SQL access for computational biologists, structured feeds for ML training pipelines, and semantic interfaces for LLM-powered tools

Skills

Required

  • Bachelors degree in Computer Science, Data Engineering, Bioinformatics, or a related field + 8 years data engineering experience OR Masters degree and 5 years data engineering experience
  • Demonstrated expertise in building data pipelines, ETL/ELT workflows, and data products that serve downstream AI/ML systems
  • Strong SQL skills and experience with complex relational database schemas
  • Proficiency in Python for data processing, scripting, and pipeline development
  • Experience with cloud data platforms (AWS preferred: Redshift, Athena, Glue, S3, or similar)

Nice to have

  • Phd in data or related field
  • Experience with modern data platform technologies, including at least one of: Databricks, Snowflake, or equivalent lakehouse platforms
  • Experience with modern data engineering tools: dbt, Spark, Airflow, or similar orchestration and transformation frameworks
  • Familiarity with at least one of: vector databases, embedding pipelines, or semantic layer tooling
  • Strong communication skills
  • Experience with biomedical or scientific data: omics datasets (RNA-seq, proteomics, GWAS), clinical data, or laboratory information management systems
  • Experience in pharmaceutical, biotech, or life sciences environments
  • Familiarity with biomedical ontologies and controlled vocabularies (Gene Ontology, MeSH, ChEBI, HGNC) and their application to data integration
  • Experience building data products that serve AI/ML systems — feature stores, training datasets, evaluation benchmarks, or semantic annotations for text-to-SQL
  • Knowledge of data governance practices in regulated industries: data lineage, access controls, versioning, and auditability
  • Experience with knowledge graph technologies (Neo

What the JD emphasized

  • AI-consumable layers
  • AI system consumption
  • AI-ready data products
  • AI-accessible from the point of ingestion
  • AI systems
  • LLM-powered tools
  • regulated environment

Other signals

  • AI foundation
  • AI-ready data products
  • semantic layer
  • data harmonization infrastructure
  • lakehouse architecture
  • natural language interfaces
  • automated analysis workflows
  • intelligent search
  • vector embedding pipelines
  • RAG-based retrieval
  • LLM-powered tools