Principal Data Scientist - Oncology

Johnson & Johnson Johnson & Johnson · Pharma · Spring House, PA +7

The Principal Data Scientist - Oncology will standardize and connect biomedical and clinical data using semantic technologies, ontology, and graph data modeling to power analytics, search, and AI across Johnson and Johnson Innovative Medicine. This role involves building a scalable knowledge graph infrastructure, curating ontologies, developing ingestion pipelines, and enabling NLP/RAG over graphs.

What you'd actually do

  1. Be a key contributor to the design and implementation of a scalable knowledge graph infrastructure focused on data standardization and interoperability, focusing on Oncology R&D data.
  2. Apply graph-based data modeling for efficient Oncology R&D organization, integration and retrieval to ensure system flexibility and long-term maintainability.
  3. Work with a larger community of Data Scientists, Clinical Scientists, and Discovery Scientists to standardize, curate and create AI-Ready datasets.
  4. Curate and extend ontologies for clear mapping into established biomedical ontologies and controlled terminologies using resource description framework (RDF) standards.
  5. Work with SPARQL/GraphQL/REST services; develop ingestion and curation pipelines to ingest, normalize and map concepts across data sources.

Skills

Required

  • semantic technologies
  • ontology
  • graph data modeling
  • SPARQL
  • RDF
  • life sciences domain
  • data standardization
  • data interoperability

Nice to have

  • Ph.D. or Master's degree in bioengineering, computer science, IT, bioinformatics, physics, mathematics, or related fields, emphasis on semantic technologies for biomedical application
  • 5+ years professional experience in health informatics
  • large-scale knowledge graphs construction
  • pharmaceutical or healthcare domains integration
  • parser combinators
  • natural language processing
  • linked data (RDF Triple Stores and property graphs)
  • OWL
  • graph databases (Neo4j, Amazon Neptune)
  • complex biomedical datasets (e.g. clinical, genomics, proteomics)
  • SQL
  • key-value
  • column
  • document
  • graph stores
  • taxonomies
  • CI/CD implementations
  • git usage
  • CI/CD stacks (Jenkins, GitLab, Azure DevOps)
  • DevOps tools
  • metrics/monitoring
  • containerization technologies (Docker, Singularity)
  • stakeholder management
  • requirements gathering
  • business analysis
  • planning
  • manage a numerous projects simultaneously
  • prioritize work
  • organizational skills
  • flexibility

What the JD emphasized

  • strong familiarity with the life sciences domain
  • trusted, interoperable knowledge powers analytics, search, and AI
  • AI-Ready datasets
  • NLP/RAG over graphs

Other signals

  • knowledge graph
  • ontology
  • semantic technologies
  • interoperability
  • AI-Ready datasets