51
CURRENT ADVANCES TO BRIDGE THE USABILITY-EXPRESSIVITY GAP IN BIOMEDICAL SEMANTIC SEARCH (AND VISUALIZING LINKED DATA) Maulik R. Kamdar Biomedical Informatics PhD Program 3 rd April 2015

Current advances to bridge the usability-expressivity gap in biomedical semantic search (and visualizing linked data)

Embed Size (px)

Citation preview

CURRENT ADVANCES TO BRIDGE THE USABILITY-EXPRESSIVITY GAP IN BIOMEDICAL SEMANTIC SEARCH (AND VISUALIZING LINKED DATA)

Maulik R. Kamdar Biomedical Informatics PhD Program

3rd April 2015

QUERYING HETEROGENEOUS DATASETS ON THE LINKED DATA WEB André Freitas, Edward Curry, João Gabriel Oliveira and Seán O'Riain

Internet Computing February 2012

EVALUATING THE USABILITY OF NATURAL LANGUAGE QUERY LANGUAGES AND INTERFACES TO SEMANTIC WEB KNOWLEDGE BASES Esther Kaufmann and Abraham Bernstein

Journal Of Web Semantics

November 2010

INTRODUCTION

¢ Opportunities �  Builds on existing Web Infrastructure (URIs and HTTP)

and Semantic Web Standards (RDF, RDFS, vocabularies) �  Reduce barriers to data publication, consumption, reuse

and availability, adding a fine-grained structure. �  Expose previously siloed databases as data graphs (D2R,

Google Refine) to be interlinked and integrated with other datasets to create a global-scale interlinked dataspace.

¢ Challenges �  Awareness of which exposed datasets potentially contain

the data they want, their location and their data model. �  Syntax of structured query languages like SPARQL �  Heterogeneous, different descriptors for same entity,

loosely-connected (yet!) and distributed data sources

USABILITY-EXPRESSIVITY GAP

USABILITY-EXPRESSIVITY GAP

USABILITY-EXPRESSIVITY GAP

USABILITY-EXPRESSIVITY GAP

USABILITY-EXPRESSIVITY GAP

USABILITY-EXPRESSIVITY GAP

EXISTING APPROACHES

¢  Information Retrieval Approaches �  Entity-centric Search (SWSE, Sindice) �  Structure Search (Semplore) – use of inverted indexes

and user feedback strategies

¢ Natural Language Queries �  Question Answering (PowerAqua, FREyA) �  Difficult to expand across domains �  Best-effort Natural Language Interfaces (Treo) �  Habitability Problem - users need guidance and support �  WordNet/Wikipedia semantic approximation techniques

¢ Structured SPARQL Queries

CHALLENGE DIMENSIONS

¢  Query expressivity �  Query datasets by referencing elements in the data model, operate

over the data (aggregate results, express conditional statements).

¢  Usability �  An easy-to-operate, intuitive, and task-efficient query interface.

¢  Vocabulary-level semantic matching �  Semantically match query terms to dataset vocabulary-level terms.

¢  Entity reconciliation �  Match entities expressed in the query to semantically equivalent

dataset entities.

¢  Semantic tractability mechanisms �  Answer queries not supported by explicit dataset statements

(for example, “Is Natalie Portman an Actress?” can be supported by the statement “Natalie Portman starred Star Wars”).

GOOGLE KNOWLEDGE GRAPH

GOOGLE KNOWLEDGE GRAPH

BIOMEDICAL MOTIVATION

~5 compounds

~300 000 compounds

~300 interesting compounds

~ 10 interesting compounds

Lite

ratu

re

Virtu

al S

cree

ning

Que

ry d

atab

ases

Hypothesis Generation

(Linked) Data

“Are there Drugs with molecular weight under 400 tested against ‘Colon Cancer’?”

“Do any Publications refer to assays using ‘Aspirin’ as the primary Drug in treatment of ‘Prostrate Cancer’?

REVEALD: A USER-DRIVEN DOMAIN-SPECIFIC INTERACTIVE SEARCH PLATFORM FOR BIOMEDICAL RESEARCH

Maulik R. Kamdar, Dimitris Zeginis, Ali Hasnain, Stefan Decker and Helena F. Deus

Journal of Biomedical Informatics February 2014

CHALLENGES

¢ Awareness of which exposed datasets potentially contain the data they want and their data model.

¢ Large, heterogeneous biomedical data sources, which are too dynamic for reliable data centralization

¢ The assembly of SPARQL queries to create the aggregated information for bioinformatics analysis still poses a high cognitive entry barrier.

¢ Human-readable, and more specifically, domain-specific representation of query results is required.

¢ None of the previous systems tested in biomedical domains, except DistilBio, VIQUEN and Cuebee

¢ Trade-off between expressivity and usability.

BACKGROUND: CANCO DOMAIN-SPECIFIC MODEL

Zeginis, Dimitris, et al. "A collaborative methodology for developing a semantic model for interlinking Cancer Chemoprevention linked-data sources." Semantic Web 5.2 (2014): 127-142.

BACKGROUND: CANCO DOMAIN-SPECIFIC MODEL

Zeginis, Dimitris, et al. "A collaborative methodology for developing a semantic model for interlinking Cancer Chemoprevention linked-data sources." Semantic Web 5.2 (2014): 127-142.

LIFE SCIENCES LINKED OPEN DATA CLOUD

~3 Billion Triples Life Sciences 53 datasets

Cyganiak,R. and Jentzsch,A. (2014) The Linking Open Data cloud diagram. http://lod-cloud.net/ [Accessed: March 23, 2013]

BACKGROUND: CATALOGUING & LINKING 1248 Concepts and 1255 properties were harvested from more than 53 Linked Biomedical Data Sources (LBDS) (Life Sciences Linked Open Data – LSLOD catalogue) and linked to the CanCO Query Elements.

Hasnain, Ali, et al. "Cataloguing and linking life sciences LOD cloud." 1st International Workshop on Ontology Engineering in a Data-driven World (OEDW 2012).

BACKGROUND: ENTITY RECONCILIATION

BACKGROUND: FEDERATED ARCHITECTURE

Chebi:Compound                void-­‐ext:subClassOf          Granatum:Molecule  Pubchem:Compound    void-­‐ext:subClassOf          Granatum:Molecule  

?molec a Granatum:Molecule

?molec a Chebi:Compound ?molec a Pubchem:Compound

SPARQL    Query  

Chebi   DrugBank   UniProt   Others  

Life  Sciences  Linked  Open  Data    (LSLOD)  

LSLOD  Catalogue  

CanCO  

Saved  Queries  

Transformed  Query  

Transformed  Query  

Transformed  Query  

Transformed  Query  

Rule  Templates  Experimental  Datasets  

Query    Engine    Query  Logging  

TransformaGon  

Cataloguing  &    Links  CreaGon  

RDFizaGon  

Social  CollaboraGve  Workspace  

Hasnain, Ali, et al. "A Roadmap for navigating the Life Scinces Linked Open Data Cloud." International Semantic Technology (JIST2014) conference. 2014.

BACKGROUND: FEDERATED ARCHITECTURE

Ø Non-intuitive Ø SPARQL, RDF, Schema knowledge required Ø Domain-specific visualization of results is not possible

REVEALD SEARCH PLATFORM

¢ ReVeaLD :- Real-Time Visual Explorer and Aggregator of Linked Data, is a user-driven domain-specific search platform.

¢  Intuitively formulate advanced search queries using a click-input-select mechanism

¢ Visualize the results in a domain–suitable format. ¢ Entity-centric and Visual Query Search System ¢ Assembly of the query is governed by a Domain-

specific Language (DSL), which in this case is the Cancer Chemoprevention Ontology(CanCO)

REVEALD SEARCH PLATFORM Demo: https://www.youtube.com/watch?v=6HHK4ASIkJM&hd=1

REVEALD SEARCH PLATFORM Demo: https://www.youtube.com/watch?v=6HHK4ASIkJM&hd=1

DSL VISUAL REPRESENTATION

¢ Concept Map Visualization

VISUAL QUERY BUILDER

CanCO DSL

VISUAL QUERY BUILDER

CanCO DSL

VISUAL QUERY MODEL

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX granatum: <http://chem.deri.ie/granatum/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT DISTINCT * WHERE { ?x0_Assay a granatum:Assay ; granatum:hasInput ?x1_Target ; granatum:identify ?x2_ChemopreventiveAgent ; granatum:outcome_method ?x3_outcome_method . ?x1_Target granatum:title ?x4_title . ?x2_ChemopreventiveAgent granatum:molecularWeight ?x10_molecularWeight ; granatum:SMILESnotation ?x9_SMILESnotation ; granatum:hasFormula ?x7_hasFormula ; granatum:HBD ?x5_Hydrogen_Bond_Donors ; granatum:HBA ?x6_Hydrogen_Bond_Acceptors ; granatum:TPSA ?x8_Topological_Polar_Surface_Area . FILTER regex(xsd:string(?x4_title), "estrogen receptor", "is") FILTER ( xsd:double(?x10_molecularWeight) < 300 ) } LIMIT 100

Pubchem

ChEBI

Uniprot

↑ → SPARQL Translation

All Assays, which Target Estrogen Receptors present in Human (Organism), and which identify potential Chemopreventive Agents with Molecular Weight < 300

http://srvgal78.deri.ie:8080/explorer?type=sampleQuery&nodes=17-1-30-33-73-78-91-81-82-92-98-63 &links=17.1-17.30-1.33-17.73-17.78-1.91-30.81-30.82-30.92-30.98-33.63 &filters=1.91.c.estrogen%20receptor|30.98.lt.300|33.63.c.human&flexible=1

REVEALD DATA BROWSER

REVEALD DATA BROWSER

REVEALD DATA BROWSER

REVEALD DATA BROWSER

GRAPHIC RULES

¢ Query : SELECT * WHERE {<clickedURI> ?p ?o} ¢ Results are subjected to a set of Graphic Rules, which

follow the Event-Condition-Action paradigm (ECA) and provide visual representations using Fresnel Display Vocabulary.

¢ Example : �  Event: Each retrieved triple as query execution result

<http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/844> <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/pdbIdPage> “http://www.pdb.org/pdb/explore/explore.do?structureId=1IVO”

�  Condition: sdf_file or pdbIdpage (Predicate) + http (Object) �  Action: HTTP GET and invoke a specific Resource Renderer �  Resource Renderer: GLMol Molecular Viewer

SINGLE ENTITY SEARCH

EVALUATION ¢  Tracking Real-time User Experience Methodology (TRUE)

- widely used in the HCI community to evaluate computer games

¢  Game-based evaluation where domain users are given tasks to complete and time and interactions are tracked using Google Analytics

¢  Subjectivistic evaluation where users were asked to fill out a survey.

¢  The main purpose of this evaluation focused on two usability concerns: �  Does familiarity of the users with the DSL affect the time needed to

formulate the query? �  Does a constrained DSL (smaller DSL), lead to less time needed for

query formulation?

EVALUATION RESULTS

EVALUATION RESULTS

EVALUATION RESULTS

OTHER IMPLEMENTATIONS: LINKED TCGA

Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.

http://srvgal78.deri.ie/tcga-pubmed/

OTHER IMPLEMENTATIONS: LINKED TCGA

Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.

http://srvgal78.deri.ie/tcga-pubmed/

OTHER IMPLEMENTATIONS: LINKED TCGA

Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.

http://srvgal78.deri.ie/tcga-pubmed/

OTHER IMPLEMENTATIONS: LINKED TCGA

Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.

http://srvgal78.deri.ie/tcga-pubmed/

OTHER IMPLEMENTATIONS: LINKEDPPI

Kazemzadeh, L., Kamdar, M. R.,et al. LinkedPPI: Enabling Intuitive, Integrative Protein-Protein Interaction Discovery. Linked Science, 48.

OTHER IMPLEMENTATIONS: LINKEDPPI

Kazemzadeh, L., Kamdar, M. R.,et al. LinkedPPI: Enabling Intuitive, Integrative Protein-Protein Interaction Discovery. Linked Science, 48.

DISCUSSION

¢ DSL Incrementation Mechanism �  Extend the current model represented in the Visual Query

Builder by adding new concepts and properties. �  Use or merge publicly available extensions of the DSL

¢ No reliance on the Federated Query Engine, SPARQL Endpoint, underlying DSL and Graphic Rules.

¢ Corrupt Graphic Rules result in the textual representation of the relevant triple.

¢ Domain-specific Languages increase usability and enable abstraction of underlying data models

Query expressivity   Usability   Vocabulary-level semantic matching  

Entity reconciliation   Semantic tractability mechanisms  

Medium  (SELECT,  FILTER,  OPTIONAL)  

Medium  (En=ty-­‐centric  Search,  VQS)  

Low  (Indexed  Term  URI  to  Concept)  

Low  (owl:sameAs  for  same  unique  keys)  

None  

FUTURE WORK

¢ Ontologies, indexed term labels and catalogue as elements in a Controlled Natural Language to increase usability

¢ Results pipelined to any Problem-solving method (like Autodock Vina, visualization, ML algorithm etc.)

¢ Faceted Search, Related Entity Recognition based on Feature-based Similarity Measures

¢ Allowing users of the platform to provide their own DSL, data sources, and graphic rules.

¢ SPARQL Endpoint availability and latency ¢ Ontology Reuse instead of Ontology Alignment!

Thank You!

[email protected]