Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web services and HYDRA

  • View
    300

  • Download
    0

  • Category

    Science

Preview:

Citation preview

COMPREHENSIVE SELF-SERVICE

LIFE SCIENCE DATA FEDERATION

WITH SADI SEMANTIC WEB SERVICES

AND HYDRA

Alexandre Riazanov, CTOIPSNP Computing Inc

Oslo University, Sep 23, 2015

WHO WE ARE

• IPSNP Computing Inc -- a Canadian startup, building on and commercializing prior academic research on SADI.

• Founded to develop an industrial strength query tool for SADI, to supercede a research proof-of-concept prototype.

• Looking for customers/partners and investors.

BIOMEDICAL RESEARCHERS AND CLINICIANS USE DATA FROM MULTIPLE SOURCES

• Online and in-house databases, spreadsheets.

• Web services, e.g., literature search, etc.

• Nomenclatures, ontologies, controlled vocabularies.

• Web sites, scientific publications, patents, etc.

• Algorithms, e.g., BLAST, molecular structure prediction, various text mining programs, etc.

BIG VISION: FEDERATED QUERYING OF HETEROGENEOUS AND DISTRIBUTED DATA SOURCES

• We want to query 1000s of data sources as a single database.

• We want more agility than datawarehousing can provide: e.g., just-in-time algorithm execution, plug-and-play data source addition, live data querying.

• We want to use simple and declarative queries, not to program workflow scripts.

IS THIS SCI-FI?

WE CAN ACTUALLY DO THIS WITH SEMANTIC WEB SERVICES

Here is how our data federation engine HYDRA works:

HOW IS THIS ALL POSSIBLE?

• Key ingredient: the SADI framework for Semantic Web services (Semantic Automated Discovery and Integration).

• SADI services are: • RESTful services• consuming and producing one format -- RDF,• with semantic descriptions (in OWL) fully defining

their functionality.

PLAN OF THE TALK

• What are SADI services?

• Automatic service discovery and invocation in query engines (HYDRA).

• Self-service querying vision.

• Query composition with HYDRA GUI.

• An overview of Bioinformatics and Clinical Intelligence case studies.

Tons of screenshots!

SADI SERVICE I/O

• Input: RDF description of an input object.

• Output: another RDF graph providing more (computed or retrieved) info about the input object or linking it to other objects.

• Since all SADI services “talk the same language” (RDF), they are 100% syntactically interoperable:– output of one SADI service can be directly

consumed by any other SADI services.

Describe your input, and I will tell you something else about it”

COMPLETE SEMANTIC DESCRIPTIONSOF SERVICE FUNCTIONALITY

• SADI services carry semantic descriptions of their I/O that completely define what the service expects and can accept as input, and what RDF assertions the service can output.

• Unique and extremely powerful property: it facilitatescompletely automatic discovery

and orchestration of services.

HYDRA QUERY ENGINE

● Given a SPARQL query, HYDRA analyses it by using an intelligent logic-based algorithm (proprietary, unlike SADI itself).

● HYDRA requests descriptions of potentially useful services from available SADI service registries.

● HYDRA processes the descriptions and figures out which services have to be invoked, on what data and in what order.

SPARQL is a W3C standard semantic query language -- much more intuitive than SQL.

QUERY EXAMPLE

• Find documents mentioning "haloalkane dehalogenase activity", extract information about mutations and visualise the mutations on 3D protein structure images.

• HYDRA automatically finds and orchestrates 5 services from our registry:– PubMed search: keyword query ⟶ document PubMed IDs– PDF retrieval: PubMed ID ⟶ PDF file URL– ASCII extraction: PDF file ⟶ ASCII text– Text mining: ASCII text ⟶ mutation info– Visualisation: mutation & protein ⟶ 3D image (Jmol)

RESULTS

Deploying mutation impact text-mining software with the SADI Semantic Web Services frameworkhttp://www.biomedcentral.com/qc/1471-2105/12/S4/S6

WHAT IS SO COOL ABOUT IT?

• Data federation at its best:

– independent, heterogeneous data sources (PubMed doc search, PubMed Central for PDFs);

– not only data is integrated: ASCII extraction, text mining and 3D visualisation are algorithms!

• Execution is completely automatic: HYDRA finds and invokes the services without any help from the user.

MORE QUERY EXAMPLES

• Find drug products that contain active ingredient X.• Find drugs that have been studied in clinical trials targeting

infections caused by bacteria X.• Annotate a DNA sequence X with molecular functions of

proteins produced by the corresponding gene.

• Find patients with precondition X diagnosed with infections Y resulting from procedure Z.

• Many many other questions that Life Scientists and Clinicians ask on a daily basis.

IT’S ONLY ½ OF THE STORY

REMEMBER THE BIG VISION?

HERE IS AN EVEN BIGGER VISION:Self-service ad hoc querying of federated data.

HYDRA IMPLEMENTS SEMANTIC QUERYING

• Users need not know how the source data is organised or accessed.

• They just need to know the terminology of their subject domain.

• Queries are completely declarative: specify what you want to find, not how.

HYDRA ALSO SUPPORTS CONCEPT HIERARCHIES AND RULES

● Some queries would be too complex if we could not exploit generality:o a query concerning all antibiotics requires

generalisation, otherwise all types of antibiotics would have to be enumerated in the query.

● Much better way to do this is to import a classification of drugs and use it in query execution.

● HYDRA facilitates such reasoning and even more complex reasoning with rules.

THERE ARE NO PRINCIPLE OBSTACLES TO SELF-SERVICE QUERYING

We just need an adequate user interface for building queries.

HYDRA QUERY TOOL = ENGINE + GUI

QUERY COMPOSITION

Queries built based on entry of “Google-like” keyphrases:

Keyphrase: “document mentions protein “P22607”

A QUERY GRAPH IS GENERATED FOR THE KEYPHRASE

“document mentions protein “P22607””

Keyphrase: “has pubmed id”:

ADDING ANOTHER KEYPHRASE

QUERY GRAPH IS EXTENDED WITH NODES CORRESPONDING TO THE SECOND KEYPHRASE

Keyphrase: “has pubmed id”Keyphrase: “document mentions protein “P22607”

OPTION 2: MANUALLY ADD/DELETE CLASSES, INCOMING AND OUTGOING PROPERTIES

MANUALLY ADDED PROPERTY

FINISHED QUERY: FIND PUBMED IDS OF DOCUMENTS MENTIONING PROTEIN P22607 AND CO-MENTIONED PROTEINS

SERVICES IN THE REGISTRY

SPARQL GENERATION

QUERY EXECUTION WITH THE HYDRA ENGINE

EXPORTED RESULTS IN AN EXCEL SPREADSHEET

SADI AND HYDRA QUERY TOOL

AT WORK

BIOINFORMATICS AND CHEMINFORMATICS CASE STUDIES AND PILOTS WITH SADI AND HYDRA

• Integrating genomics text mining results with online biomedical data and visualisation algorithms.

• Integrating programs for lipid molecule structural analysis and classification.

• Interpreting toxicity experiment data by discovering relevant info in online databases.

• Large-scale retrieval of toxicity information from publications.

INTERPRETING TOXICITY EXPERIMENT DATA

• Partner: university lab studying effects of environmental pollutants.

• Querying needs: finding relevant prior experiments, gene annotation, protein domain annotation, etc.

• Data sources: ArrayExpress, BLAST, HMMER3, RefSeq, Pfam, ORFPredictor, GO, UniProt, NCBI Taxonomy -- all queried as a single DB!

SUBTASK: DNA MICROARRAY ANNOTATION

• Toxicity experiments with microarrays: which DNA sequences are under/overexpressed after organism’s exposure to toxin X?

• Interpretation requires knowing affected protein functions and domains.

• HYDRA virtually implements this workflow:

RETRIEVAL OF TOXICITY DATA FROM PUBLICATIONS

• Customer: government agency (Canada).

• Querying needs: online publication search by organism and chemical types, text-mining for toxicity data.

• Data sources: NCBI Taxonomy and ChEBI with free-text search, PubMed search, electronic libraries, journal Web sites, Google Scholar, specialised text-mining algorithm, text utilities.

Apparent value: some queries save many man-weeks of work of a postdoc.

CLASSIFYING NEW LIPID MOLECULES

• One of the early experiments with SADI.• A group in Carleton U. had a program for

identifying functional groups in a molecule structure.

• A group in U. of New Brunswick had a classifier estimating lipid classes based on presence/absence of functional groups.

• Publishing the prototypes as SADI services allowed us to integrate them with each other and relevant external resources.

CLINICAL IT CASE STUDIES AND PILOTS WITH SADI AND HYDRA

• Ad hoc querying of clinical data for Hospital Acquired Infections surveillance and research (with UNB, McGill SoM and Ottawa H.)

• On-going pilot with a US hospital.

• Looking for pilot opportunities for Clinical Trial Cohort selection:• trial eligibility criteria can be implemented as queries

over heterogeneous and distributed clinical data;• benefits: cost reduction and timely alerts.

THANK YOU!

Further materials/services are available on request:• Live and recorded demos.

• Publications on previous (academic) case studies.

• Training/consulting.

• http://ipsnp.com/ (Canada) and http://ipsnp.co/ (UK)