45
APPLIED LINKED DATA AND SEMANTIC TECHNOLOGY Expanding a Neurobiology Dataset

Applied semantic technology and linked data

Embed Size (px)

DESCRIPTION

Mapping a human brain generates petabytes of gene listings and the corresponding locations of these genes throughout the human brain.  Due to the large dataset a prototype Semantic Web application was created with the unique ability to link new datasets from similar fields of research, and present these new models to an online community.  The resulting application presents a large set of  gene to location mappings and provides new information about diseases, drugs, and side effects in relation to the genes and areas of the human brain. In this presentation we will discuss the normalization processes and tools for adding new datasets, the user experience throughout the publishing process, the underlying technologies behind the application, and demonstrate the preliminary use cases of the project.

Citation preview

Page 1: Applied semantic technology and linked data

APPLIED LINKED DATA AND SEMANTIC TECHNOLOGYExpanding a Neurobiology Dataset

Page 2: Applied semantic technology and linked data

Today we are discussing…• What is the use case and who requested it?• How do you import and normalize thousands of RDF

triples worth of gene data?• How do we enrich the normalized gene data with parallel

research data sets?• Creating instance pages without knowing exactly what will

be displayed on them.• Demonstration of the initial use cases• Question and answer session

Page 3: Applied semantic technology and linked data

Why?• Prototype: How do we assemble the data mine and refine

the authoring tools?

How do we expand this to the research community?

• How do we expand ownership of the data to research professionals?

• How do we build systems in a way that research professionals can author and link the data?

• How do we publish these new relationships to the wider research community?

Page 4: Applied semantic technology and linked data

What is the Allen Institute for Brain Science?

• Launched in 2003 with seed funding from founder and philanthropist Paul G. Allen.

• Serving the scientific community is at the center of our mission to accelerate progress toward understanding the brain and neurological systems.

• The Allen Institute's multidisciplinary staff includes neuroscientists, molecular biologists, informaticists, and engineers.

“The Allen Institute for Brain Science is an independent 501(c)(3) nonprofit medical

research organization dedicated to accelerating the understanding of how the human brain

works.”

Page 5: Applied semantic technology and linked data

Human Brain Map

• Open, public online access• A detailed, interactive three-dimensional

anatomic atlas of the "normal" human brain

• Data from multiple human brains• Genomic analysis of every brain

structure, providing a quantitative inventory of which genes are turned on where

• High-resolution atlases of key brain structures, pinpointing where selected genes are expressed down to the cellular level

• Navigation and analysis tools for accessing and mining the data

Page 6: Applied semantic technology and linked data

Biological Linked Data Map

• Open, public online access• Data from multiple RDF data stores• Complete import pipeline using

LDIF framework• Outlines of each imported instance

embedding inline wiki properties and providing views of imported properties from original RDF datasets

• Charting tools that ‘pivot’ SPARQL queries providing several views of each query

• Navigation and composition tools for accessing and mining the data

Page 7: Applied semantic technology and linked data

Where did we get the data?

• KEGG : Kyoto Encyclopedia of Genes and Genomes• “KEGG GENES is a collection of gene catalogs for all complete genomes

generated from publicly available resources, mostly NCBI RefSeq.”

• Diseasome• “The Diseasome website is a disease/disorder relationships explorer and a

sample of an innovative map-oriented scientific work. Built by a team of researchers and engineers, it uses the Human Disease Network data set.”

• DrugBank• “The DrugBank database is a unique bioinformatics and cheminformatics

resource that combines detailed drug data with comprehensive drug target information.”

• SIDER• “SIDER contains information on marketed medicines and their recorded adverse

drug reactions. The information is extracted from public documents and package inserts.”

Page 8: Applied semantic technology and linked data

New ontology map for import• Genes

• DrugBank : 4,553• Diseasome : 3,919• KEGG : 9,841

• Diseases• Diseasome : 4,213• KEGG : 459

• Drugs• DrugBank : 4,772• KEGG : 2,482• SIDER : 924

• Effects• SIDER : 1,737

• Pathways• KEGG : 28,442

We chose to intentionally simplify the ontology due to disagreements between

researchers about entity relationships and subclasses.

Page 9: Applied semantic technology and linked data

Importing and mapping the Linked Data

• R2R

• 32,900 instances were converted to the wiki ontology.

• 583,746 properties mapped

• Pathways were ignored for wiki ontology import, but are available within the triple store KEGG Pathway graph.

• SIEVE

• 20,849 instances available in wiki ontology after SILK normalization

• Instance merging effected drugs, genes, and diseases across datasets.

• Triple Store SPARQL Update

Page 10: Applied semantic technology and linked data

Importing and mapping the Linked Data

• R2R

• 32,900 instances were converted to the wiki ontology.

• 583,746 properties mapped

• Pathways were ignored for wiki ontology import, but are available within the triple store KEGG Pathway graph.

• SIEVE

• 20,849 instances available in wiki ontology after SILK normalization

• Instance merging effected drugs, genes, and diseases across datasets.

• Triple Store SPARQL Update

Page 11: Applied semantic technology and linked data

LDIF: LINKED DATA INTEGRATION FRAMEWORKExpanding a Neurobiology Dataset

Page 12: Applied semantic technology and linked data

Linked Data challenges• Data sources that overlap in content may:

• Use a wide range of different RDF vocabularies• Use different identifiers for the same real-world entity• Provide conflicting values for the same properties

• Implications• Queries become hand crafted for a specific RDF data set – no

different than using a proprietary API.• Individual, improvised and manual merging techniques for data

sets.

• Integrating public datasets with internal databases poses the same problems

Page 13: Applied semantic technology and linked data

Linked Data Integration Framework• LDIF normalizes the Linked Data from multiple sources

into a clean, local target representation while keeping track of data provenance.

1

2

3

4

5

Collect data: Managed download and update

Translate data into a single, target vocabulary

Resolve identifier aliases into local target URIs

Cleanse data and resolve conflicting values

Output to local file system or triple store

Page 14: Applied semantic technology and linked data

LDIF Pipeline

1

2

3

4

5

Collect data

Translate data

Resolve identities

Cleanse data

Output data

Component Stack

Supported Data Formats

• RDF Files (Multiple Formats• SPARQL Endpoints• Crawling Linked Data

Page 15: Applied semantic technology and linked data

LDIF Pipeline

1

2

3

4

5

Collect data

Translate data

Resolve identities

Cleanse data

Output data

Component Stack

Sources use a wide range of different RDF vocabularies

Page 16: Applied semantic technology and linked data

LDIF Pipeline

1

2

3

4

5

Collect data

Translate data

Resolve identities

Cleanse data

Output data

Component Stack

Sources use different identifiers for the same entity

Page 17: Applied semantic technology and linked data

LDIF Pipeline

1

2

3

4

5

Collect data

Translate data

Resolve identities

Cleanse data

Output data

Component Stack

Sources provide different values for the same property

Page 18: Applied semantic technology and linked data

LDIF Pipeline

1

2

3

4

5

Collect data

Translate data

Resolve identities

Cleanse data

Output data

Component Stack

Supported Output Formats

• N-Quads• N-Triples• SPARQL Update Stream

Provenance tracking using Named Graphs

Page 19: Applied semantic technology and linked data

LDIF Architecture

Page 20: Applied semantic technology and linked data

Normalized Linked Data is not always pretty.

Not very

helpful.

Page 21: Applied semantic technology and linked data

Normalized Linked Data is not always pretty.

Still not

giving me

much.

Page 22: Applied semantic technology and linked data

SELECT DISTINCT ?group1 ?item1 ?group2 ?item2 { GRAPH ?G {

?target drugbank:geneName "{{{1}}}" ; drugbank:geneName ?geneName ; . ?drug drugbank:target ?target ; drugbank:genericName ?item2 ; drugbank:affectedOrganism ?group2 ; .

}GRAPH ?G1 {

?siderDrug sider:drugName ?item2 ;rdfs:label ?group1 ;sider:sideEffect ?effect;.?effect rdfs:label ?item1 .

}}

Less than

helpful.

Page 23: Applied semantic technology and linked data

Semantic MediaWiki

Semantic MediaWiki is a full-fledged framework, in conjunction with many spinoff extensions, that can turn a wiki into a powerful and flexible knowledge management system. All data created within SMW can easily be published via the Semantic Web, allowing other systems to use this data seamlessly.

Can you

expand on

this?

Page 24: Applied semantic technology and linked data

Four initial templates for each instance by category

1. Custom infobox within outline template

• Visible inline properties

2. Outline template providing instance information

3. Widget template displaying dynamic charts or third party services

• Donut charts and AIBS gene feed

4. Broad table SPARQL queries showing instance relationships

5. Hidden inline properties for other extensions

Page 25: Applied semantic technology and linked data

Creating instance wiki pages• The Triple Store now contained tens of

thousands of recognized category instances. Creating the pages require a bot.

1. Fetch the RDF dumps from an active D2R server

2. Use regex to fetch the rdf:label property that was mapped by R2R as an instance name

3. Open category specific text file of wiki markup (page of template includes)

4. Contact Neurowiki and request a new page from the list of names with the category content

Page 26: Applied semantic technology and linked data

Final application stack

Page 27: Applied semantic technology and linked data

NEUROWIKIExpanding a Neurobiology Dataset

Page 28: Applied semantic technology and linked data

How are base entities like Calcium represented?

1. The wiki page and corresponding template components are rendered.

2. Relations are pulled from the normalized data store of linked data.

3. The JavaScript components are populated via a data feed

Page 29: Applied semantic technology and linked data

How are base entities like Calcium represented?

• Because so many organisms contain calcium the mappings to affected species were never created to conserve space in the data store.

Drug and Disease Class Ratios of CalciumInner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class

Page 30: Applied semantic technology and linked data

What are the dangers of Propofol?

1. Propofol DrugBank relations are rendered in corresponding JavaScript components.

2. The Diseasome disease relations show classes of illness Propofol affects.

3. An aggregate of SIDER side effects are rendered in relation to Propofol and disease classes.

Page 31: Applied semantic technology and linked data

What are the dangers of Propofol?

Page 32: Applied semantic technology and linked data

What are the dangers of Propofol?

Page 33: Applied semantic technology and linked data

What are the dangers of Propofol?

Page 34: Applied semantic technology and linked data

Which drugs are used in Chemotherapy?

1. Diseasome disease relations normalized by LDIF.

2. DrugBank and AIBS relations to genes affected by both the disease and drug.

3. SIDER side effects related to the gene, disease, and drug.

4. DrugBank drug glossary definition specifying various forms of Cancer treatment.

Page 35: Applied semantic technology and linked data

Which Drugs are used in Chemotherapy?

Page 36: Applied semantic technology and linked data

Which drugs are used in Chemotherapy?

Drug and Disease Class Ratios of ARInner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class

Page 37: Applied semantic technology and linked data

Which drugs are used in Chemotherapy?

Drug and Side Effect Ratios of ARInner Circle: Drugs by Affected Species, Outer Circle: Side Effect Ratios of Drugs

Page 38: Applied semantic technology and linked data

Which drugs are used in Chemotherapy?

Page 39: Applied semantic technology and linked data

Which drugs are used in Chemotherapy?

Drug and Disease Class Ratios of NilutamideInner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class

Page 40: Applied semantic technology and linked data

Which drugs are used in Chemotherapy?

Drug and Disease Class Ratios of BicalutamideInner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class

Page 41: Applied semantic technology and linked data

Which drugs are used in Chemotherapy?

Page 42: Applied semantic technology and linked data

Expanding the Prototype• Semantic MediaWiki query construction

• Could this be done in SPARQL?

• Authoring SILK / R2R mappings for the LDIF Pipeline• Extremely difficult and the editors are not intuitive

• How do you get data owners to fuse the sets and create the data store themselves?• Tested with Aura Wiki prototype

• Expand authoring provenance• How do we ensure new data / links comes from an authoritative

source?

Page 43: Applied semantic technology and linked data

Today we discussed…• The Allen Institute for Brain Science (AIBS)• Four similar research data sets to interlink with the AIBS

data set• An import pipeline named Link Data Integration

Framework (LDIF)• The interlinking process for 5 concurrent research data

sets (AIBS, DrugBank, Diseasome, KEGG, SIDER)• A prototype neurobiology authoring platform.• Creating instance pages to display the new connections.• Demonstration of the initial use cases.

Page 44: Applied semantic technology and linked data

QUESTIONS? COMMENTS?Expanding a Neurobiology Dataset

Page 45: Applied semantic technology and linked data

THANK YOU.Expanding a Neurobiology Dataset