17
NBDC / DBCLS presents BioHackathon 2014 Standardization and utilization of human genome information with Semantic Web technologies Toshiaki Katayama <[email protected]> http://jp.linkedin.com/in/toshiakikatayama Database Center for Life Science (DBCLS), Research Organization of Information and Systems (ROIS), Japan 2014/11/9 @ Tohoku Medical Megabank, Sendai, Japan

Introduction to BioHackathon 2014

Embed Size (px)

DESCRIPTION

Slide presented at the BioHackathon 2014 symposium http://2014.biohackathon.org/symposium

Citation preview

NBDC / DBCLS presents

BioHackathon 2014Standardization and utilization of human genome informationwith Semantic Web technologies

Toshiaki Katayama <[email protected]>http://jp.linkedin.com/in/toshiakikatayamaDatabase Center for Life Science (DBCLS),Research Organization of Information and Systems (ROIS), Japan

2014/11/9 @ Tohoku Medical Megabank, Sendai, Japan

Excursion...

SymposiumHackathon

Mission of NBDC/DBCLS

• Biomedical domain• So many databases, so many publications

• Integration of life science databases• To accelerate data driven science

• Standardization and interoperability• Semantic Web and Linked Open Data• Software development• :

• Right technology + collaborative community• BioHackathon = Bio + Hack + Marathon = effective innovation

http://2014.biohackathon.orgBioHackathon 2014 - the 7th NBDC/DBCLS BioHackathon

• BioHackathon 2008 in Tokyo• Towards integrated Web service in life science with Open Bio* libraries• http://hackathon.dbcls.jp

• BioHackathon 2009 in Okinawa• Integration of Web services in bioinformatics applications• http://hackathon2.dbcls.jp

• BioHackathon 2010 in Tokyo• Integration and interpretation of biological knowledge with the Semantic Web technologies• http://hackathon3.dbcls.jp

• BioHackathon 2011 in Kyoto• Creation and utilization of Linked Data in life sciences• http://2011.biohackathon.org

• BioHackathon 2012 in Toyama• Biomecial applications based on the Semantic Web technologies• http://2012.biohackathon.org

• BioHackathon 2013 in Tokyo• Semantic interoperability and standardization of bioinformatics data and Web services• http://2013.biohackathon.org

BioHackathon publications

BioHackathonthematic series

Linked Open Data

• Use URIs as names for things

• Use HTTP URIs• so that people can look up those names

• When someone looks up a URI• provide useful information• using the standards (RDF*, SPARQL)

• Include links to other URIs• so that they can discover more things

• Genome annotation / Protein annotation / Biomedical ontologies / URIs

http://togogenome.orgTogoGenome: RDF-based genome DB

}Accumulate annotations in RDF

Genome

Regulatory region Protein coding gene rRNA gene

↑ ↑ ↑ ↑<exon>

<gene> rdfs:subClassOf obo:SO_0000704 ;faldo:location [ ... ] ;rdfs:label "geneA" ;rdfs:seeAlso <UniProt> .

rdfs:subClassOf obo:SO_0000147 .

← FALDO locations← Sequence ontology types

← Label of annotations← Link to external resources

↑ ↑ ↑ ↑

Annotation w/ in-house developed ontologies

Accumulate annotations in RDF+

In-house developed ontologies•MEO (environment)•MPO (phenotype)•GMO (growth medium)•MCCV (culture collection)•PDO (infectious disease)

↓Stored in triple store

↓SPARQL query

↓TogoGenome / TogoStanza

Genome sequences

NCBI: BioProject/RefSeq -- existing reference seqsDDBJ: Annotation pipeline/GTPS -- newly sequenced

Ontologies

NCBO: BioPortal, OBO (GO, SO, ...)DBCLS: FALDO, MEO, MPO, GMO, MCCV, PDO ...DDBJ: INSDC, Taxonomy, ...Titech: PDO, ...GOLD: Environmental metadata

Samples and metadata

INSDC, NCBI: SRA, GEODBCLS: RefEx, KusarinokoBulk data: Literatures, Images, ...

Annotations

UniProt: Protein functions and linksFormats: GFF3, GTF, GVF, DAS, BED, ..Tools: Cufflinks, BLAST, InterProt, ...

http://togogenome.orgTogoGenome faceted search & modular reports

<gene> rdf:type insdc:Gene ; so:so_part_of <chromosome> .

<mRNA> rdf:type insdc:Messenger_RNA ; sio:is-transcribed-from <gene> ; sio:has-ordered-part <p1>, <p2>, ... .

<p1> sio:has-value "1"^^xsd:integer ; sio:refers-to <exon1> .

<p2> sio:has-value "2"^^xsd:integer ; sio:refers-to <exon2> .

<exon1> rdf:type insdc:Exon ; faldo:location <region1> .

<region1> rdf:type faldo:Region ; faldo:begin <position1> ; faldo:end <position2> .

<position1> rdf:type faldo:ExactPosition, faldo:ForwardStrandPosition ; faldo:position 12345 ; faldo:reference <chromosome> .

Genome

Regulatory region Protein coding gene rRNA gene

↑ ↑ ↑ ↑<exon>

<gene> rdfs:subClassOf obo:SO_0000704 ;faldo:location [ ... ] ;rdfs:label "geneA" ;rdfs:seeAlso <UniProt> .

rdfs:subClassOf obo:SO_0000147 .

← FALDO locations← Sequence ontology types

← Label of annotations← Link to external resources

↑ ↑ ↑ ↑

INSDC/RefSeq/Ensembl RDF:

RDF summit May 17-20, 2014Standardization of RDF models for genomics

Standardization of INSDC nucleotide annotations in RDF

INSDC

Ontology for locations of annotations

Common URIs to be shared

Common RDF model for genomes

Transcriptomes and regulations

Personal/Japanese genomes

http://genomicsandhealth.org/Global Alliance for Genomics and Health

The greatest need was a common framework of international standards designed to enable and oversee how genomic and clinical data are shared in an effective, responsible, and interpretable manner.

to develop this common framework, enabling learning from data while protecting participant autonomy and privacy.

Over 180 organizations world wide (2013-14) + Google (since 2014/2/28)

To enable secure sharing of genomic and clinical data

18,487March 11, 2011

http://www9.nhk.or.jp/311shogen/map/

Tohoku Medical Megabank Organization

• As one of the reconstruction plans, Tohoku Medical Megabank Organization was founded for rebuilding the community medical system by developing a biobank that combines medical and genome information for supporting health and welfare in the Tohoku area.

SELECT ?question

WHERE {

?question :bb|^:b{2} ?question .

}

Questions?

:bb

:b

?question

?x

:b