Upload
lars-juhl-jensen
View
683
Download
3
Tags:
Embed Size (px)
DESCRIPTION
10th Course in Bioinformatics and Systems Biology for Molecular Biologists, Schloss Hohenkammer, Hohenkammer, Germany, March 15, 2010.
Citation preview
Integration of heterogeneous data
Lars Juhl Jensen
data mining
text mining
interaction networks
Kuhn et al., Nucleic Acids Research, 2010
parts lists
630 genomes
2.5 million proteins
~74,000 small molecules
many databases
different formats
model organism databases
Ensembl
RefSeq
PubChem
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
conserved neighborhood
operons
Korbel et al., Nature Biotechnology, 2004
bidirectional promoters
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
experimental data
gene coexpression
protein interactions
Jensen & Bork, Science, 2008
genetic interactions
Beyer et al., Nature Reviews Genetics, 2007
small molecule interactions
in vitro binding assays
cellular activity assays
many databases
GEOGene Expression Omnibus
BINDBiomolecular Interaction Network Database
BioGRIDGeneral Repository for Interaction Datasets
DIPDatabase of Interacting Proteins
IntAct
MINTMolecular Interactions Database
HPRDHuman Protein Reference Database
PDBProtein Data Bank
BindingDB
CTDComparative Toxicogenomics Database
DrugBank
GLIDAGPCR-Ligand Database
MATADOR
PDSP KiPsycoactive Drug Screening Program
PharmGKBPharmacogenomics Knowledge Base
different formats
different identifiers
partially redundant
Campillos & Kuhn et al., Science, 2008
curated knowledge
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
Gene Ontology
MIPSMunich Information center
for Protein Sequences
KEGGKyoto Encyclopedia of Genes and Genomes
MetaCyc
Reactome
PIDNCI-Nature Pathway Interaction Database
high confidence
different formats
different identifiers
partially redundant
literature mining
>10 km
human readable
not computer readable
different names
text corpus
MEDLINE
SGDSaccharomyces Genome Database
The Interactive Fly
OMIMOnline Mendelian Inheritance in Man
thesaurus
co-mentioning
statistical methods
NLPNatural Language Processing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxgene The GAL4 gene]
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
restricted access
Reflect
augmented browsing
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
integration
the easy problems
many databases
different formats
different identifiers
partially redundant
parsers
thesaurus
book keeping
the hard problems
many data types
not comparable
variable quality
raw quality scores
intergenic distances
Korbel et al., Nature Biotechnology, 2004
correlations
reproducibility
von Mering et al., Nucleic Acids Research, 2005
score calibration
gold standard
von Mering et al., Nucleic Acids Research, 2005
spread over 630 genomes
transfer by orthology
von Mering et al., Nucleic Acids Research, 2005
two modes
COG mode
von Mering et al., Nucleic Acids Research, 2005
protein mode
von Mering et al., Nucleic Acids Research, 2005
combine all evidence
P = 1-(1-P1)(1-P2)(1-P3) …
visualize
Kuhn et al., Nucleic Acids Research, 2010
access
access for humans
web interfaces
access for computers
web services
RESTRepresentational State Transfer
SOAPSimple Object Access Protocol
Acknowledgments
STITCH– Michael Kuhn
– Damian Szklarczyk
– Andrea Franceschini
– Monica Campillos
– Christian von Mering
– Lars Juhl Jensen
– Andreas Beyer
– Peer Bork
Reflect– Sean O’Donoghue
– Heiko Horn
– Sune Frankild
– Evangelos Pafilis
– Michael Kuhn
– Nigel Brown
– Reinhardt Schneider
STRING– Christian von Mering
– Michael Kuhn
– Manuel Stark
– Samuel Chaffron
– Chris Creevey
– Jean Muller
– Tobias Doerks
– Philippe Julien
– Alexander Roth
– Milan Simonovic
– Jan Korbel
– Berend Snel
– Martijn Huynen
– Peer Bork
larsjuhljensen