Upload
lars-juhl-jensen
View
326
Download
6
Tags:
Embed Size (px)
Citation preview
Large-scale integration of data and text
Lars Juhl Jensen
Large-scale integration of data and text
Lars Juhl Jensen
association networks
text mining
localization and diseases
me
promoter analysis
Jensen & Knudsen, Bioinformatics, 2000
function prediction
Jensen, Gupta et al., Journal of Molecular Biology, 2002
protein networks
de Lichtenberg, Jensen et al., Science, 2005
chemoinformatics
Campillos, Kuhn et al., Science, 2008
data mining
text mining
electronic health records
association networks
guilt by association
STRING
~2.6 million proteins
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
STITCH
~300,000 small molecules
Kuhn et al., Nucleic Acids Research, 2012
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
operons
Korbel et al., Nature Biotechnology, 2004
bidirectional promoters
Korbel et al., Nature Biotechnology, 2004
metagenome neighborhood
Harrington et al., PNAS, 2007
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
protein interactions
Jensen & Bork, Science, 2008
curated knowledge
drug targets
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
hard work
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
missing most of the data
text mining
>10 km
too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive lexicon
cyclin dependent kinase 1
CDK1
CDC2
flexible matching
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
orthographic variation
CDC2
hCdc2
“black list”
SDS
information extraction
count co-mentioning
within documents
within paragraphs
within sentences
scoring scheme
corpora
~22 million abstracts
no access
~4 million full-text articles
augmented browsing
Reflect
browser add-on
real-time text mining
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010
localization and disease
small molecules
proteins
compartments
tissues
diseases
organisms
environments
suite of web resources
common backend database
jensenlab.org
text mining
curated knowledge
experimental data
computational predictions
quality scores
web-centric databases
DISEASES
visualization
COMPARTMENTS
compartments.jensenlab.org
TISSUES
tissues.jensenlab.org
project onto networks
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
compartments.jensenlab.org
tissues.jensenlab.org
diseases.jensenlab.org
summary
bioinformatics
more than alignment
data/text mining
save you much time
Acknowledgments
STRING/STITCHChristian von Mering
Damian Szklarczyk
Michael Kuhn
Manuel Stark
Samuel Chaffron
Chris Creevey
Jean Muller
Tobias Doerks
Philippe Julien
Alexander Roth
Milan Simonovic
Jan Korbel
Berend Snel
Martijn Huynen
Peer Bork
Literature miningSune Frankild
Evangelos Pafilis
Janos Binder
Kalliopi Tsafou
Alberto Santos
Heiko Horn
Michael Kuhn
Nigel Brown
Reinhardt Schneider
Sean O’Donoghue
Questions?