Upload
andrew-su
View
1.400
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Keynote presentation at the ISMB Bio-ontologies SIG (Vienna, Austria) on July 15, 2011. (Apologies, I occasionally use animations that obscures some slide content, so feel free to download the PowerPoint version to see what's underneath...)
Citation preview
Cultivating and mining the Gene Wiki for crowdsourced gene annotation
ISMBBio-Ontologies SIG
July 14, 2011
Andrew Su, Ph.D.
Few genes are well annotated…2
38%
59%
TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE
Data: NCBI gene2pubmed, August 2010
23,278 protein-coding genes
Genes, sorted by decreasing counts
Co
un
ts
Gene ontology
PubMed
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
0
200,000
400,000
600,000
800,000
1,000,000
Number of PubMed-indexed articles
… because the literature is sparsely curated?3
… because the literature is sparsely curated?4
0
1 0
2 0
Average capacity of human scientistNumber of articles read by typical scientist
5
311,696 articles (1.5% of PubMed)have been cited by GO annotations
6
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
The Long Tail is a prolific source of content7
ShortHead
Long Tail
Content produced
Contributors (sorted)
Publishing:Video:
Product reviews:Food reviews:
Judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
Wikipedia is reasonably accurate8
Wikipedia has breadth and depth9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Words/ article
Wikipedia Britannica Online
10
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
10,000 gene “stubs” within Wikipedia11
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Wiki success depends on a positive feedback12
Gene wiki page utility
Number ofusers
Number ofcontributors
1001
2002
Filtering, extracting, and summarizing PubMed
Documents
Concepts
A review article for every gene is powerful14
Hyperlinks to related concepts
References to the literature
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
Gene Wiki has a diverse critical mass of readers15
Utility
Users
Contributors
Rank 1-10: General society
InsulinTitin
Human chorionic gonadotropinVasopressin
ANKHCLOCKCatalase
ErythropoietinGlucagon
Parathyroid hormone
Rank 1001-1010: Specialists
CSDACNTNAP2
IGSF8Adenosine A3 receptor
RYR1ETV6
Small heterodimer partner5-HT1D receptor
TRPC6Interleukin-6 receptor
Rank 101-110: Scientists
Tau proteinInterleukin 10
APCC-Met
Factor VInterleukin 8
CD44Histamine H1 receptorKappa Opioid receptor
Dihydrofolate reductase
Total: 5.0 million views / month
Readership is poised to grow16
Utility
Users
Contributors
The Gene Wiki has a critical mass of editors17
Utility
Users
Contributors
In Jan – Jun 2010 …
… 7474 edits were made by 2109 unique users
… total increase in text ≈ 20 PLoS Biology research articles
Edi
tor
coun
t Editors
Edits Edi
t co
unt
Making the Gene Wiki more reliable18
The company name is derived from old Greek, and means
"destroyer of birds".
Novartis is a multinational pharmaceutical company
based in Basel, Switzerland that manufactures drugs such
as clozapine (Clozaril), diclofenac (Voltaren), …
2
2
Making the Gene Wiki more reliable19
http://www.wikitrust.net/
The company name is derived from old Greek, and means
"destroyer of birds".
Novartis is a multinational pharmaceutical company
based in Basel, Switzerland that manufactures drugs such
as clozapine (Clozaril), diclofenac (Voltaren), …
*
36211 total edits 36 total edits
High-trust author Low-trust author
******
** *
*
*
**
2
Making the Gene Wiki more computable20
Structured annotations
!
Free text
Example text from 5-HT1A receptor
“…5-HT1A receptor agonists decrease blood pressure and heart rate or cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…”
Snippet from article on 5-HT1A receptor:
ReceptorAgonists Heart rate
Blood pressure
Hypotension
Vagus nerve
Vasodilation
“…5-HT1A receptor agonists decrease blood pressure and heart rate or cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…”
Snippet from article on 5-HT1A receptor:
Example text from 5-HT1A receptor
5-HT1A receptor
ReceptorAgonists Heart rate
Blood pressure
Hypotension
Vagus nerve
Vasodilation
23
Re-discovering common knowledge24
Wikilink
GO exact synonym
Gene Wiki mapping
NCBI Entrez Gene: 3362
GO:0004993
Candidate assertion
Mining the most recent literature25
Wikilink
GO related concept
Gene Wiki mapping
NCBI Entrez Gene: 57620
GO:0030154
Candidate assertion
Filling the gaps in gene annotation26
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
GO:0006897
Candidate assertion
Disease associations mined from the Gene Wiki
2147 candidate
annotations
Gene Wiki Articles (10,271)
Filter out seeded text
NCBO Annotator
Compare to DO database
Matched Disease Ontology terms
(2983)
70% have no match
2% match child
23% exact match
5% match parent
Disease associations mined from the Gene Wiki
Expert curation
Correct Maybe Incorrect
86%
4%
10%Overall specificity: 90-93%
GO associations mined from the Gene Wiki
6319 candidate
annotations
Gene Wiki Articles (10,271)
Filter out seeded text
NCBO Annotator
Compare to GO database
Matched Gene Ontology terms
(11,022)
55% have no match
2% match child
17% exact match
26% match parent
GO associations mined from the Gene Wiki
Expert curation
Correct Maybe Incorrect
60% Overall specificity: 48-64%26%
14%
Common sources of error in GO associations31
OR2F1: “Olfactory receptors … are responsible for the recognition and G protein-mediated transduction of odorant signals.”
1) Incorrect concept recognition
Transduction (GO:0009293)
The transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector.
Signal transduction (GO:0007165)
The cellular process in which a signal is conveyed to trigger a change in the activity or state of a cell. Signal transduction begins with reception of a signal, e.g. a ligand binding to a receptor or receptor activation by a stimulus such as light, and ends with regulation of a downstream cellular process…
Common sources of error in GO associations32
MEF2C: “Several post translational modifications have been identified including phosphorylation on serine-59 …”
2) Incorrect sentence context
DephosphorylationExcretionGene expressionGlycosylationLocalizationMethylationProteolysisSecretionTransportTranscriptionTranslation
MEF2C
Myelination
Phosporylation
Neurogenesis
Is 48 – 64 % specificity useful?33
GO term
Gene listConcept
recognitionPubMed abstracts
Gene Wiki
+
Enrichment analysis
GO:0006936 GO:0006936
muscle contraction
(GO:0006936)
87 genes
Linked genes by PubMed
only
Linked genes by PubMed +
Gene Wiki
P = 1.0 P = 1.22 E-09
5449 articles
87 articles
GO associations improve enrichment analyses34
p-value (PubMed only)
p-value (PubMed + Gene Wiki)
Muscle contraction
35“Like the image of the [mammoth] hairball, it is equally unhelpful in understanding the object’s properties. You can guess that the network is large and its connectivity is complex, but not more. At best, the visualization is merely decorative.”
- Martin Krzywinski
http://mkweb.bcgsc.ca/linnet/talks/linnet-informatics2010.pdf
36
TOP 100 GENES
Mapping to many biomedical semantic groups37
ü
Semantic representation
From text mining to a Semantic Gene Wiki38
Community contributions
Semantics Semantic querying
Gene Wiki/ Wikipedia ü ûSemantic Gene Wiki – ü ü
Home-grown wiki û ü ü
?
Semantic Wiki Links39
apoptosis apoptosis apoptosis
[[apoptosis]] apoptosis[[apoptosis]]
Semantic Gene Wiki
Based on Semantic MediaWiki (SMW)
Gene Wiki
Based on MediaWiki
apoptosis[[promote::apoptosis]][[repress::apoptosis]]
[[modulate::apoptosis]]
{{SWL|target=apoptosis|type=promotes}}
Rendered text
Mirror and translate
Semantic queries, RDF, etc
For community-based science, data is king40
Data without structure is valuable, but structure without data is not.
For community-based science, data is king41
Domain expert
Information scientist
Copy-editing
Figures
Structure
Citations
Provenance
X =
Data without structure is valuable, but structure without data is not.
XX
Wikipedia
WP:MCB, Boghog
Artists and illustrators
Wiki links, infoboxes
DOI bot, CitationBot
WikiTrust
The Gene Wikisuccessfully harnesses the
Long Tail of scientists for community annotation
of gene function
42
43
(*) See talk on SNPedia mashup
at 1:55 PM
Doug Howe, ZFINSalvatore Loguercio (*), TU DresdenJohn Hogenesch, U PennJon Huss, GNFAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Erik ClarkeBen Good (*)
Ian MacleodChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
ISMB travel support
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Luca de AlfaroBo AdlerIan Pye
WikiTrust(UCSC)