Upload
andrew-su
View
806
Download
2
Embed Size (px)
DESCRIPTION
Presentation on "Microtask crowdsourcing for annotating diseases in PubMed abstracts" at ASHG14 session on "Cloudy with a chance of big data".
Citation preview
Microtask crowdsourcing for
annotating diseases in
PubMed abstracts
Andrew Su, Ph.D.@andrewsu
http://sulab.org
October 20, 2014
ASHG
Slides: slideshare.net/andrewsu
OK
OK
OK
Potential conflicts of interest
• Novartis
• Assay Depot
• Avera Health
2
3
Condition A Condition B
Candidate
genes/
proteins
RNA-seqExome seq
Whole
genome seq
ProteomicsGenotyping
Copy-number
analysis
Genome-scale profiling
ChIP-seqMethylation
Functional
genomics
4
Candidate
genes/
proteins
Related
diseases
Related
drugs
Related
pathways
Databases are fragmented and incomplete5
KEGG
(4)
OMIM
(6)
PharmGKB
(10)
HuGE
Navigator
(517)
0
2
0
20
0
0
0
0
0
x
2
507
1
6
Disease links for Apolipoprotein E
6
7
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
8
9
http://www.flickr.com/photos/portland_mike/6140660504/
Harnessing
the crowd…
10
… to organize
information
http://www.flickr.com/photos/45697441@N00/6629580443
Information extraction for a Network of BioThings11
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
Genes/
proteins
Diseases
DrugsPathways
The NCBI Disease corpus12
• 793 PubMed abstracts
• 12 expert annotators (2 annotate each
abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. Proceedings of the 2012 Workshop on Biomedical
Natural Language Processing. Association for Computational Linguistics.
Question: Can a group of non-scientists
collectively perform concept
recognition in biomedical texts?
13
Experimental design
Task: Identify the disease mentions in the
PubMed abstracts from the NCBI disease
corpus
– 5 non-scientists annotate each abstract
– The details:
• Recruit workers using Amazon Mechanical Turk
• Pay $0.066 per Human Intelligence Task (HIT)
• HIT = annotate one abstract from PubMed
14
Instructions to workers15
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer , but
undergoes…”
• Highlight symptoms - physical results of having a
disease
– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss,
and visual impairment.
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Aggregation function based on simple voting16
1 or more votes (K=1)This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Comparison to gold standard17
F score = 0.81Precision
Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard18
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard19
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard20
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard21
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 3 6 9 12 15 18
Comparison to gold standard22
Max F = 0.69 0.79 0.82
k=1
2
3
2
3 4 5
0.85
k=1
N = 3 6 9 12 15 18
7 8
0.85 0.85
F = 0.76 – score of single Ph.D. annotator
F = 0.87 – agreement between multiple Ph.D. annotators
23
Crowd-based biocuration
• 7 days
• 17 workers
• $192.90
Professional biocuration
• Many months
• 12 experts
• $150,000+
In aggregate, our worker
ensemble is faster, cheaper
and as accurate as a single
expert annotator for disease
concept recognition.
Information extraction for a Network of BioThings24
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
Genes/
proteins
Diseases
DrugsPathways
Vision-based Citizen Science
• Galaxy Zoo (galaxy classification; 110M+
classifications, 300k+ volunteers)
• Foldit (protein folding; 350k+ players)
• Eterna (RNA folding; 80k players)
• Eyewire (3D neuron structure determination;
130k volunteers)
• Phylo (multiple sequence alignment; 30k+
players, 285k alignments)
• …
25
`
27
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)
The Su Lab
Chunlei Wu
Ben Good
Salvatore Loguercio
Max Nanis
Louis Gioia
Ramya Gamini
Greg Stupp
Ginger Tsueng
Erick Scott
Vyshakh Babji
Karthik Gangavarapu
Adam Mark
Key Alumni
Katie Fisch
Tobias Meissner
Key Collaborators
Andra Waagmeester
Lynn Schriml
Peter Robinson
Contact
http://sulab.org
@andrewsu
+Andrew Su
We are recruiting
programmers,
postdocs, and
awesome people of
all kinds!
bit.ly/SuLabJobs
We are hosting a hackathon
Nov 7-9 for the Network of
BioThingsbit.ly/hackNoB