Upload
rodney-franklin
View
214
Download
0
Embed Size (px)
Citation preview
Accomplishments and Challenges in Literature Data Mining for Biology
L. Hirschman et al.
Presented by Jing Jiang
CS491CXZ Spring, 2004
Outline
Accomplishments– Natural Language Processing Perspective– Biomedical Applications
Challenges– Organizing A Challenge Evaluation– Sample Challenge Problems:
Extraction of Biological Pathways Automated Database Curation and Ontology Development
Early Work: to Identify Protein Names
Fukuda et al. (1998) Challenges encountered:
– Long compound names– Different names for the same protein– Common English words as protein names
Solutions proposed:– Uppercase letters (Src homology 2 domains)– Numerals (p54 SAP kinase)– Special endings (EGF receptor)
Recent Work: to Recognize Interactions between Proteins and Other Molecules
Statistical Approach– Stapley & Benoit (2000): co-occurrences of gene names to
predict connections– Ding et al. (2002): co-occurrences when the unit is an abstract,
a sentence, or a phrase
NLP Approach– Ng & Wong (1999): templates with linguistic structures to
recognize interactions– Others: extended Ng & Wong’s work– All based on grammars
NLP in Biological Applications
To capture specific relations in databases– To learn ontological relations– To extract biological pathways
To improve retrieval and clustering in searching large collections– Homology search using sequence similarity– Clustering MEDLINE abstracts
For classification
Problem I
ResearchersPrecision/
Specificity
Recall/
SensitivityData Set
Extracted Results
Yakushiji
et al. (2001)60 – 80% /
MEDLINE abstracts
argument structures
Friedman
et al. (2001)96% 63%
8000 word article from
Cell
broad set of biological relations
Pustejovsky & Castaño (2002)
90% 57% MEDLINEthe “inhibit”
relations
How to compare different approaches?
Problem II
How well does a system have to perform to be useful?– What does 90% specificity at 57%
sensitivity mean to the user?– Need user-centered evaluations.
Challenge Evaluation
Task Definition
Building System
Identification of Challenge Problem
Training Data
Evaluation Evaluation MethodologyTest Data
Participants
Evaluator Funding
Sample Challenge Problem I:Extraction of Biological Pathways
What are biological pathways?
A network of interactions and events between proteins, drugs, and other molecules.
E.g. the Glycolytic Pathway
Challenge Problem
Three layers of challenges: To recognize names of proteins, drugs, and
other molecules To recognize basic interaction events between
molecules To recognize the relationships between the
basic interaction events
Task Definition
(t1, F1)
(t2, F2)
…
(tm, Fm)
db: set of recordsti: texts (sentences, abstracts,
or whole articles)
Fi = {fi,1, fi,2, …, fi,ni}: set of
expected facts (short sentences in highly standardized forms.
e.g. “P1 activate P2”)
Evaluation Methodology
recall(E) = TP(E)/[TP(E) + FN(E)]
precision(E) = TP(E)/[TP(E) + FP(E)]
E: information extractor
TP: true positive
FN: false negative
FP: false positive
Evaluation Methodology
At the record level At the database level
)())(()(
)()()(
)()(
),(
),(
),(
ETP|t|EEFP
ETP|F|EFN
F|t|EETP
dbFt
dbFt
dbFt
)(|)(|)(
)(||)(
|)(|)(
),(
),(
),(
ETPtEEFP
ETPFEFN
FtEETP
dbFt
dbFt
dbFt
Question: which one is more effective a measure?
Test Data
Appendix of Kohn (1999)– 200 statements of interaction events– Sentences of a fairly complex form
MEDLINE abstracts on “Topoisomerase inhibitors”– 150 – 200 new abstracts each year– Less than 1000 names and less than 200 interaction
events each year
Sample Challenge Problem II:Automated Database Curation and
Ontology Development
Importance:– protein referred to by names
The nomenclature problem for proteins:– A newly discovered protein may be named based
on its functions, sequence features, gene name, cellular location, molecular weight, etc.
NLP technologies in information extraction, classification and ontology induction can be applied here
An Example
3 fields from the entry for Appl+P130kD in FlyBase:(1) Protein size (kD): Luo et al, 1990 130(2) Cell location: Luo et al, 1990 axon(3) Expression pattern: Luo et al, 1990
Stage Tissue/PositionEmbryo Embryonic Central Nervous SystemEmbryo Peripheral Nervous System
The abstract of Luo et al. (1990)(1) APPL … is converted to a 130-kDa secreted from …(2) APPL … was observed in … axonal tracts, …(3) In the embryo, APPL proteins are expressed exclusively in the
CNS and PNS neurons …
Knowledge Discovery and Data Mining Challenge Cup 2002
Participants are given– A collection of journal articles– Each labeled with genes mentioned in the article
Participants are required to answer– Does the article contain any experimental results
about gene expression that should be put in the database?
– If so, for each gene in the article, is there experimental evidence for any transcripts (RNA), protein, or polypeptide products of that gene?
Evaluation of Ontologies
Challenging:– no established metric for measuring knowledge in
terms of content or value Two levels:
– Intrinsic: compare terms and ontological relations discovered by the system against those found by humans
– Extrinsic: evaluate ontology’s usefulness in manual query expansion
Summary
Contributions of this paper: Summarized the work done so far in the field of
literature data mining for biology Identified the important ingredients for a
successful evaluation Gave concrete evaluation examples
Identifying Protein Names from Biological Papers (Fukuda et al.)
Capital letters, numerical figures, and special symbols (core-terms)
– Src homology (SH) 2 and SH3 domains– P54 SAP kinase
Key-words (feature-terms)– EGF receptor– Ras GRPase-activating protein (GAP)
IE system:– Core-term extraction from tokenized texts– Concatenation of core-terms and f-terms
Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts (Ng & Wong)
Key function words:– Inhibitor: {inhibit, suppress, negatively regulate}– Activator: {activate, transactivate, induce,
unregulate, positively regulate}
Pattern matching rules:– <A> … <fn> … <B>– <A> … <fn> of … <B>– <A> … <fn> by … <B>
Evaluation Methodology
Simple Matching Coefficient (SMC)– SMC(E) = TP(E)/[TP(E) + FN(E) + FP(E)]
Satisfies two conditions:– To distinguish the ideal information extractor from
the worst one– To show a gradual monotonic change in value when
the information extractor is changed from the worst to the best
Three Tasks
To recognize names: obvious To recognize interaction events: grammar
PosEvent ::= P phosphorylate P [on T] [at L]
| P dephosphorylate P [on T] [at L] …
Event ::= PosEvent [mediated-by P+] [independent-of P+] …
To recognize relationships: grammarRelationship ::= Event [is-caused-by Event+] [provided Event+]
…