Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004

Accomplishments and Challenges in Literature Data Mining for Biology

L. Hirschman et al.

Presented by Jing Jiang

CS491CXZ Spring, 2004

Outline

Accomplishments– Natural Language Processing Perspective– Biomedical Applications

Challenges– Organizing A Challenge Evaluation– Sample Challenge Problems:

Extraction of Biological Pathways Automated Database Curation and Ontology Development

Early Work: to Identify Protein Names

Fukuda et al. (1998) Challenges encountered:

– Long compound names– Different names for the same protein– Common English words as protein names

Solutions proposed:– Uppercase letters (Src homology 2 domains)– Numerals (p54 SAP kinase)– Special endings (EGF receptor)

Recent Work: to Recognize Interactions between Proteins and Other Molecules

Statistical Approach– Stapley & Benoit (2000): co-occurrences of gene names to

predict connections– Ding et al. (2002): co-occurrences when the unit is an abstract,

a sentence, or a phrase

NLP Approach– Ng & Wong (1999): templates with linguistic structures to

recognize interactions– Others: extended Ng & Wong’s work– All based on grammars

NLP in Biological Applications

To capture specific relations in databases– To learn ontological relations– To extract biological pathways

To improve retrieval and clustering in searching large collections– Homology search using sequence similarity– Clustering MEDLINE abstracts

For classification

Problem I

ResearchersPrecision/

Specificity

Recall/

SensitivityData Set

Extracted Results

Yakushiji

et al. (2001)60 – 80% /

MEDLINE abstracts

argument structures

Friedman

et al. (2001)96% 63%

8000 word article from

Cell

broad set of biological relations

Pustejovsky & Castaño (2002)

90% 57% MEDLINEthe “inhibit”

relations

How to compare different approaches?

Problem II

How well does a system have to perform to be useful?– What does 90% specificity at 57%

sensitivity mean to the user?– Need user-centered evaluations.

Challenge Evaluation

Task Definition

Building System

Identification of Challenge Problem

Training Data

Evaluation Evaluation MethodologyTest Data

Participants

Evaluator Funding

Sample Challenge Problem I:Extraction of Biological Pathways

What are biological pathways?

A network of interactions and events between proteins, drugs, and other molecules.

E.g. the Glycolytic Pathway

Challenge Problem

Three layers of challenges: To recognize names of proteins, drugs, and

other molecules To recognize basic interaction events between

molecules To recognize the relationships between the

basic interaction events

Task Definition

(t1, F1)

(t2, F2)

…

(tm, Fm)

db: set of recordsti: texts (sentences, abstracts,

or whole articles)

Fi = {fi,1, fi,2, …, fi,ni}: set of

expected facts (short sentences in highly standardized forms.

e.g. “P1 activate P2”)

Evaluation Methodology

recall(E) = TP(E)/[TP(E) + FN(E)]

precision(E) = TP(E)/[TP(E) + FP(E)]

E: information extractor

TP: true positive

FN: false negative

FP: false positive


At the record level At the database level

)())(()(

)()()(

)()(

),(

),(

),(

ETP|t|EEFP

ETP|F|EFN

F|t|EETP

dbFt

dbFt

dbFt

)(|)(|)(

)(||)(

|)(|)(

),(

),(

),(

ETPtEEFP

ETPFEFN

FtEETP

dbFt

dbFt

dbFt

Question: which one is more effective a measure?

Test Data

Appendix of Kohn (1999)– 200 statements of interaction events– Sentences of a fairly complex form

MEDLINE abstracts on “Topoisomerase inhibitors”– 150 – 200 new abstracts each year– Less than 1000 names and less than 200 interaction

events each year

Sample Challenge Problem II:Automated Database Curation and

Ontology Development

Importance:– protein referred to by names

The nomenclature problem for proteins:– A newly discovered protein may be named based

on its functions, sequence features, gene name, cellular location, molecular weight, etc.

NLP technologies in information extraction, classification and ontology induction can be applied here

An Example

3 fields from the entry for Appl+P130kD in FlyBase:(1) Protein size (kD): Luo et al, 1990 130(2) Cell location: Luo et al, 1990 axon(3) Expression pattern: Luo et al, 1990

Stage Tissue/PositionEmbryo Embryonic Central Nervous SystemEmbryo Peripheral Nervous System

The abstract of Luo et al. (1990)(1) APPL … is converted to a 130-kDa secreted from …(2) APPL … was observed in … axonal tracts, …(3) In the embryo, APPL proteins are expressed exclusively in the

CNS and PNS neurons …

Knowledge Discovery and Data Mining Challenge Cup 2002

Participants are given– A collection of journal articles– Each labeled with genes mentioned in the article

Participants are required to answer– Does the article contain any experimental results

about gene expression that should be put in the database?

– If so, for each gene in the article, is there experimental evidence for any transcripts (RNA), protein, or polypeptide products of that gene?

Protein Knowledge Base

Evaluation of Ontologies

Challenging:– no established metric for measuring knowledge in

terms of content or value Two levels:

– Intrinsic: compare terms and ontological relations discovered by the system against those found by humans

– Extrinsic: evaluate ontology’s usefulness in manual query expansion

Summary

Contributions of this paper: Summarized the work done so far in the field of

literature data mining for biology Identified the important ingredients for a

successful evaluation Gave concrete evaluation examples

End of the Talk

Identifying Protein Names from Biological Papers (Fukuda et al.)

Capital letters, numerical figures, and special symbols (core-terms)

– Src homology (SH) 2 and SH3 domains– P54 SAP kinase

Key-words (feature-terms)– EGF receptor– Ras GRPase-activating protein (GAP)

IE system:– Core-term extraction from tokenized texts– Concatenation of core-terms and f-terms

Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts (Ng & Wong)

Key function words:– Inhibitor: {inhibit, suppress, negatively regulate}– Activator: {activate, transactivate, induce,

unregulate, positively regulate}

Pattern matching rules:– <A> … <fn> … <B>– <A> … <fn> of … <B>– <A> … <fn> by … <B>


Simple Matching Coefficient (SMC)– SMC(E) = TP(E)/[TP(E) + FN(E) + FP(E)]

Satisfies two conditions:– To distinguish the ideal information extractor from

the worst one– To show a gradual monotonic change in value when

the information extractor is changed from the worst to the best

Three Tasks

To recognize names: obvious To recognize interaction events: grammar

PosEvent ::= P phosphorylate P [on T] [at L]

| P dephosphorylate P [on T] [at L] …

Event ::= PosEvent [mediated-by P+] [independent-of P+] …

To recognize relationships: grammarRelationship ::= Event [is-caused-by Event+] [provided Event+]

…

Documents

Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004