27
BioContrasts: Extracting and Exploiting Protein-protein Co ntrastive Relations from Biom edical Literature Jung-jae Kim 1 , Zhuo Zhang 2 , Jong C. Park 1 and See-Kiong Ng 2 1 Computer Science Division & AlTrc, K orea Advanced Institute of Science a nd Technology 2 Knowledge Discovery Department, Inst itute for Infocomm Research, Singapo re (Bioinformatics, Vol. 22, 5, 2006, p. 597-6

BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

Embed Size (px)

Citation preview

Page 1: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature

Jung-jae Kim1, Zhuo Zhang2, Jong C. Park1 and See-Kiong Ng2

1Computer Science Division & AlTrc, Korea Advanced Institute of Science and Technology

2Knowledge Discovery Department, Institute for Infocomm Research, Singapore

(Bioinformatics, Vol. 22, 5, 2006, p. 597-605)

Page 2: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

2/27

Abstract Contrastive information between proteins can reveal

what similarities, divergences and relations there are of the two proteins.

BioContrasts extracts protein–protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation.

A total of 799,169 pairs of contrastive expressions were extracted from 2.5 million MEDLINE abstracts.

Using grounding of contrastive protein names to Swiss-Prot entries, they produce 41,471 pieces of contrasts between Swiss-Prot protein entries.

Page 3: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

3/27

1. Introduction Biologists have expended much efforts to co

mpile various protein and interaction databases (e.g., Swiss-Prot, BIND, DIP and KEGG) from experimental data and the published literature. But they are typically positive facts such as “protein A binds to B”.

BioContrasts identifies contrastive expressions in the MEDLINE abstracts that contain typical contrastive negation patterns such as “A but not B”.

They also link proteins to Swiss-Prot entries.

Page 4: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

4/27

2. Background (1/3) “not” often encodes contrasts betwe

en biomedical objects. (1) NAT1 binds eIF4A but not eIF4E and inhibits bo

th cap-dependent and cap-independent translation.

(2) Truncated N-terminal mutant huntingtin repressed transcription, whereas the corresponding wild-type fragment did not repress transcription.

Knowledge representation a. NAT1 binds to X b. X=eIF4A, XeIF4E.

Page 5: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

5/27

2. Background (2/3) Each piece of contrastive information i

s made up of two parts: (i) A contrastive pair of two or more object

s that are so contrasted (e.g. {eIF4A, eIF4E}, {wild-type huntingtin, mutant huntingtin}), called focused objects.

(ii) A biological property or process that the contrast is based on (e.g. binding to NAT1, transcription repression), called presupposed property.

Page 6: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

6/27

2. Background (3/3) For contrasts to be meaningful, there

should also exist a certain level of implicit similarity between the contrasted objects in addition to the explicit differences being communicated.

For example, the focused objects of the contrast encoded in (1) are all eukaryotic translation initiation factors.

A contrast is highly informative that not only explicitly represents a difference between its focused objects but also implicitly indicates that the focused objects are semantically similar.

Page 7: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

7/27

3. Methodology First, contrastive expressions are identified from the

text based on a two-step approach: (i) Matching a subclausal coordination pattern (e.g. ‘A but

not B’) to an input sentence. (ii) identifying parallelism between a negated clause and an

affirmative clause in the sentence or adjacent sentences. Then, identifying protein–protein contrasts from the

contrastive information to create a useful database of contrastive relationships between Swiss-Prot protein entries.

Last, the extracted information are presented via a user-friendly interactive web-portal for exploitation.

Page 8: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

8/27

Page 9: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

9/27

3.1 Identifying subclausal coordinations (1/6)

Given a sentence that contains a ‘not’, the system first tries to identify contrastive expressions using subclausal coordination patterns as defined in Table 1.

Page 10: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

10/27

3.1 Identifying subclausal coordinations (2/6) Authors developed (a) a novel POS tagger th

at assigns to each word its most frequent POS tag by looking up a manually curated POS dictionary together with some domain specific correction rules and (b) a novel noun phrase recognizer that looks for noun phrases that begin or end at predetermined words or POS, since the NP variables in the patterns are adjacent to a predetermined word (e.g. ‘but’, ‘not’) or a POS (e.g. PREP, V).

Page 11: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

11/27

3.1 Identifying subclausal coordinations (3/6) The system then analyzes semantic

similarity between the coordinated expressions by analyzing word-level similarity between the expressions.

E.g. ‘not V NP but V NP’: In contrast, IFN-gamma priming did not

affect the expression of p105 transcripts but enhanced the expression of p65 mRNA (2-fold).

Page 12: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

12/27

3.1 Identifying subclausal coordinations (4/6)

Word-level similarity: checking whether the variable-matching phrases are semantically identical or at least in a subsumption relation. Verb and adjective: Utilize the synonymy and hyper

nymy relations in WordNet and also in a hand-made resource (constructed by examining 166 ‘not’-containing abstracts, e.g. Table 2).

Page 13: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

13/27

3.1 Identifying subclausal coordinations (5/6)

Noun phrase: Analyze the similarity between their headwords.

(a) Whether the two nouns denote the same kind of biological objects (e.g. ‘eIF4A’ and ‘eIF4E’ in example (1) are protein names) by looking up well-known biomedical databases such as Swiss-Prot.

(b) Whether they are semantically identical [e.g. ‘transcription’ in example (2)] by utilizing WordNet and their resource for semantic similarity.

E.g. “affect” is a hypernym of “enhance”, and the two NPs have the same headword “expression”.

Page 14: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

14/27

3.1 Identifying subclausal coordinations (6/6)

Next, the system determines the presupposed property for the focused proteins by extracting the subject phrase and the verb whose object phrases correspond to the focused proteins (i.e. ‘NAT1 binds CONTRAST_OBJ’, where CONTRAST_OBJ indicates the variable for focused objects.).

Page 15: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

15/27

3.2 Identifying clause-level parallelisms (1/3) If none of the subclausal coordination patte

rns can be matched to the input sentences in the previous step, the system then attempts to identify any parallelism expressed with the sentences.

Parallelism for a contrast refers to a pair of a negated clause and an affirmative clause that are either semantically identical or at least in a subsumption relation .

Page 16: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

16/27

Page 17: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

17/27

3.2 Identifying clause-level parallelisms (3/3) (2) Truncated N-terminal mutant huntingtin repres

sed transcription, whereas the corresponding wild-type fragment did not repress transcription.

First locate verbs “repress” in the subordinate clause which is negated by “not” and the main clause.

Next identify subject phrases and object phrases in the two clauses.

Then check their semantic identity.

Page 18: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

18/27

3.3 Extracting contrastive protein names The two preceding steps produce con

trastive expressions that match the variables for focused objects.

Now, the system checks whether the contrastive expressions can be resolved in Swiss-Prot protein entries. Here, the system extract individual protein names as well coordinations between protein names.

Page 19: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

19/27

3.4 Grounding protein names (1/2)

Manually constructed pattern dealing with variations of protein names concerning special characters, indices, plurals, discardable words and acronyms.

Page 20: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

20/27

3.4 Grounding protein names (2/2)

For ambiguous protein names Match the full names. Discard protein names that are easily con

fused with other terms such as normal English words of lower case (e.g. mass, stellate) and two character words with one digit (e.g. S1, T1).

Page 21: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

21/27

4. Implementation (1/2) A corpus of 2.5 million ‘not’-containing

MEDLINE abstracts to extract protein–protein contrasts reported in the literature.

Extract 799,169 pairs of contrastive expressions from this corpus.

11,284 pairs of contrastive protein names. 41,471 contrasts between Swiss-Prot (S–P)

entries—the larger number in the resulting protein–protein contrasts owes to the fact that a protein name may be grounded with multiple Swiss-Prot entries.

Page 22: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

22/27

4. Implementation (2/2)

Page 23: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

23/27

5. Evaluation (1/2) The performance in extracting contrastive p

roteins: Examine a set of 100 pairs of contrastive protein

s randomly selected from the BioContrasts database.

Correctly extract 97 pairs (97.0% precision) [cf. the performance of the earlier system: 85.7% precision and 61.5% recall).

In terms of the system’s pattern usage, the main contrastive pattern ‘A but not B’ was used to extract 91 pairs (all correct), while the parallelism patterns were used 5 times (2/5 correct or 40% precision) in this evaluation.

Page 24: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

24/27

5. Evaluation (2/2) The performance in grounding the

contrastive proteins with Swiss-Prot entries: Out of the 200 focused protein names in the test

protein contrasts, 182 names were expressed with adequate clarifying descriptions in their respective abstracts for grounding.

Among these 182 protein names, the system correctly grounded 164 (90.1% precision) and linked 6 additional protein names with partially correct entries (3.3% partial precision).

Page 25: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

25/27

6. Applications (1/2) Refine pathway roles of similar

proteins These information can be used for

distinguishing between the finer biological roles of seemingly similar proteins.

Page 26: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

26/27

6. Applications (2/2) Identify implicitly similar proteins

Such implicit similarity information can be useful in understanding the interactome (e.g. functional annotation of proteins).

E.g. Out of the 5407 protein contrasts that involved functionally annotated proteins, there are 4948 contrasts (91.7%) between proteins that shared at least one GO code at level 2. This suggested that there are implicit functional similarities between the focused proteins being contrasted.

Page 27: BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and

27/27

7. Conclusions BioContrasts system captures useful contrastive and n

onpositive biological relationships between proteins by extracting contrastive expressions in the MEDLINE abstracts containing the contrastive negation patterns.

Future work: Extract contrastive information expressed in linguistic structu

res other than ‘A but not B’. Other ways for exploiting the extracted contrasts for biologica

l knowledge discovery. E.g., it may be possible that the presupposed properties (biological activities) of the same pairs of focused proteins are also related to each other.