BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical...

Preview:

Citation preview

BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature

Jung-jae Kim1, Zhuo Zhang2, Jong C. Park1 and See-Kiong Ng2

1Computer Science Division & AlTrc, Korea Advanced Institute of Science and Technology

2Knowledge Discovery Department, Institute for Infocomm Research, Singapore

(Bioinformatics, Vol. 22, 5, 2006, p. 597-605)

2/27

Abstract Contrastive information between proteins can reveal

what similarities, divergences and relations there are of the two proteins.

BioContrasts extracts protein–protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation.

A total of 799,169 pairs of contrastive expressions were extracted from 2.5 million MEDLINE abstracts.

Using grounding of contrastive protein names to Swiss-Prot entries, they produce 41,471 pieces of contrasts between Swiss-Prot protein entries.

3/27

1. Introduction Biologists have expended much efforts to co

mpile various protein and interaction databases (e.g., Swiss-Prot, BIND, DIP and KEGG) from experimental data and the published literature. But they are typically positive facts such as “protein A binds to B”.

BioContrasts identifies contrastive expressions in the MEDLINE abstracts that contain typical contrastive negation patterns such as “A but not B”.

They also link proteins to Swiss-Prot entries.

4/27

2. Background (1/3) “not” often encodes contrasts betwe

en biomedical objects. (1) NAT1 binds eIF4A but not eIF4E and inhibits bo

th cap-dependent and cap-independent translation.

(2) Truncated N-terminal mutant huntingtin repressed transcription, whereas the corresponding wild-type fragment did not repress transcription.

Knowledge representation a. NAT1 binds to X b. X=eIF4A, XeIF4E.

5/27

2. Background (2/3) Each piece of contrastive information i

s made up of two parts: (i) A contrastive pair of two or more object

s that are so contrasted (e.g. {eIF4A, eIF4E}, {wild-type huntingtin, mutant huntingtin}), called focused objects.

(ii) A biological property or process that the contrast is based on (e.g. binding to NAT1, transcription repression), called presupposed property.

6/27

2. Background (3/3) For contrasts to be meaningful, there

should also exist a certain level of implicit similarity between the contrasted objects in addition to the explicit differences being communicated.

For example, the focused objects of the contrast encoded in (1) are all eukaryotic translation initiation factors.

A contrast is highly informative that not only explicitly represents a difference between its focused objects but also implicitly indicates that the focused objects are semantically similar.

7/27

3. Methodology First, contrastive expressions are identified from the

text based on a two-step approach: (i) Matching a subclausal coordination pattern (e.g. ‘A but

not B’) to an input sentence. (ii) identifying parallelism between a negated clause and an

affirmative clause in the sentence or adjacent sentences. Then, identifying protein–protein contrasts from the

contrastive information to create a useful database of contrastive relationships between Swiss-Prot protein entries.

Last, the extracted information are presented via a user-friendly interactive web-portal for exploitation.

8/27

9/27

3.1 Identifying subclausal coordinations (1/6)

Given a sentence that contains a ‘not’, the system first tries to identify contrastive expressions using subclausal coordination patterns as defined in Table 1.

10/27

3.1 Identifying subclausal coordinations (2/6) Authors developed (a) a novel POS tagger th

at assigns to each word its most frequent POS tag by looking up a manually curated POS dictionary together with some domain specific correction rules and (b) a novel noun phrase recognizer that looks for noun phrases that begin or end at predetermined words or POS, since the NP variables in the patterns are adjacent to a predetermined word (e.g. ‘but’, ‘not’) or a POS (e.g. PREP, V).

11/27

3.1 Identifying subclausal coordinations (3/6) The system then analyzes semantic

similarity between the coordinated expressions by analyzing word-level similarity between the expressions.

E.g. ‘not V NP but V NP’: In contrast, IFN-gamma priming did not

affect the expression of p105 transcripts but enhanced the expression of p65 mRNA (2-fold).

12/27

3.1 Identifying subclausal coordinations (4/6)

Word-level similarity: checking whether the variable-matching phrases are semantically identical or at least in a subsumption relation. Verb and adjective: Utilize the synonymy and hyper

nymy relations in WordNet and also in a hand-made resource (constructed by examining 166 ‘not’-containing abstracts, e.g. Table 2).

13/27

3.1 Identifying subclausal coordinations (5/6)

Noun phrase: Analyze the similarity between their headwords.

(a) Whether the two nouns denote the same kind of biological objects (e.g. ‘eIF4A’ and ‘eIF4E’ in example (1) are protein names) by looking up well-known biomedical databases such as Swiss-Prot.

(b) Whether they are semantically identical [e.g. ‘transcription’ in example (2)] by utilizing WordNet and their resource for semantic similarity.

E.g. “affect” is a hypernym of “enhance”, and the two NPs have the same headword “expression”.

14/27

3.1 Identifying subclausal coordinations (6/6)

Next, the system determines the presupposed property for the focused proteins by extracting the subject phrase and the verb whose object phrases correspond to the focused proteins (i.e. ‘NAT1 binds CONTRAST_OBJ’, where CONTRAST_OBJ indicates the variable for focused objects.).

15/27

3.2 Identifying clause-level parallelisms (1/3) If none of the subclausal coordination patte

rns can be matched to the input sentences in the previous step, the system then attempts to identify any parallelism expressed with the sentences.

Parallelism for a contrast refers to a pair of a negated clause and an affirmative clause that are either semantically identical or at least in a subsumption relation .

16/27

17/27

3.2 Identifying clause-level parallelisms (3/3) (2) Truncated N-terminal mutant huntingtin repres

sed transcription, whereas the corresponding wild-type fragment did not repress transcription.

First locate verbs “repress” in the subordinate clause which is negated by “not” and the main clause.

Next identify subject phrases and object phrases in the two clauses.

Then check their semantic identity.

18/27

3.3 Extracting contrastive protein names The two preceding steps produce con

trastive expressions that match the variables for focused objects.

Now, the system checks whether the contrastive expressions can be resolved in Swiss-Prot protein entries. Here, the system extract individual protein names as well coordinations between protein names.

19/27

3.4 Grounding protein names (1/2)

Manually constructed pattern dealing with variations of protein names concerning special characters, indices, plurals, discardable words and acronyms.

20/27

3.4 Grounding protein names (2/2)

For ambiguous protein names Match the full names. Discard protein names that are easily con

fused with other terms such as normal English words of lower case (e.g. mass, stellate) and two character words with one digit (e.g. S1, T1).

21/27

4. Implementation (1/2) A corpus of 2.5 million ‘not’-containing

MEDLINE abstracts to extract protein–protein contrasts reported in the literature.

Extract 799,169 pairs of contrastive expressions from this corpus.

11,284 pairs of contrastive protein names. 41,471 contrasts between Swiss-Prot (S–P)

entries—the larger number in the resulting protein–protein contrasts owes to the fact that a protein name may be grounded with multiple Swiss-Prot entries.

22/27

4. Implementation (2/2)

23/27

5. Evaluation (1/2) The performance in extracting contrastive p

roteins: Examine a set of 100 pairs of contrastive protein

s randomly selected from the BioContrasts database.

Correctly extract 97 pairs (97.0% precision) [cf. the performance of the earlier system: 85.7% precision and 61.5% recall).

In terms of the system’s pattern usage, the main contrastive pattern ‘A but not B’ was used to extract 91 pairs (all correct), while the parallelism patterns were used 5 times (2/5 correct or 40% precision) in this evaluation.

24/27

5. Evaluation (2/2) The performance in grounding the

contrastive proteins with Swiss-Prot entries: Out of the 200 focused protein names in the test

protein contrasts, 182 names were expressed with adequate clarifying descriptions in their respective abstracts for grounding.

Among these 182 protein names, the system correctly grounded 164 (90.1% precision) and linked 6 additional protein names with partially correct entries (3.3% partial precision).

25/27

6. Applications (1/2) Refine pathway roles of similar

proteins These information can be used for

distinguishing between the finer biological roles of seemingly similar proteins.

26/27

6. Applications (2/2) Identify implicitly similar proteins

Such implicit similarity information can be useful in understanding the interactome (e.g. functional annotation of proteins).

E.g. Out of the 5407 protein contrasts that involved functionally annotated proteins, there are 4948 contrasts (91.7%) between proteins that shared at least one GO code at level 2. This suggested that there are implicit functional similarities between the focused proteins being contrasted.

27/27

7. Conclusions BioContrasts system captures useful contrastive and n

onpositive biological relationships between proteins by extracting contrastive expressions in the MEDLINE abstracts containing the contrastive negation patterns.

Future work: Extract contrastive information expressed in linguistic structu

res other than ‘A but not B’. Other ways for exploiting the extracted contrasts for biologica

l knowledge discovery. E.g., it may be possible that the presupposed properties (biological activities) of the same pairs of focused proteins are also related to each other.

Recommended