8
An IR-Aided Machine Learning Framework for the BioCreative II.5 Challenge Yonggang Cao, Zuofeng Li, Feifan Liu, Shashank Agarwal, Qing Zhang, and Hong Yu Abstract—The team at the University of Wisconsin-Milwaukee developed an information retrieval and machine learning framework. Our framework requires only the standardized training data and depends upon minimal external knowledge resources and minimal parsing. Within the framework, we built our text mining systems and participated for the first time in all three BioCreative II.5 Challenge tasks. The results show that our systems performed among the top five teams for raw F1 scores in all three tasks and came in third place for the homonym ortholog F1 scores for the INT task. The results demonstrated that our IR-based framework is efficient, robust, and potentially scalable. Index Terms—Bioinformatics (genome or protein) databases, information search and retrieval, systems and software, text mining. Ç 1 INTRODUCTION I T is well recognized that automatically extracting knowl- edge from a tremendously increasing amount of biome- dical literature is an important task. Over the past decade, there has been significant research development in the field of biomedical text mining, including document classifica- tion [1], named entity recognition [2], [3], and relation identification [4]. One of the text mining tasks for relation identification is protein-protein interaction (PPI), which is widely consid- ered to be critical for biological knowledge mining [5], [6]. Toward this end, many different approaches (e.g., [4], [7], [8], [9], [10], [11], [12]) have been developed, and evaluation is an important component of the task. The focus of the Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge is the development and evaluation of biological information extraction systems on a standardized data set. In previous challenges, BioCreative I [13] mainly focused on identifying gene and protein names (gene/protein mention), normalizing and associating them with their functions. BioCreative II [14] added the task of extracting protein-protein interactions from full texts. The BioCreative II.5 competition consisted of three tasks: an article categorization task (ACT), an interactor normal- ization task (INT), and an interaction pair task (IPT). The ACT involved using a binary classification showing whether an article contained a protein-protein interaction. The INT extracted a ranked list of interacting proteins from the article and mapping them to their primary UniProt IDs. The IPT was designed to recognize all the pairs of interacting proteins. BioCreative II.5 posed two main challenges: first, it required the identification of species information, while previous tasks focused on predefined organism databases, viz., fly, mouse, and yeast. Second, it required distinguish- ing interacting proteins and noninteracting ones. We participated in all three BioCreative II.5 tasks. We built our text mining systems using an information- retrieval-driven machine learning framework. We com- pleted our system development within a short time, and our system yielded competitive performance among different teams with respect to F1 scores. The evaluation results have demonstrated that our framework is efficient, robust, and potentially scalable. 2 RELATED WORK There has been over a decade of research on biomedical text mining, including in the areas of named entity recognition, relation extraction, information retrieval and document classification, summarization, question answering, and work that goes beyond text to include images. Different approaches have been developed for protein- protein interaction from the biological literature [8], [15]. Existing methods can be roughly divided into four categories: co-occurrence, rule-based, natural language processing (NLP), and machine learning. Co-occurrence approaches [7], [16], [17] define a protein-protein interaction as text in which both proteins are mentioned together in the same sentence, with the assumption that textual proximity implies biological association [8]. Rule-based approaches [9], [18], [19], [20] apply prespecified rules and pattern matching. NLP approaches [10], [21], [22], [23] explore syntactic parsing. Finally, supervised machine learning approaches include decision trees [11], maximum entropy [12], naı ¨ve Bayes [24], and support vector machines [4], [25]. Each of these methods has its own weakness. Co- occurrence methods miss newly discovered PPT events; rule-based approaches usually demand laborious human 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 . The authors are with the College of Health Sciences, University of Wisconsin-Milwaukee, 2400 E Hartford Avenue, Room 979, Milwaukee, WI 53211. E-mail: {chiefadminofficer, lizuofeng, feifan.liu, shashanksmailbox, zq.zhangqing}@gmail.com, [email protected]. Manuscript received 11 Jan. 2010; revised 2 Apr. 2010; accepted 10 May 2010; published online 24 May 2010. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBBSI-2010-01-0011. Digital Object Identifier no. 10.1109/TCBB.2010.56. 1545-5963/10/$26.00 ß 2010 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

An IR-Aided Machine Learning Frameworkfor the BioCreative II.5 Challenge

Yonggang Cao, Zuofeng Li, Feifan Liu, Shashank Agarwal, Qing Zhang, and Hong Yu

Abstract—The team at the University of Wisconsin-Milwaukee developed an information retrieval and machine learning framework.

Our framework requires only the standardized training data and depends upon minimal external knowledge resources and minimal

parsing. Within the framework, we built our text mining systems and participated for the first time in all three BioCreative II.5 Challenge

tasks. The results show that our systems performed among the top five teams for raw F1 scores in all three tasks and came in third

place for the homonym ortholog F1 scores for the INT task. The results demonstrated that our IR-based framework is efficient, robust,

and potentially scalable.

Index Terms—Bioinformatics (genome or protein) databases, information search and retrieval, systems and software, text mining.

Ç

1 INTRODUCTION

IT is well recognized that automatically extracting knowl-edge from a tremendously increasing amount of biome-

dical literature is an important task. Over the past decade,there has been significant research development in the fieldof biomedical text mining, including document classifica-tion [1], named entity recognition [2], [3], and relationidentification [4].

One of the text mining tasks for relation identification is

protein-protein interaction (PPI), which is widely consid-

ered to be critical for biological knowledge mining [5], [6].

Toward this end, many different approaches (e.g., [4], [7],

[8], [9], [10], [11], [12]) have been developed, and evaluation

is an important component of the task. The focus of the

Critical Assessment of Information Extraction systems in

Biology (BioCreAtIvE) challenge is the development and

evaluation of biological information extraction systems on a

standardized data set.In previous challenges, BioCreative I [13] mainly focused

on identifying gene and protein names (gene/protein

mention), normalizing and associating them with their

functions. BioCreative II [14] added the task of extracting

protein-protein interactions from full texts.The BioCreative II.5 competition consisted of three tasks:

an article categorization task (ACT), an interactor normal-

ization task (INT), and an interaction pair task (IPT). The

ACT involved using a binary classification showing

whether an article contained a protein-protein interaction.

The INT extracted a ranked list of interacting proteins from

the article and mapping them to their primary UniProt IDs.

The IPT was designed to recognize all the pairs ofinteracting proteins.

BioCreative II.5 posed two main challenges: first, itrequired the identification of species information, whileprevious tasks focused on predefined organism databases,viz., fly, mouse, and yeast. Second, it required distinguish-ing interacting proteins and noninteracting ones.

We participated in all three BioCreative II.5 tasks. Webuilt our text mining systems using an information-retrieval-driven machine learning framework. We com-pleted our system development within a short time, and oursystem yielded competitive performance among differentteams with respect to F1 scores. The evaluation results havedemonstrated that our framework is efficient, robust, andpotentially scalable.

2 RELATED WORK

There has been over a decade of research on biomedical textmining, including in the areas of named entity recognition,relation extraction, information retrieval and documentclassification, summarization, question answering, andwork that goes beyond text to include images.

Different approaches have been developed for protein-protein interaction from the biological literature [8], [15].Existing methods can be roughly divided into fourcategories: co-occurrence, rule-based, natural languageprocessing (NLP), and machine learning. Co-occurrenceapproaches [7], [16], [17] define a protein-protein interactionas text in which both proteins are mentioned together in thesame sentence, with the assumption that textual proximityimplies biological association [8]. Rule-based approaches[9], [18], [19], [20] apply prespecified rules and patternmatching. NLP approaches [10], [21], [22], [23] exploresyntactic parsing. Finally, supervised machine learningapproaches include decision trees [11], maximum entropy[12], naıve Bayes [24], and support vector machines [4], [25].

Each of these methods has its own weakness. Co-occurrence methods miss newly discovered PPT events;rule-based approaches usually demand laborious human

454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

. The authors are with the College of Health Sciences, University ofWisconsin-Milwaukee, 2400 E Hartford Avenue, Room 979, Milwaukee,WI 53211. E-mail: {chiefadminofficer, lizuofeng, feifan.liu,shashanksmailbox, zq.zhangqing}@gmail.com, [email protected].

Manuscript received 11 Jan. 2010; revised 2 Apr. 2010; accepted 10 May2010; published online 24 May 2010.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTCBBSI-2010-01-0011.Digital Object Identifier no. 10.1109/TCBB.2010.56.

1545-5963/10/$26.00 � 2010 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Page 2: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

efforts to compile patterns and rules; NLP-enhancedsystems depend on parsing performance, which is achallenging task in and of itself due to domain variances;and supervised learning methods depend upon annotateddata, which are frequently not available.

Approaches have been explored in other areas related toPPI, including named entity recognition using conditionalrandom field (CRF) models [26], [27], name tokenization,and context-based disambiguation [14], [28], [29]. Variousknowledge-based, linguistic, heuristic, and statistical ma-chine learning approaches, as well as hybrid systems havebeen developed. Here, we explore a new framework that isentirely IR-driven and machine learning based.

3 METHODS

Fig. 1 shows an overview of our systems for the BioCreativeII.5 tasks. In the following, we will first describe the trainingdata and then describe each system component.

3.1 The BioCreative II.5 Training Data

Our systems were trained on the 619 annotated articlesreleased by the BioCreative II.5 organizers. Of this collectionof articles, 61 articles incorporate PPT events. We refer tothese 61 articles as IPTpositive and the remaining 558 articlesas IPTnegative. IPTpositive comprise a total of 248 proteinsand 233 protein-protein interaction pairs. Each protein inthe interaction pair is assigned its UniProtKB ID.

For the data preprocessing, we split sentences with aregular-expression-based sentence boundary detector thatwe built. We replaced all Greek characters with theirEnglish words (for example, “�” was replaced with“alpha”), and all Roman numerals were replaced with theirArabic equivalents (example, “VII” was replaced with “7”).

3.2 The Article Categorization Task

We explored supervised machine learning approaches foridentifying whether an article incorporates protein-proteininteraction. We trained on the 61 IPTpositive and 558 IPTne-gative articles. We explored different supervised machinelearning algorithms, including multinomial naıve Bayes,logistic regression, naıve Bayes, decision tree, AdaBoost,and support vector machines. The features we exploredinclude words and n-grams. We also experimented withfeature selection using both information gain and chi-square analysis. We found that the best performing systemwas a multinomial naıve Bayes classifier with featureselection based on information gain.

3.3 The Interactor Normalization Task

Our INT task was built upon the IR and ML framework.

3.3.1 Building Resources

We built resources from protein names and their profiles.To build a UniProt ID profile, we first processed theUniProtKB knowledge resource (version 14.8) and assignedeach UniProt ID its corresponding protein name, source,species, which led to a profile resource LISTprotein.1 As partof the process for building the UniProt ID profile, wegenerated a list of protein names from LISTprotein. Becauseprotein names appear as different variants, we enrichedLISTprotein with two additional resources by ID mapping:Entrez Gene2 and the Biolexicon [30].

To overcome the problem of term variation [31], [32], weindexed LISTprotein using the open-source tool Lucene3

and employed a domain-specific multifield IR framework,through which we can get candidate UniProt IDs given aprotein name query. For example, we divided a term intoparts where there is a change among upper case, lowercase, and digit, with the exception of the first character. Forexample, Hrd1p was tokenized as “Hrd,” “1,” and “p.” Wecreated a field for the head term to unite different termsthat may have the same biological meaning, for example,“Hrd1p” and “Hrd2p.” We ignore hyphen and spacebetween terms. For example, we judge “TSC-22,” “TSC22,”and “TSC 22” equally. Each protein name is associatedwith its corresponding UniProt ID(s).

3.3.2 IR-Based NER and Similarity-Based ID Mapping

Named entity recognition (NER) has proven to be one of themost important biomedical text mining tasks. BiomedicalNER is challenging, mainly due to the problem of termabundancy, irregularity, and variation. We developed an IRframework for NER. Such a framework enabled us to developa high recall system with a minimum of training data.

Given a document, our system attempted to identifyevery term if the term appeared in LISTprotein. We thenused IR to return a list of UniProt IDs. Such an approachmaximized the recall, but resulted in a large number ofcandidate IDs. Thus, ranking became a crucial task forensuring precision.

We explored different methods for ID ranking andselected a ranking procedure built on a linear combinationof the following three similarity metrics:

1. We compute the longest common substring (LCS)ratio [33] between a term in the document and itscandidate protein name. It is computed as the lengthof the longest common substring between two stringsdivided by sum of the length of the two strings. LCShas many advantages, including prefix or suffixvariation detection. For example, with LCS, oursystem can match the following two varied formsof the same protein: “Thi3p” and “Thi3,” and itsranking is higher than the match between “Thi3p”and “Thip,”which are two different proteins.

CAO ET AL.: AN IR-AIDED MACHINE LEARNING FRAMEWORK FOR THE BIOCREATIVE II.5 CHALLENGE 455

Fig. 1. An overview of the UWM BioCreative II.5 system.

1. http://www.askhermes.org/biocreative/.2. http://jura.wi.mit.edu/entrez_gene/.3. http://lucene.apache.org/.

Page 3: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

2. We compute Jaro-Winkler similarity [34], which is acharacter-based similarity metric for measuring theoverlap of characters between a term and itscandidate protein name. The metric ranks the matchbetween “hnRNPk” and “hnRNPk” higher than thematch between “hnRNPk” and “HNRNPk.”

3. We compute the cosine similarity between a termand its candidate protein names. For example, “TSC22” is ranked higher than “TS C22” for matching“TSC-22.”

3.3.3 Knowledge-Based Reranking

Our IR-based NER, described above, was entirely term-based and incorporated minimum knowledge. As anattempt to improve precision, we reranked the ID candi-dates with domain-specific knowledge. As described inSection 3.3.1, our LISTprotein aggregated protein names fromthree different knowledge resources. However, the knowl-edge resources have different levels of authority. Specifi-cally, Swissprot is expert-annotated data, and therefore, weassigned Swissprot proteins higher weight than proteinsextracted from the other two resources. We also assigned theID candidates multifield weights in which the mapped ID’sweight was increased if the corresponding protein nameappeared in the keyword, title, figure, or abstract sections. Inaddition, we associated a protein name with its priorprobability, which we calculated from the three databases.

Each protein is ubiquitously associated with the corre-sponding species, and thus, it is very important to distinguishdetween different species during the ID mapping. First, webuilt a lexicon of species names (SpeciesDict) by aggregatingthe Latin scientific names, English common names, andsynonyms for each source of proteins in UniProKB, throughwhich a cosine similarity score was computed between all thealternative species names of the corresponding ID candidateand the current article where the protein name wasmentioned. Second, we incorporated species popularityinformation in MINT database [35] by counting the numberof molecular interactions for each species name, which couldreasonably filter out entries from rare species.

3.3.4 Context-Based Disambiguation

Homonyms are abundant in biomedical literature [36], [37].To identify the sense of a term and to associate it with aunique UniProt ID, we adapted the context-based wordsense disambiguation approach in our system. Specifically,we matched the current article context with the candidateID’s rich context information in UniProKB based on cosinesimilarity between them.

3.3.5 Reranking UniProt IDs by Detecting Interacting

Events

Through the steps above, we selected the top-five candidateUniProt IDs as the final candidates and reranked thecandidates by interacting events to be described below.

BioCreative II.5 required each team to identify onlyproteins that are interacting and to ignore those that arenoninteracting. We developed the following approaches foridentifying proteins that participate in PPI events:

SVM-based interacting sentence detection. We firstbuilt an SVM-based sentence classifier for identifying

whether a sentence incorporates a PPI event. The trainingsentences were extracted from the 61 IPTpositive articles.

From the IPTpositive articles, we identified a total of8,087 sentences that incorporate at least one UniProt ID.However, not every sentence incorporating a UniProt IDrepresents a PPI event. To obtain a collection of PPIsentences, one domain expert (Li) performed manualannotation. To minimize the annotation efforts, Li manuallyidentified a maximum of 10 sentences that incorporated aPPI event from each document. The final selection of600 sentences (a few documents do not have 10 sentencesdescribing protein-protein interaction) were used as thepositive data for training. We then randomly selected600 sentences from the remaining 7,487 sentences to useas the negative data for training.

We explored words and n-grams as features. In order toovercome the problem of data sparseness, we applied asemantic backoff smoothing model to replace all biologicalnamed entity instances with the corresponding names ofthe entity (e.g, “protein,” “gene,” and “tissue”). The namedentities were recognized with an NER parser that wetrained on the GENIA corpus [38] using Abner [39].

We performed 10-fold cross-validation, and our resultsshowed a performance of 79.5 percent (F score) when usingthe top 1,000 features. Our final classifier was trained on theentire 1,200 sentences, and the binary prediction result wasused to further rerank the UniProt ID.

CRF-based interacting protein detection. In addition tothe sentence-level classification discussed above, we built aCRF-based classifier to identify whether a candidate proteinname is an interacting protein. For this task, we trained theCRF models on all the 8,087 sentences (described above).Similar to the SVM classifier, we explored words andn-grams (capturing word dependencies) as features.

3.3.6 Heuristical Filtering

Finally, we employed a set of heuristic rules to identifywhether a protein name is part of a protein-proteininteracting event or not. One heuristic was to rank an IDhigher if the ID was unique to a protein name. Weincorporated term frequency because we speculated thatthe focus of an interaction-containing article was morelikely to be an interacting protein than other proteins.

In addition to using term frequency to indicate theimportance of a protein name, we speculated that termvariations could also be useful. If a protein name ismentioned in the same article with multiple variations,the protein is more likely to be the focus of the article, andas a result, an interacting protein, than other proteins withfewer variations. For example, we found that one article(10.1016_j.febslet.2008.08.031.noSDA.utf8) has a proteinwith a large number of variations: “hnRNP K,” “hetero-geneous nuclear ribonucleoprotein K,” “hnRNPs,” and“hnRNP-K,” and indeed the protein corresponds to itsUniProt ID “P61978.”

3.4 The Interaction Pair Task

The IPT task was built upon the INT task. As stated before, webuilt supervised machine learning classifiers to test whether asentence incorporates an interacting protein or not; in thistask, we developed a binary classifier to judge whether two

456 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Page 4: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

proteins appearing in a sentence interact with each other.Note that this task in BioCreative challenge does notdistinguish interactions that were reported by authors of thispaper from those that were referenced from other articles.

We obtained the training data automatically by selectingany sentence in the 61 IPTpositive articles that incorporatedat least two of the interacting protein names. We replacedthe identified protein names in the articles with theircorresponding UniProt IDs and built an SVM classifier. Thefeatures we explored included words and the word-leveldistance between two protein names. We hypothesized thattwo proteins were more likely to be interacting if each ofthem shared the same distribution pattern in the document,and, as such, we added two additional features: the termfrequency difference within the article between the twoproteins and the pointwise mutual information (MI) of twoproteins. We calculated MI using each sentence as a co-occurrence window.

4 RESULTS

4.1 Evaluation Metrics

There are several metrics used in the BioCreative II.5challenge. In this paper, we mainly focus on the F1 scoreðF1 ¼ 2ðprecision�recallÞ=ðprecisionþ recallÞÞ, which iswidely used in information extraction and retrieval tasks.The precision is the number of correctly identified ACTarticles or INT protein IDs or IPT proteins divided by thetotal number of those identified by the system, and therecall is the number of correctly identified ACT, INT, or IPTdivided by the total number of those in the annotations. The“raw” scores refer to the original, nonmapped, andnonfiltered scores for the INT and IPT task. We report theofficial raw scores in the following sections unless specified.

4.2 Performance on the ACT Task

We submitted six runs for this task using the multinomialnaıve Bayes classifier. Our 10-fold cross-validation resultshave shown that our multinomial naıve Bayes classifiersperformed better than other classifiers, including logisticregression and SVM. Optimal feature selection was per-formed based on information gain before applying ourmodel to the standard test set.

Based on the official test results in the BioCreative II.5challenge, we found that a unigram model (using indivi-dual words as features) integrating interaction-type infor-mation provided by the organizers achieved betterperformance (F1 score of 0.571) than a trigram model (F1score of 0.333) that considers the dependency of the currentword on the preceding two words when deriving features.But the latter achieved a very high precision of 0.867, whichshowed that a tri-gram model tends to overfit duringtraining, and thus, has a poorer ability to generalize onblind test data, resulting in low recall.

Fig. 2 shows the performance of our system’s best runcompared with the best systems from other teams. We cansee that our system obtained a balanced precision and recallthat resulted in an F1 score of 0.571, which is at third placeamong the eight participating teams. Compared to the besttwo systems, our system performed slightly lower in termsof recall rate (0.540 versus 0.587 and 0.698). Note that our

system used a lightweight classifier based on only thestandardized training data, while the team with the bestperformance (t9) utilized information from the citationnetwork of the relevant literature as well as additional data.

4.3 Performance of Our System on the INT Task

As described earlier, we proposed IR-driven INT systems toaddress different types of variances for identifying proteinsand mapping them onto UniproKB IDs. Prior to thedevelopment of our IR-driven INT systems, we compareddifferent strategies to ensure good recall in the developmentstage. The first one we tried was a carefully designedregular expression matcher with some cue patterns, but itonly attained a performance of approximately 49.19 percent,identifying only about 300 ids for each article due to myriadvariations in protein names, as discussed in Section 3.3.1.We also tried a fuzzy match approach using Google’s fuzzymatcher, and it matched up to 200,000 ids for each article;however, we found it too slow to be applied to our system.The IR-driven INT systems we developed proved to bemore effective than both of the approaches described above.The first thousand hits from the IR system had a coverage ofover 95 percent, which, despite its trade-off with precision,laid a foundation for subsequent ranking and filtering toimprove precision.

Fig. 3 shows the performance curve of our system, as thenumber of output IDs for each article increases on the trainingdata after several ranking and filtering strategies. We can seethat we achieved 89.4 percent recall using the top 40 rankedcandidate UniProtKB IDs for each article (93 percent whenwe used the top 500 results), and we obtained our bestprecision (37 percent) when using only the top ID for eacharticle. For the official test, we chose the top 10 IDs as outputin the hope of identifying more candidates for IPT.

Previous work has shown that figure legends areimportant features for biomedical text mining (e.g, [40],[41]). We therefore explored using different features:1) figure legend and keyword, 2) figure legend, keyword,and the abstract, and 3) the full text. We found that the fulltext achieved a slightly better performance (microaveragedF score of 0.206) than the other two (both with amicroaveraged F1 score of 0.202), and this difference ismost likely due to the problem of data sparseness. Since thetraining data is already sparse, limiting features to the

CAO ET AL.: AN IR-AIDED MACHINE LEARNING FRAMEWORK FOR THE BIOCREATIVE II.5 CHALLENGE 457

Fig. 2. Performance of our system (t31_UWM) among different teams onthe ACT task.

Page 5: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

figure legend only, rather than the full-text article, wouldnaturally decrease the performance. In addition, we con-ducted posthoc experiments to evaluate the three similaritymetrics described in Section 3.3.2, and microaveragedresults are shown in Table 1. We can see that the combinedstrategy we used for the submission did not perform as wellas expected, and LCS itself achieved the best performancewith the F-score of 0.215 (precision of 0.153 and recall of0.36). It suggests that LCS and JW can better address thelanguage variance in protein than cosine similarity.

Fig. 4 shows the best microaverage scores (macroaveragescores show similar trends) of our system on the INT task inBioCreative II.5 compared with the best systems of otherteams. Our system (t31_UWM) is one of the top-5 systemsamong 10 teams with an F1 score of 0.224, which is 23 percentabove the mean performance of 0.182.

Although it did not perform as well as some of the othersystems in the competition, our system has some advan-tages that point to its potential. First, our system iscomputationally much cheaper than other systems withreadily available features, since our system did not dependon syntactic parsing, which is quite expensive computa-tionally and requires more training effort for supervisedsystems. We observed that some systems (such as t37 andt14) applied syntactic parsing; however, the results showthat such parsing does not always yield reliable perfor-mance. Second, our system is built only on the standardizedtraining data and depends on minimal external knowledge,while the systems of many other teams (e.g., t42) were

trained with additional corpora and (or) external knowl-

edge resources. Third, our system incorporates minimum

human effort, while other systems required much humaneffort to compile specific patterns (e.g., t42). Considering

homonym ortholog matching and organism filtering in the

evaluation, our best system attained a microaveraged F1

score of 0.426, reaching the third position among the total of10 teams. The results suggest that our system performs

even better on identifying interacting protein IDs, although

there is much room for improvement on disambiguation

among species.

4.4 Identification Performance for the IPT Task

Fig. 5 shows the performance of our system on the IPT taskin comparison with the work of other teams. Our system

achieved an F-score of 0.103, which places it among the top

five teams, and the precision of 0.184 was the second-

highest precision score in the competition.The examination of different feature sets after the official

test shows that the differences in frequency between the

proteins of the candidate interaction pairs is a very indicative

feature, according to the 10-fold cross-validation on the

training set, achieving an F-score of up to 0.694 using onlythis feature. Adding the top five lexical features that occur

between candidate pairs boosted the cross-validation per-

formance to 0.711, which is much better than the 0.57 score

458 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

TABLE 1Performance Comparison Among Different Similarity Metrics

Fig. 4. Performance of our system (t31_UWM) among different teams onthe INT task.

Fig. 5. Performance of our system (t31_UWM) among different teams onthe IPT task.

Fig. 3. Performance on development data with top N selection.

Page 6: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

achieved by using the full feature set including lexicalfeatures, frequency, distance, and mutual information. Howto choose an optimal feature set for reliable performance stillrequires more investigation.

As shown in Fig. 5, most of the systems didn’t performvery well, except for t18, which utilized the MINT database.This suggests that the IPT task is very challenging and reliesnot only on the approach applied in this task itself, but alsoon the performance of the preceding tasks, that is, ACT andINT. Error propagation is an inherent problem in a pipelinesystem.

5 DISCUSSION/ERROR ANALYSIS

We explored a supervised machine learning approach for theACT task, and our cross-validation results on the trainingdata are much higher than the official testing results.Considering that the full-text articles of the BioCreative II.5competition are from the FEBS Letters journal, whose subjectarea covers a wide range of life sciences, we speculate that theperformance difference is due to the small training data set,which creates problems in terms of how representative itsfeatures and feature relations are.

To relate UniProt IDs to their appropriate protein names,we developed similarity models to rank a returned ID listfrom our IR module. Our models employ a threshold tofilter an ID if it has a low similarity with the correspondingprotein name. Such a threshold has a trade-off betweenrecall and precision. For example, when the threshold was0.6, the top 500 IDs achieved a recall of 93 percent; however,the recall decreased significantly to 70 percent when thethreshold was increased to 0.7.

Although not reported in the methods, we also exploredone knowledge-rich approach to associating a protein namewith a UniProtKB ID. Specifically, since a PPI event mightbe supported by site-directed mutagenesis experiments, weattempted to use the sequence information from mutagen-esis for the ID mapping. We found that for articles withmutagenesis experiments, the precision of this approachwas very high. However, based on our experiments on thetraining data, incorporating this approach did not yield anoticeable gain. We speculate that one reason is that theproportion of mutagenesis experiments in the current datais very low, which limited the potential of this approach forthe overall system performance.

In the INT task, our results show that when theidentification of organisms was ignored, the performanceof homonym orthologs INT increased among all systems.But our system benefits even more than the norm, as one ofour systems had one of the top three runs. The resultsindicate that species disambiguation improvement wouldsignificantly increase our system performance but for nowthat must remain for future work.

The INT task is a key task because it is more challengingthan the other two tasks, and its performance directly affectsthe IPT task. We therefore systematically analyzed the errorsthat occurred in the INT task. As shown in Fig. 1, our INTsystem incorporates a cascade of mapping and rerankingsubcomponents, each of which contributed errors.

Specifically, we found that more than half of the errorswere produced by a similarity threshold 0.6 that weempirically applied for this competition. As we discussed

previously, deciding on an appropriate threshold isdifficult, and the performance of the system is quitesensitive to different thresholds. Further study is neededon this issue.

In another two major error categories, about 35 percentof the gold standard IDs were excluded in the rerankingprocess in which the top five IDs were selected for eachprotein name (Sections 3.3.3 and 3.3.4), and the Top N (10)selection process in which Top N IDs were chosen as thefinal output for each article. These empirical thresholdscould be tuned to achieve a better performance with moredevelopment data.

About 5 percent of the gold standard IDs were excludedin the heuristical filtering step. These IDs had a low rank,and only one variance of each of them was detected in thecorresponding article. The rule seems reasonable. However,more experiments need to be performed to prove theefficiency gained by such a rule in comparison with thesystem without such a rule.

Our error analysis has shown a limitation in our systemin that although it is a lightweight framework that can begeneralizable to any semantic mapping task, a significantnumber of thresholds are needed to tune for optimalperformance.

6 CONCLUSION

We proposed an IR-aided framework in this paper andapplied the framework for all three BioCreative II.5challenges. We achieved a good performance even thoughour systems were built upon limited training data andminimum external knowledge (although the systems wereallowed the ability to incorporate external knowledgeresources). The results indicate that our IR framework isefficient, robust, and scalable, and we speculate that theframework has the potential to benefit many other textmining tasks, including large-scale information retrieval,summarization, and question answering.

ACKNOWLEDGMENTS

The authors acknowledge the BioCreative organizers fortraining and test data preparation, task definition, andcoordination, which enabled us to compete in this challenge.We thank Balaji Polepalli Ramesh for his work on investiga-tion and experiments on Annotation Server-related issuesand Lamont Antieau for his editing. This work was supportedby grants from NIH: 1R01LM010125, 5R01LM009836,5R21RR024933, and 5U54DA021519. The first three authorscontributed equally to this paper. The corresponding authorwas Hong Yu.

REFERENCES

[1] D. Chen, H.M. Muller, and P.W. Sternberg, “Automatic DocumentClassification of Biological Literature,” BMC Bioinformatics, vol. 7,p. 370, 2006.

[2] D. Hanisch, K. Fundel, H.T. Mevissen, R. Zimmer, and J. Fluck,“ProMiner: Rule-Based Protein and Gene Entity Recognition,”BMC Bioinformatics, vol. 6, pp. S14-S22, 2005.

[3] K.J. Lee, Y.S. Hwang, S. Kim, and H.C. Rim, “Biomedical NamedEntity Recognition Using Two-Phase Model Based on SVMs,”J. Biomedical Informatics, vol. 37, pp. 436-447, 2004.

CAO ET AL.: AN IR-AIDED MACHINE LEARNING FRAMEWORK FOR THE BIOCREATIVE II.5 CHALLENGE 459

Page 7: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

[4] R. Sætre and K. Sagae, “Syntactic Features for Protein-ProteinInteraction Extraction,” Proc. Int’l Symp. Languages in Biology andMedicine, 2007.

[5] A. Rzhetsky, I. Iossifov, T. Koike, M. Krauthammer, P. Kra, M.Morris, H. Yu, P.A. Duboue, W. Weng, W.J. Wilbur, V. Hatzivassi-loglou, and C. Friedman, “GeneWays: A System for Extracting,Analyzing, Visualizing, and Integrating Molecular Pathway Data,”J. Biomedical Informatics, vol. 37, pp. 43-53, Feb. 2004.

[6] M. Krauthammer, C.A. Kaufmann, T.C. Gilliam, and A. Rzhetsky,“Molecular Triangulation: Bridging Linkage and Molecular-Net-work Information for Identifying Candidate Genes in Alzheimer’sDisease,” Proc. Nat’l Academy of Sciences USA, vol. 101, pp. 15148-15153, Oct. 2004.

[7] B.J. Stapley and G. Benoit, “Biobibliometrics: Information Retrie-val and Visualization from Co-Occurrences of Gene Names inMedline Abstracts,” Proc. Pacific Symp. Biocomputing, pp. 529-540,2000.

[8] J. Bandy, D. Milward, and S. McQuay, “Mining Protein-ProteinInteractions from Published Literature Using Linguamatics I2E,”Methods in Molecular Biology (Clifton, NJ), vol. 563, pp. 3-13, 2009.

[9] T. Sekimizu, H. Park, and J. Tsujii, “Identifying the Interactionbetween Genes and Gene Products Based on Frequently SeenVerbs in Medline Abstracts,” Proc. Workshop Genome Informatics,vol. 9, pp. 62-71, 1998.

[10] S.T. Ahmed, D. Chidambaram, H. Davulcu, and C. Baral, “Intex:A Syntactic Role Driven Protein-Protein Interaction Extractor forBio-Medical Text,” Proc. ISMB BioLINK Special Interest Group onText Data Mining and the ACL Workshop Linking Biological Literature,Ontologies and Databases: Mining Biological Semantics, pp. 54-61,2005.

[11] J. Chiang, H. Yu, and H. Hsu, “GIS: A Biomedical Text-MiningSystem for Gene Information Discovery,” Bioinformatics, vol. 20,pp. 120-121, Jan. 2004.

[12] J. Xiao, J. Su, G.D. Zhou, and C.L. Tan, “Protein-ProteinInteraction Extraction: A Supervised Learning Approach,” Proc.Symp. Semantic Mining in Biomedicine, pp. 51-59, 2005.

[13] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, “Overview ofBioCreAtIvE: Critical Assessment of Information Extraction forBiology,” BMC Bioinformatics, vol. 6, suppl 1, pp. S1-S10, 2005.

[14] M. Krallinger, F. Leitner, C. Rodriguez-Penagos, and A. Valencia,“Overview of the Protein-Protein Interaction Annotation Extrac-tion Task of Biocreative II,” Genome Biology, vol. 9, suppl 2, pp. S4-S22, 2008.

[15] Y. Niu, D. Otasek, and I. Jurisica, “Evaluation of LinguisticFeatures Useful in Extraction of Interactions from PubMed;Application to Annotating Known, High-Throughput and Pre-dicted Interactions in I2D,” Bioinformatics, vol. 26, pp. 111-119,Jan. 2010.

[16] B.J. Stapley, L.A. Kelley, and M.J. Sternberg, “Predicting the Sub-Cellular Location of Proteins from Text Using Support VectorMachines,” Proc. Pacific Symp. Biocomputing, 2002.

[17] H. Shatkay and R. Feldman, “Mining the Biomedical Literature inthe Genomic Era: An Overview,” J. Computational Biology, vol. 10,pp. 821-855, 2003.

[18] C. Blaschke, M.A. Andrade, C. Ouzounis, and A. Valencia,“Automatic Extraction of Biological Information from ScientificText: Protein-Protein Interactions,” Proc. Int’l Conf. IntelligentSystems for Molecular Biology, pp. 60-67, 1999.

[19] L. Wong, “A Protein Interaction Extraction System,” Proc. PacificSymp. Biocomputing, 2001.

[20] U. Pieper, N. Eswar, H. Braberg, M.S. Madhusudhan, F.P. Davis,A.C. Stuart, N. Mirkovic, A. Rossi, M.A. Marti-Renom, A. Fiser, B.Webb, D. Greenblatt, C.C. Huang, T.E. Ferrin, and A. Sali,“MODBASE, a Database of Annotated Comparative ProteinStructure Models, and Associated Resources,” Nucleic AcidsResearch, vol. 32, pp. D217-D222, 2004.

[21] J. Thomas, D. Milward, C. Ouzounis, S. Pulman, and M. Carroll,“Automatic Extraction of Protein Interactions from ScientificAbstracts,” Proc. Pacific Symp. Biocomputing, pp. 541-552, 2000.

[22] C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky,“GENIES: A Natural-Language Processing System for the Extrac-tion of Molecular Pathways from Journal Articles,” Bioinformatics(Oxford, England), vol. 17, pp. 74-82, 2001.

[23] N. Daraselia, A. Yuryev, S. Egorov, S. Novichkova, A. Nikitin, andI. Mazo, “Extracting Human Protein Interactions from MEDLINEUsing a Full-Sentence Parser,” Bioinformatics, vol. 20, pp. 604-611,Mar. 2004.

[24] D.R. Rhodes, S.A. Tomlins, S. Varambally, V. Mahavisno, T.Barrette, S. Kalyana-Sundaram, D. Ghosh, A. Pandey, and A.M.Chinnaiyan, “Probabilistic Model of the Human Protein-ProteinInteraction Network,” Nature Biotechnology, vol. 23, pp. 951-959,2005.

[25] A. Koike and T. Takagi, “Prediction of Protein-Protein InteractionSites Using Support Vector Machines,” Protein Eng., Design andSelection, vol. 17, pp. 165-173, Feb. 2004.

[26] R. McDonald and F. Pereira, “Identifying Gene and ProteinMentions in Text Using Conditional Random Fields,” BMCBioinformatics, vol. 6, suppl 1, pp. S6-S12, 2005.

[27] T. Sandler, A.I. Schein, and L.H. Ungar, “Automatic Term ListGeneration for Entity Tagging,” Bioinformatics, vol. 22, pp. 651-657,2006.

[28] A. Morgan, Z. Lu, X. Wang, A. Cohen, J. Fluck, P. Ruch, A. Divoli,K. Fundel, R. Leaman, J. Hakenberg, C. Sun, H. Liu, R. Torres, M.Krauthammer, W. Lau, H. Liu, C. Hsu, M. Schuemie, K.B. Cohen,and L. Hirschman, “Overview of BioCreative II Gene Normal-ization,” Genome Biology, vol. 9, pp. S3-S21, 2008.

[29] L. Smith, L. Tanabe, R. Ando, C. Kuo, I. Chung, C. Hsu, Y. Lin, R.Klinger, C. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C.Struble, R. Povinelli, A. Vlachos, W. Baumgartner, L. Hunter, B.Carpenter, R. Tsai, H. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P.Adriaans, C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli,M. Mana-Lopez, J. Mata, and W.J. Wilbur, “Overview ofBioCreative II Gene Mention Recognition,” Genome Biology,vol. 9, pp. S2-S20, 2008.

[30] Y. Sasaki, S. Montemagni, P. Pezik, D. Schuhman, J. Mcnaught,and S. Ananiadou, “{BioLexicon}: {A} Lexical Resource for theBiology Domain,” Proc. Third Int’l Symp. Semantic Mining inBiomedicine (SMBM ’08), pp. 109-116, 2008.

[31] H. Yu, G. Hripcsak, and C. Friedman, “Mapping Abbreviations toFull Forms in Biomedical Articles,” J. Am. Medical InformaticsAssoc., vol. 9, pp. 262-272, May 2002.

[32] H. Yu and E. Agichtein, “Extracting Synonymous Gene andProtein Terms from Biological Literature,” Bioinformatics (Oxford,England), vol. 19, suppl 1, pp. i340-i349, 2003.

[33] D.S. Hirschberg, “Algorithms for the Longest Common Subse-quence Problem,” J. ACM, vol. 24, pp. 664-675, 1977.

[34] W.E. Winkler, “The State of Record Linkage and Current ResearchProblems,” Technical Report RR99-04, Statistical Research Divi-sion, United States Census Bureau, 1999.

[35] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M.Helmer-Citterich, and G. Cesareni, “MINT: a Molecular INTerac-tion Database,” FEBS Letters, vol. 513, pp. 135-140, Feb. 2002.

[36] H. Yu, W. Kim, V. Hatzivassiloglou, and J. Wilbur, “A Large Scale,Corpus-Based Approach for Automatically Disambiguating Bio-medical Abbreviations,” ACM Trans. Information Systems, vol. 24,pp. 380-404, 2006.

[37] H. Yu, W. Kim, V. Hatzivassiloglou, and W.J. Wilbur, “UsingMEDLINE as a Knowledge Source for Disambiguating Abbrevia-tions and Acronyms in Full-Text Biomedical Journal Articles,”J. Biomedical Informatics, vol. 40, pp. 150-159, 2007.

[38] J. Kim, T. Ohta, Y. Tateisi, and J. Tsujii, “GENIA Corpus—Semantically Annotated Corpus for Bio-Textmining,” Bioinfor-matics, vol. 19, suppl 1, pp. i180-i182, 2003.

[39] B. Settles, “ABNER: An Open Source Tool for AutomaticallyTagging Genes, Proteins and Other Entity Names in Text,”Bioinformatics, vol. 21, pp. 3191-3192, July 2005.

[40] Y. Regev, M. Finkelstein-Landau, R. Feldman, M. Gorodetsky, X.Zheng, S. Levy, R. Charlab, C. Lawrence, R.A. Lippert, Q. Zhang,and H. Shatkay, “Rule-Based Extraction of Experimental Evidencein the Biomedical Domain: The KDD Cup 2002 (Task 1),” ACMSIGKDD Exploration Newsletter, vol. 4, pp. 90-92, 2002.

[41] H. Yu and M. Lee, “Accessing Bioscience Images from AbstractSentences,” Bioinformatics, vol. 22, pp. e547-e556, 2006.

Yonggang Cao received the PhD degree incomputer science from BeiHang University. Heworked for Microsoft Research Asia and then atthe University of Wisconsin-Milwaukee as anassociate researcher and research associate.He is now working for Amazon. His researchinterests include text mining, question answer-ing, and information retrieval and extraction.

460 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Page 8: 454 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL …...chine learning approaches, as well as hybrid systems have been developed. Here, we explore a new framework that is entirely IR-driven

Zuofeng Li received the BS degree in clinicalmedicine from Qingdao Medical College, the MSdegree in pathophysiology from Shanghai Sec-ond Medical University, and the PhD degree inbioinformatics from Fudan University in China.He worked at the Shanghai Center for Bioinfor-mation Technology in biomedical informatics in2006. Since 2009, he has been with ProfessorHong Yu’s group at the University of Wisconsin-Milwaukee with an NCIBI’s Building Bridges

Postdoc Fellowship. His research interests include biomedical textmining and translational medicine.

Feifan Liu received the PhD degree in patternrecognition and intelligent systems from theChinese Academy of Sciences, Beijing. He iscurrently an associate scientist (research fa-culty) at the University of Wisconsin-Milwaukee.His main research interests are natural languageprocessing and biomedical text mining.

Shashank Agarwal received the MS degree inbiotechnology from Australian National Univer-sity, Canberra. He is currently working towardthe PhD degree in medical informatics at theUniversity of Wisconsin-Milwaukee. His re-search interests are in biomedical text miningand natural language processing.

Qing Zhang received the bachelor’s degree inmechanical engineering and the master’s de-gree in software engineering from the BeijingUniversity of Aeronautics and Astronautics,China. She is currently working toward thePhD degree in the Computer Science Programat the University of Wisconsin-Milwaukee. Herresearch interests include text mining andinformation retrieval. She is a student memberof the AMIA.

Hong Yu received the PhD degree in biomedicalinformatics from Columbia University. She iscurrently an associate professor at the Univer-sity of Wisconsin-Milwaukee. Her researchinterests are in biomedical informatics, with afocus on biomedical text mining.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

CAO ET AL.: AN IR-AIDED MACHINE LEARNING FRAMEWORK FOR THE BIOCREATIVE II.5 CHALLENGE 461