Towards Improving Classification of Real World Biomedical Articles

Towards Improving Classification of Real World

Biomedical Articles

Towards Improving Classification of Real World

Biomedical Articles

Kostas Fragos

TEI of Athens

[email protected]

Christos Skourlas

TEI of Athens

[email protected]

2

SummarySummary

• We propose a method to improve performance in biomedical article classification.

• We use Naïve Bayes and Maximum Entropy classifiers to classify real world biomedical articles derived from the dataset used in the classification competition task BC2.5

• To improve classification performance, we use two merging operators, Max and Harmonic Mean to combine results of the two classifiers.

• The results show that we can improve classification performance of real world biomedical data

3

IntroductionIntroduction

• From the biomedical point of view there are many challenges in classifying biomedical information [3].

• Even the most sophisticated of solutions often overfit to the training data and do not perform as well on real-world data [4].

• In this paper we try to devise a method which makes real world biomedical data classification more robust.

• First we parse documents applying a keyword extraction algorithm to find out the keywords from the full text. Second, we apply chi-square feature selection strategy to identify the most relevant. Finally, we apply Naïve Bayes and Maximum Entropy classifiers to classify documents and then combine them using two merging operators to improve performance.

THE CLASSIFICATION METHODTHE CLASSIFICATION METHOD

• Naïve Bayes Classifiers:

• A text classifier could be defined as a function that maps a document d of x1, x2,x3,..,xn words (features), d=( x1, x2,x3,..,xn ), to a confidence that the document d belongs to a text category.

• the Naïve Bayes classifier [1] is often used to estimate the probability of each category.

• The Bayes theorem can be used to estimate the probabilities: Pr(c|d)=Pr(d|c)*Pr(c)/Pr(d) [6]

THE CLASSIFICATION METHODTHE CLASSIFICATION METHOD

• Maximum Entropy Classifiers:

• Entropy was used by Shannon (Shannon, 1948), in the communication theory. The entropy H itself measures the average uncertainty of a single random variable X : H(p)=H(x)=-Σp(x)log2p(x) [2]

• The maximum entropy model can be specially adjusted for text classification.

• This can be done using the iterative scaling (IIS) algorithm and a hillclimbing algorithm for estimating the parameters of the maximum entropy model [6]

5

Merging ClassifiersMerging Classifiers

We use two operators to combine the results of the Naïve Bayes Classifier (NBC) and the Maximum Entropy Classifier (MEC) to improve the classification performance. The Maximum and the Harmonic Mean of the results of the two classifiers

• MaxC(d) = Max {NBC(d), MEC (d)} HarmC (d) = 2.0 × NBC(d) ×MEC (d) / (NBC(d) + MEC (d))

• The MaxC(d) operator chooses a maximum value among the results of the two classifiers. The HarmC (d) operator estimates the Harmonic Mean of the results of these two classifiers. 6

BioCreAtIvE challengeBioCreAtIvE challenge

• Description [2004-01-02]

• The BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge evaluation consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain.

• http://www.biocreative.org/about/background/description/

7

BioCreative II.5 challengeBioCreative II.5 challenge

• Evaluation library [2009-12-17]

• This is the current version of the BioCreative evaluation library including a command line tool to use it; current, official version: 3.2 (use command line option --version to see the version of the script you have installed: bc-evaluate --version. If you have reason to believe that there is a bug with the tool or the library, or any other questions related to it, please contact the author, Florian Leitner.

• http://www.biocreative.org/resources/biocreative-ii5/evaluation-library/

8

BioCreative II.5 challengeBioCreative II.5 challenge

• Task 2: Protein-Protein Interactions [2006-04-01]

• This task is organized as a collaboration between the IntAct and MINT protein interaction databases and the CNIO Structural Bioinformatics and Biocomputing group.

• Background

• Introduction

• Task description

• Data

• Resources

• http://www.biocreative.org/tasks/biocreative-ii/task-2-protein-protein-interac/

9

10

Preparing the Data.Preparing the Data.

• For experimentation purposes we used the data used in the article classification competition task BC2.5 [4].

• This classification task was based on a training data set comprised of 61 full-text articles relevant to protein-protein interaction and 558 irrelevant one.

• For training we chose the first 60 relevant and sampled randomly 60 irrelevant articles, for testing we used the Biocreative 2.5 testing data set consisting of 63 full-text articles relevant to protein- protein interaction and 532 irrelevant ones.

Preparing the Data.Preparing the Data.

• Before using the data for training and testing we pre-processed all articles by filtering out stop words and porter stemming the remaining words/keywords.

• Finally, we ranked keywords extracted from BC2.5 training articles according to chi-square scoring formula to identify most top relevant keywords [6].

11

12

ExperimentsExperiments

• The experiments consist of the following phases:

• First, we collect five sets of top relevant keywords using chi-square feature selection strategy. Second, we compare the performance of the two classifiers, Naïve Bayes and Maximum Entropy, for each set of word features. Third, we use merging operators to combine the results of these two classifiers to improve performance.

• In each experiment we calculate Precision, Recall, True Negative Rate and Accuracy measures.

ResultsResults

• The Maximum Entropy classifier shows the best performance Precision, Recall and Accuracy, 0,186%, 0.857% and 0.589% at 500 top ranked keywords, while for True Negative Rate shows the best performance 0.565% at 700 top ranked keywords.

• We combine the results of the two classifiers using the two merging operators mentioned above to improve the performance, especially the Recall rate.

• The merging operators do improve performance, for Precision 0.189%, Recall 0.873%, True Negative Rate 0.560% and Accuracy 0.591%.

13

ConclusionConclusion

• The results show that the Maximum Entropy classifier shows the better performance at 500 top relevant keywords.

• Combining the results of the two classifiers we can improve classification performance of real world biomedical data

ReferencesReferences

1. Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K. 1991. Air/X – A rule-based multi-stage indexing system for lage subject fields. RIAO’91, pp. 606-623.

2. Galathiya, A. S., Ganatra, A. P., and Bhensdadia, K. C. 2012. An Improved decision tree induction algorithm, with feature selection, cross validation, model complexity & reduced error pruning, IJSCIT, March 2012.

3. Feldman, R., Sanger, J. 2006. The Text Mining Handbook: advanced approaches in analyzing unstructured data. Cambridge University Press.

4. Krallinger, M., et al. 2009 The BioCreative II. 5 challenge overview. In: Proc. The BioCreative II. 5 Workshop 2009 on Digital Annotations, pp. 7–9.

15

ReferencesReferences

5. Fragos, K., Maistros, I. 2006. A Goodness of Fit Test Approach in Information Retrieval. In journal of “Information Retrieval”, Springer, Volume 9, Number 3, pp 331 – 342.

6. McCallum A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization.

7. Fragos, K., Maistros, I., Skourlas, C. 2005. A X2-Weighted Maximum Entropy Model for Text Classification. 2nd International Conference On N.L.U.C.S, Miami, Florida.

16

Questions…Questions…

Documents

Towards Improving Classification of Real World Biomedical Articles