Final Thesis 7 1 2014 PDF

8/13/2019 Final Thesis 7 1 2014 PDF

1/75


2/75ii

Abbreviations

Abbreviation Description

IR Information Retrieval

TC Text Classification

ATC Arabic Text Classification

WWW World Wide Web

MMM Multinomial Mixture Model

KNN K-Nearest Neighbor

NB Nave Bayes

SVM Support Vector Machine

D Document

C Class

HTML Hyper Text Markup Language

SGML Standard Generalized Markup Language

XML Extensible Markup LanguageRe Recall

Pr Precision

FFS Feature subset selection

VMF Von Mises Fisher

BPSO Binary Particle Swarm Optimisation

LDA Latent Dirichlet Allocation


3/75


4/75iv

4.5. Results of Nave Bayes algorithm (MMM) with 5070 documents .................................... 60

4.6. Summary ........................................................................................................................... 63

4.7. Conclusion and Future Work: ........................................................................................... 64

4.8. Reference .......................................................................................................................... 65


5/75v

Acknowledgements

I would like to express my sincerest gratitude to my supervisor, Prof. Ghassan

Kanaan, who has been exceptionally patient and understanding with me during

my studies. Without his kind words of encouragement and advice this work

would not have been possible.

I am extremely grateful to all staff that has assisted me in the Department of

Computer sciences and informatics, especially Prof. AlaaAl-H amami.

Thanks also to all of my other colleagues in the Computer sciences andinformatics for making my time here an enjoyable experience.

I would like to thank the Libyan Embassy in Amman to take care of me to

supplement my study.

The support of my family and friends has been much appreciated, and most

importantly, I would like to thank my husband, Ali and my children, to whom I

am indebted for all of the moral and loving support they have given me during

this time.


6/75vi

Abstract

Text Classification (TC) assigns documents to one or more predefined

categories based on their contents. This project focuses on the comparison of

three automatic TC techniques: Rocchio, K-Nearest Neighbor (KNN) and

Nave Bayes (NB) classifier using a multinomial mixture model (MMM) on

Arabic language. In order to evaluation the mentioned techniques using the

MMM, an Arabic TC corpus that consists of 1445 Arabic documents that are

classified into nine categories: Computer, Economics, Education, Sport,

Politics, Engineer, Medicine, Law, and Religion.The main goal of this projectis to compare some of automatic text classification technique using a

multinomial mixture model on the Arabic language. The classification

effectiveness has been compared with the SVM model. This model was applied

in other project used the same traditional classifiers and the same collection.

Moreover; the experimentalresults are presented in terms of macro-averaging

precision, macro-averaging recall and macro-averagingF1 measures.

Furthermore, theresults reveal that the naive Bayes using MMMwork best for

Arabic TC tasksand outperformed k-NN and Rocchio classifiers.


7/751

1. Chapter one: Introduction


8/752

1.1. IntroductionThe rapid development of the Internet, a larger number of Arabic information

is an available online; this motivates researchers to find some tools that may

help people to classify the huge Arabic information.

1.1.1. Information retrievalIt is necessary to clarify exactlywhat is mean by Information Retrieval (IR)

system. The information retrieval system is designed to analyse process, store

sources of information and retrieve those that match a particular user's

requirements. In other words in IR system, the similarity scores between a query

and a set of documents has been calculated, and the relevant documents have

been ranked based on their similarity scores. There are two main issues in the

IR systems, the first one is that characterization of the user information need is

not always clear and need to be transformed in order to be understood by the IR

system which is known as Query(short document contain few words)(Hasan,

2009). The second problem is in the structure of the information where there

are no standards or rules that control this structure especially on the World Wide

Web (WWW) and each language has its own characteristics and semantics. In

addition, the users need to find excellent informationwhich is suitable for their

requirement. Furthermore, the time has been taken in account to catch

information quickly. The issues mention to very important topic which is Text

Classification (TC).

Information retrieval (IR) is a department of computer science. The main object

of IR is to provide effective methods for satisfying information needs.

Information that satisfies an information need is called relevant.


9/753

Figure 1.1:IR system components(Alnobani, 2008)

Three essential components can be used to represent IR system:

Input: set ofavailable documents and set of request information (query).the

problem here all this information must convert to form which is suitable for

computer to use.

The processor: in this part of the retrieval system interested with the retrieval

process. A retrieval algorithm is given a query created by user that represents

their information need. In the case of text, this query consists of a series ofwords, along with possible a set of relation between them. When all the inputs

are ready the process will make compare between the query and documents to

give the exact the requests users or at last retrieve the nearest result. The

information will be found resides in a collection which consists of a set of

documents.

Output: As a result of requested information (Query), the retrieval algorithm

scores the documents in the collection, ranking them according to some measure

of how well the query terms and relations are matched by information in the

document. For text, the relations most often used between terms are co-

occurrence or proximity constraints. Traditional relevance also relies on the


10/754

frequency with which terms occur in a document, and how unusual terms are in

the collection(Collins-Thompson and Adviser-Callan, 2008) .

It is necessary to recognize about how the retrieval process works. There are

two main approaches that can be used to build a system that retrieves documents

for a given query. One is in an ad-hoc method, and the other is a more

principled, model-based method.

Ad-hoc information retrieval: In Ad-hoc retrieval the documents in the

collection remain relatively static while new queries are submitted to the

system. For example, when the query compared to a set of documents, the

documents which contain the query terms will be retrieved. To improve upon

this in an ad-hoc method, we could decide to factor the number of times that a

term appears in a document as well. There are many benefits of ad-hoc

information retrieval. Ad-hoc retrieval system is quick to build. Many of the ad-

hoc retrieval methods are also very fast, needing only to look at the occurrence

of query terms in the documents they appear at query time. On the other hand,

the weak pointes are: the model is not built at all in Ad-hoc and every change ismade to an ad-hoc retrieval system effects the retrieval in unpredictable ways.

Additionally, it is very hard to understand exactly what is happening in these

systems, and therefore understand what should be done to improve them.

According to this reason the researchers prefer built models for information

retrieval(Alnobani, 2008).

Information retrieval using model: Unlike Ad-hoc in this case the retrieval

model has been built. If the model builds correctly, it will capture the important

aspects of a query and the documents needed for retrieval .Moreover, the benefit

of build the model is to understand and control the system.


11/755

1.1.1.1. Information retrieval modelsSeveral information retrieval models have been applied Boolean model is one

of the earliest IR systems. This model considers clear formalism and simplicity.

However, a major problem with this kind of models is binary decision criterion

without any notionof a grading scale, and the difficulty of translating the query

into Boolean expressions. In addition, Difficult to control the number of

documents retrieved becauseall matched documents will be returned. However,

there is another popular retrieval model which is Vector model. It has

advantages such as, a term weighting scheme to improve retrieval performance

by sorting the documents according to their degree of similarity to the query,

and a partial matching strategy, which approximates the query conditions.

Despite its advantages which are mentioned above, the vector model suffers

from several major drawbacks such as lacked clean formalism and simplicity.

The third Information retrieval model is probabilistic models. The probabilistic

model is one of the most common frameworks for building principled

information retrieval models. However, the existence of a query has beenassumed by the probabilistic models then model the generation of documents

that are relevant and irrelevant to that query. Given a probabilistic generative

model for the corpus, a probabilistic model must retrieve and rank documents.

On the other hand, the probabilistic model refers of weak points such as: the

initial definition of relevant documents has to be supposed and the weights

ignore term frequency.

1.1.1.2. Evaluate the information retrievalFinally, the performance of any systems has been proven according to achieved

result. The most common measures of system efficiency are time and space.

The shorter response time and the smaller space used are indicated to the best


12/756

system. In addition, effectiveness is a measure of the ability of the system to

retrieve relevant documents while at the same time holding back non-relevant

one; it can be measured by recall and precision (recall and precision have been

explained with more details in chapter 3).

1.1.2. Text classificationThe TC system is classification all documents into static number of predefined

categories based on their content. Moreover, the text classification may be either

single-label where exactly one category must be assigned to each document or

multi-label where one or more categories can be assigned to each document.

Therefore, the main object of use TC is make the IR system result better than

without TC (Ghwanmeh et al., 2007). These advantages have led to the

development of automatic text and document classification systems. These are

capable of automatically organizing and classifying documents (Duwairi,

2007b).

Classification process can be done by manual or automatic, It is interesting to

note that manual categorization consider difficult and complex task especially

with huge information, because it classifies documents one by one via human

experts. Furthermore, the time to complete this mission will be too much. on

the other hand,with the speedy growth of online text documents, automatic text

categorization (TC) becomes an essential tool to develop text documents

efficiently and effectively(Wang et al.).

Text Classification is the task of classifying a documentundera predefinedcategory. Additional officially, ifdis a document of the entire set of documentsD and (c, c,,) is the set of all the categories, then text classificationassigns one categoryc to a document dAccording to increased number ofArabic information on the Internet classifying documents manually is not


13/757

practical. However an automatic text classification has become an essential task

to save human effort to perform manual text classification. Therefore, the

optimal approach is automatic classification, due to the TC is used the science

of grammar ( ) as discharge and express (). In addition, thethesauri and dictionaries are used. According to these, the system can be

understood the main topics in a document. This is done using statistical methods

to study the repetition of words within a document, and then determine the

context, which helps in operation of search.

There three essential stages in the TC system which are: document indexing,

classifier learning, and classifier evaluation.

Document index:is one of the most substantial issues in TC, which includes

document representation and a term weighting scheme. The bag of word is the

most common way to represent the context of text. This approach consider

simplicity because it is recorded only the frequency of word in document.

Moreover, all the predefine categories, the synonyms and prefix words for the

category are found and it helps to assign any document to that category basedon the synonym or prefix of a term. In addition some of term weight schemes

have been indicated with details in chapter 3.

Classifier learning : There are several mechanism learning algorithms have

been applied on automatic text classificationby supervised learning(Ko et al.,

2004).The supervised learning algorithm finds a representation or judgment rule

from an example set of labelled documents for each class(Ko and Seo, 2009).

This can be illustrated briefly by Naive Bayes(NB)(Chen et al., 2009,Noaman

et al.,Zhang and Gao), Support Vector Machine (SVM)(Moraes et al.,Wang

and Chiang,Mesleh and Kanaan, 2008), Nearest Neighbour (k-NN)(Wan et al.,

Jiang et al.), Decision Trees(DT),Rocchio (Ko et al., 2004), and Voting etc.


14/758

Classifier evaluation:to know theeffective of classifier that is according the

achieved result from each one. Moreover, perfect evaluation measures such as

Recall, Precision and F1-measure have been used to evaluate the different

classifiers.

1.1.2.1. Classification Based on Supervised LearningThe target of classification methods is assigned class labels to unlabelled text

documents from a fixed number of unknown categories. Each document can be

multiple, exactly one, or no category at all.

Supervised machine learning methods prescribe the input and output format.

The input to these methods is a set of objects (training data), and the output is

the classes which these objects belong.

The key advantage of supervised learning methods over unsupervised methods

is that by having a clear knowledge of the classes the different objects belong

to these algorithms can perform an effective feature selection if that leads to

better prediction accuracy.

Automatic text classification is treated as supervised learning task. The targetof this task is to evaluation a Boolean function determine whether a given

document belongs to the category or not by looking the synonyms or prefix of

that category(Deisy et al., 2010).

Classification can obviously be formulated as a supervised learning problem

with two class labels (positive and negative). Training and testing data used in

existing research are mostly product reviews, which is not surprising due to the

above assumption. Since each review at a typical review site already has a

reviewer-assigned rating (e.g., 1-5 stars), training and testing data are readily

available. Typically, a review with 4-5 stars is considered a positive review


15/759

(thumbs-up), and a review with 1-2 stars is considered a negative review

(thumbs-down).

Classification is similar to but also different from classic topic-based text

classification, which classifies documents into predefined topic classes

(politics, sciences, sports, etc.). In topic-based classification, topic related

words are important. In sentiment classification, topic-related words are

unimportant. Instead, sentiment or opinion words that indicate positive or

negative opinions are important (e.g., great, excellent, amazing, horrible, bad,

worst, etc.).

Existing supervised learning methods can be readily applied to sentiment

classification, such as, nave Bayesian, support vector machines (SVM) (Pang

et al., 2002). This approach can be used to classify movie reviews into two

classes (positive and negative). It was shown that using unigrams (a bag of

individual words) as features in classification performed well with both nave

Bayesian and SVM. Neutral reviews were not used in this work, making the

problem easier. The used features are data attributes used in machine learning,not object features referred to in the previous section.

Subsequent research used many more kinds of features and techniques in

learning. As most machine learning applications, the main task of sentiment

classification is to find a suitable set of features. Some of the example features

used in research and possibly in practice such as mentioned (Pang and Lee,

2008).

Terms and their frequency: These features are individual words or word n-

grams and their frequency counts. Sometimes, word positions may also be

considered. The TF-IDF weighting scheme from information retrieval may be

applied too. These features are also commonly used in traditional topic-based


16/7510

text classification. They have been shown quite effective in sentiment

classification as well.

Part of speech tags: In many early researches, it was found that adjectives are

important indicators of subjectivities and opinions. Therefore, adjectives have

been treated as special features.

Opinion words and phrases: Opinion words are words that are commonly used

to express positive or negative sentiments. For example, beautiful, wonderful,

good, and amazing are positive opinion words, and bad, poor, and terrible are

negative opinion words. Although many opinion words are adjectives and

adverbs, nouns (rubbish, junk, crap, etc.) and verbs (hate, and like) can also

indicate opinions. In addition to opinion words, there are also opinion phrases

and idioms (cost someone an arm and a leg). Opinion words and phrases are

helpful to sentiment analysis.

Syntactic dependency: Words dependency based features generated from

parsing or dependency trees are also tried by several researchers.

Negation: Clearly negation words are important since their appearances oftenchange the opinion orientation. For example, the sentence I dont like this

camera is negative. Negation words must be handled with care because not all

occurrences of such words mean negation. For example, not in not onlybut

also does not change the orientation direction.

Research also predicts the rating scores(Pang et al., 2002). In this case, the

problem is formulated as a regression problem since the rating scores are

ordinal. Another investigated research direction is the transfer learning or

domain adaptation. As it had been shown, sentiment classification is highly

sensitive to the domain from which the training data are extracted. A classifier

trained using opinionated texts from one domain often performs poorly when it


17/7511

is applied or tested on opinionated texts from another domain. The reason is that

words and even language constructs used in different domains for expressing

opinions can be substantially different. Sometimes, the same word in one

domain means positive, but in another domain means negative (Turney, 2002).

For example, the adjective unpredictable may have a negative orientation in a

car review (unpredictable steering), but it could have a positive orientation in

a movie review (unpredictable plot). Therefore, domain adaptation is needed.

Existing research has used labelled data from one domain and unlabelled data

from the target domain, and general opinion words as features for

adaptation(Gamon and Aue, 2005).

1.1.2.2. Classification Based on Unsupervised LearningOpinion words and phrases are the dominating indicators for sentiment

classification. Therefore, using unsupervised learning based on such words and

phrases would be quite natural. The methods used in(Turney, 2002).performs

classification based on some fixed syntactic phrases that are likely to be used to

express opinions. The algorithm consists of three steps:Step 1: It extracts phrases containing adjectives or adverbs. The reason for doing

this is that research has shown that adjectives and adverbs are good indicators

of subjectivity and opinions. Although an isolated adjective may indicate

subjectivity, there may be an insufficient context to determine its opinion

orientation. Thus, the algorithm extracts two consecutive words, where one

member of the pair is an adjective/adverb, and the other is a context word.

For example: In the sentence, This camera produces beautiful pictures,

beautiful pictures will be extracted as it satisfies the first pattern.

Step 2: It estimates the orientation of the extracted phrases using the Point wise

Mutual Information (PMI) measure given in equation (1.1).


18/7512

)2()1(

)2&1(2log)2,1(

termptermp

termtermptermtermPMI

1.1

P (term1& term2) is the co-occurrence probability of term1 and term2, and P

(term1) P (term2) is the probability that the two terms co-occur if they are

statistically independent. The ratio between P (term1&term2) and P (term1) P

(term2) is a measure of the degree of statistical dependence between them. The

log of this ratio is the amount of information that we acquire about the presence

of one of the words when the other is observed.

The Opinion Orientation (OO) of a phrase is computed based on its association

with the positive reference word excellent, and its association with the

negative reference word poor:

).'',()'',()( poorphrasePMIexcellentphrasePMIphraseSO 1.2

The probabilities are calculated by issuing queries to a search engine and

collecting the number of hits.

For each search query, a search engine usually gives the number of relevant

documents to the query, which is the number of hits. Thus, by searching the two

terms together and separately, we can estimate the probabilities in

equation(1.1)(Turney, 2002)used the AltaVista search engine because it has a

NEAR operator, which constrains the search to documents that contain the

words within ten words of one another, in either order. Let hits (query) are the

number of hits returned. Equation (1.2) can be rewritten as:


19/7513

)'(')''(

'(')''(2log)(

excellenthitspoorphraseNEARhits

poorhitsexcellentphraseNEARhitsphraseSO

1.3

Step 3: Given a review, the algorithm computes the average OO of all phrases

in the review, and classifies the review as recommended if the average OO is

positive, not recommended otherwise.

In this project, the text classification has been compared the three different

classifiers which are KNN, Rocchio and Nave Bayes by using MMMs.TC

system for Arabic language may not consider easy task compare with English

language. Because the Arabic language has very complex

morphology(Albalooshi et al.). Moreover, the test system considers the most

important stage in any IR system. It is used to determine the efficiency of the

system and helped to know which system is better than other. However, the

major goal of an IR system is to retrieve all the documents which are relevant

to a user query while retrieving as little non relevant documents as possible. The

evaluation can be achieved by Recall and Precision measures.1.1.3. Arabic languageApply text classification systems for Arabic language is a challenging task due

to it has very complex morphology (Albalooshi et al.). The Arabic alphabet

consists of the 28 letters:

The characters ( ) are called by vowels and rest letter are consonants. TheArabic letters can be written by different forms, which depend on position it in

the word (beginning, middle, and end). For example the letter () has severalshapes. ( )if appear in the began(which mean in English Road);( ) If


20/7514

the letter appears in the middle(which mean in English Surface);( )if theletter appears in end(which mean in English Rubber).Furthermore TheArabic language contain on diacritics() which putting over or below theletters, The diacritics (fathah,kasra,dama,sukon,double fathah, double kasra,

double dama and shada) are used to clarify the mean of the words(Duwairi,

2007c). On top of that, in Arabic words, when diacritics are not clearly

mentioned, the text has a several meanings. Therefore, ambiguous meaning

which will negatively effect on text classification. To avoid these problems pre-

processing can be applied on Arabic language.

The Arabic language has more complex morphology than English language.

The Arabic language is written from right to left. Arabic words have two

genders, feminine and masculine; three numbers,singular,dual, and plural; and

three grammatical cases, nominative, accusative, and genitive. Noun has the

nominative case when it is the subject; accusative when it is the object of a verb;

and the genitive when it is the object of a preposition. In addition to, the Arabic

sentence are divided into three parts: noun, verb, and character. The noun andverb stems are derived from a few thousand root by infixing, for example,

creating words like (computer), (he calculates),and (wecalculates),from the root(Duwairi, 2006).A noun is a name or a word that describes a person, thing, or an idea.

Arabic verbs similar to English verbs, which are classified into Perfect and

Imperfect. Perfect tense denotes actions completed, while Imperfect denotes

incomplete actions. The imperfect tense has four mood: Indicative, subjective,

jussive, and imperative(Abboud and McCarus, 1983)

Arabic particles include prepositions, adverbs, conjunctions, interrogative

particles, exceptions, and interjections.


21/7515

Most of Arabic words are derived from the pattern (); all words followingthe same pattern have common properties and states. For example, the pattern

() indicates the subject of the verb, the pattern () represent the objectof the verb.

An Arabic adjective can also have many variant. When an adjective modifies a

noun in a phrase, the adjective agrees with the noun in gender, number, case,

and definiteness. An adjective has a masculine singular form such as (new),a feminine singular form such as (new), a masculine plural form such as(new),and a feminine plural form such as .((new)(Chen and Gey, 2002In addition to the different forms of the Arabic word that result from the

derivational process, most connectors, conjunction, prepositions, pronouns, and

possession forms are attached to the Arabic surface form as prefixes and

suffixes. For instance, the definitive nouns are formed by attaching the article

() (and) )(as the) to the immediate front of the nouns. The conjunction word)is often attached to the following word. The letters (, , , ) can be addingto the front to the word as prepositions. The suffix () is attached to representthe feminine gender of the word. Also some suffixes are added to represent the

possessive pronoun (Her), for(My),and, for (Their)(Chen and Gey,2002,Zrigui et al., 2012).

In addition, Arabic has two kinds of plurals: sound plurals and broken plurals.

The sound plurals are formed by adding plural suffixes to singular nouns. The

plural suffix is for feminine nouns in all three grammatical cases, formasculine nouns in nominative case, and for masculine nouns in genitive andaccusative cases. Moreover, the formation of broken plurals is more complex

and often irregular, it is therefore, difficult to predict .furthermore, and broken

plurals are very common in Arabic. For example, the plural form of the noun


22/7516

(child)is (children),which is formed by attaching the prefix andinserting the infix .The plural form of the noun (book)is (books),whichis formed by deleting the infix .the plural form of (woman) is(women)the plural form completely different of singular form(Chen and

Gey, 2002).

1.2. The statement of the problemThe IR system has been widely used to assist users with the discovery of useful

information from the Internet. Furthermore, the current IR systems are based on

the similarity and frequency term between query (users requirement) and the

available information on the Internet. However, the IR ignores important

semantic relationships between them. In addition, that ignorance makes the

research operation slowly and waste a lot of time. In addition to, the retrieval

documents May not useful and the big problem if the word has double meaning.

To overcome this problem, text categorization (classification) is a solution.

Text classification technique has been applied a little on the Arabic languages

compared to other languages (Al-Harbi et al., 2008). Unfortunately, there are

not perfect techniques to classify the text, thus the researchers have been

encouraged to develop TC techniques by using many different models and many

methods.

In this project the multinomial mixture model (MMM) has been suggested and

applied to classify the Arabic documents. In addition, this experiment will be

compared with other classifiers. In order to clarify, which model can be used

well than others.


23/7517

1.3. Thesis ObjectiveArabic text can be believed completely different of the English text and had

complex morphology. In this thesis, the multinomial mixture model (MMM)

has been recommended and applied to classify the Arabic documents.

Moreover, three different techniques are examined with Arabic text such as

Rocchio algorithm, traditional k-NN and nave Bayes.

The text classification system with these techniques has been evaluated using

the standard measures: recall, precision and f-measure. Moreover, the

effectiveness of the classifier will be decided according to the results achieved.

Finally, the results of MMM has been compared with the other two algorithms

to determine the best information retrieve system to Arabic language.

1.4. SummaryThis chapter gives a short introduction into Information retrieval system (IR).

It also focuses on text categorization (TC), and describes the most important

tasks of a text categorization system. After the short introduction some

interesting text categorization systems and Arabic language are described

briefly. Moreover, the thesiss problems are presented. Finally; the multinomial

mixture model has been adopted as thesis objective.


24/7518

2. Chapter two: literature Review


25/7519

2.1. Literature ReviewText classification is defined as assigning new documents to a set of pre-defined

categories based on the classification patterns (Al Zamil and Can,Uuz). Inrecent years, there have been an increasing amount of literatures on TC topic.

Moreover, the researchers have shown an increased interest to continue the

research and developed it according to the previous work.

2.1.1. Text classificationThe text classification techniques have been investigated and used in many

application areas. Moreover, there are many researchers studied text

classification using different techniques.

The study(Guiying et al.) was presented review the key text classification

techniques including text model, feature selection methods and text

classification algorithms in building a text classification system. In addition,

the text classification system based on Mutual Information, K-Nearest

Neighbour algorithm and Support Vector Machine had been implemented. The

data set was created from the famous Reuters-21578 text classification

collection. Furthermore, the experiment result was shown that the Classification

accuracy rates were 91.1 %. This was mentioned to obtain better performance

than no feature selection and improved the classification rate. Moreover, the

SVM classifier gains the higher performance compared to KNN classifier.

In(Zhang and Gao) the practical a new feature selection method (Auxiliary

Feature)had been used. In addition the enhancement performance of Naive

Bayes for Text Classification was proved and an auxiliary feature method was

proposed to determine features by an existing feature selection method, and

selected an auxiliary feature which can reclassify the text space aimed at the


26/7520

chosen features. In order to evaluate this experiment; the date set was choose

30000 junk mails and 10000 normal mails from CCERT. The result from this

study shows that the proposed method indeed improves the performance of

naive Bayes classifier.

Feature sub-set selection (FSS) is an important step for effective text

classification (TC) systems. Due to it may have a great effect on accuracy of the

classifier (Karabulut et al., Mesleh). However, there are many valuable studies

that investigated FSS metrics for English TC tasks; these studies have been used

different classifiers and many TC corpora (Uuz,Al-Ani et al.,Khushaba et

al.).The Arabic TC tasks, there are some works that handles the FSS problem.

In recent years, there has been an increasing amount of literature on an empirical

comparison of seventeen FSS metrics (Chi, Gss, Gsss, Ngl, Or, Mi, Ig, Bns, Df,

Pwr, Acc, Acc2, F1, Pr, Re, Fo and Er) for Arabic TC task using SVM5

classifier, the evaluation used an Arabic corpus that consists of 7842 documents

which are independently classified into ten categories. However, the result of

experiment was proven that Chi-square and Fallout FSS metrics work best forArabic TC tasks (Mesleh).

Another study has been mentioned to the FSS problem which has title under

two-stage feature selection and feature extraction is used to improve the

performance of text categorization(Uuz, 2011).

An improved KNN algorithm for text classification was proposed, which builds

the classification model by combining constrained one pass clustering algorithm

and KNN text categorization. Despite the KNN is a simple and effective method

for text classification, it has three drawback points: firstly, the complexity of its

sample similarity computing is huge; secondly, its performance is easily

affected by single training sample, thirdly, KNN consider a lazy learning cause


27/7521

do not build the classification model.Moreover, to overcome these drawbacks,

the improved KNN algorithm was executed. In addition, this algorithm was

used Vector Space Model (VSM) to represent the documents. The result show

that the INNTC classier is much more effective and efficient than KNN(Jiang

et al.).

In (Zhang et al., 2013) a novel project prototype based classifier for text

classification had been implemented. The basic idea behind the indicated

algorithm is based on which document categories in modelled by as set of

prototypes and their individual term subspace of the document category. The

classifier was tested using two English data sets then compared its performwith other five classifiers: svm, three prototype, knn, knn-model and centroid

classifier. The experiment result of the suggested classifier show that the project

prototype based classifier was achieved higher classifier accuracy at a lower

computation cost than the traditional prototype based classifier especially for

date includes interfering document classification.

2.1.2. Arabic Text ClassificationThe studies carried out for Arabic text classification consider very few

compared to other languages (like English). Due to the Arabic language has an

extremely rich morphology and complex orthography. However, there are some

related work had been proposed to classify Arabic documents:

In (Duwairi, 2007a)claimed that three classifiers KNN, NB and distance-based

classifier had been implemented for Arabic TC. Every category wasrepresented as a vector of keywords in the distance-based and KNN. On the

other hand, the vectors with the NB were bag of word. The Dice measure was

used to calculate the similarity between them. In addition, the accuracy of the

classifier was tested using Arabic text corpus, which collected from online


28/7522

magazines and newspapers. According to the result, the NB classifiers do better

than other two classifiers.

In 2008(Mesleh and Kanaan, 2008), the researcher indicated to SVM algorithm.

It had been implemented on Arabic text classification. The paper pointed out

to the SVM classifier achieved the result better than other classifiers such as

(Nave Bayes & KNN). In addition, the light stemming on Arabic TC tasks was

evaluated with SVM classifier. As a result, the light stemming did not enhance

the performance of Arabic SVM text classifier. On the other hand, Feature

Subset Section (FSS) had been implemented and improved the performance of

Arabic SVM text classifier. All feature subset section methods such as (Chi-

square, GSS, NGL, OR, IG and MI) had been achieved better Recall and better

F1 measure. Furthermore, the best result had been achieved with two methods

of feature subset section included (Chi-square, NGL).Finally, a new Ant Colony

Based FSS algorithm (ACO) had been applied to achieve the greatest TC

effectiveness of the six methods of FSS.The main object was compared automatic text classification using kNN,

Rocchio and NB classifier on the Arabic language(Kanaan et al., 2009a).

Moreover the system had been tested by using a corpus of 1445 Arabic text

document. Additionally two models were used; the first model was the Support

Vector machine (SVM). It used to implement KNN and Rocchio classifier. Each

document was represented as a vector of terms. The second model was

probabilistic, which used to execute NB classifier. In probabilistic model the

probability of document belonging to any class had been calculated. The

document assigned to class has maximum probabilistic. However, the


29/7523

experiments shown the Nave Bayes is the best performer followed by kNN and

Rocchio.

The paper (McCallum and Nigam, 1998a)had reported that comparison between

two probabilistic classifiers. In addition, the researchers mentioned to the

multinomial model were given the result better than the multivariate Bernoulli

model at large vocabulary sizes. In contrast when the vocabulary size is smaller

the multivariate Bernoulli model outperforms the multinomial model.

Furthermore, the results were tested on five real world corpora. As the

evaluation of their experimental proved that the multinomial model was reduced

error by an average of 27%, and sometimes by more than 50%.

In(Ueda and Saito, 2002) implemented the probabilistic generative models

called parametric mixture models (PMMs). The main goal of PMMs was to

avoid multiclass and multi-labelled text categorization problems. In addition,

the PMMs was achieved good results compared on the binary classification, dueto PMMs can simultaneously detect multiple categories of text instead depend

on binary judgment. Furthermore, the PMMs approaches was applied by using

World Wide Web pages, and showed on its efficiency.

In (Zhong and Ghosh, 2003)demonstrated comparative study of generative

models for document clustering was used multinomial model. In addition, the

comparative this model with other two probabilistic models such as multivariate

Bernoulli, and Von Mises-Fisher [2003] (VMF) was performed by applying

the cluster. Unfortunately the Bernoulli model was the worst for text clustering.

On the other hand, the VMF model produced clustering results better than both

Bernoulli and multinomial models.


30/7524

As mentioned in (Li and Zhang, 2008)the literature, a novel mixture model

method for text clustering was named multinomial mixture model with feature

selection (M3FS). The M3FS method was used MMM instead using the

Gaussian mixtures to improve text clustering tasks. Prior studies that have noted

no label in unsupervised text clustering was the hard problem of feature

selection. In order to overcome this problem the M3FS was proposed to text

cluster. Furthermore the results demonstrate that M3FS method has good

clustering performance and feature selection capability.

The main idea was discussed by(Bouguila et al., 2012) two problems, one is

many irrelevant features which may affect the speed and also compromise the

accuracy of the used learning algorithm. The second challenge is the presence

of outliers, which affects the resulting models parameters. For this reason, the

researchers were suggested apply an algorithm that partitions a given data set

without a priori information about the number of clusters. Furthermore novelstatistical mixture model, based on the Gamma distribution, which makes

explicit what data or features have to be ignored and what information has to be

retained. The performance of method of a finite mixture model by using

different applications with analysis of data, real data and objects shape

clustering have been proposed. Moreover the experiment was prove this

approach has excellent modelling capabilities and that feature selection mixed

with outliers detection influences significantly the clustering performance.

(McCallum and Nigam, 1998b, Lewis, 1998)were discussed the history of naive

Bayes in information retrieval, and presents a theoretical comparison of the


31/7525

multinomial and the multi-variate Bernoulli (again called the binary

independence model).

Compared to Indo-European languages (like English), the Arabic language has

an extremely rich morphology and a complex orthography. This is one of the

main reasons (El-Halees, 2007, Duwairi, 2006,MESLEH, 2007)behind the lack

of research in the field of Arabic text classification. However, many machine

learning approaches have been proposed to classify Arabic documents: Support

Vector Machine (SVM) classifier with the Chi-square feature extraction method

(MESLEH, 2007)the Nave Bayesian method(2004), k-Nearest Neighbours

(Al-Shalabi et al., 2006) distance based classifiers, the Rocchio

Algorithm(Syiam et al., 2006).

Sawaf, Zaplo and Ney(Sawaf et al., 2001) had used the maximum entropy

method for Arabic document clustering. Initially, documents were randomly

assigned to clusters. In subsequent iterations, documents were shifted from one

cluster to another if an improvement was gained. The algorithm terminated

when no further improvement could be achieved. Their text classificationmethod is based on unsupervised learning.

El-Kourdi, Bensaid, and Rachidi(2004)have used a Nave Bayesian classifier

to classify an in-house collection of Arabic documents.

They have concluded that there is some indication that the performance of

Nave Bayesian algorithm in classifying Arabic documents is not sensitive to

the Arabic root extraction algorithm. In addition to their own root extraction

algorithm, they 26 used other root extraction algorithms such as the algorithms

suggested by(Baeza-Yates and Ribeiro-Neto, 1999, Al-Shalabi and Evens,

1998).


32/7526

Duwairi (Duwairi, 2006)has proposed a distance-based classifier for Arabic TC

tasks, where the Dice measure was used as a similarity measure. According to

work had been done by Duwairi, each category was represented as a vector of

words. In the training phase, the text classifier scanned training documents to

extract features that best capture inherent category specific properties.

Documents were classified on the basis of their closeness to the feature vectors

of the text.

El-Halees (El-Halees, 2007) was implemented a maximum entropy based

classifier to classify Arabic documents. Compared with other text classification

systems (such as El-Kourdi et al. and Sawaf et al.), the overall performance of

the system was good (in comparisons, the results were used as recorded in the

published papers mentioned above by El-Halees).

Hmeidi, Hawashin and El-Qawasmeh(Hmeidi et al., 2008) reported a

comparative study of SVM and K-Nearest Neighbours KNN classifiers on

Arabic text classification tasks. The concluded proven that SVM classifier

shows a better micro-averaging F1-measure.Al-Saleem (Alsaleem, 2011)proposed an automated Arabic text classification

using SVM and NB classification methods. These methods were investigated

on different Arabic datasets. Several text evaluation measures had been used.

The experimental results against different Arabic text categorization datasets

showed that SVM algorithm outperforms the NB with regards to all measures

(recall, precision and F-measure). The F-measure of SVM was 77.8% while

74% for NB.

Al-Diabat et al, (Al-diabat, 2012)had investigated the problem of Arabic Text

Classification (ATC) by using rule-based classification approaches. The


33/75


34/7528

As long as the feature selection is a key factor in the accuracy and effectives of

resulting classification, the author in (Chantar and Corne, 2011)mentioned to

Binary Particle Swarm Optimisation (BPSO) as the feature selection for Arabic

text classification. The aim of apply Bpso/Knn is to find a good subset of feature

to facilitate the task of Arabic text categorization. SVM, Nave Bayes and C4.5

decision tree have been applied as classification algorithms. However the

suggest method was effective and achieved satisfy outcome on classification

accuracy.The particle swarm optimization has been used to achieve the excellent feature

selection in(Al-Saleem, 2010).

In 2013(Abuaiadah, 2013), it was reported that multiword features has

implemented to improve Arabic information retrieval. Multiword features are

displayed as a mixture of word appearing within windows of varying size.

However, multiword features were applied with two similarity functions: the

dice similarity function and the cosine similarity function to improve the

outcome of Arabic text classification. According to the results had achieved thedice function perform better than the cosine function. With the dice similarity

function, the frequencies of the features in the document are ignored and only

their existence is recognized.

In (Alsaleem, 2011) the investigator concentrated on just a single label

assignment. The goal of this paper is to present and compare result obtained

against Saudi Newspaper Arabic text collection using SVM algorithm and NB

algorithm. However, the experiment shows that the SVM classifier achieved

better result than NB classifier.

In(Zrigui et al., 2012)Latent Dirichlet Allocation(LDA)had been proposed as

text feature. LDA was used to index and represent Arabic texts. However, the


35/7529

mine idea behind LDA is that documents are represented as random mixtures

over latent topics where each topic is described by a distribution over words.

SVM was used to apply classification task. Moreover, LDA-SVM algorithm

was achieved high effectiveness for Arabic text classification,which exceeds

SVM without LDA, Nave Bayes and KNN classifiers.

2.2. SummaryIn chapter 2, there are different text classification algorithms were described. A

little was explained general in text classification and other explicated specially

the Arabic languages. Finally some papers with the Multinomial Mixture model

had been shown.


36/7530

3. Chapter three: Methodology


37/7531

3.1. IntroductionThere are many approaches can be used in text classification. KNN, Rocchio

and Nave Bayes by using MMM model have been implemented. Moreover,

these algorithms have been applied to the same datasets.

The main aim of applying TC on the Arabic languages is to improve the

performance of information retrieved without TC. Many steps can be done to

implement the TC task. However, all phases have been explained in section 3.2.

An IR process starts with the submission of a query, which describes a users

topic and finishes with a set of ranked results estimated by the IRsranking

scheme to be the most relevant to the query(Ajayi et al.).

Recall and Precision consider famous measures; these can be used to evaluate

any IR system. Furthermore, the efficiency of the system can be determined by

using those measures.

This chapter is divided into three main sections. The section 3.1 shows overview

about the project. Section 3.2 has been presented the main text classification

system architecture.Section3.3mentioned to the short summary of chapter three.

3.2. System ArchitectureThe text classification technique has been implemented by passing through

several phases. Moreover, these phases execute sequentially to facility the TC

task. Uncategorized documents were pre-processed by removing punctuation

marks and stopwords. Every document is then represented either as a vector of

words only or as a vector of word, their frequencies and number of documents

in which these words appeared (inverse document frequency). Stemming was

used to decrease the dimensionality of feature vectors of document. The

accuracy of the classifier is computed using recall, precision and F-


38/75


39/7533

3.2.2. Pre-processingThe pre-processing can be defined as the process of filter out words, which may

not give any meaning to a text, also might not be useful in information retrievalsystems. These words are called stop word(Al-Maimani et al.). The purpose of

applying pre-processing is to transform documents into a suitable representation

for classification task. In addition, it is reduced size of information which may

make search operation faster. The pre-processing can be done as follows:

-The different documents which have format such as HTML, SGML and XML

are converted to plain text format.

-Digital and Punctuation mark have been removed from each document.

Tokenization: Tokenization divides the document to set of tokens (words).

Remove stopword:There are two kinds of the terms in any document; the

firstly, it has been called stopword which occur commonly in all documents and

may not give any mean to the document. The secondly, it can be described as

Keywords or features. However, stopwords such as (punctuation mark,

formatting tags, prepositions, pronouns, conjunction and auxiliary verb) have

been removed to reduce the text size and save the process time. Moreover, this

process is essential to removing these high-frequency words because they may

misclassify the documents(Uuz).

Normalization: It is a very essential phase that will reduce many words that

have the same meaning, but it was written in different forms. Arabic language

refers to a very common problem when a single word has been written in manyforms like -- (which mean in English start).However, table 3.1show someletters have been normalized:


40/75


41/75


42/7536

collection. MIDF is normalized term frequency over the collection, which

provides the correct terms for learning. However, the MIDF performs better

than the existing term weighting schemes such as TF.IDF and WIDF. Indexing

a document is the method for characterizing its content to purpose making easy

subsequent retrieval in document storage. In addition, the index terms of

information retrieval systems are word stems automatically derived from a

document and weighted according to their distribution in a document collection.

Automatically indexing is the process of producing the descriptors (index

terms) of a text automatically(Lahtinen, 2000). Automatically indexing an

information source will save time and more rapidly, since most of the precise

human effort can be performed by a machine. However, in indexing approach

the order of terms in the vector are ignored. In information retrieval systems,

index terms are usually weighted according to their importance for describing

documents, and typically the weighting schemes are based on detection of word

frequencies across the document collection(Obaseki).The vector of words can be called vector of weighted terms, consists of alldistinct terms that appear in all training documents. It consists of term frequency

which measures the number of times the frequency of term i appears in a

document j, Inverse Document Frequency (IDF) which measures the number of

times the term i appears in all the collection set of documents.

Term FrequencyInverse Document Frequency (TF-IDF) has been used in

this work as one of the most popular weight schemes. It considers not only term

frequencies in a document, but also the frequencies of a term in the entire

collection of documents(Moraes et al.). The classic TFIDF , assigns to termt a weight in document d as:


43/7537

TFIDFi, j TFi, j.IDFi 3.1

Thus, TF*IDF weighting assigns a high degree of importance to terms occurring

frequently only in few documents of a collection. Inverse Document Frequency

(IDF) for term Ti calculated as fallowing:

IDFi log nType equation here. 3.2

Where, DFi(document frequency of term Ti) isnumber of documents in which Ti occurs.

Automatic indexing relies typically on word frequencies. If the word occurs

frequently in a document, but does not occur in many other documents, it is

possibly an appropriate document descriptor, and it should be weighted high by

the indexer.

Feature selection:

Feature sub-set selection (FSS) is one of important pre-processing steps of

machine learning and essentially a task for text classification. Feature selection

methods study how to choose a subset of attributes that are used to construct

models describing data(Khushaba et al.).There are many methods of FSS have

been applied on Arabic text(Al-Ani et al.,Mesleh and Kanaan, 2008).

According to the previous relation works the FSS approach was proven to

provides several advantages for text classification system because it has very

effective in reducing dimensionality, removing irrelevant and redundant terms

from documents and decrease computational complexity. In addition the FSS


44/7538

increasing learning accuracy, improving classification efficiency and scalability

by make building the classifier is usually simpler and faster).On the other hand,

FSS may decrease the classifiers accuracy (Mesleh). (Singh et al.,Khushaba et

al.,Al-Ani et al.).

While the number of feature is huge and redundant with text classification task,

it is important to examine how to select the best feature can use to achieve better

efficiency than others.

Many FSS algorithms have been tested and comparison in text classification

system for example: Chi-square and fallout were achieved the satisfy result in

Arabic TC tasks and Ant colony(ACO) is an optimization algorithm ,which is

derived from the study of real ant colonies ant it is one of the hopeful approaches

to better feature selection.To classify a new document was pre-processed by removing punctuation marks

and stopwords, followed by extracting the roots of the remaining keywords. The

feature vector of a new document and the feature vector of all categories should

be compared. Ultimately, the document was assigned to the category with hasmaximum similarity.

3.2.3. ClassifiersThere are a lot of classifiers type have been applied and excused in text

classification area. Moreover, the result was completely different from one to

another, while every classifier has specific algorithm. However, many sort of

classifier has been explained and shown the advantages and drawbackaccording to the achieved result.

3.2.3.1. Support vector machine (SVM)Support vector machine has been widely applied in text classification area

(Alsaleem, 2011,Zrigui et al., 2012,Mesleh and Kanaan, 2008). SVM classifier


45/75


46/7540

The KNN has advantages such as simple, non-parameter and shows a very good

performance on text categorization tasks for Arabic text Language.On the other

hand, the KNN has drawbacks such as difficult to find optimal value of k;

classification time is long due to the distance of each query instance to all

training samples has been computed. In addition, this classifier has been called

lazy learning system, because it does not involve a true training phrase(Wan et

al.).

The major steps to apply k-nearest neighbor classifier:

Pre-process documents in training set.

Choose the K parameter value, K value means the number of nearest neighbors

of d in the training data.

Determine the distance between the testing document (d) and the training

documents (previous classes).

Class the distance and determine neighbors based on the minimum distance of

k-distances.

To classify an unknown document, the KNN classifier ranks the documentsneighbors among the training documents and uses the class labels of the k most

similar neighbors. The similarity score of each nearest neighbor document to

the test document is used as the weight classes for the neighbor document. If a

specific category is shared by more than one of the k-nearest neighbors, then

the sum of the similarity scores of those neighbors is obtained from the weight

of that particular shared category(Mitra et al., 2007).

An example of KNN classification has been showed in figure3.2.a. Moreover,

the document X has been assumed as test sample, which should be classified

either to the first category of white circle or to the second category of black

circle. If k = 1 the document X will be classified to the white category, because


47/75


48/7542

to represent each document and class. The vector to represented the classes ( c

) has been called prototype or centroid (Ko et al., 2004, Kanaan et al., 2009a).

Prototype for each class calculated by subtract the average all document

appeared in class Cof the average all document do not appears in the classC.

jj CDdjCdj

j dCD

dC

c

||

1

||

1

3.4

Where, & are parameters that adjust the relative impact of positive and

negative training examples.

Practically in text classification, Rocchio calculates similarity between test

document and each of prototype vectors. Then, the test document assigns to the

category which has the maximum similarity score.

3.2.3.4. Nave Bayes:Nave Bayes classifier uses a probabilistic model of text. It achieves good

performance results on TC task for Arabic text(Kanaan et al., 2009b).

The NB mentioned (Noaman et al., Zhang and Gao) that it is a simpleprobabilistic classifier based on applying Bayes theorem the condition

probability P (C|d) for each class can be computed as:

p(cd) P(C)Pd|c

Pd

3.5

Where, P (c) is the prior probability of a document occurring in classc.Frequently, each document (d) in text classification represented as a vectorof words (v , v, . v) then the above equation become as:


49/7543

p(cd) P(C) k= P(vkc)

Pd

3.6

P (d) is constant from all categories.

P(vkc) fF

3.7

Where, f is the frequency of a word (vk) in the test document.F is the number

of document in which word vkhas appeared in.Notes that:to avoid zero probability add one Laplace smoothing is used and

therefore the; equation become as:

P(vkc) f 1

F w

3.8

wEquals number of training document in the categoryc.The Bayes classifier compute separately the posterior of document D falling

into each class, and assign the document to the class with the highest

probability, that is

c= x cd; || 3.9

Where, |C| is the total number of classes(Duwairi, 2007a).


50/7544

The Nave Bayes for categorization is frequently used in text classification; due

to it has speed and simplicity. Moreover, there are two event models of Nave

Bayes: multinomial model and Bernoulli model(Prasad).

In Bernoulli model, a test document is classified as binary occurrence

information, the number of occurrences is ignored. Although multinomial

model is kept tracking of multiple occurrences.(Zhong and Ghosh, 2003,

McCallum and Nigam, 1998b).

3.2.3.5. Multinomial Mixture ModelIt is necessary to clarify exactly what is meant by MMM. It can be defined as

the distribution of words in a document as a multinomial. Furthermore, a

document is treated as a sequence of words and it is assumed that each word

position generated independently of every other(Rennie et al., 2003). In text

classification, the use of class-conditional multinomial mixtures can be seen as

a generalization of the Naive Bayes text classifier relaxing its (class-conditional

feature) independence assumption(Civera and Juan, 2005).When a test

document is classified, an MMM keeps track of multiple occurrences comparedwith another model such as Bernoulli model (Zhong and Ghosh, 2003). The

Bernoulli model uses binary occurrence information and ignores the number of

occurrences. As long as an MMM keeps the occurrence from all words

(frequency, position), thus, this makes the classification task easier.

p(cd) P(C)Pd|cPd 3.10

P(C)=+

L+ 3.11


51/7545

Where,n is number of the document in class.n is number of document intraining set D.L is number of classes(Chen et al., 2009).

P(dc) pdd! k= p(wkc)

nk!

3.12

p(wkc) 1 nkn n

3.13

Where,nkis number of documents in class cjthat contain wordwk..3.2.4. EvaluationAs long as there are many retrieval systems on the market, but which one is the

best. It depends on the result which proposed from every one. An important

issue for information retrieval systems is the notion of relevance. The purpose

of an information retrieval system is to retrieve all the relevant documents

(recall) and no non-relevant documents (precision). Recall and precision are

defined as:

Precision: The ability to retrieve top-ranked documents that are mostly

relevant.

Precision Number of relevant documents retrievedTotal number of documents retrieved 3.14

The maximum (and optimal) precision value would be 100% and the worst

possible precision of 0% is achieved when not a single relevant document was

found.

Recall: The ability of the search to find all of the relevant items in the corpus.


52/7546

Recall Number of relevant documents retrievedTotal number of relevant documents 3.15

One substantial aspect of results is how many of the relevant documents in a

collection have been found. Recall shows how many of the relevant documents

a user could possibly come across when reading all documents in the result set.

Therefore the higher level of recall it is mention to the best system.

Despite of with the recall measure both of number of relevant items retrieved

and total number of items retrieved are available, but total number of relevantitems is usually not available.

The most essential averages are: micro-average, which counts each document

equally important, and macro-average, which counts each category equally

important (see 4.3 for extra details).

The perfect information retrieval system can be achieved when the result of both

recall and precision equal one.

F1-measure: As a measure of effectiveness that combines the contributions of

precision and recall. The well-known F1 measure function is used to test

perform of the Information retrieval systems, which defined as:

RePr

Re.Pr21

F

3.16

Fallout: It is another evaluated measure can be used to evaluate the Information

Retrieval systems. Although, Recall and Precision consider the good evaluation

measure but they do not care on number of irrelevant documents in the


53/7547

collection, that caused to undefined recall when there is no relevant document

in the collection, also to undefined precision when no document is retrieved.

However, Fallout number of irrelevant documents in the collection had been

taken in account. In another word the Fallout is inverse of Recall, that is indicate

to a good system should have high recall and low fallout.

3.3. SummaryThis chapter gives some introduction to information retrieval, and describes the

common tasks of a TC system. Using multinomial mixture model as a machine

learning algorithm is nowadays the most popular approach. In the rest of chapter

three interesting kinds of TC algorithms have been described briefly.


54/7548

4. Chapter four: Experiments and Evaluation


55/7549

4.1. IntroductionAutomatic Text Classification is defined as classifying unlabelled documents

into predefined categories based on its contents. It has become an important

topic due to the increased number of documents on the internet that people have

to deal with daily; this in itself has led to the urgent need of organizing them. In

this chapter, experiments will be achieved then the performance of the Rocchio

algorithm with traditional k-NN and Nave Bayes using MMM classifiers will

be documented.

These classifiers will be evaluated by some measures in order to know whether

Nave Bayes using MMM outperforms the other classifiers. The rest of this

chapter will be organized as following: section 4.2 will discuss the preparing

process for data set evaluation. Section 4.3 will list the performance measures.

Section 4.4 will discuss the evaluation results. Section 4.5 will discuss the

results of MMM with 5070 documents. In section 4.6 will show the summary.

Section 4.7 will explain the conclusion and future work. Section 4.8 will

indicate to the references.

4.2. Data set preparationThe corpus has been downloaded from(SAAD, 2010). The documents classified

into nine categories. The categories and number of documents of each one of

them appears in table 4.1. The total number of documents is 1445. The length

of documents is varying from each other. The nine categories are: Computer,

Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.

After the pre-processing achieved on all the documents, a copy of these pre-

processed documents have been converted into Attribute-Relation File Format

(ARFF) in order to be suitable for Weka tool.


56/7550

NO Category Number

1 Medicine 3

2 Economics 2

3 Religion

4 Sport 3

5 Politics 41

6 Engineer 44

Law 7

Computer 27 Education 8

Table 4.1 the number of documents for each category

4.3. Performance measures:Computational efficiency and classification effectiveness is what it meant of

performance of text classification algorithm. So, when a large number of

documents categorized into many categories, the efficiency of text classification

will be take into account. The effectiveness of text classification will be

measures by precision and recall(Kanaan et al., 2009a).

Precision and Recall are defined as follows:


57/7551

Recall + tp fp>0 , 4.1

Precision tptp fn tp fn > 0 , 4. 2Where,counts the number of documents that classified by classifier correctly,while counts the number of documents that classified by classifierincorrectly, fpcounts the number of documents that not classified by classifier

correctly counts the not assigned but incorrect cases and tn counts the not

assigned and correct cases. As showed in table 4.2.

Classifier

Decision

Correct Decision By Expert

YES is correct NO is incorrect

Assigned YES Tp Fn

Not Assigned

NO

Fp tn

Table4.2: confusion matrix for Performance measures

Precision is the fraction of retrieved instances that are relevant as it appear in

equation4.1, while recall is the fractions of relevant instances that are retrieved

as it appear in equation4.2.Both precision and recall are therefore based on an

understanding and measure of relevance. Precision and recall values often

depend on parameter tuning; thats mean there is a trade-off between precision

and recall. This is why another measure that combined both of the precision and

recall used: the F-measure which is defined as follows:

To evaluate the performance across categories, F-measure is averaged. There

are two kinds of averaged values, namely, micro average and macro average

(MESLEH, 2007).
http://en.wikipedia.org/wiki/Relevancehttp://en.wikipedia.org/wiki/Relevance


58/7552

Fmeasure2PrecisionRecal PrecisionRecall 4.3For obtaining estimates of precision and recall relative to the whole category

set, two different methods may be adopted:

Category set

C={c1,....,c|C}

Expert Judgments

YES NO

Classifier

Judgments

YES

||

1

iTPTPC

i

||

1iFNF

C

i

N

NO

||

1

iFPFPC

i

||

1

iTNTC

i

N

Table 4.3: The global contingency table

Macroaveraging: precision and recall are first evaluated locally for each

category, and then globally by averaging over the results of the different

categories.

Precision Recall

Macroaveraging

||||

Pr||

1

||

1

C

FNTPTP

C

C

i ii

iC

i

i

||||

Re

||

1

||

1

C

FPTPTP

C

C

i ii

iC

i

i

Table 4.4: Macro-average

Microaveraging: precision and recall are obtained by globally summing over

all individual decisions. For this, the global contingency table of table 4.4,

obtained by summing over all category-specific contingency tables, is needed.


59/75


60/7554

In single-label classification, as is implemented in the experiments, micro-

averaged precision equals recall(Rocchio, 1971), and is equal to F1 , so only

micro F1 will be noted for the micro-averaged results.

4.4. Evaluation ResultsThe results were obtained for each of the k-nearest neighbor, Rocchio, and

Nave Bayes using MMM as follow:

4.4.1. Nave Bayes algorithm using (MMM).Table4.6 shows the confusion matrix for Nave Bayes using MMM algorithm.

The numbers reported in an entry of a confusion matrix correspond to the

number of documents that are known to actually belong to the category given

by the row header of the matrix, but that are assigned by NB using MMM to the

category given by the column header.

As shown in the table4.6; 67 documents of category Computer are classified

correctly into Computer category while 3 documents of Computer classified

incorrectly where 2 of these 3 documents classified as Education and 1 from 3

classified as law. The best classification at category is Sport where 231 of this

category classified correctly. Lowest value of correctly classified documents for

Education category where 56 documents classified correctly and 12 documents

classified incorrectly.


61/7555

Table 4.6: Confusion Matrix results for NB using MMM algorithm

Figure4.1 shows recall, precision and f-measure for every category when the

Nave Bayes classifier was used, the precision reach it is highest value(1) for

the Sport , and computer categories while the lowest value of precision was

(0.812) for education category. Recall reaches its highest value (0.996) for sportcategory and its lowest value for law category (0.804). F-measure reaches its

heights value (0.998) for sport category and its lowest value for education

category (0.818). The rest of the figure is self-exploratory.


62/75


63/7557

Table4.8 shows the average of the above values for all categories in MMM

algorithm, the overall f-measure is 0. 908. This value consider high.

Nave Bayes usingMMM

Precision Recall F-measure

Weighted average 0.911 0.907 0.908

Table 4.8NB using MMM classifier weighted average for the nine categories

4.4.2. Comparisons MMM with other techniques anddiscussions 0f results

Firstly a comparison made between k-NN, Rocchio and Nave Bayes classifiers.All the results of KNN and Rocchio have been taken from(Kanaan et al.,

2009a). A summary of the recall, precision and F1 measures are shown in table

4.9. Nave Bayes gave the best F-measures with MiF1=0.9185 and

MaF1=0.908, followed by kNN widf with MiF1=0. 7970 and MaF1=0. 7871,

closely followed by Rocchio tf.idf with MiF1=0. 7314 and MaF1=0. 7882. A

comparison of values of MiF1 and MaF1 is shown in figure 4.2.

Method maP MaR maF1 miF1

kNN tf 0.7100 0.5359 0.6100 0.5711

kNN tfidf 0.8363 0.6902 0.7562 0.7272

kNN widf 0.8094 0.7662 0.7871 0.7970

Rocchio tf 0.5727 0.4501 0.5022 0.4427

Rocchio tfidf 0.8515 0.7337 0.7882 0.7314

Rocchio widf 0.7796 0.7199 0.7484 0.6968

ave Bayes 0.911 0.907 0.908 0.9185

Table 4.9: Classifier comparison


64/7558

The figure [4.2], show the maFi, miF1 for all the classifiers (KNN, Rocchio,

and Nave Bayes), from the figure, we can see that the Nave Bayes using MMM

got the higher value for both values (maF1, and miF1).

Figure 4.2: maF1, miF1 comparison for classifiers

The next figure [4.3] shows the macro precision of all the categories, and it

appears that the highest value is for the Naive Bayes using MMM, Rocchio, and

also KNN tf.idf not far away from Rocchio.

Figure 4.3: maP comparison for classifiers

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchiotfidf

Rocchiowidf

Nave Bayes

maF1

miF1

0.71

0.8363 0.8094

0.5727

0.85150.7796

0.911

0

0.2

0.4

0.6

0.8

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio

widf

Nave Bayes

maP

maP


65/7559

The next figure [4.4] shows the macro recall of all the categories, and it appears

that the highest value is for the naive Bayes using MMM, KNN and Rocchio is

not far away from KNN.

Figure 4.4: maR Comparison for Classifiers

It is clear that Naive Bayes classifier has the high values for the three measures

and then KNN classifier comes in the second place, the worst values in the three

measures was for Rocchio. As observed also there is disproportion in the

precision, recall and f-measure values for the k-NN where it is reach to high

value (0.83) at precision measure and very low at recall (0.53). As shown also

the values of precision, recall and f-measure values for the other two classifiers:

Rocchio and Nave Bayes classifiers more stability.

0.5359

0.69020.7662

0.4501

0.7337 0.7199

0.907

0

0.2

0.4

0.6

0.8

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio

widf

Nave Bayes

MaR

MaR


66/7560

Figure 4.5 precision, recall and f-measure for the three classifiers

4.5. Results of Nave Bayes algorithm (MMM) with 5070documents

Another experiment has been conducted, the collected corpus has showed in

Table4.10 contains 5070 documents that vary in length(SAAD, 2010). These

documents fall into six categories: Business, Entertainment, Middle East news,

Sport, World news, and Science and Technology.

NO Category Number

1 Business 836

2 Entertainment 474

3 Middle East news 1462

4 Sport 762

5 World news 1010

6 Science and Technology 526

Table 4.10 categories and their distributions in the corpus (5070 documents)

0

0.1

0.2

0.3

0.4

0.50.6

0.7

0.8

0.9

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchio

tfidf

Rocchio

widf

Nave Bayes

maP

MaR

maF1


67/7561

Table 4.11shows the confusion matrix for Nave Bayes using MMM algorithm.

Lowest value of correctly classified documents for Entertainment category

where 400 documents classified correctly and 74 documents classified

incorrectly.

Table 4.11: Confusion matrix results for NB Algorithm in the corpus (5070

Documents)

Figure4.6 shows recall, precision and f-measure for every category when the

Nave Bayes classifier was used, the precision reach it is highest value(0. 991)

for the Sport category while the lowest value of precision was (0. 746) for

Entertainment category. Recall reaches its highest value (0. 979) for Sport

category and its lowest value for Middle East news category (0. 832). F-measure

reaches its heights value (0. 985) for Sport category and its lowest value for

Entertainment category (0. 792). The rest of the figure is self-exploratory.


68/7562

Table4.12Confusion Matrix results for NB Algorithm in the corpus (5070 Documents)

The next figure [4.6] shows the precision and recall for all the categories that

classified using Nave Bayes by using MMM.

Figure 4.6: Result of the Naive Bayes classification algorithm

Table4.13 shows the average of the above values for all categories in NB

algorithm, the overall f-measure is 0. 884.

0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall


69/7563

Nave Bayes using

MNM

Precision Recall F-measure

Weighted average 0.882 0.890 0. 884

Table 4.13NB using MMM classifier weighted average for the six Categories in the

Corpus (5070 documents)

Comparing the overall result from table 4.8 and 4.13 show that there is a little

degrades in performance of precision, recall, and F-measure. Its because the

testing set still as its 4-fold cross validation. If the test set was in percent, the

results will be different since the classifier will learn more.

4.6. SummaryThe Naive Bayes using MMM has been evaluated against KNN and Rocchio.

The Naive Bayes using MMM outperformed k-NN and Rocchio classifier.

Naive Bayes (MMM) classifier has the best precision, and then the other

techniques came after Naive Bayes.


70/7564

4.7. Conclusion and Future Work:Text classification for Arabic languages has been investigated in this project.

Three classifiers were compared: KNN, Rocchio and Naive Bayes using

Multinomial Mixture Model (MMM).

Unclassified documents were pre-processed by removing stopwords and

punctuation marks. The rest of words was stemmed and stored in feather

vectors. Every test document has its own feature vector. Finally the document

will be classified to the best class according to the classifier technique.

The accuracy of classifiers has been measured using recall, precision and F-

measure. For project experiments the classifiers were tested using 1445

document. The result shows that the performance of NB by using Multinomial

model outperformed the other two classifiers.

As a future work, we plan to continue working with Arabic text categorization

as this area not widely explored in the literature and trying the classifiers on a

huge collection.

-Apply an auxiliary feature method with Multinomial model in order to improve

classification accuracy.

-Comparing the Nave Bayes MMM model with different models such as the

multivariate Bernoulli(Zhang and Gao).

-Evaluate Bpso feature selection with Multinomial classifier by using the same

Arabic database was mention it in (Chantar and Corne, 2011), then compare the

result between the two achieved result.


71/75


72/7566

20 CHANTAR, H. K. & CORNE, D. W. Feature subset selection for Arabic documentcategorization using BPSO-KNN. Nature and Biologically Inspired Computing (NaBIC),2011 Third World Congress on, 2011. IEEE, 546-551.

21 CHEN, A. & GEY, F. C. Building an Arabic Stemmer for Information Retrieval. TREC, 2002.22 CHEN, J., HUANG, H., TIAN, S. & QU, Y. 2009. Feature selection for text classification with

Nave Bayes. Expert Systems with Applications,36, 5432-5435.23 CIVERA, J. & JUAN, A. 2005. Multinomial Mixture Modelling for Bilingual Text

Classification. Technical report DSIC-II/10/05, UPV.24 COLLINS-THOMPSON, K. & ADVISER-CALLAN, J. 2008. Robust model estimation methods

for information retrieval, Carnegie Mellon University.25 DEISY, C., GOWRI, M., BASKAR, S., KALAIARASI, S. & RAMRAJ, N. 2010. A novel term

weighting scheme MIDF for Text Categorization. Journal of Engineering Science andTechnology,5, 94-107.

26 DUWAIRI, R. 2007a. Arabic text categorization. the international Arab Journal ofinformation Technology,7.

27 DUWAIRI, R. 2007b. Arabic Text Categorization. International Arab Journal on InformationTechnology,4.

28 DUWAIRI, R. M. 2006. Machine learning for Arabic text categorization. Journal of theAmerican Society for Information Science and Technology,57, 1005-1010.

29 DUWAIRI, R. M. 2007c. Arabic Text Categorization. Int. Arab J. Inf. Technol.,4, 125-132.30 EL-HALEES, A. 2007. Arabic text classification using maximum entropy. The Islamic

University Journal (Series of Natural Studies and Engineering) Vol,15, 157-167.31 GAMON, M. & AUE, A. Automatic identification of sentiment vocabulary: exploiting low

association with known sentiment terms. Proceedings of the ACL Workshop on FeatureEngineering for Machine Learning in Natural Language Processing, 2005. Association forComputational Linguistics, 57-64.

32 GHWANMEH, S., KANAAN, G., AL-SHALABI, R. & ABABNEH, A. Enhanced ArabicInformation Retrieval System based on Arabic Text Classification. Innovations inInformation Technology, 2007. IIT '07. 4th International Conference on, 18-20 Nov. 20072007. 461-465.

33 GOUDJIL, M., KOUDIL, M., HAMMAMI, N., BEDDA, M. & ALRUILY, M. Arabic textcategorization using SVM active learning technique: An overview. Computer andInformation Techno