Final Thesis 7 1 2014 PDF

Embed Size (px)

Citation preview

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    1/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    2/75ii

    Abbreviations

    Abbreviation Description

    IR Information Retrieval

    TC Text Classification

    ATC Arabic Text Classification

    WWW World Wide Web

    MMM Multinomial Mixture Model

    KNN K-Nearest Neighbor

    NB Nave Bayes

    SVM Support Vector Machine

    D Document

    C Class

    HTML Hyper Text Markup Language

    SGML Standard Generalized Markup Language

    XML Extensible Markup LanguageRe Recall

    Pr Precision

    FFS Feature subset selection

    VMF Von Mises Fisher

    BPSO Binary Particle Swarm Optimisation

    LDA Latent Dirichlet Allocation

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    3/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    4/75iv

    4.5. Results of Nave Bayes algorithm (MMM) with 5070 documents .................................... 60

    4.6. Summary ........................................................................................................................... 63

    4.7. Conclusion and Future Work: ........................................................................................... 64

    4.8. Reference .......................................................................................................................... 65

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    5/75v

    Acknowledgements

    I would like to express my sincerest gratitude to my supervisor, Prof. Ghassan

    Kanaan, who has been exceptionally patient and understanding with me during

    my studies. Without his kind words of encouragement and advice this work

    would not have been possible.

    I am extremely grateful to all staff that has assisted me in the Department of

    Computer sciences and informatics, especially Prof. AlaaAl-H amami.

    Thanks also to all of my other colleagues in the Computer sciences andinformatics for making my time here an enjoyable experience.

    I would like to thank the Libyan Embassy in Amman to take care of me to

    supplement my study.

    The support of my family and friends has been much appreciated, and most

    importantly, I would like to thank my husband, Ali and my children, to whom I

    am indebted for all of the moral and loving support they have given me during

    this time.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    6/75vi

    Abstract

    Text Classification (TC) assigns documents to one or more predefined

    categories based on their contents. This project focuses on the comparison of

    three automatic TC techniques: Rocchio, K-Nearest Neighbor (KNN) and

    Nave Bayes (NB) classifier using a multinomial mixture model (MMM) on

    Arabic language. In order to evaluation the mentioned techniques using the

    MMM, an Arabic TC corpus that consists of 1445 Arabic documents that are

    classified into nine categories: Computer, Economics, Education, Sport,

    Politics, Engineer, Medicine, Law, and Religion.The main goal of this projectis to compare some of automatic text classification technique using a

    multinomial mixture model on the Arabic language. The classification

    effectiveness has been compared with the SVM model. This model was applied

    in other project used the same traditional classifiers and the same collection.

    Moreover; the experimentalresults are presented in terms of macro-averaging

    precision, macro-averaging recall and macro-averagingF1 measures.

    Furthermore, theresults reveal that the naive Bayes using MMMwork best for

    Arabic TC tasksand outperformed k-NN and Rocchio classifiers.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    7/751

    1. Chapter one: Introduction

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    8/752

    1.1. IntroductionThe rapid development of the Internet, a larger number of Arabic information

    is an available online; this motivates researchers to find some tools that may

    help people to classify the huge Arabic information.

    1.1.1. Information retrievalIt is necessary to clarify exactlywhat is mean by Information Retrieval (IR)

    system. The information retrieval system is designed to analyse process, store

    sources of information and retrieve those that match a particular user's

    requirements. In other words in IR system, the similarity scores between a query

    and a set of documents has been calculated, and the relevant documents have

    been ranked based on their similarity scores. There are two main issues in the

    IR systems, the first one is that characterization of the user information need is

    not always clear and need to be transformed in order to be understood by the IR

    system which is known as Query(short document contain few words)(Hasan,

    2009). The second problem is in the structure of the information where there

    are no standards or rules that control this structure especially on the World Wide

    Web (WWW) and each language has its own characteristics and semantics. In

    addition, the users need to find excellent informationwhich is suitable for their

    requirement. Furthermore, the time has been taken in account to catch

    information quickly. The issues mention to very important topic which is Text

    Classification (TC).

    Information retrieval (IR) is a department of computer science. The main object

    of IR is to provide effective methods for satisfying information needs.

    Information that satisfies an information need is called relevant.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    9/753

    Figure 1.1:IR system components(Alnobani, 2008)

    Three essential components can be used to represent IR system:

    Input: set ofavailable documents and set of request information (query).the

    problem here all this information must convert to form which is suitable for

    computer to use.

    The processor: in this part of the retrieval system interested with the retrieval

    process. A retrieval algorithm is given a query created by user that represents

    their information need. In the case of text, this query consists of a series ofwords, along with possible a set of relation between them. When all the inputs

    are ready the process will make compare between the query and documents to

    give the exact the requests users or at last retrieve the nearest result. The

    information will be found resides in a collection which consists of a set of

    documents.

    Output: As a result of requested information (Query), the retrieval algorithm

    scores the documents in the collection, ranking them according to some measure

    of how well the query terms and relations are matched by information in the

    document. For text, the relations most often used between terms are co-

    occurrence or proximity constraints. Traditional relevance also relies on the

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    10/754

    frequency with which terms occur in a document, and how unusual terms are in

    the collection(Collins-Thompson and Adviser-Callan, 2008) .

    It is necessary to recognize about how the retrieval process works. There are

    two main approaches that can be used to build a system that retrieves documents

    for a given query. One is in an ad-hoc method, and the other is a more

    principled, model-based method.

    Ad-hoc information retrieval: In Ad-hoc retrieval the documents in the

    collection remain relatively static while new queries are submitted to the

    system. For example, when the query compared to a set of documents, the

    documents which contain the query terms will be retrieved. To improve upon

    this in an ad-hoc method, we could decide to factor the number of times that a

    term appears in a document as well. There are many benefits of ad-hoc

    information retrieval. Ad-hoc retrieval system is quick to build. Many of the ad-

    hoc retrieval methods are also very fast, needing only to look at the occurrence

    of query terms in the documents they appear at query time. On the other hand,

    the weak pointes are: the model is not built at all in Ad-hoc and every change ismade to an ad-hoc retrieval system effects the retrieval in unpredictable ways.

    Additionally, it is very hard to understand exactly what is happening in these

    systems, and therefore understand what should be done to improve them.

    According to this reason the researchers prefer built models for information

    retrieval(Alnobani, 2008).

    Information retrieval using model: Unlike Ad-hoc in this case the retrieval

    model has been built. If the model builds correctly, it will capture the important

    aspects of a query and the documents needed for retrieval .Moreover, the benefit

    of build the model is to understand and control the system.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    11/755

    1.1.1.1. Information retrieval modelsSeveral information retrieval models have been applied Boolean model is one

    of the earliest IR systems. This model considers clear formalism and simplicity.

    However, a major problem with this kind of models is binary decision criterion

    without any notionof a grading scale, and the difficulty of translating the query

    into Boolean expressions. In addition, Difficult to control the number of

    documents retrieved becauseall matched documents will be returned. However,

    there is another popular retrieval model which is Vector model. It has

    advantages such as, a term weighting scheme to improve retrieval performance

    by sorting the documents according to their degree of similarity to the query,

    and a partial matching strategy, which approximates the query conditions.

    Despite its advantages which are mentioned above, the vector model suffers

    from several major drawbacks such as lacked clean formalism and simplicity.

    The third Information retrieval model is probabilistic models. The probabilistic

    model is one of the most common frameworks for building principled

    information retrieval models. However, the existence of a query has beenassumed by the probabilistic models then model the generation of documents

    that are relevant and irrelevant to that query. Given a probabilistic generative

    model for the corpus, a probabilistic model must retrieve and rank documents.

    On the other hand, the probabilistic model refers of weak points such as: the

    initial definition of relevant documents has to be supposed and the weights

    ignore term frequency.

    1.1.1.2. Evaluate the information retrievalFinally, the performance of any systems has been proven according to achieved

    result. The most common measures of system efficiency are time and space.

    The shorter response time and the smaller space used are indicated to the best

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    12/756

    system. In addition, effectiveness is a measure of the ability of the system to

    retrieve relevant documents while at the same time holding back non-relevant

    one; it can be measured by recall and precision (recall and precision have been

    explained with more details in chapter 3).

    1.1.2. Text classificationThe TC system is classification all documents into static number of predefined

    categories based on their content. Moreover, the text classification may be either

    single-label where exactly one category must be assigned to each document or

    multi-label where one or more categories can be assigned to each document.

    Therefore, the main object of use TC is make the IR system result better than

    without TC (Ghwanmeh et al., 2007). These advantages have led to the

    development of automatic text and document classification systems. These are

    capable of automatically organizing and classifying documents (Duwairi,

    2007b).

    Classification process can be done by manual or automatic, It is interesting to

    note that manual categorization consider difficult and complex task especially

    with huge information, because it classifies documents one by one via human

    experts. Furthermore, the time to complete this mission will be too much. on

    the other hand,with the speedy growth of online text documents, automatic text

    categorization (TC) becomes an essential tool to develop text documents

    efficiently and effectively(Wang et al.).

    Text Classification is the task of classifying a documentundera predefinedcategory. Additional officially, ifdis a document of the entire set of documentsD and (c, c,,) is the set of all the categories, then text classificationassigns one categoryc to a document dAccording to increased number ofArabic information on the Internet classifying documents manually is not

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    13/757

    practical. However an automatic text classification has become an essential task

    to save human effort to perform manual text classification. Therefore, the

    optimal approach is automatic classification, due to the TC is used the science

    of grammar ( ) as discharge and express (). In addition, thethesauri and dictionaries are used. According to these, the system can be

    understood the main topics in a document. This is done using statistical methods

    to study the repetition of words within a document, and then determine the

    context, which helps in operation of search.

    There three essential stages in the TC system which are: document indexing,

    classifier learning, and classifier evaluation.

    Document index:is one of the most substantial issues in TC, which includes

    document representation and a term weighting scheme. The bag of word is the

    most common way to represent the context of text. This approach consider

    simplicity because it is recorded only the frequency of word in document.

    Moreover, all the predefine categories, the synonyms and prefix words for the

    category are found and it helps to assign any document to that category basedon the synonym or prefix of a term. In addition some of term weight schemes

    have been indicated with details in chapter 3.

    Classifier learning : There are several mechanism learning algorithms have

    been applied on automatic text classificationby supervised learning(Ko et al.,

    2004).The supervised learning algorithm finds a representation or judgment rule

    from an example set of labelled documents for each class(Ko and Seo, 2009).

    This can be illustrated briefly by Naive Bayes(NB)(Chen et al., 2009,Noaman

    et al.,Zhang and Gao), Support Vector Machine (SVM)(Moraes et al.,Wang

    and Chiang,Mesleh and Kanaan, 2008), Nearest Neighbour (k-NN)(Wan et al.,

    Jiang et al.), Decision Trees(DT),Rocchio (Ko et al., 2004), and Voting etc.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    14/758

    Classifier evaluation:to know theeffective of classifier that is according the

    achieved result from each one. Moreover, perfect evaluation measures such as

    Recall, Precision and F1-measure have been used to evaluate the different

    classifiers.

    1.1.2.1. Classification Based on Supervised LearningThe target of classification methods is assigned class labels to unlabelled text

    documents from a fixed number of unknown categories. Each document can be

    multiple, exactly one, or no category at all.

    Supervised machine learning methods prescribe the input and output format.

    The input to these methods is a set of objects (training data), and the output is

    the classes which these objects belong.

    The key advantage of supervised learning methods over unsupervised methods

    is that by having a clear knowledge of the classes the different objects belong

    to these algorithms can perform an effective feature selection if that leads to

    better prediction accuracy.

    Automatic text classification is treated as supervised learning task. The targetof this task is to evaluation a Boolean function determine whether a given

    document belongs to the category or not by looking the synonyms or prefix of

    that category(Deisy et al., 2010).

    Classification can obviously be formulated as a supervised learning problem

    with two class labels (positive and negative). Training and testing data used in

    existing research are mostly product reviews, which is not surprising due to the

    above assumption. Since each review at a typical review site already has a

    reviewer-assigned rating (e.g., 1-5 stars), training and testing data are readily

    available. Typically, a review with 4-5 stars is considered a positive review

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    15/759

    (thumbs-up), and a review with 1-2 stars is considered a negative review

    (thumbs-down).

    Classification is similar to but also different from classic topic-based text

    classification, which classifies documents into predefined topic classes

    (politics, sciences, sports, etc.). In topic-based classification, topic related

    words are important. In sentiment classification, topic-related words are

    unimportant. Instead, sentiment or opinion words that indicate positive or

    negative opinions are important (e.g., great, excellent, amazing, horrible, bad,

    worst, etc.).

    Existing supervised learning methods can be readily applied to sentiment

    classification, such as, nave Bayesian, support vector machines (SVM) (Pang

    et al., 2002). This approach can be used to classify movie reviews into two

    classes (positive and negative). It was shown that using unigrams (a bag of

    individual words) as features in classification performed well with both nave

    Bayesian and SVM. Neutral reviews were not used in this work, making the

    problem easier. The used features are data attributes used in machine learning,not object features referred to in the previous section.

    Subsequent research used many more kinds of features and techniques in

    learning. As most machine learning applications, the main task of sentiment

    classification is to find a suitable set of features. Some of the example features

    used in research and possibly in practice such as mentioned (Pang and Lee,

    2008).

    Terms and their frequency: These features are individual words or word n-

    grams and their frequency counts. Sometimes, word positions may also be

    considered. The TF-IDF weighting scheme from information retrieval may be

    applied too. These features are also commonly used in traditional topic-based

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    16/7510

    text classification. They have been shown quite effective in sentiment

    classification as well.

    Part of speech tags: In many early researches, it was found that adjectives are

    important indicators of subjectivities and opinions. Therefore, adjectives have

    been treated as special features.

    Opinion words and phrases: Opinion words are words that are commonly used

    to express positive or negative sentiments. For example, beautiful, wonderful,

    good, and amazing are positive opinion words, and bad, poor, and terrible are

    negative opinion words. Although many opinion words are adjectives and

    adverbs, nouns (rubbish, junk, crap, etc.) and verbs (hate, and like) can also

    indicate opinions. In addition to opinion words, there are also opinion phrases

    and idioms (cost someone an arm and a leg). Opinion words and phrases are

    helpful to sentiment analysis.

    Syntactic dependency: Words dependency based features generated from

    parsing or dependency trees are also tried by several researchers.

    Negation: Clearly negation words are important since their appearances oftenchange the opinion orientation. For example, the sentence I dont like this

    camera is negative. Negation words must be handled with care because not all

    occurrences of such words mean negation. For example, not in not onlybut

    also does not change the orientation direction.

    Research also predicts the rating scores(Pang et al., 2002). In this case, the

    problem is formulated as a regression problem since the rating scores are

    ordinal. Another investigated research direction is the transfer learning or

    domain adaptation. As it had been shown, sentiment classification is highly

    sensitive to the domain from which the training data are extracted. A classifier

    trained using opinionated texts from one domain often performs poorly when it

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    17/7511

    is applied or tested on opinionated texts from another domain. The reason is that

    words and even language constructs used in different domains for expressing

    opinions can be substantially different. Sometimes, the same word in one

    domain means positive, but in another domain means negative (Turney, 2002).

    For example, the adjective unpredictable may have a negative orientation in a

    car review (unpredictable steering), but it could have a positive orientation in

    a movie review (unpredictable plot). Therefore, domain adaptation is needed.

    Existing research has used labelled data from one domain and unlabelled data

    from the target domain, and general opinion words as features for

    adaptation(Gamon and Aue, 2005).

    1.1.2.2. Classification Based on Unsupervised LearningOpinion words and phrases are the dominating indicators for sentiment

    classification. Therefore, using unsupervised learning based on such words and

    phrases would be quite natural. The methods used in(Turney, 2002).performs

    classification based on some fixed syntactic phrases that are likely to be used to

    express opinions. The algorithm consists of three steps:Step 1: It extracts phrases containing adjectives or adverbs. The reason for doing

    this is that research has shown that adjectives and adverbs are good indicators

    of subjectivity and opinions. Although an isolated adjective may indicate

    subjectivity, there may be an insufficient context to determine its opinion

    orientation. Thus, the algorithm extracts two consecutive words, where one

    member of the pair is an adjective/adverb, and the other is a context word.

    For example: In the sentence, This camera produces beautiful pictures,

    beautiful pictures will be extracted as it satisfies the first pattern.

    Step 2: It estimates the orientation of the extracted phrases using the Point wise

    Mutual Information (PMI) measure given in equation (1.1).

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    18/7512

    )2()1(

    )2&1(2log)2,1(

    termptermp

    termtermptermtermPMI

    1.1

    P (term1& term2) is the co-occurrence probability of term1 and term2, and P

    (term1) P (term2) is the probability that the two terms co-occur if they are

    statistically independent. The ratio between P (term1&term2) and P (term1) P

    (term2) is a measure of the degree of statistical dependence between them. The

    log of this ratio is the amount of information that we acquire about the presence

    of one of the words when the other is observed.

    The Opinion Orientation (OO) of a phrase is computed based on its association

    with the positive reference word excellent, and its association with the

    negative reference word poor:

    ).'',()'',()( poorphrasePMIexcellentphrasePMIphraseSO 1.2

    The probabilities are calculated by issuing queries to a search engine and

    collecting the number of hits.

    For each search query, a search engine usually gives the number of relevant

    documents to the query, which is the number of hits. Thus, by searching the two

    terms together and separately, we can estimate the probabilities in

    equation(1.1)(Turney, 2002)used the AltaVista search engine because it has a

    NEAR operator, which constrains the search to documents that contain the

    words within ten words of one another, in either order. Let hits (query) are the

    number of hits returned. Equation (1.2) can be rewritten as:

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    19/7513

    )'(')''(

    '(')''(2log)(

    excellenthitspoorphraseNEARhits

    poorhitsexcellentphraseNEARhitsphraseSO

    1.3

    Step 3: Given a review, the algorithm computes the average OO of all phrases

    in the review, and classifies the review as recommended if the average OO is

    positive, not recommended otherwise.

    In this project, the text classification has been compared the three different

    classifiers which are KNN, Rocchio and Nave Bayes by using MMMs.TC

    system for Arabic language may not consider easy task compare with English

    language. Because the Arabic language has very complex

    morphology(Albalooshi et al.). Moreover, the test system considers the most

    important stage in any IR system. It is used to determine the efficiency of the

    system and helped to know which system is better than other. However, the

    major goal of an IR system is to retrieve all the documents which are relevant

    to a user query while retrieving as little non relevant documents as possible. The

    evaluation can be achieved by Recall and Precision measures.1.1.3. Arabic languageApply text classification systems for Arabic language is a challenging task due

    to it has very complex morphology (Albalooshi et al.). The Arabic alphabet

    consists of the 28 letters:

    The characters ( ) are called by vowels and rest letter are consonants. TheArabic letters can be written by different forms, which depend on position it in

    the word (beginning, middle, and end). For example the letter () has severalshapes. ( )if appear in the began(which mean in English Road);( ) If

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    20/7514

    the letter appears in the middle(which mean in English Surface);( )if theletter appears in end(which mean in English Rubber).Furthermore TheArabic language contain on diacritics() which putting over or below theletters, The diacritics (fathah,kasra,dama,sukon,double fathah, double kasra,

    double dama and shada) are used to clarify the mean of the words(Duwairi,

    2007c). On top of that, in Arabic words, when diacritics are not clearly

    mentioned, the text has a several meanings. Therefore, ambiguous meaning

    which will negatively effect on text classification. To avoid these problems pre-

    processing can be applied on Arabic language.

    The Arabic language has more complex morphology than English language.

    The Arabic language is written from right to left. Arabic words have two

    genders, feminine and masculine; three numbers,singular,dual, and plural; and

    three grammatical cases, nominative, accusative, and genitive. Noun has the

    nominative case when it is the subject; accusative when it is the object of a verb;

    and the genitive when it is the object of a preposition. In addition to, the Arabic

    sentence are divided into three parts: noun, verb, and character. The noun andverb stems are derived from a few thousand root by infixing, for example,

    creating words like (computer), (he calculates),and (wecalculates),from the root(Duwairi, 2006).A noun is a name or a word that describes a person, thing, or an idea.

    Arabic verbs similar to English verbs, which are classified into Perfect and

    Imperfect. Perfect tense denotes actions completed, while Imperfect denotes

    incomplete actions. The imperfect tense has four mood: Indicative, subjective,

    jussive, and imperative(Abboud and McCarus, 1983)

    Arabic particles include prepositions, adverbs, conjunctions, interrogative

    particles, exceptions, and interjections.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    21/7515

    Most of Arabic words are derived from the pattern (); all words followingthe same pattern have common properties and states. For example, the pattern

    () indicates the subject of the verb, the pattern () represent the objectof the verb.

    An Arabic adjective can also have many variant. When an adjective modifies a

    noun in a phrase, the adjective agrees with the noun in gender, number, case,

    and definiteness. An adjective has a masculine singular form such as (new),a feminine singular form such as (new), a masculine plural form such as(new),and a feminine plural form such as .((new)(Chen and Gey, 2002In addition to the different forms of the Arabic word that result from the

    derivational process, most connectors, conjunction, prepositions, pronouns, and

    possession forms are attached to the Arabic surface form as prefixes and

    suffixes. For instance, the definitive nouns are formed by attaching the article

    () (and) )(as the) to the immediate front of the nouns. The conjunction word)is often attached to the following word. The letters (, , , ) can be addingto the front to the word as prepositions. The suffix () is attached to representthe feminine gender of the word. Also some suffixes are added to represent the

    possessive pronoun (Her), for(My),and, for (Their)(Chen and Gey,2002,Zrigui et al., 2012).

    In addition, Arabic has two kinds of plurals: sound plurals and broken plurals.

    The sound plurals are formed by adding plural suffixes to singular nouns. The

    plural suffix is for feminine nouns in all three grammatical cases, formasculine nouns in nominative case, and for masculine nouns in genitive andaccusative cases. Moreover, the formation of broken plurals is more complex

    and often irregular, it is therefore, difficult to predict .furthermore, and broken

    plurals are very common in Arabic. For example, the plural form of the noun

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    22/7516

    (child)is (children),which is formed by attaching the prefix andinserting the infix .The plural form of the noun (book)is (books),whichis formed by deleting the infix .the plural form of (woman) is(women)the plural form completely different of singular form(Chen and

    Gey, 2002).

    1.2. The statement of the problemThe IR system has been widely used to assist users with the discovery of useful

    information from the Internet. Furthermore, the current IR systems are based on

    the similarity and frequency term between query (users requirement) and the

    available information on the Internet. However, the IR ignores important

    semantic relationships between them. In addition, that ignorance makes the

    research operation slowly and waste a lot of time. In addition to, the retrieval

    documents May not useful and the big problem if the word has double meaning.

    To overcome this problem, text categorization (classification) is a solution.

    Text classification technique has been applied a little on the Arabic languages

    compared to other languages (Al-Harbi et al., 2008). Unfortunately, there are

    not perfect techniques to classify the text, thus the researchers have been

    encouraged to develop TC techniques by using many different models and many

    methods.

    In this project the multinomial mixture model (MMM) has been suggested and

    applied to classify the Arabic documents. In addition, this experiment will be

    compared with other classifiers. In order to clarify, which model can be used

    well than others.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    23/7517

    1.3. Thesis ObjectiveArabic text can be believed completely different of the English text and had

    complex morphology. In this thesis, the multinomial mixture model (MMM)

    has been recommended and applied to classify the Arabic documents.

    Moreover, three different techniques are examined with Arabic text such as

    Rocchio algorithm, traditional k-NN and nave Bayes.

    The text classification system with these techniques has been evaluated using

    the standard measures: recall, precision and f-measure. Moreover, the

    effectiveness of the classifier will be decided according to the results achieved.

    Finally, the results of MMM has been compared with the other two algorithms

    to determine the best information retrieve system to Arabic language.

    1.4. SummaryThis chapter gives a short introduction into Information retrieval system (IR).

    It also focuses on text categorization (TC), and describes the most important

    tasks of a text categorization system. After the short introduction some

    interesting text categorization systems and Arabic language are described

    briefly. Moreover, the thesiss problems are presented. Finally; the multinomial

    mixture model has been adopted as thesis objective.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    24/7518

    2. Chapter two: literature Review

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    25/7519

    2.1. Literature ReviewText classification is defined as assigning new documents to a set of pre-defined

    categories based on the classification patterns (Al Zamil and Can,Uuz). Inrecent years, there have been an increasing amount of literatures on TC topic.

    Moreover, the researchers have shown an increased interest to continue the

    research and developed it according to the previous work.

    2.1.1. Text classificationThe text classification techniques have been investigated and used in many

    application areas. Moreover, there are many researchers studied text

    classification using different techniques.

    The study(Guiying et al.) was presented review the key text classification

    techniques including text model, feature selection methods and text

    classification algorithms in building a text classification system. In addition,

    the text classification system based on Mutual Information, K-Nearest

    Neighbour algorithm and Support Vector Machine had been implemented. The

    data set was created from the famous Reuters-21578 text classification

    collection. Furthermore, the experiment result was shown that the Classification

    accuracy rates were 91.1 %. This was mentioned to obtain better performance

    than no feature selection and improved the classification rate. Moreover, the

    SVM classifier gains the higher performance compared to KNN classifier.

    In(Zhang and Gao) the practical a new feature selection method (Auxiliary

    Feature)had been used. In addition the enhancement performance of Naive

    Bayes for Text Classification was proved and an auxiliary feature method was

    proposed to determine features by an existing feature selection method, and

    selected an auxiliary feature which can reclassify the text space aimed at the

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    26/7520

    chosen features. In order to evaluate this experiment; the date set was choose

    30000 junk mails and 10000 normal mails from CCERT. The result from this

    study shows that the proposed method indeed improves the performance of

    naive Bayes classifier.

    Feature sub-set selection (FSS) is an important step for effective text

    classification (TC) systems. Due to it may have a great effect on accuracy of the

    classifier (Karabulut et al., Mesleh). However, there are many valuable studies

    that investigated FSS metrics for English TC tasks; these studies have been used

    different classifiers and many TC corpora (Uuz,Al-Ani et al.,Khushaba et

    al.).The Arabic TC tasks, there are some works that handles the FSS problem.

    In recent years, there has been an increasing amount of literature on an empirical

    comparison of seventeen FSS metrics (Chi, Gss, Gsss, Ngl, Or, Mi, Ig, Bns, Df,

    Pwr, Acc, Acc2, F1, Pr, Re, Fo and Er) for Arabic TC task using SVM5

    classifier, the evaluation used an Arabic corpus that consists of 7842 documents

    which are independently classified into ten categories. However, the result of

    experiment was proven that Chi-square and Fallout FSS metrics work best forArabic TC tasks (Mesleh).

    Another study has been mentioned to the FSS problem which has title under

    two-stage feature selection and feature extraction is used to improve the

    performance of text categorization(Uuz, 2011).

    An improved KNN algorithm for text classification was proposed, which builds

    the classification model by combining constrained one pass clustering algorithm

    and KNN text categorization. Despite the KNN is a simple and effective method

    for text classification, it has three drawback points: firstly, the complexity of its

    sample similarity computing is huge; secondly, its performance is easily

    affected by single training sample, thirdly, KNN consider a lazy learning cause

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    27/7521

    do not build the classification model.Moreover, to overcome these drawbacks,

    the improved KNN algorithm was executed. In addition, this algorithm was

    used Vector Space Model (VSM) to represent the documents. The result show

    that the INNTC classier is much more effective and efficient than KNN(Jiang

    et al.).

    In (Zhang et al., 2013) a novel project prototype based classifier for text

    classification had been implemented. The basic idea behind the indicated

    algorithm is based on which document categories in modelled by as set of

    prototypes and their individual term subspace of the document category. The

    classifier was tested using two English data sets then compared its performwith other five classifiers: svm, three prototype, knn, knn-model and centroid

    classifier. The experiment result of the suggested classifier show that the project

    prototype based classifier was achieved higher classifier accuracy at a lower

    computation cost than the traditional prototype based classifier especially for

    date includes interfering document classification.

    2.1.2. Arabic Text ClassificationThe studies carried out for Arabic text classification consider very few

    compared to other languages (like English). Due to the Arabic language has an

    extremely rich morphology and complex orthography. However, there are some

    related work had been proposed to classify Arabic documents:

    In (Duwairi, 2007a)claimed that three classifiers KNN, NB and distance-based

    classifier had been implemented for Arabic TC. Every category wasrepresented as a vector of keywords in the distance-based and KNN. On the

    other hand, the vectors with the NB were bag of word. The Dice measure was

    used to calculate the similarity between them. In addition, the accuracy of the

    classifier was tested using Arabic text corpus, which collected from online

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    28/7522

    magazines and newspapers. According to the result, the NB classifiers do better

    than other two classifiers.

    In 2008(Mesleh and Kanaan, 2008), the researcher indicated to SVM algorithm.

    It had been implemented on Arabic text classification. The paper pointed out

    to the SVM classifier achieved the result better than other classifiers such as

    (Nave Bayes & KNN). In addition, the light stemming on Arabic TC tasks was

    evaluated with SVM classifier. As a result, the light stemming did not enhance

    the performance of Arabic SVM text classifier. On the other hand, Feature

    Subset Section (FSS) had been implemented and improved the performance of

    Arabic SVM text classifier. All feature subset section methods such as (Chi-

    square, GSS, NGL, OR, IG and MI) had been achieved better Recall and better

    F1 measure. Furthermore, the best result had been achieved with two methods

    of feature subset section included (Chi-square, NGL).Finally, a new Ant Colony

    Based FSS algorithm (ACO) had been applied to achieve the greatest TC

    effectiveness of the six methods of FSS.The main object was compared automatic text classification using kNN,

    Rocchio and NB classifier on the Arabic language(Kanaan et al., 2009a).

    Moreover the system had been tested by using a corpus of 1445 Arabic text

    document. Additionally two models were used; the first model was the Support

    Vector machine (SVM). It used to implement KNN and Rocchio classifier. Each

    document was represented as a vector of terms. The second model was

    probabilistic, which used to execute NB classifier. In probabilistic model the

    probability of document belonging to any class had been calculated. The

    document assigned to class has maximum probabilistic. However, the

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    29/7523

    experiments shown the Nave Bayes is the best performer followed by kNN and

    Rocchio.

    The paper (McCallum and Nigam, 1998a)had reported that comparison between

    two probabilistic classifiers. In addition, the researchers mentioned to the

    multinomial model were given the result better than the multivariate Bernoulli

    model at large vocabulary sizes. In contrast when the vocabulary size is smaller

    the multivariate Bernoulli model outperforms the multinomial model.

    Furthermore, the results were tested on five real world corpora. As the

    evaluation of their experimental proved that the multinomial model was reduced

    error by an average of 27%, and sometimes by more than 50%.

    In(Ueda and Saito, 2002) implemented the probabilistic generative models

    called parametric mixture models (PMMs). The main goal of PMMs was to

    avoid multiclass and multi-labelled text categorization problems. In addition,

    the PMMs was achieved good results compared on the binary classification, dueto PMMs can simultaneously detect multiple categories of text instead depend

    on binary judgment. Furthermore, the PMMs approaches was applied by using

    World Wide Web pages, and showed on its efficiency.

    In (Zhong and Ghosh, 2003)demonstrated comparative study of generative

    models for document clustering was used multinomial model. In addition, the

    comparative this model with other two probabilistic models such as multivariate

    Bernoulli, and Von Mises-Fisher [2003] (VMF) was performed by applying

    the cluster. Unfortunately the Bernoulli model was the worst for text clustering.

    On the other hand, the VMF model produced clustering results better than both

    Bernoulli and multinomial models.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    30/7524

    As mentioned in (Li and Zhang, 2008)the literature, a novel mixture model

    method for text clustering was named multinomial mixture model with feature

    selection (M3FS). The M3FS method was used MMM instead using the

    Gaussian mixtures to improve text clustering tasks. Prior studies that have noted

    no label in unsupervised text clustering was the hard problem of feature

    selection. In order to overcome this problem the M3FS was proposed to text

    cluster. Furthermore the results demonstrate that M3FS method has good

    clustering performance and feature selection capability.

    The main idea was discussed by(Bouguila et al., 2012) two problems, one is

    many irrelevant features which may affect the speed and also compromise the

    accuracy of the used learning algorithm. The second challenge is the presence

    of outliers, which affects the resulting models parameters. For this reason, the

    researchers were suggested apply an algorithm that partitions a given data set

    without a priori information about the number of clusters. Furthermore novelstatistical mixture model, based on the Gamma distribution, which makes

    explicit what data or features have to be ignored and what information has to be

    retained. The performance of method of a finite mixture model by using

    different applications with analysis of data, real data and objects shape

    clustering have been proposed. Moreover the experiment was prove this

    approach has excellent modelling capabilities and that feature selection mixed

    with outliers detection influences significantly the clustering performance.

    (McCallum and Nigam, 1998b, Lewis, 1998)were discussed the history of naive

    Bayes in information retrieval, and presents a theoretical comparison of the

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    31/7525

    multinomial and the multi-variate Bernoulli (again called the binary

    independence model).

    Compared to Indo-European languages (like English), the Arabic language has

    an extremely rich morphology and a complex orthography. This is one of the

    main reasons (El-Halees, 2007, Duwairi, 2006,MESLEH, 2007)behind the lack

    of research in the field of Arabic text classification. However, many machine

    learning approaches have been proposed to classify Arabic documents: Support

    Vector Machine (SVM) classifier with the Chi-square feature extraction method

    (MESLEH, 2007)the Nave Bayesian method(2004), k-Nearest Neighbours

    (Al-Shalabi et al., 2006) distance based classifiers, the Rocchio

    Algorithm(Syiam et al., 2006).

    Sawaf, Zaplo and Ney(Sawaf et al., 2001) had used the maximum entropy

    method for Arabic document clustering. Initially, documents were randomly

    assigned to clusters. In subsequent iterations, documents were shifted from one

    cluster to another if an improvement was gained. The algorithm terminated

    when no further improvement could be achieved. Their text classificationmethod is based on unsupervised learning.

    El-Kourdi, Bensaid, and Rachidi(2004)have used a Nave Bayesian classifier

    to classify an in-house collection of Arabic documents.

    They have concluded that there is some indication that the performance of

    Nave Bayesian algorithm in classifying Arabic documents is not sensitive to

    the Arabic root extraction algorithm. In addition to their own root extraction

    algorithm, they 26 used other root extraction algorithms such as the algorithms

    suggested by(Baeza-Yates and Ribeiro-Neto, 1999, Al-Shalabi and Evens,

    1998).

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    32/7526

    Duwairi (Duwairi, 2006)has proposed a distance-based classifier for Arabic TC

    tasks, where the Dice measure was used as a similarity measure. According to

    work had been done by Duwairi, each category was represented as a vector of

    words. In the training phase, the text classifier scanned training documents to

    extract features that best capture inherent category specific properties.

    Documents were classified on the basis of their closeness to the feature vectors

    of the text.

    El-Halees (El-Halees, 2007) was implemented a maximum entropy based

    classifier to classify Arabic documents. Compared with other text classification

    systems (such as El-Kourdi et al. and Sawaf et al.), the overall performance of

    the system was good (in comparisons, the results were used as recorded in the

    published papers mentioned above by El-Halees).

    Hmeidi, Hawashin and El-Qawasmeh(Hmeidi et al., 2008) reported a

    comparative study of SVM and K-Nearest Neighbours KNN classifiers on

    Arabic text classification tasks. The concluded proven that SVM classifier

    shows a better micro-averaging F1-measure.Al-Saleem (Alsaleem, 2011)proposed an automated Arabic text classification

    using SVM and NB classification methods. These methods were investigated

    on different Arabic datasets. Several text evaluation measures had been used.

    The experimental results against different Arabic text categorization datasets

    showed that SVM algorithm outperforms the NB with regards to all measures

    (recall, precision and F-measure). The F-measure of SVM was 77.8% while

    74% for NB.

    Al-Diabat et al, (Al-diabat, 2012)had investigated the problem of Arabic Text

    Classification (ATC) by using rule-based classification approaches. The

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    33/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    34/7528

    As long as the feature selection is a key factor in the accuracy and effectives of

    resulting classification, the author in (Chantar and Corne, 2011)mentioned to

    Binary Particle Swarm Optimisation (BPSO) as the feature selection for Arabic

    text classification. The aim of apply Bpso/Knn is to find a good subset of feature

    to facilitate the task of Arabic text categorization. SVM, Nave Bayes and C4.5

    decision tree have been applied as classification algorithms. However the

    suggest method was effective and achieved satisfy outcome on classification

    accuracy.The particle swarm optimization has been used to achieve the excellent feature

    selection in(Al-Saleem, 2010).

    In 2013(Abuaiadah, 2013), it was reported that multiword features has

    implemented to improve Arabic information retrieval. Multiword features are

    displayed as a mixture of word appearing within windows of varying size.

    However, multiword features were applied with two similarity functions: the

    dice similarity function and the cosine similarity function to improve the

    outcome of Arabic text classification. According to the results had achieved thedice function perform better than the cosine function. With the dice similarity

    function, the frequencies of the features in the document are ignored and only

    their existence is recognized.

    In (Alsaleem, 2011) the investigator concentrated on just a single label

    assignment. The goal of this paper is to present and compare result obtained

    against Saudi Newspaper Arabic text collection using SVM algorithm and NB

    algorithm. However, the experiment shows that the SVM classifier achieved

    better result than NB classifier.

    In(Zrigui et al., 2012)Latent Dirichlet Allocation(LDA)had been proposed as

    text feature. LDA was used to index and represent Arabic texts. However, the

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    35/7529

    mine idea behind LDA is that documents are represented as random mixtures

    over latent topics where each topic is described by a distribution over words.

    SVM was used to apply classification task. Moreover, LDA-SVM algorithm

    was achieved high effectiveness for Arabic text classification,which exceeds

    SVM without LDA, Nave Bayes and KNN classifiers.

    2.2. SummaryIn chapter 2, there are different text classification algorithms were described. A

    little was explained general in text classification and other explicated specially

    the Arabic languages. Finally some papers with the Multinomial Mixture model

    had been shown.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    36/7530

    3. Chapter three: Methodology

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    37/7531

    3.1. IntroductionThere are many approaches can be used in text classification. KNN, Rocchio

    and Nave Bayes by using MMM model have been implemented. Moreover,

    these algorithms have been applied to the same datasets.

    The main aim of applying TC on the Arabic languages is to improve the

    performance of information retrieved without TC. Many steps can be done to

    implement the TC task. However, all phases have been explained in section 3.2.

    An IR process starts with the submission of a query, which describes a users

    topic and finishes with a set of ranked results estimated by the IRsranking

    scheme to be the most relevant to the query(Ajayi et al.).

    Recall and Precision consider famous measures; these can be used to evaluate

    any IR system. Furthermore, the efficiency of the system can be determined by

    using those measures.

    This chapter is divided into three main sections. The section 3.1 shows overview

    about the project. Section 3.2 has been presented the main text classification

    system architecture.Section3.3mentioned to the short summary of chapter three.

    3.2. System ArchitectureThe text classification technique has been implemented by passing through

    several phases. Moreover, these phases execute sequentially to facility the TC

    task. Uncategorized documents were pre-processed by removing punctuation

    marks and stopwords. Every document is then represented either as a vector of

    words only or as a vector of word, their frequencies and number of documents

    in which these words appeared (inverse document frequency). Stemming was

    used to decrease the dimensionality of feature vectors of document. The

    accuracy of the classifier is computed using recall, precision and F-

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    38/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    39/7533

    3.2.2. Pre-processingThe pre-processing can be defined as the process of filter out words, which may

    not give any meaning to a text, also might not be useful in information retrievalsystems. These words are called stop word(Al-Maimani et al.). The purpose of

    applying pre-processing is to transform documents into a suitable representation

    for classification task. In addition, it is reduced size of information which may

    make search operation faster. The pre-processing can be done as follows:

    -The different documents which have format such as HTML, SGML and XML

    are converted to plain text format.

    -Digital and Punctuation mark have been removed from each document.

    Tokenization: Tokenization divides the document to set of tokens (words).

    Remove stopword:There are two kinds of the terms in any document; the

    firstly, it has been called stopword which occur commonly in all documents and

    may not give any mean to the document. The secondly, it can be described as

    Keywords or features. However, stopwords such as (punctuation mark,

    formatting tags, prepositions, pronouns, conjunction and auxiliary verb) have

    been removed to reduce the text size and save the process time. Moreover, this

    process is essential to removing these high-frequency words because they may

    misclassify the documents(Uuz).

    Normalization: It is a very essential phase that will reduce many words that

    have the same meaning, but it was written in different forms. Arabic language

    refers to a very common problem when a single word has been written in manyforms like -- (which mean in English start).However, table 3.1show someletters have been normalized:

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    40/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    41/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    42/7536

    collection. MIDF is normalized term frequency over the collection, which

    provides the correct terms for learning. However, the MIDF performs better

    than the existing term weighting schemes such as TF.IDF and WIDF. Indexing

    a document is the method for characterizing its content to purpose making easy

    subsequent retrieval in document storage. In addition, the index terms of

    information retrieval systems are word stems automatically derived from a

    document and weighted according to their distribution in a document collection.

    Automatically indexing is the process of producing the descriptors (index

    terms) of a text automatically(Lahtinen, 2000). Automatically indexing an

    information source will save time and more rapidly, since most of the precise

    human effort can be performed by a machine. However, in indexing approach

    the order of terms in the vector are ignored. In information retrieval systems,

    index terms are usually weighted according to their importance for describing

    documents, and typically the weighting schemes are based on detection of word

    frequencies across the document collection(Obaseki).The vector of words can be called vector of weighted terms, consists of alldistinct terms that appear in all training documents. It consists of term frequency

    which measures the number of times the frequency of term i appears in a

    document j, Inverse Document Frequency (IDF) which measures the number of

    times the term i appears in all the collection set of documents.

    Term FrequencyInverse Document Frequency (TF-IDF) has been used in

    this work as one of the most popular weight schemes. It considers not only term

    frequencies in a document, but also the frequencies of a term in the entire

    collection of documents(Moraes et al.). The classic TFIDF , assigns to termt a weight in document d as:

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    43/7537

    TFIDFi, j TFi, j.IDFi 3.1

    Thus, TF*IDF weighting assigns a high degree of importance to terms occurring

    frequently only in few documents of a collection. Inverse Document Frequency

    (IDF) for term Ti calculated as fallowing:

    IDFi log nType equation here. 3.2

    Where, DFi(document frequency of term Ti) isnumber of documents in which Ti occurs.

    Automatic indexing relies typically on word frequencies. If the word occurs

    frequently in a document, but does not occur in many other documents, it is

    possibly an appropriate document descriptor, and it should be weighted high by

    the indexer.

    Feature selection:

    Feature sub-set selection (FSS) is one of important pre-processing steps of

    machine learning and essentially a task for text classification. Feature selection

    methods study how to choose a subset of attributes that are used to construct

    models describing data(Khushaba et al.).There are many methods of FSS have

    been applied on Arabic text(Al-Ani et al.,Mesleh and Kanaan, 2008).

    According to the previous relation works the FSS approach was proven to

    provides several advantages for text classification system because it has very

    effective in reducing dimensionality, removing irrelevant and redundant terms

    from documents and decrease computational complexity. In addition the FSS

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    44/7538

    increasing learning accuracy, improving classification efficiency and scalability

    by make building the classifier is usually simpler and faster).On the other hand,

    FSS may decrease the classifiers accuracy (Mesleh). (Singh et al.,Khushaba et

    al.,Al-Ani et al.).

    While the number of feature is huge and redundant with text classification task,

    it is important to examine how to select the best feature can use to achieve better

    efficiency than others.

    Many FSS algorithms have been tested and comparison in text classification

    system for example: Chi-square and fallout were achieved the satisfy result in

    Arabic TC tasks and Ant colony(ACO) is an optimization algorithm ,which is

    derived from the study of real ant colonies ant it is one of the hopeful approaches

    to better feature selection.To classify a new document was pre-processed by removing punctuation marks

    and stopwords, followed by extracting the roots of the remaining keywords. The

    feature vector of a new document and the feature vector of all categories should

    be compared. Ultimately, the document was assigned to the category with hasmaximum similarity.

    3.2.3. ClassifiersThere are a lot of classifiers type have been applied and excused in text

    classification area. Moreover, the result was completely different from one to

    another, while every classifier has specific algorithm. However, many sort of

    classifier has been explained and shown the advantages and drawbackaccording to the achieved result.

    3.2.3.1. Support vector machine (SVM)Support vector machine has been widely applied in text classification area

    (Alsaleem, 2011,Zrigui et al., 2012,Mesleh and Kanaan, 2008). SVM classifier

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    45/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    46/7540

    The KNN has advantages such as simple, non-parameter and shows a very good

    performance on text categorization tasks for Arabic text Language.On the other

    hand, the KNN has drawbacks such as difficult to find optimal value of k;

    classification time is long due to the distance of each query instance to all

    training samples has been computed. In addition, this classifier has been called

    lazy learning system, because it does not involve a true training phrase(Wan et

    al.).

    The major steps to apply k-nearest neighbor classifier:

    Pre-process documents in training set.

    Choose the K parameter value, K value means the number of nearest neighbors

    of d in the training data.

    Determine the distance between the testing document (d) and the training

    documents (previous classes).

    Class the distance and determine neighbors based on the minimum distance of

    k-distances.

    To classify an unknown document, the KNN classifier ranks the documentsneighbors among the training documents and uses the class labels of the k most

    similar neighbors. The similarity score of each nearest neighbor document to

    the test document is used as the weight classes for the neighbor document. If a

    specific category is shared by more than one of the k-nearest neighbors, then

    the sum of the similarity scores of those neighbors is obtained from the weight

    of that particular shared category(Mitra et al., 2007).

    An example of KNN classification has been showed in figure3.2.a. Moreover,

    the document X has been assumed as test sample, which should be classified

    either to the first category of white circle or to the second category of black

    circle. If k = 1 the document X will be classified to the white category, because

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    47/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    48/7542

    to represent each document and class. The vector to represented the classes ( c

    ) has been called prototype or centroid (Ko et al., 2004, Kanaan et al., 2009a).

    Prototype for each class calculated by subtract the average all document

    appeared in class Cof the average all document do not appears in the classC.

    jj CDdjCdj

    j dCD

    dC

    c

    ||

    1

    ||

    1

    3.4

    Where, & are parameters that adjust the relative impact of positive and

    negative training examples.

    Practically in text classification, Rocchio calculates similarity between test

    document and each of prototype vectors. Then, the test document assigns to the

    category which has the maximum similarity score.

    3.2.3.4. Nave Bayes:Nave Bayes classifier uses a probabilistic model of text. It achieves good

    performance results on TC task for Arabic text(Kanaan et al., 2009b).

    The NB mentioned (Noaman et al., Zhang and Gao) that it is a simpleprobabilistic classifier based on applying Bayes theorem the condition

    probability P (C|d) for each class can be computed as:

    p(cd) P(C)Pd|c

    Pd

    3.5

    Where, P (c) is the prior probability of a document occurring in classc.Frequently, each document (d) in text classification represented as a vectorof words (v , v, . v) then the above equation become as:

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    49/7543

    p(cd) P(C) k= P(vkc)

    Pd

    3.6

    P (d) is constant from all categories.

    P(vkc) fF

    3.7

    Where, f is the frequency of a word (vk) in the test document.F is the number

    of document in which word vkhas appeared in.Notes that:to avoid zero probability add one Laplace smoothing is used and

    therefore the; equation become as:

    P(vkc) f 1

    F w

    3.8

    wEquals number of training document in the categoryc.The Bayes classifier compute separately the posterior of document D falling

    into each class, and assign the document to the class with the highest

    probability, that is

    c= x cd; || 3.9

    Where, |C| is the total number of classes(Duwairi, 2007a).

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    50/7544

    The Nave Bayes for categorization is frequently used in text classification; due

    to it has speed and simplicity. Moreover, there are two event models of Nave

    Bayes: multinomial model and Bernoulli model(Prasad).

    In Bernoulli model, a test document is classified as binary occurrence

    information, the number of occurrences is ignored. Although multinomial

    model is kept tracking of multiple occurrences.(Zhong and Ghosh, 2003,

    McCallum and Nigam, 1998b).

    3.2.3.5. Multinomial Mixture ModelIt is necessary to clarify exactly what is meant by MMM. It can be defined as

    the distribution of words in a document as a multinomial. Furthermore, a

    document is treated as a sequence of words and it is assumed that each word

    position generated independently of every other(Rennie et al., 2003). In text

    classification, the use of class-conditional multinomial mixtures can be seen as

    a generalization of the Naive Bayes text classifier relaxing its (class-conditional

    feature) independence assumption(Civera and Juan, 2005).When a test

    document is classified, an MMM keeps track of multiple occurrences comparedwith another model such as Bernoulli model (Zhong and Ghosh, 2003). The

    Bernoulli model uses binary occurrence information and ignores the number of

    occurrences. As long as an MMM keeps the occurrence from all words

    (frequency, position), thus, this makes the classification task easier.

    p(cd) P(C)Pd|cPd 3.10

    P(C)=+

    L+ 3.11

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    51/7545

    Where,n is number of the document in class.n is number of document intraining set D.L is number of classes(Chen et al., 2009).

    P(dc) pdd! k= p(wkc)

    nk!

    3.12

    p(wkc) 1 nkn n

    3.13

    Where,nkis number of documents in class cjthat contain wordwk..3.2.4. EvaluationAs long as there are many retrieval systems on the market, but which one is the

    best. It depends on the result which proposed from every one. An important

    issue for information retrieval systems is the notion of relevance. The purpose

    of an information retrieval system is to retrieve all the relevant documents

    (recall) and no non-relevant documents (precision). Recall and precision are

    defined as:

    Precision: The ability to retrieve top-ranked documents that are mostly

    relevant.

    Precision Number of relevant documents retrievedTotal number of documents retrieved 3.14

    The maximum (and optimal) precision value would be 100% and the worst

    possible precision of 0% is achieved when not a single relevant document was

    found.

    Recall: The ability of the search to find all of the relevant items in the corpus.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    52/7546

    Recall Number of relevant documents retrievedTotal number of relevant documents 3.15

    One substantial aspect of results is how many of the relevant documents in a

    collection have been found. Recall shows how many of the relevant documents

    a user could possibly come across when reading all documents in the result set.

    Therefore the higher level of recall it is mention to the best system.

    Despite of with the recall measure both of number of relevant items retrieved

    and total number of items retrieved are available, but total number of relevantitems is usually not available.

    The most essential averages are: micro-average, which counts each document

    equally important, and macro-average, which counts each category equally

    important (see 4.3 for extra details).

    The perfect information retrieval system can be achieved when the result of both

    recall and precision equal one.

    F1-measure: As a measure of effectiveness that combines the contributions of

    precision and recall. The well-known F1 measure function is used to test

    perform of the Information retrieval systems, which defined as:

    RePr

    Re.Pr21

    F

    3.16

    Fallout: It is another evaluated measure can be used to evaluate the Information

    Retrieval systems. Although, Recall and Precision consider the good evaluation

    measure but they do not care on number of irrelevant documents in the

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    53/7547

    collection, that caused to undefined recall when there is no relevant document

    in the collection, also to undefined precision when no document is retrieved.

    However, Fallout number of irrelevant documents in the collection had been

    taken in account. In another word the Fallout is inverse of Recall, that is indicate

    to a good system should have high recall and low fallout.

    3.3. SummaryThis chapter gives some introduction to information retrieval, and describes the

    common tasks of a TC system. Using multinomial mixture model as a machine

    learning algorithm is nowadays the most popular approach. In the rest of chapter

    three interesting kinds of TC algorithms have been described briefly.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    54/7548

    4. Chapter four: Experiments and Evaluation

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    55/7549

    4.1. IntroductionAutomatic Text Classification is defined as classifying unlabelled documents

    into predefined categories based on its contents. It has become an important

    topic due to the increased number of documents on the internet that people have

    to deal with daily; this in itself has led to the urgent need of organizing them. In

    this chapter, experiments will be achieved then the performance of the Rocchio

    algorithm with traditional k-NN and Nave Bayes using MMM classifiers will

    be documented.

    These classifiers will be evaluated by some measures in order to know whether

    Nave Bayes using MMM outperforms the other classifiers. The rest of this

    chapter will be organized as following: section 4.2 will discuss the preparing

    process for data set evaluation. Section 4.3 will list the performance measures.

    Section 4.4 will discuss the evaluation results. Section 4.5 will discuss the

    results of MMM with 5070 documents. In section 4.6 will show the summary.

    Section 4.7 will explain the conclusion and future work. Section 4.8 will

    indicate to the references.

    4.2. Data set preparationThe corpus has been downloaded from(SAAD, 2010). The documents classified

    into nine categories. The categories and number of documents of each one of

    them appears in table 4.1. The total number of documents is 1445. The length

    of documents is varying from each other. The nine categories are: Computer,

    Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.

    After the pre-processing achieved on all the documents, a copy of these pre-

    processed documents have been converted into Attribute-Relation File Format

    (ARFF) in order to be suitable for Weka tool.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    56/7550

    NO Category Number

    1 Medicine 3

    2 Economics 2

    3 Religion

    4 Sport 3

    5 Politics 41

    6 Engineer 44

    Law 7

    Computer 27 Education 8

    Table 4.1 the number of documents for each category

    4.3. Performance measures:Computational efficiency and classification effectiveness is what it meant of

    performance of text classification algorithm. So, when a large number of

    documents categorized into many categories, the efficiency of text classification

    will be take into account. The effectiveness of text classification will be

    measures by precision and recall(Kanaan et al., 2009a).

    Precision and Recall are defined as follows:

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    57/7551

    Recall + tp fp>0 , 4.1

    Precision tptp fn tp fn > 0 , 4. 2Where,counts the number of documents that classified by classifier correctly,while counts the number of documents that classified by classifierincorrectly, fpcounts the number of documents that not classified by classifier

    correctly counts the not assigned but incorrect cases and tn counts the not

    assigned and correct cases. As showed in table 4.2.

    Classifier

    Decision

    Correct Decision By Expert

    YES is correct NO is incorrect

    Assigned YES Tp Fn

    Not Assigned

    NO

    Fp tn

    Table4.2: confusion matrix for Performance measures

    Precision is the fraction of retrieved instances that are relevant as it appear in

    equation4.1, while recall is the fractions of relevant instances that are retrieved

    as it appear in equation4.2.Both precision and recall are therefore based on an

    understanding and measure of relevance. Precision and recall values often

    depend on parameter tuning; thats mean there is a trade-off between precision

    and recall. This is why another measure that combined both of the precision and

    recall used: the F-measure which is defined as follows:

    To evaluate the performance across categories, F-measure is averaged. There

    are two kinds of averaged values, namely, micro average and macro average

    (MESLEH, 2007).

    http://en.wikipedia.org/wiki/Relevancehttp://en.wikipedia.org/wiki/Relevance
  • 8/13/2019 Final Thesis 7 1 2014 PDF

    58/7552

    Fmeasure2PrecisionRecal PrecisionRecall 4.3For obtaining estimates of precision and recall relative to the whole category

    set, two different methods may be adopted:

    Category set

    C={c1,....,c|C}

    Expert Judgments

    YES NO

    Classifier

    Judgments

    YES

    ||

    1

    iTPTPC

    i

    ||

    1iFNF

    C

    i

    N

    NO

    ||

    1

    iFPFPC

    i

    ||

    1

    iTNTC

    i

    N

    Table 4.3: The global contingency table

    Macroaveraging: precision and recall are first evaluated locally for each

    category, and then globally by averaging over the results of the different

    categories.

    Precision Recall

    Macroaveraging

    ||||

    Pr||

    1

    ||

    1

    C

    FNTPTP

    C

    C

    i ii

    iC

    i

    i

    ||||

    Re

    ||

    1

    ||

    1

    C

    FPTPTP

    C

    C

    i ii

    iC

    i

    i

    Table 4.4: Macro-average

    Microaveraging: precision and recall are obtained by globally summing over

    all individual decisions. For this, the global contingency table of table 4.4,

    obtained by summing over all category-specific contingency tables, is needed.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    59/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    60/7554

    In single-label classification, as is implemented in the experiments, micro-

    averaged precision equals recall(Rocchio, 1971), and is equal to F1 , so only

    micro F1 will be noted for the micro-averaged results.

    4.4. Evaluation ResultsThe results were obtained for each of the k-nearest neighbor, Rocchio, and

    Nave Bayes using MMM as follow:

    4.4.1. Nave Bayes algorithm using (MMM).Table4.6 shows the confusion matrix for Nave Bayes using MMM algorithm.

    The numbers reported in an entry of a confusion matrix correspond to the

    number of documents that are known to actually belong to the category given

    by the row header of the matrix, but that are assigned by NB using MMM to the

    category given by the column header.

    As shown in the table4.6; 67 documents of category Computer are classified

    correctly into Computer category while 3 documents of Computer classified

    incorrectly where 2 of these 3 documents classified as Education and 1 from 3

    classified as law. The best classification at category is Sport where 231 of this

    category classified correctly. Lowest value of correctly classified documents for

    Education category where 56 documents classified correctly and 12 documents

    classified incorrectly.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    61/7555

    Table 4.6: Confusion Matrix results for NB using MMM algorithm

    Figure4.1 shows recall, precision and f-measure for every category when the

    Nave Bayes classifier was used, the precision reach it is highest value(1) for

    the Sport , and computer categories while the lowest value of precision was

    (0.812) for education category. Recall reaches its highest value (0.996) for sportcategory and its lowest value for law category (0.804). F-measure reaches its

    heights value (0.998) for sport category and its lowest value for education

    category (0.818). The rest of the figure is self-exploratory.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    62/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    63/7557

    Table4.8 shows the average of the above values for all categories in MMM

    algorithm, the overall f-measure is 0. 908. This value consider high.

    Nave Bayes usingMMM

    Precision Recall F-measure

    Weighted average 0.911 0.907 0.908

    Table 4.8NB using MMM classifier weighted average for the nine categories

    4.4.2. Comparisons MMM with other techniques anddiscussions 0f results

    Firstly a comparison made between k-NN, Rocchio and Nave Bayes classifiers.All the results of KNN and Rocchio have been taken from(Kanaan et al.,

    2009a). A summary of the recall, precision and F1 measures are shown in table

    4.9. Nave Bayes gave the best F-measures with MiF1=0.9185 and

    MaF1=0.908, followed by kNN widf with MiF1=0. 7970 and MaF1=0. 7871,

    closely followed by Rocchio tf.idf with MiF1=0. 7314 and MaF1=0. 7882. A

    comparison of values of MiF1 and MaF1 is shown in figure 4.2.

    Method maP MaR maF1 miF1

    kNN tf 0.7100 0.5359 0.6100 0.5711

    kNN tfidf 0.8363 0.6902 0.7562 0.7272

    kNN widf 0.8094 0.7662 0.7871 0.7970

    Rocchio tf 0.5727 0.4501 0.5022 0.4427

    Rocchio tfidf 0.8515 0.7337 0.7882 0.7314

    Rocchio widf 0.7796 0.7199 0.7484 0.6968

    ave Bayes 0.911 0.907 0.908 0.9185

    Table 4.9: Classifier comparison

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    64/7558

    The figure [4.2], show the maFi, miF1 for all the classifiers (KNN, Rocchio,

    and Nave Bayes), from the figure, we can see that the Nave Bayes using MMM

    got the higher value for both values (maF1, and miF1).

    Figure 4.2: maF1, miF1 comparison for classifiers

    The next figure [4.3] shows the macro precision of all the categories, and it

    appears that the highest value is for the Naive Bayes using MMM, Rocchio, and

    also KNN tf.idf not far away from Rocchio.

    Figure 4.3: maP comparison for classifiers

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    kNN tf kNN tfidf kNN widf Rocchio tf Rocchiotfidf

    Rocchiowidf

    Nave Bayes

    maF1

    miF1

    0.71

    0.8363 0.8094

    0.5727

    0.85150.7796

    0.911

    0

    0.2

    0.4

    0.6

    0.8

    1

    kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio

    widf

    Nave Bayes

    maP

    maP

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    65/7559

    The next figure [4.4] shows the macro recall of all the categories, and it appears

    that the highest value is for the naive Bayes using MMM, KNN and Rocchio is

    not far away from KNN.

    Figure 4.4: maR Comparison for Classifiers

    It is clear that Naive Bayes classifier has the high values for the three measures

    and then KNN classifier comes in the second place, the worst values in the three

    measures was for Rocchio. As observed also there is disproportion in the

    precision, recall and f-measure values for the k-NN where it is reach to high

    value (0.83) at precision measure and very low at recall (0.53). As shown also

    the values of precision, recall and f-measure values for the other two classifiers:

    Rocchio and Nave Bayes classifiers more stability.

    0.5359

    0.69020.7662

    0.4501

    0.7337 0.7199

    0.907

    0

    0.2

    0.4

    0.6

    0.8

    1

    kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio

    widf

    Nave Bayes

    MaR

    MaR

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    66/7560

    Figure 4.5 precision, recall and f-measure for the three classifiers

    4.5. Results of Nave Bayes algorithm (MMM) with 5070documents

    Another experiment has been conducted, the collected corpus has showed in

    Table4.10 contains 5070 documents that vary in length(SAAD, 2010). These

    documents fall into six categories: Business, Entertainment, Middle East news,

    Sport, World news, and Science and Technology.

    NO Category Number

    1 Business 836

    2 Entertainment 474

    3 Middle East news 1462

    4 Sport 762

    5 World news 1010

    6 Science and Technology 526

    Table 4.10 categories and their distributions in the corpus (5070 documents)

    0

    0.1

    0.2

    0.3

    0.4

    0.50.6

    0.7

    0.8

    0.9

    1

    kNN tf kNN tfidf kNN widf Rocchio tf Rocchio

    tfidf

    Rocchio

    widf

    Nave Bayes

    maP

    MaR

    maF1

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    67/7561

    Table 4.11shows the confusion matrix for Nave Bayes using MMM algorithm.

    Lowest value of correctly classified documents for Entertainment category

    where 400 documents classified correctly and 74 documents classified

    incorrectly.

    Table 4.11: Confusion matrix results for NB Algorithm in the corpus (5070

    Documents)

    Figure4.6 shows recall, precision and f-measure for every category when the

    Nave Bayes classifier was used, the precision reach it is highest value(0. 991)

    for the Sport category while the lowest value of precision was (0. 746) for

    Entertainment category. Recall reaches its highest value (0. 979) for Sport

    category and its lowest value for Middle East news category (0. 832). F-measure

    reaches its heights value (0. 985) for Sport category and its lowest value for

    Entertainment category (0. 792). The rest of the figure is self-exploratory.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    68/7562

    Table4.12Confusion Matrix results for NB Algorithm in the corpus (5070 Documents)

    The next figure [4.6] shows the precision and recall for all the categories that

    classified using Nave Bayes by using MMM.

    Figure 4.6: Result of the Naive Bayes classification algorithm

    Table4.13 shows the average of the above values for all categories in NB

    algorithm, the overall f-measure is 0. 884.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    Precision

    Recall

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    69/7563

    Nave Bayes using

    MNM

    Precision Recall F-measure

    Weighted average 0.882 0.890 0. 884

    Table 4.13NB using MMM classifier weighted average for the six Categories in the

    Corpus (5070 documents)

    Comparing the overall result from table 4.8 and 4.13 show that there is a little

    degrades in performance of precision, recall, and F-measure. Its because the

    testing set still as its 4-fold cross validation. If the test set was in percent, the

    results will be different since the classifier will learn more.

    4.6. SummaryThe Naive Bayes using MMM has been evaluated against KNN and Rocchio.

    The Naive Bayes using MMM outperformed k-NN and Rocchio classifier.

    Naive Bayes (MMM) classifier has the best precision, and then the other

    techniques came after Naive Bayes.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    70/7564

    4.7. Conclusion and Future Work:Text classification for Arabic languages has been investigated in this project.

    Three classifiers were compared: KNN, Rocchio and Naive Bayes using

    Multinomial Mixture Model (MMM).

    Unclassified documents were pre-processed by removing stopwords and

    punctuation marks. The rest of words was stemmed and stored in feather

    vectors. Every test document has its own feature vector. Finally the document

    will be classified to the best class according to the classifier technique.

    The accuracy of classifiers has been measured using recall, precision and F-

    measure. For project experiments the classifiers were tested using 1445

    document. The result shows that the performance of NB by using Multinomial

    model outperformed the other two classifiers.

    As a future work, we plan to continue working with Arabic text categorization

    as this area not widely explored in the literature and trying the classifiers on a

    huge collection.

    -Apply an auxiliary feature method with Multinomial model in order to improve

    classification accuracy.

    -Comparing the Nave Bayes MMM model with different models such as the

    multivariate Bernoulli(Zhang and Gao).

    -Evaluate Bpso feature selection with Multinomial classifier by using the same

    Arabic database was mention it in (Chantar and Corne, 2011), then compare the

    result between the two achieved result.

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    71/75

  • 8/13/2019 Final Thesis 7 1 2014 PDF

    72/7566

    20 CHANTAR, H. K. & CORNE, D. W. Feature subset selection for Arabic documentcategorization using BPSO-KNN. Nature and Biologically Inspired Computing (NaBIC),2011 Third World Congress on, 2011. IEEE, 546-551.

    21 CHEN, A. & GEY, F. C. Building an Arabic Stemmer for Information Retrieval. TREC, 2002.22 CHEN, J., HUANG, H., TIAN, S. & QU, Y. 2009. Feature selection for text classification with

    Nave Bayes. Expert Systems with Applications,36, 5432-5435.23 CIVERA, J. & JUAN, A. 2005. Multinomial Mixture Modelling for Bilingual Text

    Classification. Technical report DSIC-II/10/05, UPV.24 COLLINS-THOMPSON, K. & ADVISER-CALLAN, J. 2008. Robust model estimation methods

    for information retrieval, Carnegie Mellon University.25 DEISY, C., GOWRI, M., BASKAR, S., KALAIARASI, S. & RAMRAJ, N. 2010. A novel term

    weighting scheme MIDF for Text Categorization. Journal of Engineering Science andTechnology,5, 94-107.

    26 DUWAIRI, R. 2007a. Arabic text categorization. the international Arab Journal ofinformation Technology,7.

    27 DUWAIRI, R. 2007b. Arabic Text Categorization. International Arab Journal on InformationTechnology,4.

    28 DUWAIRI, R. M. 2006. Machine learning for Arabic text categorization. Journal of theAmerican Society for Information Science and Technology,57, 1005-1010.

    29 DUWAIRI, R. M. 2007c. Arabic Text Categorization. Int. Arab J. Inf. Technol.,4, 125-132.30 EL-HALEES, A. 2007. Arabic text classification using maximum entropy. The Islamic

    University Journal (Series of Natural Studies and Engineering) Vol,15, 157-167.31 GAMON, M. & AUE, A. Automatic identification of sentiment vocabulary: exploiting low

    association with known sentiment terms. Proceedings of the ACL Workshop on FeatureEngineering for Machine Learning in Natural Language Processing, 2005. Association forComputational Linguistics, 57-64.

    32 GHWANMEH, S., KANAAN, G., AL-SHALABI, R. & ABABNEH, A. Enhanced ArabicInformation Retrieval System based on Arabic Text Classification. Innovations inInformation Technology, 2007. IIT '07. 4th International Conference on, 18-20 Nov. 20072007. 461-465.

    33 GOUDJIL, M., KOUDIL, M., HAMMAMI, N., BEDDA, M. & ALRUILY, M. Arabic textcategorization using SVM active learning technique: An overview. Computer andInformation Techno