Abstract - BIUu.cs.biu.ac.il/~nlp/wp-content/uploads/liebeskind-chaya... · Web viewWikipedia is a collaborative online encyclopedia that covers a wide variety of domains. Wikipedia

Bar Ilan University

The Department of Computer Science

Text Categorization for Large

Multi-class Taxonomy

by

Chaya Liebeskind

Submitted in partial fulfillment of the requirements for the Master's

Degree in the Department of Computer Science, Bar-Ilan University

Ramat Gan, Israel November 2009, Cheshvan 5770

1

This work was carried out under the supervision of

Prof. Ido Dagan and Prof. Moshe Koppel

The Department of Computer Science

Bar-Ilan University

Israel

2

Abstract

This thesis investigates Keyword-based Text Categorization (TC) using only a topical

taxonomy of category names as input. The TC task is most commonly addressed as a

supervised learning task. However, the supervised setting requires a substantial amount

of manually labeled documents, which is often impractical in real-life settings.

In Keyword-based TC methods the knowledge about the classes of interest is

provided in the form of a few keywords per class. Few keywords are typically generated

more quickly and easily than even a small number of labeled texts. However, the

keyword-based approach still requires nonnegligible manual work in creating a

representative keyword list per category. Our research is based on a new approach, first

proposed in (Gliozzo et al.,2005), which eliminates this requirement by using the

category names alone as the initial keyword list.

We adopted the scheme of (Barak et al., 2009) that combines two types of

similarity. One type regards words which refer specifically to the category name’s

meaning (Reference). While the other type captures typical context words for the

category which do not necessarily imply its specific meaning (Context).

This thesis is a part of the Negev Consortium (Next Generation Personalized

Video Content Service), within the content recommendation task. Therefore, our first

step was creating a taxonomy for video content along with video dataset construction

and annotation. We then focused on the adaptation of the above scheme to our specific

classification task.

Classification into a large real-world taxonomy raises different issues than

classification for an artificial taxonomy created specifically for a certain academic

dataset. This study describes a proposed classification and evaluation scheme for such a

taxonomy and particularly for our IMDB (Internet Movie Database) taxonomy.

We utilized statistical correlation measured over the target IMDB corpus, for

improving both the reference model and the context model, aiming to improve the state-

of-the-art method proposed by (Barak et al., 2009). We propose a simpler context model

3

based on Dice coefficient (Mandala et al., 1999), which is a measure of statistical

correlation, along with a new statistical Lexical Reference (LR) resource which is based

on the Dice coefficient as well.

Furthermore, we offer a different classification and evaluation scheme based on

the assumption that tuning a parameter for each category is an acceptable demand under

the industrial circumstances of the Negev Consortium. We adopt the multi-class

classification scheme, since many of the documents in our dataset are classified to more

than one category in the gold standard, while many others are not classified to any of

the taxonomy categories, and measure how much recall can be achieved at a certain

precision level. We then select our precision level according to the desired recall-

precision trade-off.

Positive empirical results are presented for our complete method, which indeed

shows higher performance than the previous state-of-the-art method of (Barak et al.,

2009). Our analysis reveals that the reference requirement as the basis for the TC score

helps to classify documents according to the topic they actually discuss, as opposed to

using context models alone, which only reveal the documents broader context.

4

Acknowledgements

I would like to take this opportunity to thank the people whose joint efforts assisted me

in writing this thesis.

First and foremost, my greatest thanks go to Prof. Ido Dagan for introducing me to the

wonderful world of Natural Language Processing, and for supervising this research.His

constant support, thorough guidance, and great patience enabled this work.

I wish to thank my supervisor, Prof. Moshe Koppel for providing advice and sharing his

experience.

My gratitude goes also to all my NLP lab members for sharing with me their time and

moral support. I especially want to express my appreciation to Eyal Shnarch ,Idan

Szpektor, Jonathan Berant, Lili Kotlerman, RoyBar-Haim and Shachar Mirkin for

sharing with me their words of wisdom, experience and advice when needed.

I would like to thank Naomi Zeichner for her assistance in the taxonomy creation and

the corpus annotation.

I wish to thank Libby Barak for setting up the groundwork for this research, providing

me with her text categorization system and for her guidance at the beginning of this

work.

I want to thank my parents for encouraging me to pursue my academic goals and

dreams, and for giving me the special kind of support only family can provide. I would

also like to thank my husband for his unique support, understanding and faith in me,

which encouraged me greatly throughout this work and to my children for simply loving

me.

This thesis was partly supported by the Negev Consortium (www.negevinitiative.org),

funded by the Israeli Ministry of Industry, Trade and Labor.

5

Contents

Introduction......................................................................................................8

Background....................................................................................................12

2.1 Unsupervised keyword-based text categorization..........................................12

2.2 Categorization based on category name.......................................................13

2.3 Lexical Reference...................................................................................17

2.4 Query expansion....................................................................................21

State of the art performance on IMDB dataset.......................................................24

3.1 The IMDB dataset....................................................................................24

3.2 The limited performance of previous state-of-the-art methods...........................26

3.2.1 Unsupervised single-class classification...................................................27

3.2.2 Bootstrapping:...................................................................................35

3.3 Applying a state-of-the art query expansion method........................................36

Algorithm improvements...................................................................................39

4.1 Utilizing statistical correlation....................................................................40

4.1.1 Dice-based context model....................................................................41

4.1.2 Dice expansions resource......................................................................43

4.2 Combined scoring....................................................................................45

A Classification and evaluation scheme for a large real-world taxonomy.....................47

5.1 Multi-class classification scheme..................................................................47

5.2 Evaluation measures.................................................................................48

5.2.1 Recall-Precision curves........................................................................49

5.2.2 Mean Average Precision (MAP).............................................................52

Results and Analysis.........................................................................................54

6.1 Results...................................................................................................54

6.2 Contribution of Our Method Components.....................................................57

6

6.2.1 Component Ablation Tests....................................................................57

6.2.2 Resources Ablation Tests......................................................................59

6.3 Further Analysis......................................................................................60

6.3.1 Recall-Precision Curves Comparison......................................................60

6.3.2 Error Analysis...................................................................................62

6.4 Bootstrapping results................................................................................65

Conclusion and future work...............................................................................68

Appendix A..................................................................................................70

Our complete IMDB taxonomy.....................................................................70

Appendix B..................................................................................................72

The annotation guidelines............................................................................72

7

List of Figures

3.1: A part of the IMDB taxonomy………………………………………………….25

3.2: An example for the problem with the cosine similarity function……………….34

5.1: A typical recall-precision graph…………………………………………………49

6.1: R@P averaged curves methods comparison…………………………………….54

6.2: Comparison of R@P average curves of ablation tests…………………………..57

6.3: Comparison of R@P average curves of resources ablation tests……………….58

6.4: Recall-precision curve approaches comparison………………………………...60

8

List of Tables

3.1: Single-class classification results for the IMDB dataset………………………28

3.2: Document samples for the passing reference phenomenon……………………29

3.3: Document samples for the ambiguity phenomenon……………………………31

3.4: Missing expanding terms………………………………………………………32

3.5: Incorrect or ambiguous expanding terms………………………………………33

3.6: Final bootstrapping results…………………………………………………….35

3.7: Query expansion results…………………………………………………….......37

4.1: Dice expansions resource marginal contribution……………………………….43

5.1: Contingency Table for one category……………………………………………48

6.1: MAP values methods comparison………………………………………………55

9

Chapter 1

Introduction

Topical Text categorization (TC – also known as text classification) is the task of

automatically classifying a set of documents into categories (or classes, or topics) from

a predefined set.

With the rapid growth of online information, text categorization has become one

of the key techniques for handling and organizing text data. Text categorization

techniques are used to classify news stories, to find interesting information on the web

and to guide a user’s search through hypertext browsing. Since building text classifiers

by hand is difficult and time consuming, it is advantageous to learn classifiers

automatically.

The classical supervised learning paradigm requires many hand-labeled

examples to learn accurately. Manually categorizing unlabeled documents for creating

training documents is difficult due to the amount of human labor it requires. Therefore,

some recent researches have focused on unsupervised learning algorithms with

bootstrapping technique. These algorithms require unlabeled text collections, which in

general are easily available.

Keyword-based TC methods aim at a more practical setting. Each category is

represented by a list of characteristic keywords, which should capture the category

meaning. Classification is then based on measuring similarity between the category

keywords and the classified documents, typically followed by a bootstrapping step. The

manual effort is thus reduced to providing a keyword list per category (McCallum and

10

Nigam, 1999). (Ko and Seo, 2004; Liu et al., 2004) even partly automated this step,

using clustering to generate candidate keywords. Nevertheless, the method still requires

manual specification as part of the classification process.

(Gliozzo et al., 2005) succeeded in eliminating the requirement for manual

specification of keywords by using the category name alone as the initial keyword, yet

obtaining superior performance within the keyword-based approach. This was achieved

by measuring similarity between category names and documents in Latent Semantic

Space (LSA) (Deerwester et al., 1990), which implicitly captures contextual similarities

for the category name through unsupervised dimensionality reduction. They generated

an initial similarity-based classification that assigns a the single most similar category to

each document, with the similarity measure typically being the cosine between the

corresponding vectors. This initial unsupervised classification is used, in the subsequent

bootstrapping step, to train a standard supervised classifier (either with single or multi-

class labels per document), yielding the eventual classifier for the category set.

Requiring only category names as user input seems very appealing, particularly when

labeled training data is too costly, while modest performance (relative to supervised

methods) is still useful.

(Barak et al., 2009) offered a novel taxonomy-based approach for keyword-

based TC, which bases its similarity measure on a Lexical Reference (LR) measure

instead of a context measure only. LR suggested by (Glickman et al., 2006) defines a

more accurate semantic relation, which aims to identify whether the meaning of a

certain term is referenced by some text. This measure aims at a more appropriate

relation to base the TC assumption on, since it requires the actual reference to the

category topic in the text rather than general context similarity. In order to identify

whether the topic is addressed by the text as the main topic and not as a marginal

("passing") reference, they integrate the LSA context model in their overall framework.

Once a reference to the category topic is recognized in a text, they also measure its

context similarity to the category topic. Using this novel integrated framework they

achieve a complementary semantic measure that quantifies the topics mentioned and the

contextual relevancy at the same time. In addition, they use the automatic integrated

measure to create an initial set of classified documents that are then used as input for a

supervised learner in a bootstrapping procedure in order to acquire a final classification.

11

They utilize relations that are likely to correspond to lexical reference from two

resources: the WordNet (Fellbaum, 1998) semantic relation ontology and the online

encyclopedia Wikipedia. The two resources are complementary by nature and, as

expected, they contribute to different types of categories and relations. Their context-

based method is based on the co-occurrence-based method used in (Gliozzo et al.,

2005), utilizing a Latent Semantic Analysis (LSA) method to represent the context

similarity of documents and categories.

Classification by a large real-world taxonomy is a difficult task. It raises

different issues than classification for an artificial taxonomy created specifically for a

certain academic dataset. This study describes a proposed classification and evaluation

scheme for such a taxonomy and particularly for the IMDB taxonomy.

In this thesis we adopt the approach of (Barak et al., 2009), which combines the

reference similarity score with the context similarity score. Aiming to improve their

method, we utilized statistical correlation for improving both the reference model and

the context model.

We propose a simpler context model based on Dice coefficient (Mandala et al.,

1999), which is a measure of statistical correlation. We expand each category name by

the top-k co-occurring terms with the highest Dice score and calculate the cosine

similarity score between the expanded vector and the document vector. This score is

used as our context model score. Combining our context model with the LSA context

model yields performance improvement. We also found our simple dice-based context

model alone as comparable to the useful but complex LSA context model.

Furthermore, we utilized a new statistical LR resource, overcoming the problem

of WordNet and Wikipedia, which sometimes find good references that do not appear in

the corpus. We used the Dice coefficient measure for this purpose as well. We filtered

the top-k co-occurring terms, reduced their noise and achieved relatively precise LR

lists.

We also found that it is better to avoid the single-class classification scheme

suggested by (Barak et al., 2009), since we address a large real-world taxonomy. In a

real-world taxonomy, a portion of the documents may not be classified into any of the

categories, while many documents can be classified into multiple categories. On one

12

hand single-class classification forces a classification for each document, while on the

other hand removes classifications, since only the category with the maximal

classification score for each document is selected. We therefore adopt a multi-class

classification scheme, where each document may be classified to zero, one or more

categories.

In this thesis we offer a different classification and evaluation scheme based on

the assumption that tuning a parameter for each category is an acceptable demand under

certain particularly (industrial) circumstances. We measure how much recall can be

achieved at a certain precision level and select our precision level according to the

desired recall-precision trade-off. Classifications that maintain precision greater than the

given precision level are considered as valid.

Positive empirical results are presented for our complete method, which indeed

shows higher results than the state-of-the-art method suggested by (Barak et al., 2009).

Our results support the hypothesis that the LR-based approach is more accurate than the

context-based approach alone. The results reveal that our classification and evaluation

scheme contributes to the performance improvement as well.

In Section 2 we provide some background on recent works and the resources

used for our method. Section 3 describes the IMDB dataset and analyses the state-of-

the-art performance on it. We describe our new context and reference models in

Sections 4.1.1 and 4.1.2. Section 5 discusses our different classification and evaluation

schemes. Results and analysis are presented in Section 6.

We show that using an initial reference method as the basis for the classification

decision provides promising results, which are restricted mostly by the recall of the LR

resource in use.

Our proposed method achieves higher precision results, suggesting that the

reference assumption along with the context verification is indeed more suitable to the

needs of the TC task. With the ongoing development of promising LR resources and

different context models, it is expected that TC methods based on the combined

approach can attain results showing further improvement.

13

Chapter 2

Background

The goal of Text Categorization (TC) is to classify texts into a number of predefined

categories. Supervised systems for TC require a large number of labeled training texts.

While it is easy to collect unlabeled texts, it is not so easy to manually categorize them

for creating training texts. Unsupervised Text Categorization enables classifiers to

classify texts from unlabeled texts, thereby saving substantial human labor. This section

describes related work and provides motivation for our method. Unsupervised keyword-

based text categorization is first presented (Section 2.1), and then categorization based

on category name is described and the framework and motivation of the method we

employ is presented (Section 2.2). Next, background on the lexical reference framework

and resources and the motivation to use it are explained (Section 2.3). Finally, query

expansion methods are described and their relevancy to TC is explained (Section 2.4).

2.1 Unsupervised keyword-based text categorization

This study focuses on unsupervised keyword-based TC. In Unsupervised Text

Categorization, the knowledge about the classes of interest is provided in the form of a

few keywords per class. Few keywords are typically generated more quickly and easily

than even a small number of labeled texts.

One approach is to apply a bootstrapping procedure starting from a few

describing keywords per class (McCallum and Nigram, 1999). The approach follows

these steps: (a) based on keyword-matching, a rule-based classifier categorizes the

unlabeled examples, (b) the labeled data is then used to train a Naïve Bayes (NB)

14

classifier using an Expectation Maximization (EM) algorithm, (c) the EM step is

performed until the likelihood function reaches the optimal value.

A more recent approach based on the vector-space model of information

retrieval (Liu et al., 2004) was implemented by the following steps: (a) a clustering

algorithm was applied to find a list of candidate keywords, (b) a lexicographer chose

from that list a set of words for each category, (c) the unlabeled examples were

categorized using the highest similarity score defined by similarity metrics in the Vector

Space Model (VSM) (Salton and McGill, 1983), (d) a NB classifier was trained with the

automatically labeled data (e) the whole collection was classified with the obtained

classifier following the EM schema. This approach achieved slightly lower results than

a supervised NB classifier on the same task.

2.2 Categorization based on category name

TC approaches that use only the category name as the input and require no manual

effort during the classification process have been attempted rather rarely in the

literature.

One approach was introduced by (Gliozzo et al., 2005). They obtained their best

performance using only the category name as the input for the bootstrapping algorithm.

Their algorithm includes the following steps: (a) expanding the category names using

Latent Semantic Space (Deerwester et al., 1990), such that the categories are

represented in LSA space, (b) separating relevant and non-relevant category information

using statistics from unlabeled examples by a Gaussian Mixture algorithm, (c)

classifying each unlabeled example to the most probable category and (d) training a

SVM classifier on the set of labeled examples resulting from the previous step. They

reported results on two data sets – 20 news groups1 and Reuters-10 (the 10 most

frequent categories2 in Reuters-215783), showing improvement relative to earlier

keyword-based methods.

1 The collection is available at www.ai.mit.edu/people/jrennie/20Newsgroups.

2 The first 10 categories are: Earn, Acquisition, Money-fx, Grain, Crude, Trade, Interest, Ship, Wheat and Corn.

3 available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

15

(Downy et al., 2009) introduced the Monotonic Feature (MF) abstraction, where

the probability of class membership increases monotonically with the MF’s value. In

document classification, the name of the class is a natural MF; the more frequently it is

repeated in a document, all other factors being equal, the greater the likelihood that the

document belongs to the class. They extended the experiments of (Gliozzo et al., 2005),

presenting theoretical and empirical results, showing that even relatively weak MFs can

be used to induce a noisy labeling over examples, and these examples can then be used

to train effective classifiers utilizing existing supervised or semi-supervised techniques.

They proved that the Monotonic Feature (MF) structure guarantees PAC learnability

using only unlabeled data, and that MFs are distinct from and complementary to

standard biases used in semi-supervised learning, including the manifold and cluster

assumptions.

The most recent approach has been reported by (Barak et al., 2009). They

proposed a novel scheme that models separately two types of similarity. One type

regards words that refer specifically to the category name’s meaning, such as pitcher

and yankees for the category baseball, while the other type regards typical context

words for the category that do not necessarily imply its specific meaning, like stadium

and field for the category baseball.

They were mostly inspired by (Glickman et al., 2006), who coined the term lexical

reference to denote concrete references in text to the specific meaning of a given term,

and assumed that a relevant document for a category typically includes concrete terms

that refer specifically to the category name’s meaning. Referring terms were collected

from WordNet and Wikipedia by utilizing relations that are likely to correspond to

lexical reference.

Referring terms were found in WordNet starting from relevant senses of the

category name. A category name sense was first expanded by its synonyms and

derivations, all of which were then expanded by their hyponyms. When a term had no

hyponyms it was expanded by its meronyms instead, since they observed that in such

cases meronyms often specify unique components that imply the holonym’s meaning,

such as Egypt for Middle East. However, when a term is not a leaf in the hyponymy

hierarchy, then its meronyms often refer to generic sub-parts, such as door for car.

Finally, the hyponyms and meronyms were expanded by their derivations. As a

16

common heuristic, they considered only the most frequent senses (top four) of referring

terms, avoiding low-ranked (rare) senses that are likely to introduce noise, when used

for expansion.

Additional referring terms were extracted from Wikipedia. For each category

name they extracted referring terms of two types, capturing hyponyms and synonyms.

Terms of the first type are Wikipedia page titles for which the first definition sentence

includes a syntactic “is-a” pattern whose complement is the category name, such as

Chevrolet for the category Autos. Terms of the second type are extracted from

Wikipedia’s redirect links, which capture synonyms such as x11 for X-Windows.

The reference vector for a category consists of the category name and all its

referring terms, equally weighted. The documents are vectors in term space, and the

cosine similarity function measures the category-document similarity. This similarity

result is their Reference model score.

where is the document vector in term space

Classifying by the Reference model may yield false positive classifications in two cases:

(a) inappropriate sense of an ambiguous referring term, e.g., the narcotic sense of drug

should not yield classification to Medicine; (b) a passing reference, e.g., an analogy to

cars in a software document, should not yield classification to Autos. In both these cases

the overall context in the document is expected to be a typical for the triggered category.

They therefore measure the contextual similarity between a category and a document

utilizing LSA space, replicating the method in (Gliozzo et al., 2005). Both the category

names and the documents are represented in the latent space and the LSA similarity

score between them is obtained by calculating the cosine similarity. This similarity

result is their Context model score.

where and are the LSA vectors of the category name and the document, respectively.

To combine the scores obtained by these two models of their scoring method

(termed Combined model), they used multiplication. Multiplication reduces the score of

documents that contain referring terms, but relate to irrelevant contexts. Moreover,

when the score obtained by the reference scoring method is equal to zero, the integrated

17

score would also be zero. Ideally, given perfect reference knowledge, this means that

when the text does not refer to the category topic, it would not be classified to that

category topic even if it involves a related context.

The overall similarity score is defined as:

The similarity scores obtained by this Combined measure were used to produce

an initial labeled set of documents for training a supervised classifier. They used the

initial labeled set, in which each document is considered as classified only to the best

scoring category, to train a SVM classifier for each category. They used the default

setting for SVM-light, apart from the j parameter, which was set to the number of

categories in each data set, as suggested by (Morik et al., 1999). For Reuters-10,

classification was determined independently by the classifier for each category,

allowing multiple classes per document. For 20-NewsGroups, the category that yielded

the highest classification score was chosen (one-versus-all), fitting the single-class

setting of this corpus. They experimented with two document representations for the

supervised step: either as vectors in tf-idf weighted term space, or as vectors in LSA

space.

They tested their method on the two corpora used in (Gliozzo et al., 2005). The

Reference model achieves much better precision than the Context model from (Gliozzo

et al., 2005) alone. Combining reference and context yields some improvement for

Reuters-10, but not for 20-NewsGroups. They noticed though that the realistic accuracy

of their method on 20-NewsGroups is notably higher than when measured relative to the

gold standard, due to its single-class scheme: in many cases, a document should truly

belong to more than one category, while that chosen by their algorithm was counted as a

false positive.

In this thesis we base our method on the keyword-based approach, and in

particular the approach described in (Barak et al., 2009), by creating a two-phase

method: (1) automatically creating category representations to acquire an initial set of

labeled documents based on a similarity score between the categories and the document

representations, (2) classifying the unlabeled documents based on the initial categorized

set using a SVM based classifier. We expand the integrated model based on a reference

18

requirement and context fitness. Next we will describe the lexical reference framework

and the lexical semantic relations resource used to acquire lexical reference expansions

(rules).

2.3 Lexical Reference

The Lexical Reference (LR) notion was defined in (Glickman et al., 2006) to

denote in-text references to the specific meaning of a target term. They further analyzed

the dataset of the First Recognizing Textual Entailment Challenge (Dagan et al., 2006),

which includes examples drawn from seven different application scenarios. It was found

that an entailing text indeed includes a concrete reference to practically every term in

the entailed (inferred) sentence.

The LR relation between two terms may be viewed as a lexical inference rule, denoted

LHS => RHS. This rule indicates that the left-hand-side term would generate a

reference, in some contexts, to a possible meaning of the right-hand-side term, e.g.

Jaguar => luxury car. In this example the LHS is a hyponym of the RHS. Indeed, the

commonly used hyponymy, synonymy and some cases of the meronymy relations are

special cases of lexical reference. However, lexical reference is a broader relation. For

instance, the Lexical Reference rule physician => medicine may be useful to infer the

topic medicine in a TC setting. To integrate the LR rules in the TC scheme described

above, the initial seeds based on the category name are expanded with referring terms

extracted from the LR rules. For each rule in which the RHS of the rule is one of the

seed terms for a specific category, the LHS term of this rule is added to the seed terms

of this category to create the set of representing keywords for the category. Below we

describe the external resources used by our method to extract LR rules.

2.2. Lexical Reference Resources

Lexical-semantic resources, which provide the knowledge needed for lexical inference,

are commonly utilized by applied inference systems (Giampiccolo et al., 2007) and

applications such as Information Retrieval, Question Answering and Text

Categorization (Shah and Croft, 2004; Pasca and Harabagiu, 2001; Scott and Matwin,

1999). We based our LR rules extraction methods on external resources available

online. The resources utilized for this purpose are a lexical resource, the WordNet

19

lexical ontology, and a textual resource, Wikipedia, the online encyclopedia. Given the

different nature of the two resources, the method applied to each of them is quite

different. Below we provide a short description of each resource and its characteristics.

WordNet WordNet4 is a large lexical database of English, (Fellbaum,1998),

initially developed under the direction of George A. Miller. Nouns, verbs, adjectives

and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a

distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical

relations.

Every synset contains a group of synonymous words; different senses of a word appear

in different synsets. The meaning of the synsets is further clarified with short defining

glosses (definitions and/or example sentences). A typical example synset with a gloss is:

good, right, ripe (most suitable or right for a particular purpose; “a good time to plant

tomatoes”; “the right time to act”; “the time is ripe for great sociological changes”).

Most synsets are connected to other synsets via a number of semantic relations. Among

the semantic relations WordNet consists of are hyponyms (is-a relation) and meronyms

(is-part-of relation). While semantic relations apply to all members of a synset because

they share a meaning and are all mutually synonymous, words can also be connected to

other words through lexical relations, including antonyms, or derivational relations.

The TC task is one of various NLP tasks for which WordNet is exploited as a

source for lexical expansion. WordNet was used as a source for synonyms and

hypernyms to enhance feature data for TC methods in several works. (de Buenaga

Rodriguez et al., 1997) utilized WordNet as a source for synonyms based on the

assumption that the name of the category can be a good predictor of its occurrence.

They used WordNet synsets to perform category name expansion, similar to query

expansion in search, using the category name synonyms. This information was added to

labeled training examples as the input of supervised learning algorithms. The integrated

algorithm achieved an improvement of 20 points in precision and was found to be

extremely helpful for low-frequency categories, which have a lower number of training

examples.

4 We used version 3.0 of WordNet available at http://WordNet.princeton.edu/obtain

20

Another study that combined WordNet information with labeled training data is

that of (Scott and Matwin, 1999) who used WordNet as a source for synonyms and

hypernyms, which were added to the representation of each document.

A more recent study that combined WordNet information as described earlier in

Section 2.2 is (Barak et al., 2009), which used WordNet as a source for derivations,

synonyms, hyponyms and meronyms. Our method uses their extraction method

described in Section 2.2 to acquire LR rules from WordNet knowledge.

Wikipedia Wikipedia5 is a collaborative online encyclopedia that covers a wide

variety of domains. Wikipedia is constantly growing and evolving based on the

contribution of online users, and had more than 1,700,000 articles on the English

version as of March 2007 (Kazama and Torisawa, 2007). (Gilies, 2005) shows that the

quality of Wikipedia articles is comparable to those of the Britannica internet

encyclopedia.

(Shnarch et al., 2009) developed a Wikipedia-based LR resource. Each

Wikipedia article provides a definition for the concept denoted by the title of the article.

As a starting point, they examine the potential of definition sentences as a source for LR

rules (Ide and Jean, 1993; Chodorow et al., 1985; Moldovan and Rus, 2001). When

writing a concept definition, the aim is to formulate a concise text that includes the most

characteristic aspects of the defined concept. Therefore, a definition is a promising

source for LR relations between the defined concept and the definition terms. In

addition, they extract LR rules from Wikipedia redirect and hyperlink relations. As a

guideline, they focused on developing simple extraction methods that may be applicable

for other Web knowledge resources, rather than focusing on Wikipedia-specific

attributes. Overall, their rule base contains some eight million candidate lexical

references.

(Barak et al., 2009) used this Wikipedia-based LR resource to extract referring

terms of two types: Wikipedia page titles for which the first definition sentence includes

5 We used the English version from February 2007 available at

www.ukp.tudarmstadt.de/software/JWPL

21

a syntactic “is-a” pattern whose complement is the category name (Yamaha SR500

=> motorcycle), and terms extracted from Wikipedia’s redirect links

In our research we adopted a better extraction method, as reported by (Shnarch

et al., 2009). We used a Wikipedia-based LR resource to extract more types of referring

terms, while rules filtering that tend to relate terms that are rather unlikely to occur

together.

The extraction types we used were as follows:

Be-Comp The Be-Comp extraction method identifies the is-a pattern in the

definition sentence by extracting nominal complements of the verb “be,” taking them as

the RHS of a rule whose LHS is the article title.

All-N The Be-Comp extraction method yields mostly hypernym relations,

which do not exploit the full range of lexical references within the concept definition.

Therefore, the All-N extraction method creates rules for all head nouns and base noun

phrases within the definition.

Title Parenthesis A common convention in Wikipedia to disambiguate ambiguous

titles is adding a descriptive term in parenthesis at the end of the title, as in The Siren

(musical), The Siren(sculpture) and siren (amphibian). From such titles the Title

Parenthesis extraction method extracts rules in which the descriptive term inside the

parenthesis is the RHS and the rest of the title is the LHS.

Redirect As any dictionary and encyclopedia, Wikipedia contains Redirect links

that direct different search queries to the same article, which has a canonical title. For

instance, there are 86 different queries that redirect the user to United States (e.g.

U.S.A., America, Yankee land). Redirect links are hand-coded, specifying that both

terms refer to the same concept. The method therefore generates a bidirectional

entailment rule for each redirect link.

Link Wikipedia texts contain hyper links to articles. For each link a rule is generated,

whose LHS is the linking text and RHS is the title of the linked article. In this case the

Link extraction method generates a directional rule since links do not necessarily

connect semantically equivalent entities.

22

Based on the rule filtering method proposed in (Shnarch et al., 2009), we filtered

rules which tend to relate terms that are rather unlikely to occur in combination. The

authors recognized such rules by their co-occurrence statistics within Wikipedia, using

the common Dice coefficient:

where C(x) is the number of articles in Wikipedia in which all words of x appear.

They also adjust the Dice equation for rules whose RHS is also part of a larger noun

phrase (NP). The LR resource enables extracting rules with their Dice score, where

rules filtering is done by setting a threshold on the Dice rule score. We tuned the

threshold parameter (which was set to 0.01) on our development dataset, described in

Section 3.

As an encyclopedic resource containing cultural and day-to-day terms, by its

nature Wikipedia is complementary to the type of rules extracted from the WordNet

resource, which provides the typical terms similar to terms found in a dictionary.

2.4 Query expansion

A major problem in Information Retrieval (IR) is that relevant documents may contain

words that differ from those which appear in the user formulated query, although their

meaning is the same. One way to solve this problem is through automatic query

expansion.

Query Expansion (QE) is known to improve IR performance (Xu et al., 1996).

By expanding the query, the number of returned documents increases, and we expect to

retrieve a large set of relevant documents and to improve recall. On the other hand, by

increasing the number of retrieved documents, the chance of returning non-relevant

documents increases, too, and can decrease precision, as expansion may add noise to the

retrieved set, since the query includes terms which do not contribute to relevance

(Manning et al., 2008).

23

Keyword-based TC and QE are analogous tasks. The category names in TC are

analogous to the queries in QE. Both of the tasks expand the seeds with other related

terms in order to increase recall. Therefore, in this section we describe several QE

methods.

The methods for automatic Query Expansion split into two major classes: global

methods and local methods.

Global methods: In order for an IR engine to perform automatic query expansion, it

would need a large resource that could supply good expansion terms for a variety of

query words. Examples of such resources are WordNet and Wikipedia.

Another possible source for query expansion is a distributional similarity

algorithm, such as in (Lin, 1998). In this case, a query term would be expanded with

words that appear in similar contexts.

Another type of resource is based on co-occurrences of terms in the same

document, as opposed to distributional similarity, which is based on having similar

contexts across documents.

In our keyword-based TC method we utilized large global resources as our LR

resources.

Local methods: Local methods for query expansion reduce the sorce of expanding

terms to a partial collection. These methods adjust a query relative to the documents that

initially appear to match the query.

Local techniques such as pseudo-relevance feedback (PRF) require two passes

over the query. PRF specifies the process of automatically examining top-ranked

documents in an IR system ranking, and using information from these documents to

improve documents ranking.

This is done by assuming that the top-ranked documents are relevant, and using

information from this ‘pseudo-relevant set’ to improve the accuracy of the ranking by

expanding on the initial query and re-weighting the query terms.

24

The Rocchio algorithm is the classic algorithm for implementing RF. It models a way of

incorporating relevance feedback information into the vector space model. Its

underlying theory is to find a query vector that maximizes similarity with relevant

documents while minimizing similarity with non-relevant documents. However, it was

shown that better results are obtained for routing by using only documents close to the

query of interest rather than all documents (Schutze et al., 1995). The Rocchio re-

weighting formula is:

where D is a subset of the collection that considered as relevant to the query, is the

importance of the original query (between 0 and 1), β is the importance of the relevant

documents and γ is the importance of the non-relevant documents.

(Perez-Aguera et al., 2008) adopted a different approach to query expansion,

which is based on studying the difference between the term distribution in the whole

collection and in the subsets of documents that can be relevant for the query. One would

expect that terms with little informative content have a similar distribution in any

document of the collection. On the contrary, terms closely related to those of the

original query are expected to be more frequent in the top-ranked set of documents

retrieved with the original query than in other subsets of the collection.

25

Chapter 3

State of the art performance on

IMDB dataset

The focus of this research is keyword based Text Categorization for a large real- world

taxonomy. Classification by such a taxonomy raises different difficulties and is much

more complex than classification for an artificial taxonomy created specifically for a

certain academic dataset.

In this section we describe the dataset which was built at Bar-Ilan University. In

cooperation with COMVERSE. Its construction and annotation along with the

taxonomy creation are first described (section 3.1). Next, the poor performance of state-

of-the art methods along with an error analysis are detailed (section 3.2). Finally, the

performance of a state-of-the-art query expansion method along with comparison to

other state-of-the-art methods are presented (section 3.3).

3.1 The IMDB dataset

The Internet Movie Database (IMDB)6 is an online database of information related to

films, television programs etc. In many cases, the information goes beyond simple title

and crew credit, but also includes data such as plot summaries and reviews.

The IMDB dataset which we created for our research is a collection of 130,000

movie descriptions downloaded from the IMDB website. Each movie description

6 www.imdb.com

26

(termed document) contains the movie title and plot summary information. These

documents' topics are unknown. The IMDB dataset is thus a large data collection of

movie descriptions, which wasn't labeled to predefined topics..

The IMDB taxonomy creation and corpus annotation were done at Bar-Ilan

University. We researched the IMDB database, its structure and content to see which

information can be useful for building the taxonomy. Browsing the internet, we found

media taxonomies; compared and combined them with annotated IMDB keywords to

create a new optimized taxonomy. Figure 3.1 shows a part of the taxonomy. Appendix

A includes our complete IMDB taxonomy. Our taxonomy includes 97 topical categories

organized in a three level hierarchy structure, where each classification to a daughter

category is considered as a classification to all its ancestors as well. For example, a

document whose category is baseball is considered by the gold standard as a sport

document too, since baseball is a daughter of the sport category.

Figure 3.1: A part of the IMDB taxonomy

We manually annotated 1,970 movie descriptions with topic (category) labels from the

taxonomy. While selecting the documents to be annotated we had to make sure that the

27

set of categories is a representative sample of the actual database. The major issues

which were taken into consideration are the distribution of genres in the IMDB database

and the fact that some genres are better suited for the classification task than others. We

selected 2/3 of the dataset from the group of genres better suited for topical lassification

(Biography, Documentary, History, Music, Sport, and War) and 1/3 from the rest of the

genres. Appendix B includes the annotation guidelines. A couple of iterations were

required to stabilize the annotation. We randomly splitted the annotated set to

development (50%) and test (50%) subsets.

The collection's gold standard obtained is multi-class classified; hence each

document may be classified to zero, one or more categories.

Although we filtered descriptions with less than 150 characters, many of the

descriptions in the collection are still short. Given that 30% of the descriptions in the

annotated set are not classified to any of the taxonomy categories, and the large number

of the categories in the taxonomy (97) the IMDB classification task becomes an even

more challenging.

3.2 The limited performance of previous state-of-the-art

methods

We replicated the method in (Barak et al., 2009), described in section 2.2, as

representing a state-of-the-art classifier, including both its unsupervised and

bootstrapping steps. (Barak et al., 2009) proposed a novel scheme that models

separately two types of similarity. One type regards words which refer specifically to

the category name’s meaning (Reference). While the other type is typical context words

for the category which do not necessarily imply its specific meaning (Context). For one,

it identifies words that are likely to refer specifically to the category name’s meaning

(Glickman et al.,2006), based on certain relations in WordNet and Wikipedia. In

tandem, they assess the general contextual fit of the category topic using an LSA

context model to overcome lexical ambiguity and passing references (as described in

section 2.2). The similarity scores obtained by their combined measure (Combined)

were used to produce an initial labeled set of documents which was then used to train a

supervised classifier in a bootstrapping step.

3.2.1 Unsupervised single-class classification

28

We tested both components of the scoring method in (Barak et al. 2009) (Combined),

the Reference model and the Context model.

The reference model represents each category by its seed terms along with the

referring expansion terms for the seeds (where category names are used as the seeds) ,

and obtains a reference cosine similarity score between the two vectors of each

document-category pair. The referring terms are collected from WordNet and Wikipedia

as detailed in section 2.3

The context model from (Barak et al. 2009) is a replication of the method in

(Gliozzo et al., 2005). That original method includes a Gaussian Mixture rescaling step

for the context model, which (Barak et al. 2009) didn’t find helpful. We created

representing vectors for each category - the category name was represented using Latent

Semantic Analysis (LSA), in which documents and categories are represented in a latent

semantic space. LSA is a dimensionality reduction method which decreases the number

of dimensions in the document-by-term matrix. It converts the co-occurrence data

represented in the matrix to a representation of implicit semantic concepts in the latent

space. The LSA similarity score between documents and categories is obtained by

calculating the cosine similarity between their representing LSA vectors. We used the

LSA toolkit created by Idan Spektor and Jacob Goldberg at Bar-Ilan to generate the

LSA vectors from the IMDB corpus. We have set the LSA dimension to 300.

The combined scoring method was obtained by multiplication of the reference

score with the context score.

We also examined the baseline of including only the category name in the

reference vector (Cat-Name).

This unsupervised step of the algorithm classifies each document to a single

category, the category with the highest similarity to the document.

Table 3.1 presents the relatively poor classification results obtained for these methods.

Scoring

method

recall precision F1

29

Cat-Name 0.29 0.45 0.35

Reference 0.33 0.30 0.31

Context 0.29 0.26 0.28

Combined 0.37 0.35 0.36

Table 3.1: Single-class classification results for the IMDB dataset.

Comparison between the results on the IMDB dataset to the results on the

standard datasets 20 News groups and Reuters-10, which were used in (Barak et al.,

2009) shows that the scoring method is much better in the case of the standard datasets

(the 20 News groups F1 score was 0.41, while Reuters-10 F1 score was 0.76). This

might be due to the artificial structure of these standard academic datasets. Both of these

datasets have attributes which don't exist in our real world dataset. The 20 News groups

documents are partitioned (nearly) evenly across the 20 categories.While the Reuters-10

is only a sub-corpus of the Reuters-21578 collection, constructed from the 10 most

frequent categories in the Reuters taxonomy. In addition, the Reuters categories are

domain specific, and are all relevant to economical topics.

Moreover, the IMDB documents were written by simple users of the IMDB

website. Nonprofessional writers tend to add more unnecessary details such as actor

names, use anecdotal descriptions and sometimes even leave incomplete descriptions.

This makes the IMDB classification task much harder.

Error Analysis

Several error cases that were detected and categorized are detailed below.

1. Frequent passing reference: A dominant phenomenon which causes

misclassification is passing references. Passing references occurs when the topic name

or any partial group of its characteristics terms appear in a document, but they do not

refer to the main topic of the document. This phenomenon is relevant to all types of

topics, including named entities such as company names which are commonly

mentioned, general topics which may be discussed as an allegory or an object which is

30

referred widely in the corpus. Table 3.2 shows several examples of documents which

contain a passing reference to one of the IMDB collection topics.

No

.

Gold

Standard

Category

Method's

Classification

Document Example

1 Political

History

Cinema “Jack Nicholson's portrait of Union leader...

The film follows Hoffa through his countless

battles with the RTA and President

Roosevelt...”

2 Medicine College/University "A medical student... West moves to

Miskatonic University to continue his research

3 Baseball Weather “...when it is winter Ben can spend every

waking hour with Lindsey...Lyndsey gets hit

by a line drive foul ball off of Baltimore

Orioles' Miguel Tejada, and the Sox begin to

loose...”

4 None Arts "Colin's a sad-eyed British artist (Firth) holed

up in a rundown hotel in small-town Vermont

after being dumped by his fiancee...”

Table 3.2: Document samples for the passing reference phenomenon. The problematic

terms are bolded

The first example (no.1) in Table 3.2 is an example of a term which is referred

widely in the IMDB corpus. This term is a good expansion for the category cinema, but

since the IMDB is a movies domain it causes a passing reference. This phenomenon

mostly happens with the crime and cinema categories. Examples (no. 2-4) corresponds

to terms that do not refer to the main topic of the document. They describe a certain

31

place (no. 2), profession (no. 4) or time (no.3) which is insignificant in the document.

This phenomenon is relevant for all types of categories.

(Barak et al., 2009) used two mechanisms to identify the passing reference

phenomenon. The first one is the lexical reference expansion of the category

characteristic terms, which results in higher scores for documents that contain multiple

occurrences of referring terms, and the second is the use of context models. When a

term which refers to a certain topic appears out of context, a context model should give

a lower score to the document since its context is irrelevant for this topic. In the IMDB

Corpus, the second mechanism is important since short documents often don't contain

multiple occurrences of referring terms. In many of the cases the currently used context

model failed to recognize context irrelevancy. When dealing with documents that aren't

classified to any of the topics the situation is even more problematic since any

classification in this case corresponds to a false positive classification.

2. Ambiguity of expanding terms: Ambiguity of the topic name within the collection

is rare since it is typically chosen to be a very precise term which captures the full

meaning behind the topic. However, by using reference expansions as part of the

method, terms are being added to the seed term to represent the category. One of the

reasons for wrong classification is ambiguity of the expanding terms. Table 3.3 shows

several examples of documents which were classified incorrectly due to ambiguity of

the expanding terms.

32

No. Gold

Standard

Category

Method's

Classification

Document Example

1 Crime Space “.. an escape plan that involves reinforcing two of

the mall’s shuttle buses to transport the group to a

nearby marina where Steve has a boat docked..”

2 Airplanes Advertising "...The pilots there deliver mail over a dangerous

and usually foggy mountain pass. Geoff Carter,

the lead flyer, seems distant and cold as Bonnie

tries to get closer to him...”

3 Literature Christianity “A dashing officer of the guard and romantic

poet ... Christian, who is also in love...

4 Medicine Shooting "...Once called Father Frank for his efforts to

rescue lives, Frank sees the ghosts of those he

failed to save around every turn. He has tried

everything he can to get fired, calling in sick...”

Table 3.3: Document samples for the ambiguity phenomenon. Ambiguous terms are

bolded.

Example no.3 in Table 3.3 illustrates a common proper name with an additional

sense, while all the other examples (no.1-2, no.4) are terms which appear in a different

sense than the one which corresponds to the category topic. The context model was

supposed to recognize that the overall context in these documents is not typical for the

triggered categories and avoid these classifications, but it failed to overcome this

problem too.

33

3. Limitations of lexical reference resources:

Referring terms were collected from WordNet and Wikipedia, by utilizing relations that

are likely to correspond to lexical reference. WordNet provides mostly referring terms

of general terminology while Wikipedia provides more specific terms. Both resources

were described in section 2.3.

Several limitations of the currently used resources are detailed below.

Lack of expanding terms: Some of the documents were not classified to the correct

category due to a lack of correct expansions. Table 3.4 shows examples of such missing

expanding terms.

Category Expansions

Medicine cancer. HIV

Disability blind, deaf

Mythology Aphrodite, Oedipus

Table 3.4: Missing expanding terms

Seldom there are documents that require deeper text understanding since the

correct category isn't expressed with any typical word, such as a crime document which

discusses planting a virus inside a computer.

Incorrect or ambiguous expanding terms: Both WordNet and Wikipedia added terms

which were only correct as expansions for very infrequent senses, which caused false

classifications (false-positive errors). This is in contrast to the ambiguity described in

the previous section, where the ambiguous terms didn't correspond to a rare sense. Here

we are presenting ambiguous terms in infrequent senses. Sometimes the term sense is so

rare that it even seem to be an incorrect expansion. Table 3.5 shows several examples of

such expansions.

34

Lexical Resource Category Expansion

Wordnet - Meronyms Advertising promote

Wordnet - Hyponyms Business house,

partnership

Wordnet - Derivations Terror terrified

Wikipedia Pop/Rock machine,

mix

Table 3.5: Incorrect or ambiguous expanding terms

4. Topically close categories: topically close categories are mostly sister terms at the

same level in the topical taxonomy hierarchy. In the IMDB collection, for instance,

topically close categories exist as sister terms in the music group of topics, such as

opera and classical music. Topically close categories also exist as topics in different

branches of the taxonomy, such as the military topic in the interests branch of the

taxonomy which is highly related to the war topic in the miscelenous branch. Most of

the classification errors were between close topics in different branches.

Considering the taxonomy structure, the main problem is that we are not using

the daughter terms for classifying to the parent category. When we are classifying to the

crime category for example, we might have found only the term murder, but if we

consider also the expansions of its daughter mafia we could have find mob as well.

Assuming that the true category is indeed crime, we would have missed it.

Sometimes there is not enough evidences to classify the document to one of the

category's daughters but combining evidence from all daughter categories for

classifying to the parent category will yield a higher score and improve its chances to be

selected.

5. Limitations of classification scheme:

The cosine similarity function: The cosine similarity function normalizes the

multiplication of the document and the category vectors by the length of both.

35

Consequently, Categories with fewer expanding terms are preferred. Often even

when there are more terms matching one category in the document another category is

selected, since its expansion vector is shorter.

Figure 3.2: An example for the problem with the cosine similarity function

Figure 3.2 presents an example for the cosine normalization problem. The

document true category is sport. There are four appearances of terms which belong to

the sport category. There is only one term from the category motorcycle. However,

since the sport vector includes 105 expanding terms while the motorcycle vector

consists of only 35 terms. Consequently, the algorithm classified this document to the

wrong category motorcycle.

Single-class classification: Classifying each document to a single class (termed single-

class classification) has two major disadvantages. (i) It “forces” classification. Each

document is classified to the category with the highest similarity score. Even when the

classification scores are low one of the categories will be selected.

In the IMDB corpus, many documents are not classified to any category and will be

misclassified due to the single-class classification scheme.(ii) It “disconds”

classifications, since only the category with the maximal classification score is selected.

In the IMDB collection many documents are truly classified to multiple categories and

single-class classification will lose these classifications.

36

3.2.2 Bootstrapping:

The bootstrapping step suggested by (Barak et al., 2009) and others (Ko and Seo, 2004),

(Gliozzo et al. 2005) consists of training a supervised classifier with an initial labeled

set which was created by a previous unsupervised step.

The similarity scores obtained by the combined scoring method presented in

Table 3.1 were used to produce an initial labeled set of documents to train a supervised

classifier. Replicating (Barak et al. 2009), we used the initial labeled set, in which each

document is considered as classified only to the higest scoring category, to train a SVM

classifier for each category. For this purpose, we used SVMlight (Joachims, 1999),7 a

state-of-the-art SVM classifier, representing the input vectors in tf-idf weighted term

space. Our initial automatically labeled set contained about 120,000 documents from the

IMDB corpus. The vectors were fed to the classifier using its default settings.

Classification was determined independently by the classifier for each category,

allowing multiple classes per document. Results are detailed in Table 3.6 blow.

Scoring

method

Recall precision F1

Bootstrapping 0.024 0.047 0.032

Table 3.6: Final bootstrapping results

The table results show that in the IMDB case bootstrapping is problematic, yielding

lower performance than the unsupervised classification which cocnstitutes its input

training set, as reported in Table 3.1. There might be two possible reasons for these poor

results. First, the IMDB documents are too short and their quality is low. Second, the

way we selected our training set, where each document which wasn't classified to a

category is considered as a negative example for it, is wrong.

(Barak et al., 2009) have set the j parameter of the SVMlight to the number of

categories in the data set, as suggested by (Morik et al., 1999).

The suggestion of (Morik et al., 1999) was setting the j to the ratio between the number

7 Available at http://svmlight.joachims.org

37

of the negative examples to the number of the positive examples, which equals to

number of categories only under a uniform distribution of the categories. We also tried

setting the j parameter to the number of categories, even though, the IMDB distribution

isn't uniform. Indeed, no better results were achieved. More details about applying the

bootstrapping process over the IMDB dataset can be found in Section 6.

3.3 Applying a state-of-the art query expansion method

Reformulation the user queries is a common technique in information retrieval (IR) to

cover the gap between the original user query and her need of information. The most

used technique for query reformulation is query expansion (QE), where the original user

query is expanded with new terms extracted from different sources. Queries submitted

by users are usually very short and query expansion can complete the information need

of the users. Different types of query expansion methods were described in section 2.4.

Relevance feedback helps the IR system to compute a better representation of

the information need and to extract better expansions. Pseudo relevance feedback

methods create the feedback automatically, assuming that the k-top-ranked retrieved

documents are relevant, thus avoiding manual involvement, at the cost of the confidence

that all k-top-ranked retrieved documents are relevant. However, it has been found that

this automatic technique improves performance (Buckley et al., 1995).

Keyword based TC and QE are analogous tasks. The category keywords in TC

are analogous to the queries in QE. We tested several of pseudo relevance feedback

methods, while trying to optimize our algorithm for keyword based TC. All of the

methods used the whole IMDB corpus for selecting expansions and searched the

expanded query in the annotated IMDB test set. We used the Lucene8 IR system for the

QE process. The k parameter of the pseudo relevance feedback was set to 10, and each

of the queries was expanded by 25 top-ranked terms.

We chose the Rocchio relevance feedback method as our state-of-the-art since it

performed better than other methods which we tried, such as KLD and BO1(Perez-

Aguera et al, 2008) that are based on the probability distribution of terms in the

collection and in the top ranked retrieved documents.

8 lucene.apache.org

38

The Rocchio algorithm models a way of incorporating relevance feedback

information into the vector space model (VSM). Its underlying theory is to find a query

vector that maximizes similarity with relevant documents while minimizing similarity

with non-relevant documents. More details can be found in section 2.4.

Scoring

method

recall precision F1

Cat-Name 0.29 0.45 0.35

Rocchio 0.33 0.28 0.30

Combined 0.37 0.35 0.36

Table 3.7: Query expansion results

Table 3.7 shows a comparison between Rocchio method and other two state-of-the-art

methods from Table 1, the Cat-Name method which doesn't expand the category names

at all and the combined method described in section 2.2. The obtained Rocchio results

are lower. The low results are mainly due to noisy expansion lists, for example, the

category baseball was expanded with play baseball and baseball team but also with

ball, feature and documentary. The reason for these noisy expansion lists is that they are

built from frequent terms in the category documents. These Frequent terms don't

necessarily characterize the meaning of the category. These expanding terms thus do not

correspond lexical references to the category name.

39

Chapter 4

Algorithm improvements

Our research is based on the approach of (Barak et al. 2009) (described in section 2.2)

for keyword-based text categorization (TC), which bases its similarity measure on a

Lexical Reference (LR) measure instead of a context measure only. Their method

consists of the following steps:

1. Initiating each category vector by the category seed terms, which

correspond to the category name.

2. Representing categories in vector space, each category by its seed terms

along with the refereeing terms for the seeds; and calculating the cosine

similarity score (termed Reference score) between the vectors of each

document-category pair.

3. Representing each category and document by a co-occurrence based

vector, and computing a cosine similarity (termed Context) score for

each document-category pair.

4. Combining the reference score and context score, by multiplication, to a

single categorization score for each document-category pair. Finally,

5. Labeling an initial document set by the scores obtained in the previous

step, and using the initial labeled set to train a supervised classifier.

We focused on the second and third steps above aiming to improve the poor algorithm

performance on the IMDB corpus, as was shown in section 3.

40

In this section we first describe the utilization of statistical correlations from the

IMDB corpus (section 4.1). We then show how these statistical correlations are used for

building a new context model (section 4.1.1) and for inducing a new lexical reference

expansions resource (section 4.1.2). Finally, we offer two combination schemes far

expansion resources and a global reference-context combination scheme.

4.1 Utilizing statistical correlation

Co-occurrence based methods are based on the assumption that words that occur

frequently together in the same document are related to the same topic. Therefore word

co-occurrence information can be used to identify topical semantic relationships

between words.

Various metrics can be used for measuring co-occurrence strength. We tested

three common metrics: the Dice coefficient, Pointwise Mutual Information (PMI) and a

probabilistic metric described in (Glickman, et al., 2005) which attempts to grade the

lexical entailment relationship between two terms. For two words x and y from a

vocabulary V and a set of documents D, these metrics all measure the strength of the co-

occurrence relationship between the two words, based on the frequencies of their

independent and co-occurring appearances in the corpus.

The Dice coefficient normalizes the frequency of co-occurrence, or intersection

of the document sets of the two terms, by dividing it by the sum of the individual terms

frequencies and multiplying it by two, so that we get a measure between 0 and 1, with 1

indicating complete co occurrence:

where Dx is the document set where the term x appears and Dy is the document set

where the term y appears.

The PMI metric measures the degree of dependence between two terms based on

their probabilities:

41

The resulting scale is between -∞ and ∞, where complete independence of the

two terms will give a score of 0. Complete dependence between x and y will give a

score that varies according to their individual frequencies.

The above two metrics are symmetric for x and y. The probabilistic lexical

entailment measure presented by (Glickman, et al., 2005) on the other hand measures to

what degree is y entailed by x:

Given a term x (corresponding to a seed term in a category vector), for each of

the above metrics, we can expand x using the vocabulary terms that get the highest

scores according to the metric.

We sampled 20 category names and manually compared their top-50 co-

occurring extracted terms for each of the above metrics. The best co-occurring terms

were obtained when using the Dice coefficient metric. (Sachs, 2008) computed co-

occurrence-based word similarity based on the Reuters Collection Disk-1, using the

same metrics and reported similar results on the query expansion task, favoring the Dice

metric.

4.1.1 Dice-based context model

As described in Section 2.2, the overall context of the document should be typical for

the category topic. This is needed to assure that the referring terms for that category

appear (i) as part of the main topic of the text rather than as a passing reference, and (ii)

not in a different sense than the one referring to the category name. This requirement

can be captured by a set of terms which correspond to typical category contexts, even

though they do not necessarily concretely refer to the category. Such terms frequently

appear in the category context and therefore tend to co-occur with the category's seed

terms. Occurrence of such terms implies that the text might be related to the category.

42

For example, the terms ball and game don't refer to the category baseball, as they can

appear within the context of several other sport categories. However, the presence of a

significant amount of such context words in a document increases the likelihood that

this document may be related to the baseball topic. On the other hand, the lack of any

context word in a document decreases the likelihood that this document is relevant to

the category's topic. For that purpose, we need to use context models based on co-

occurrence data of terms.

(Barak et al., 2009). utilized a Latent Semantic Analysis (LSA) method to

represent the context similarity of documents and categories. LSA is a dimensionality

reduction method which maps similar terms, by means of co-occurrence data, to a lower

dimensional space in which terms and documents are represented by new dimensions

that may be perceived as "concepts". Those "concepts" aim to capture the context

similarity of the data. LSA has the advantage of modeling both first order and second

order similarity, and by that offers a powerful context-similarity measure. It measures

not only the likelihood of terms to appear in the same document as standard co-

occurrence based methods, but it also captures the likelihood of terms to co-occur with

other common terms by their joint mapping to the same LSA "concepts".

LSA is useful but uses an implicit representation, therefore its behavior is hard

to analyze or predict. LSA is somewhat crude and has difficulties to distinct between

topically close categories; Moreover, LSA is complex to implement and

computationally expensive.

We suggest a different simpler context model based on the Dice coefficient

metric. We expand each category name by the top-k (k=100 in our case) co-occurring

terms with the highest dice score and calculate the cosine similarity score between the

expanded vector and the document vector. This score is used as our context model

score. Like (Barak et al., 2009) ,we used multiplication as the integration method of the

reference and context scoring methods to reduce the score of documents which contain

refereeing terms but relate to irrelevant context. Moreover, when the score obtained by

the reference scoring method is equal to zero the integrated score would also be zero.

However, when the score obtained by the context scoring method is equal to zero, we

used a smoothing factor so the integrated score would be low but not zero, in this way

43

the context score actually re-rank the reference score according to the context

likelihood.

Looking at the expanded vectors, we observe that the dice-based context model

indeed captures typical category contexts, for example, the category baseball was

expanded by the following unigrams: pitcher, league, bat, Yankees and the following

bigrams: baseball player/team, major league, Jackie Robinson, Ted Williams. Notice

that some of these terms are actually refereeing terms for baseball, while the others are

only related context terms.

4.1.2 Dice expansions resource

Analyzing our dice-based context model, we realized that many refereeing terms which

are not covered by WordNet or Wikipedia can be reveled from the lists of the co-

occurring terms, as they were extracted from the IMDB corpus itself.

We thus created a new resource, dice expansions resource which uses the huge

available unlabeled data of the IMDB corpus to extract statistical correlations.

Overcoming the problem of WordNet and Wikipedia which sometimes find good

expansions that don't appear in the IMDB corpus at all.

Taking the top-100 co-occurring words as we did in our dice-based context

model is too noisy. It captures both lexical references (LR) and general context terms.

Below we describe the filtering factors which we used aiming to reduce noise and get

relatively precise LR lists. We used the annotated development set for parameters

tuning.

Weight filtering: We used the dice co-efficient score for terms weighting.

44

We filtered terms which their weight with a given category name is lower than a

threshold which was set to 0.05.

Seed filtering: Often one category name appeares as an expansion of another. We

filtered these expansions since the seeds were chosen to be very precise and to capture

the full meaning behind the topic and mostly fit only their original category

Frequent term filtering: .Some terms are referred widely in the corpus and can't be

used to distinct between categories. We filtered these expansions by setting a threshold

on the term frequency in the corpus. Terms which appear in more than 4% of the

documents in the corpus are omitted from the category expansions list.

Multiple expansions filtering: Some terms expand more than one category and are

therefore less distinctive. We attribute the term only to the category which gets the

highest dice coefficient score with the term, which is:

This filtering is very important since assigning a term to more than one category

produces a lot of noise, for example, the term mob boss originally fits both the mafia

and drugs categories, but might increase confusion if assigned to drugs.

Category Name Expanding terms

Buddhism dalai lama, Karmapa, Kisaeng Hwang, lama

Gambling bookie, gambling casino/debt, illegal gambling, poker

Karate black belt, karate kid, kenpo, Miyagi Daniel

Military colonel, commander, troops, weapon

Wrestling Ric Flair, Roddy Piper, Vince Mcmahon, wcw, wwf superstars, wwf

Table 4.1: Dice expansions resource marginal contribution

45

Table 4.1 shows correct references which were found by our dice lexical

references resource. These expansions were not found nor by WordNet neither by

Wikipedia. Using a statistical LR resource we found more proper names of known

personalities in the category areas and more concepts which are strongly related to the

category names.

Statistical correlation extraction has two important advantages. (i) The number

of the category name appearances in the corpus is a direct measure of the quality of the

its expansions list, since the more documents we have for the statistics collection, the

more accurate and reliable expansions list we can get. As a result of this estimation

ability (ii) we can add more documents for categories with a small amount of documents

by crawling the web or using some other corpus. We suggest that this line of research

may be investigated further to enrich and optimize the dice LR reference resource in

order to exploit additional reference knowledge.

4.2 Combined scoring

We have now three expansions resources of referring terms: WordNet, Wikipedia and

Dice which have to be combined. We propose two different combination schemes:

Union: Referring terms are collected from all the resources. The term lists for each

category are unified to a single list which is then used to represent the category vector.

The Cosine similarity measure is then used to measure the similarity between document

vectors and category vectors. This type of combination was applied by (Barak et al.

2009) as described in section 2.2.

The weight of the seed terms and their referring terms in the category's vector

are all equally weighted, and set to 1.

Geometric Mean: Another option is to consider first each resource separately; for each

resource we represent the category vectors with its own term lists and calculate the

Cosine similarity score between these category vectors and the documents. The

category names were treated as a separate resource as well. Then we combined the

resources cosine similarity scores using Geometric Mean (GM).

46

where Simx is the similarity score of resource x, n is the number of combined resources

(n=4=|X|), and λ is a smoothing factor, which we have set to 0.0001.

The GM is lower when there is a high difference between the averaged numbers

and higher when this difference is low. This mathematical property of the GM might be

beneficial for the classification task. As resources agreement will lead to a higher

similarity score and a referring term that is supported by more than one resource will

obtain a higher similarity score.

These combination schemes deal only with the references resources

combination. Both of these schemes thus return the similarity score of the Reference

model. We still have to combine the Reference model with at least one of our context

models.

Aiming to maximize our performance we combined both of the context models

in the following way: We first combined the Reference model with the LSA context

model using multiplication as (Barak et al., 2009) did and then combined the resulting

score with the dice context model by multiplication with a smoothing factor as

described earlier this section (4.1.1).

Overall, our primary method configuration, which is evaluated in Section 6,

contains three LR resources; WordNet, Wikipedia and Dice, combined by the union

scheme, with two context models, Dice-based and LSA context models

More details about the empirical contribution of each of the algorithm

components and combination schemes can be found in section 6.

47

00

x xx

x

Sim SimSim

Sim

ngm x

x X

Sim Sim

_ , , ,X cat name wn wiki dice

Chapter 5

A Classification and evaluation

scheme for a large real-world

taxonomy

Classification by a large real-world taxonomy is a difficult task. It raises different issues

than classification for an artificial taxonomy created specifically for a certain academic

dataset. This section describes a proposed classification and evaluation scheme for such

a taxonomy and particularly for the IMDB taxonomy. First, a multi-class classification

scheme is presented (5.1), and then corresponding evaluation measures are described

(5.2).

5.1 Multi-class classification scheme

Categorization of documents can be done according to two different approaches:

classifying each document to a single category, referred here as single-class

classification, or classifying each document in several categories, referred here as multi-

class classification. This unsupervised step of the (Barak et al., 2009) algorithm, as

described in Section 2.2, used the single-class classification approach.

Single-class classification has two major disadvantages: (i) It “forces” a

classification for each document; even when the classification scores are low, one of the

categories will be selected. In the IMDB corpus,described in Section 3, many

documents are not classified into any category and will be misclassified as a result of

48

the single-class classification scheme. (ii) It removes classifications, since only the

category with the maximal classification score for each document is selected. In the

IMDB collection, many documents are truly classified into multiple categories, and

single-class classification misses these classifications.

Multi-class classification is a ranking-oriented task. A ranked list of documents

is created for each category. The documents are sorted in descending order according to

their categorization score and the top ranked classifications are selected as positive.

There are two types of possible thresholds on the selected classifications: A threshold

on the classification score value or a threshold on the percentage of the top ranked

classifications (top-k%). Ranking the documents aims at achieving a better precision at

the top of the sorted list, which means ranking true category documents at the top of the

list, while ranking irrelevant documents at the bottom. The ranking task allows

evaluation of the scoring method quality per each category.

Preliminary experiments on the IMDB dataset showed that single-class

classification using both the Reference and Context models as described in section 2.2

achieved lower results than multi-class classification using only the category names.

Considering the drawbacks of single-class classification, we decided to adopt the multi-

class classification scheme.

In the following section we will describe different evaluation measures suitable for our

new classification scheme and the rationale behind them.

5.2 Evaluation measures

Two basic evaluation measures, given the gold standard of the collection, are precision

and recall

49

Category i

Gold standard

TRUE FALSE

Classifier judgement

TRUE TPi FPi

FALSE FNi TNi

Table 5.1: Contingency table for one category

where Pi is the precision of category i and Ri is the recall of category i. Both precision

and recall have a fixed range: 0.0 to 1.0 (or 0% to 100%).

Recall and precision are measures for the entire list of documents classified to a

category. They do not account for the quality of ranking the documents in the document

list. We assume that users would want the classified documents to be ranked according

to their relevance to the category, instead of just being presented with an unordered

document set.

5.2.1 Recall-Precision curves

A common way to depict the degradation of precision with the increase of recall, as one

traverses the ranked document list, is to plot interpolated precision numbers against

percentage recall. A percentage recall of say 50% is the position in the documents list at

which 50% of the relevant documents in the collection have been retrieved. It is a

measure of the number of documents one has to examine before seeing a certain

percentage of the relevant documents. The same plot expresses the notion of recall at

50

ii

i i

TPP

TP FP

i

ii i

TPRTP FN

precision too, referring to the percentage of relevant documents which can be found at a

certain precision level.

Figure 5.1 shows a typical recall-precision graph. The graph shows the trade-off

between precision and recall. Trying to increase recall typically introduces more

incorrectly classified documents, which do not belong to the target category, into the

documents list, thereby reducing precision (i.e., moving to the right along the curve).

Trying to increase precision typically reduces recall by removing some good documents

from the document list (i.e., moving left along the curve). An ideal goal for a classifier

is to increase both precision and recall by making improvements to the classification

algorithm, i.e., the entire curve must move up and out to the right so that both recall and

precision are higher at every point along the curve.

Figure 5.1: A typical recall-precision graph

A recall-precision curve can be drawn for each of the categories separately.

However, when measuring overall system performance averaging the points is

necessary.

5.2.1.1 Macro averaging vs. Micro averaging

51

recall-precision graph

0

0 .2

0 .4

0 .6

0 .8

1

1 .2

0 0 .2 0 .4 0 .6 0 .8 1 1 .2

re c a ll

prec

isio

n

There are two conventional methods of calculating the performance of classification and

retrieval systems based on precision and recall. The first is called micro-averaging,

while the second one macro-averaging. Micro-averaged values are calculated by

constructing a global contingency table (as was shown for a single class) and then

calculating precision and recall using these sums. In contrast macro-averaged scores are

calculated by first calculating precision and recall for each category separately and then

taking the average of these. The notable difference between these two calculations is

that micro-averaging gives equal weight to every document while macro-averaging

gives equal weight to every category.

In this way micro and macro average precision and recall are calculated as can be seen

in the formulas below which are based on he definitions given in Table 5.1 at the

beginning of section 5.2

Typically, micro-averaging is used in text categorization, since the evaluation

measure should reflect the system’s performance on most common categories and

shouldn't be influenced by rare categories. As opposed to information retrieval, where

all the queries are equally weighted and macro-averaging is commonly used.

Moreover, in the IMDB annotated test set there are many categories with only

few documents or none at all. Giving these categories equal weight as the others will

give us a misleading view of the system performance.

Thus, we adopted the Micro-average averaging method; all the recall-precision

points in the graphs presented in Section 6 were calculated in this manner.

In Section 5.2.2 we describe another evaluation measure termed Average

Precision. Average Precision is a macro-average averaging type; therefore, our two

evaluation schemes cover both aspects of micro and macro averaging.

52

| |

1| |

1

C

i imicro C

ii i

TPP

TP FP

| |

1| |

1

C

i imicro C

ii i

TPR

TP FN

| |

1

1| |

Ci

macroi i i

TPPC TP FP

| |

1

1| |

Ci

macroi i i

TPR

C TP FN

5.2.1.2 R@P average curve

The multi-class classification scheme requires setting a cut-off point in the ranked

document list for each category. Typically, the cut-off might be top-k percent of ranked

classifications or a certain threshold on the classification score.

In this thesis we propose a different cut-off scheme which better fits an

industrial classification setting for large real world taxonomy. Since this thesis is a part

of the Negev Consortium (Next Generation Personalized Video Content Service), the

classification setting should enable satisfying the industry demand for a reasonable

precision. However, manual tuning of parameters is considered acceptable in an

industrial classification setting. Tuning parameters is much cheaper than training dataset

creation for a supervised classifier. Under these circumstances, our assumptions are as

follows: (i) the user would not want to descend from a certain predefined precision

level. (ii) The threshold for each category will be tuned separately, such that a different

score threshold which fits the desired precision will be set for each category. (iii) The

desired precision level would be the same for all the categories.

We propose an evaluation measure, which better fits the industrial classification

setting for large real world taxonomy, the Recall at Precision average curve (R@P

curve). The R@P curve is an averaged recall-precision curve where each cut-off point

corresponds to a certain precision level. The precision level is presented in 1/k intervals,

where k is the number of the cut-off points. For each category, we calculate the number

of correct classifications in the ranked document list that maintain precision greater than

the given precision level. We then sum the number of these correct classifications of all

the categories and divide it by the total number of classifications in the gold standard,

obtaining the recall of that precision level. Therefore, the R@P curve illustrates how

much recall the classifier can provide under a certain precision level.

Since the R@P curve measure better fits the industrial classification setting, we

use the R@P curve as our main evaluation measure. In Section 6 we also compare it

with other methods for setting cut-off points in the list of ranked documents.

5.2.2 Mean Average Precision (MAP)

53

Average precision is a common evaluation measure for system rankings, and is

computed as the average of the system's precision values at all points in the ranked list

where recall increases (Voorhees and Harman 1999). More formally, it can be written as

follows:

where n is the number of documents classified by the system to a specific category in

the test set, R is the total number of correct classifications in the test set gold standard

for this category, E(i) is 1 if the ith document is classified to this category according to

the gold standard and 0 otherwise, and i ranges over the documents, ordered by their

ranking. The score calculated by the average precision measure ranges between 0 – 1,

where 1 stands for perfect ranking which places all the category documents before the

non category ones. This value corresponds to the area under the non-interpolated recall-

precision curve for the target word. Mean Average Precision (MAP) is defined as the

mean of the average precision values for all the categories. We averages only categories

which contain at least one document in the test set gold standard (84 categories out of

97), since ranking categories with no documents is meaningless.

In the next section we evaluate our improved algorithm (Section 4) using the

evaluation measures described in this Section.

54

Chapter 6

Results and Analysis

We evaluated the classification results of our improved scoring method described in

Section 4, using Recall at Precision average curve (R@P curve) with our new cut-off

scheme presented in Section 5. Our aim was to allow the user to set her precision

constraints and choose the desired recall-precision trade-off.

We also evaluated the ranking quality of our scoring method, using the MAP

measure. The MAP measure averages only categories that contain at least one document

in the test set gold standard (84 categories out of 97), as explained in Section 5.

The Results and analysis are presented in this section. We first compare our

scoring method to three other baselines (section 6.1). Then we present ablation test

results aimed at testing the contribution of each component of our scoring method

(section 6.2). We further analyze our results with a detailed error analysis in Section 6.3.

Finally, we describe our experiment with a bootstrapping scheme and its results

(Section 6.4).

6.1 Results

We compared our scoring method explained in Section 4 to three baselines. The first is

the single-class Combined method described in Section 2.2, which was used in (Barak

et al., 2009) for the unsupervised categorization step. The second baseline is the multi-

class Combined method, on which (Barak, 2008) reported her MAP. The multi-class

Combined method ranks documents that contain at least a single occurrence of a

55

referring term. The single-class Combined method classifies each document to a single

category, the category with the highest similarity to the document, while the multi-class

Combined method classifies each document to all the categories whose referring term

are mentioned in it. The multi-class Combined method was used by (Barak, 2008) only

for ranking evaluation.

Finally we applied a baseline from an information retrieval query expansion

algorithm, the Rocchio pseudo-relevance feedback described in Section 2.4.

R@P curves

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

recall

prec

isio

n single-class combined

multi-class combined

rocchio

our method

Figure 6.1: R@P averaged curves methods comparison

Figure 6.1 presents the R@P average curves obtained by using the following

similarity scores: ours, single-class Combined, multi-class Combined and Rocchio. It

shows that using our scoring method consistently outperforms other methods by several

points. The recall of our method is higher, since we utilized a new additional statistical

LR resource. In addition, the accuracy of the statistical LR resource with the additional

dice-based context model caused the precision to increase as well. A comparison

between the curve denoting the single-class Combined scoring and the other curves,

which were obtained by multi-class classification, shows that the single-class

classification scheme is limited in recall since it selects only the best category for each

of the documents. It implies that there are more good highly ranked classifications that

the single-class classification scheme ignores.

A comparison between the different multi-class classification scoring methods;

ours, multi-class Combined and Rocchio, shows that Rocchio consistently performs

56

worse than the other methods. Its recall is limited at low-precision cut-off points, while

at high-precision cut-off points its recall is even lower than the recall of the single-class

scoring method.

A comparison between the two curves denoting our scoring and the multi-class

Combined scoring suggested by (Barak et al. 2009) shows that integrating the statistical

knowledge of the Dice-based context model in the multi-class Combined scheme

achieves higher recall, showing an average recall improvement of 6.8 points.

To complete the ranking evaluation, in Table 6.1 we present the MAP values of

the scoring methods. Ranking by the our score achieves higher MAP value than all

methods. In particular, it achieves a MAP value seven points higher than ranking

according to the multi-class Combined score.

Method MAP

Single-class combined 0.35

Multi-class combined 0.50

Rocchio 0.41

Our method 0.57

Table 6.1: MAP values methods comparison

We checked the statistical significance of our results, aiming to assess that our method

is indeed better than the multi-class Combined one. We used the Wilcoxon signed rank

sum test, which is used to test the null hypothesis that the median of a distribution is

equal to some value, or, in case of paired data, that the median difference is equal to

zero.

We compared the differences between the average precision values of each of

the categories in both methods, ignoring cases where the paired difference is 0. The

number of pairs was large enough, so we used a normal approximation. The

57

approximation gave a two-sided p-value of p=0.0004. This assesses that our results are

statistically significant and that there is strong evidence that our improved method is

better than the Multi-class combined method.

6.2 Contribution of Our Method Components

As described in Chapter 4, our method adds two components to the Multi-class

combined scoring method suggested by (Barak et al. 2009), which are included in our

scoring method. These components include (i) a co-reference dice-based context model

and (ii) a dice lexical references (LR) resource. After analyzing the results of using all

components together, we wished to assess the contribution of each component

individually. This was done by ablation tests. For ablation tests, the starting point is the

results obtained from using all components. For each component, we assessed the

influence on performance when this component is removed from our scoring method.

Furthermore, we compared the existing union-based resource combination scheme to

the new resource combination scheme based on geometrical mean, which we presented

in Section 4.2.

6.2.1 Component Ablation Tests

The results of comparing the R@P average curves of each of the ablation tests are

shown in Figure 6.2.

58

Ablation tests

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

recall

prec

isio

n

All

Ab LSA context model

Ab dice-based context model

GM combination scheme

Ab dice expansions resource

Figure 6.2: Comparison of R@P average curves of ablation tests

These results show that context verification is important. Combining different

context models is beneficial, since each of the context models has some additional

information relative to the others.

We compared our dice-based context model to the LSA context model, aiming

to measure the quality of our new dice-based context model. Figure 6.2 illustrates that

both of the context models show quite similar behavior, except for one cut-off point, of

0.7 precision, where the LSA performs better. The MAP of the LSA context model is

0.56, while the dice-based context model’s MAP is 0.55. We checked whether the

method with the LSA context model (omitting the dice context model) is better than the

method with the dice-based context model (omitting the LSA context model) using the

Wilcoxon significance test. The results showed that the difference between the context

models is not statistically significant. We conclude that our new simple dice-based

context model is comparable to the much more complex LSA context models.

A comparison of the two different resource combination schemes shows that

using geometrical mean (GM), described in Section 4.2, is no better than the resources

59

union suggested by (Barak et al. 2009). The rationale of the GM combination scheme is

interesting, but ineffective in the IMDB case.

The contribution of the dice LR resource is important when recall is considered.

Starting from the 0.6 precision cut-off point, the recall of our scoring method increases

due to the dice LR resource.

6.2.2 Resources Ablation Tests

In order to evaluate the contribution of each of the reference expansion resources used,

we did another series of ablation tests. The resources which are included in our scoring

method are WordNet, Wikipedia and Dice. We wished to assess the contribution of each

resource individually. For these ablation tests, the starting point is the results obtained

from using all resources. For each resource, we assessed the influence on performance

when it is removed from our scoring method. In addition, we checked the influence of

removing all the LR resources and of using only the dice LR resource. The context

models were left in the system in all of these ablation tests.

Figure 6.3 presents R@P average curves for each of the resources ablation tests.

resources ablation tests

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1recall

prec

isio

n

our method

Ab dice

Ab WordNet

Ab Wikipedia

No resources

Dice only

Figure 6.3: Comparison of R@P average curves of resources ablation tests

60

Analyzing our results, we can readily conclude that using LR resources is

valuable. Context models are obviously not enough for the TC task. LR defines a more

accurate semantic relation, which aims to identify whether the meaning of a certain

category name is referenced by another text. This measure aims at a more appropriate

relation to base the TC assumption on, since it requires the actual reference to the

category topic in the text, rather than general context similarity.

Wikipedia has been a potentially good resource for LR, providing the typical

knowledge found in an encyclopedia. However, there is an overlap between the dice

and Wikipedia resources. When we use the dice and WordNet resources, the

contribution of Wikipedia is insignificant since it increases the recall only where the

precision is low.

WordNet as a different type of resource that has more impact on the

classification results. Typical knowledge that can be found in a dictionary tends to occur

less frequently in co-occurrence statistics collected from a corpus. WordNet expansions

improve the document’s ranking and increase the recall, maintaining a high rate of

precision.

Dice expansions resource increases the recall where the precision is below 0.6,

but when we use only dice LR resource, the recall at the same range is not maximized.

The Dice LR resource performance is reasonable even as a single resource. It would be

especially efficient in cases where there are no additional resources available, such as in

a different language for example. However, combining the Dice LR resource with other

external resources, such as WordNet, leads to better performance.

6.3 Further Analysis

In this section we further analyze our scoring method we presented in Section 4, and our

evaluation methodology described in Section 5.

We first compare our R@P average curve to standard average curves (Section 6.3.1).

Then we present a deeper error analysis for our scoring method.

6.3.1 Recall-Precision Curves Comparison

61

In Section 5 we described our R@P average curve, which differs from other standard

recall-precision average curves in the points’ cut-off level. In our R@P average curve

each cut-off point is a certain precision level (from 1.0 to 0.1), while a standard cut-off

criterion might be a threshold on the score or a certain percentage of the top ranked

classifications.

Recall-precision curve approaches

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

recall

prec

isio

n Percentage cut-off

Scores cut-off

P@R cut-off

Figure 6.4: Recall-precision curve approaches comparison

Figure 6.4 shows a comparison between our cut-off approach and two other

typical cut-off approaches: (i) similarity score cut-off, where the threshold is set over

the classification scores and (ii) classification percentage cut-off, where the threshold is

set on the percentage of the top ranked classifications (top-k%) to be taken. Both the

percentage and the score cut-off approaches do not require manual tuning of each

category separately. We just have to select a certain cut-off point according to the

desired global recall-precision trade-off. The cut-off point itself defines a threshold on

the number of the classification or on the classification score. For each category, we

calculate the number of correct classifications that meet the threshold condition. We

then sum the number of these correct classifications of all the categories and divide it by

the total number of classifications in the gold standard, to obtain the recall level for this

threshold. For each category we calculate the global number of classifications that meet

the threshold condition as well. We then sum the number of total classifications of all

the categories and divide the sum of correct classifications of all the categories by the

sum of total classifications of all the categories, to obtain the precision level for this

threshold.

62

However, since tuning parameters before product marketing is acceptable in the

industry as described in Section 5.2, our proposed cut-off scheme is applicable in the

IMDB case.

Figure 4 show that our more expensive cut-off scheme indeed outperforms other

standard cut-off schemes consistently by several percentage points. The performance of

our cut-off scheme is better, since the cut-off of a certain precision level defines

different thresholds over different categories. The Figure also illustrates that by setting

our desired precision level to 0.6 for example, we can achieve 0.56 recall, and the F1

score of 0.58, which is comparable to previous results on academic datasets.

6.3.2 Error Analysis

The multi-class classification schemes classify each document into numerous

categories. Any random mention of a refereeing term yields a classification with a

positive score. The main issue is whether the scoring method succeeded in ranking the

category's documents properly. Ranking the documents according to their score

revealed interesting characteristics of the scoring methods and of the topical

categorization in general. Below is a detailed error analysis explaining some of the

reasons for the remaining errors and suggesting how our method could be improved in

future work.

1. Passing reference: Passing references occurs when the topic name or any partial

subset of its characteristic reference terms appears in a document in their required sense,

but they do not refer to the main topic of the document.

This phenomenon in the IMDB collection is mostly relevant to the following types of

terms: (i) Professionals, such as model (fashion), photographer, artist(arts),

student(school), actor(theater). (ii) Places, such as school, college/university,

island(beach), ocean(ships). (iii) Adjectives that are strongly related to the categories,

such as wealthy(business), free(prison), ancient(history), and (iv) Terms that are part of

a noun phrase, such as “Motor Pool crew”(motoring, swimming) “architecture

student”(arts) .

The additional dice-based context model is a mechanism that should decrease

the score obtained by documents with a passing reference. When a term that refers to a

63

certain topic appears out of context, the context model gives a lower score to the

document since its context is irrelevant for that topic.

We address the passing reference phenomenon by combining two context

models, which are responsible for demoting irrelevant documents that obtained a high

reference scores due to low context score.

2. Ambiguity of expanding terms: Using reference expansions as part of our method,

terms are being added to the seed term to represent the category. Often some of these

expanding terms are ambiguous. Sometimes the ambiguity is due to two relatively

frequent senses of a term, for example the term rebound is an expansion of the seed

basketball but also appears in the senses of a movement back from an impact, or a

reaction to a crisis or setback or frustration. Ambiguity is mostly a problem of a specific

term in a document. The document’s context is usually unambiguous. Therefore,

combining context models address this phenomenon too..

3. Lexical reference resources limitations: Referring terms were collected from

WordNet, Wikipedia and Dice by utilizing relations that are likely to correspond to

lexical reference. WordNet provides mostly referring terms of general terminology,

while Wikipedia provides more specific terms. Both resources were described in

Section 2.2. Dice provides statistical correlations that include both types of refereeing

concepts as well as more specific terms. Several limitations of the currently used

resources are detailed below.

Lack of expanding terms: Our additional dice LR resource partly solves the lack of

expanding terms. There are still missing terms not covered by any of our lexical

reference resources. Category names that tend to appear rarely in the corpus suffer from

this problem most.

Incorrect or ambiguous expanding terms: Ambiguity might be caused by expanding

the category name with a term in an infrequent sense, such as expanding the category

athletics with the term meet, which refers to meeting at which a number of athletic

contests are held. There are many terms which refer to the corresponding category only

in an infrequent sense that were added as expanding terms.

64

The WordNet and Wikipedia LR resources include many frequent ambiguous

terms, such as house, life and union for the businesses category, while the Dice LR

resource added many terms that aren’t LR at all, such as roommate, boyfriend and

fraternity for the college/university category. The Dice LR still added context terms,

even though we filtered many of them, as explained in Section 4. The quality of the

Dice expansions lists of category names that appear frequently in the IMDB corpus is

relatively high, therefore we suggest adding documents for categories with a small

number of documents in the IMDB corpus by crawling the web or using some other

corpus. This line of research may be investigated further to enrich and optimize the dice

LR resource.

The problem of Incorrect or ambiguous expanding terms is destructive when

there are multiple ambiguous or wrong terms in a document. In these cases irrelevant

documents get a high ranking and the performances of our scoring method are harmed.

For example, a sentence such as "They deal with the life challenges of finding women to

love and be loved by, committing to a relationship, and getting past their childhood

dreams and desires to deal with reality and appreciate life", where life, relationship and

deal are referring terms for the category business, will cause the document to be

misclassified with a high classification score.

4. Topically close categories: Topically close categories are problematic since both of

the contexts models encounter difficulties distinguishing between close categories.

Unfortunately, there are still mis-classifications between close categories such as

fashion and arts, war and military, since these documents tend to contain multiple LR

terms from both of the close categories and none of the context models knows how to

distinguish between them.

5. Unclassified documents: Many of the documents in the IMDB corpus are not

classified to any of the categories in the taxonomy. Our scoring method succeeded in

ranking short unclassified documents lower, since even when a LR term was found

there were no other context terms. However, when these documents were longer,

unclassified documents sometimes did get a higher ranking.

6. Categories characteristics: We examined two interesting issues considering

category characteristics: (i) whether categories with more documents in the test set are

65

ranked better than categories with fewer documents, and (ii) whether estimating the size

of the category and taking classifications from the ranked list relative to the frequency

of the category in the corpus might be beneficial. Unfortunately, the answer to both of

these questions is negative.

The quality of the categories’ ranking does not depend on the category size, but

rather on the specification of the category name. Category names that express their

specific topic meaning, such as football, buddhism and motorcycle, were ranked much

better than category names that express more general meaning of the topic, such as

history, disability, environment and travel.

Estimating the size of the categories is also a problematic issue since the

frequency of the category name in the whole corpus does not provide an accurate

estimation. For example, the category name showBiz or show business is rather rare in

the corpus but this category contains a relatively high number of documents.

In our research we did not make any adjustments to the category names, but

simply set the seeds to be the category names as they were given in the taxonomy. The

reason for this policy was that we wanted our results to be replicable, so we did not use

any prior knowledge on the resources behavior. Prior knowledge on seeds that get more

effective expansions such as cooking vs. cookery in Wikipedia or computer vs.

computing in Dice LR resource might be very helpful. By manual effort of tuning the

category name, which would be acceptable in the industry, further improvement can be

easily achieved

6.4 Bootstrapping results

The bootstrapping step suggested by (Barak et al., 2009) and others (Ko and Seo, 2004),

(Gliozzo et al. 2005) consists of training a supervised classifier with an initial labeled

set created by a previous unsupervised step. In Section 3 we showed that bootstrapping

performance on the IMDB corpus by the method of (Barak et al., 2009) was very poor,

yielding lower performance than the unsupervised classification that constitutes its input

training set. (Barak et al., 2009) used the similarity scores obtained by the combined

scoring method presented in Section 2.2 to produce an initial labeled set of documents

to train a supervised classifier. In their initial labeled set, each document was considered

66

as classified only to the best category, to train a SVM classifier for each category.

Classification was determined independently by the classifier for each category,

allowing multiple classes per document. Their first unsupervised step was based on

single-class classification of each document in the unlabeled set.

In Section 5 we introduced our multi-class unsupervised classification scheme

where each document may be classified to zero, one or more categories. Consequently,

in this section we present a different approach for producing an initial labeled set of

documents using a multi-class classification scheme as our first step.

We also introduced (in Section 5) a different evaluation methodology, which

suggested drawing average recall-precision curves, where each cut-off point

corresponds to a certain precision level. Ideally, we would have wanted to set a high

precision level and take the documents that meet our requirement as positive training

examples for the supervised classifier. However, we lacked the human resources for

manually tuning each of the categories under this thesis framework. We therefore had to

adopt a standard global cut-off scheme. We selected the percentage cut-off scheme,

where a top percent of the highly ranked documents of each category is selected as

positive examples for the category classifier.

Good negative examples for the supervised training process need to have two

properties: (i) high confidence that they are indeed negative examples and (ii) at least

some of them should be close enough to the positive examples, containing passing

references or sharing similar contexts. Otherwise the classifier might simply classify

according to the appearance of the category names in the documents, rather than trying

to learn more features for the categories.

We used our dice-based context model for selecting the negative examples.

Analyzing our dice-based context model, we found out that 99% of the classifications

that were ranked relatively lower in the classification list of a certain document, were

inappropriate. However, some of these documents did have something in common with

the inappropriate category as we had hoped. Therefore, we sorted the classification list

of a given document and selected the document as a negative example for categories

that were ranked lower than a certain rank in the sorted list.

67

The last issue was selecting the ratio between the positive and negative

examples. Since we were unable to estimate the real portion of a category in the corpus,

we selected the same number of examples for both the negatives and the positives for

each category.

We manually tuned both of our parameters. The percentage classification

threshold of the positive examples was set to 0.3, with the aim of having enough

training documents out of the 120,000 unlabeled documents in the IMDB corpus. The

dice-based context rank was set to 5, based on experiments on the development set. We

represented the input examples vectors for the SVM supervised classifier in tf-idf

weighted term space and used a common feature selection suggested by (Forman,

2003), which removes the least common features in the corpus, features which appear

fewer than 3 times.

However, we did not obtain any reasonable results. Both the recall and precision

were lower than 0.1. These experiment results lead us to believe that the problem is not

the bootstrapping scheme, but rather with the IMDB collection. The IMDB documents

are too short and their quality is low; more efforts to collect better textual information

on the IMDB movies are needed in the future.

Moreover, the bootstrapping step actually contradicts the rationale of LR-based

approaches. LR specifies an accurate semantic relation, which aims to identify whether

the meaning of a certain category name is referenced by the document text. This

measure aims at a more appropriate relation to base the TC assumption on, since it

requires an actual reference to the category topic in the text, rather than general context

similarity. In contrast, in the bootstrapping step a supervised classifier is used to

perform the final categorization step on the test corpus. The supervised classifier does

not capture the exact semantic relation needed to assess classification decision. It might

model the broader context of the text and not the specific topic it discusses. Therefore,

we conclude that when LR-based approaches are applied, it is desirable to avoid the

bootstrapping step.

68

Chapter 7

Conclusion and future work

In this work we investigated the keyword-based TC approach, which is based on the

integration of reference models and context models. The proposed method integrates a

new LR and a new context model in to the scoring method, proposed by (Barak et al.,

2009). Our research focused on the multi-class classification scheme with a novel

evaluation approach, revealing a new perspective on the classification results.

Our investigation highlights several important conclusions about the integration

of the two models, about each of the new models and about the classification and

evaluation scheme:

1. Indeed as (Barak et al., 2009) reported, our analysis reveals that the reference

requirement as the basis for the TC score helps to classify documents according to the

topic they actually discuss, as opposed to using context models, which only reveal the

documents’ broader context.

2. Utilizing statistical correlation from a curpos of the target domain can be useful for

both context representation and lexical reference extraction. Word co-occurrence

information captures topical semantic relationships between words. These relationships

include both referring terms and other context related terms.

3. Our dice-based context model is much simpler than the LSA context model, which is

complex to implement and computationally expensive. Furthermore, the dice-based

context model uses a direct representation of word co-occurrence that is easy to analyze,

69

while the LSA representation is implicit. However, it is comparable to the LSA context

model in performance.

4. Combining different LR resources provides a more complete perspective and

expansion abilities. However, when there are no external resources available, in a

different language for example, statistical LR resources might be very beneficial.

5. When a small degree of manual intervention is possible, such as in an industrial

classification setting for a large real-world taxonomy, our new classification and

evaluation methodology is more suitable, and increases performance significantly.

We will now describe several promising research directions observed during the course

of this study.

1. One of the advantages of a statistical LR resource is that more documents can be

added for categories with a small amount of documents by crawling the web or using

an additional corpus. We suggest that this line of research may be investigated further to

enrich and optimize the Dice LR resource in order to exploit additional reference

knowledge.

2. In our research we did not adjust the category names, but have simply set the seeds to

be the category names as they were given in the taxonomy Future work should include

investigate the contribution of human involvement in the seeds selection.

3. Obviously, the recall of our method can be improved by utilizing further reference

knowledge resources. (Kotlerman et al., 2009) defined novel directional statistical

measures of semantic similarity. Their measure was based on the Distributional

Similarity Hypothesis, which suggests that words occurring within similar contexts are

semantically similar (Harris, 1968). However, their measure is asymmetric. The

directionality was based on Distributional Inclusion, which assumes that prominent

semantic traits of an expanding word should co-occur with the expanded word as well.

We suggest improving the keyword-based TC task by utilizing Directional

Distributional Similarity expansion resources, which might be based on the above

measure.

70

Appendix A

Our complete IMDB taxonomy

Categories 1 Religion

1.1 Buddhism 1.2 Hinduism 1.3 Christianity

1.3.1 christmas1.4 Islam 1.5 Judaism

2 Sport 2.1 Bicycle 2.2 Boxing 2.3 Fishing 2.4 Football 2.5 Golf 2.6 Hockey 2.7 martial-arts

2.7.1 karate2.8 Athletics 2.9 Running

2.10 shooting 2.11 Skiing 2.12 soccer 2.13 water sports

2.13.1 surfing2.13.2 swimming

2.14 Tennis 2.15 Baseball 2.16 Wrestling 2.17 basketball 2.18 Horseracing 2.19 Olympic games

3 Interests (NON-CAT) 3.1 Beach 3.2 Outdoor 3.3 Gardening 3.4 Pets 3.5 Fitness 3.6 Cookery 3.7 Fashion 3.8 Computing 3.9 Travel

3.10 Motoring 3.10.1 cars3.10.2 Motorcycle

3.11 Trains 3.12 Airplanes 3.13 Ships 3.14 Radio 3.15 Business

71

3.16 Nature 3.16.1 Animals

3.17 outer Space 3.18 the environment 3.19 Showbiz 3.20 Traditions 3.21 Infants 3.22 Military 3.23 Weather

4 Arts 4.1 Cinema 4.2 Advertising 4.3 Theater 4.4 Music

4.4.1 Opera4.4.2 classical music 4.4.3 Jazz4.4.4 Pop/rock4.4.5 country music 4.4.6 Hip Hop

4.5 Dance 4.5.1 Ballet

5 Science 5.1 Medicine

5.1.1 disability5.2 Technology 5.3 Psychology

6 Education 6.1 School 6.2 College/University

7 Miscelenous (NON-CAT) 7.1 crime (NON-CAT)

7.1.1 prison7.1.2 mafia7.1.3 drugs7.1.4 fraud7.1.5 gambling7.1.6 terrorism

7.2 Literature 7.3 History 7.4 Political 7.5 Social (NON-CAT)

7.5.1 racism7.6 Legal 7.7 Communism 7.8 War

7.8.1 World war 17.8.2 World war 2

7.9 Aliens 7.10 comic-book 7.11 journalism 7.12 mythology

72

Appendix B

The annotation guidelines

You are given a list of films with their plot description and a taxonomy of film categories.

The taxonomy

The taxonomy is made up of film subject matters and is arranged in a hierarchical order so that if a sub-category is marked its ancestors are also relevant. This is true in all cases except when a category is only present in order to group similar subject together in which case it is marked with the text

(NON-CAT) next to it.

For example:

A film categorized as dealing with 'cars' will also be relevant to 'motoring' but not to 'interests' as it is not a category.

3 Interests (NON-CAT)

3.9 Travel

3.10 Motoring

3.10.1 cars

3.10.2 Motorcycle

* Note - the taxonomy is not exhaustive, you may find that there is no category in the taxonomy which accurately fits the film even though you can think of a subject matter that does. If a broader category is present choose it, otherwise choose none.

For each film, you must decide which categories (if any) out of the taxonomy are relevant to it. You can choose as many or as few categories as you see fit, or none.

* Note – if you find more than one category, please put each category in a separate line (insert lines if necessary).

You must categorize according to the following guidelines:

1. Is the background story prominent – not just a passing reference.Examples:

73

The following film should be categorized as relevant to 'crime':"Jessie is an ageing career criminal who has been in more jails, fights, schemes, and lineups than just about anyone else. His son Vito, while currently on the straight and narrow, has had a fairly shady past and is indeed no stranger to illegal activity. They both have great hope for Adam, Vito's son and Jessie's grandson, who is bright, good-looking, and without a criminal past. So when Adam approaches Jessie with a scheme for a burglary he's shocked, but not necessarily disinterested...."

The following film should be categorized as relevant to 'animals':Farmer Hoggett wins a runt piglet at a local fair and young Babe, as the piglet decides to call himself, befriends and learns about all the other creatures on the farm. He becomes special friends with one of the sheepdogs, Fly. With Fly's help, and Farmer Hoggett's intuition, Babe embarks on a career in sheepherding with some surprising and spectacular results. Babe is a little pig who doesn't quite know his place in the world. With a bunch of odd friends, like Ferdinand the duck who thinks he is a rooster and Fly the dog he calls mom, Babe realizes that he has the makings to become the greatest sheep pig of all time, and Farmer Hogget Knows it. With the help of the sheep dogs Babe learns that a pig can be anything that he wants to be.

The following film should not be categorized as relevant to 'baskeball':"This gritty drama follows two high school acquaintances, Hancock, a basketball star, and Danny, a geek turned drifter, after they graduate. The first film commissioned by the Sundance Film Festival, it portrays the other half of the American dream, as Hancock and his cheerleader girlfriend Mary wander to a middle-class mediocrity out itself out of reach for Danny and his psychotic wife Bev."

2. You must not base your decision on prior knowledge of the film, only on information provided in the plot.

74

Bibliography

1. Buckley, C., Singhal A. and Mandar M. "New retrieval approaches using

SMART: TREC 4".In Proc. TREC, 1995.

2. Chade-Meng Tan, Yuan-Fang Wang, Chan-Do Lee: The Effectiveness of

Bigrams in Automated Text Categorization. ICMLA 2002: 275-281

3. Chirag Shah and Bruce W. Croft. 2004. Evaluating high accuracy retrieval

echniques. In Proceedings of SIGIR

4. D. Downey and O. Etzioni. Look ma, no hands: Analyzing the monotonic

feature abstraction for text classification. In Advances in Neural Information

Processing Systems (NIPS) 21, 2009, January 2009.

5. Dan Moldovan and Vasile Rus. 2001. Logic form transformation of wordnet and

its applicability to question answering. In Proceedings of ACL.

6. Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The

third pascal recognizing textual entailment challenge. In Proceedings of ACL-

WTEP Workshop.

7. Deerwester, S., S. Dumais, G. Furnas, T. Landauer, and R. Harshman. 1990.

Indexing by latent semantic analysis. Journal of the American Society of

Information Science.

8. Ephraim Sachs, Semantic Aspects of Information Retrieval, MSc thesis, Hebrew

university, Israel, 2008.

9. Eyal Shnarch, Libby Barak, Ido Dagan. Extracting Lexical Reference Rules

from Wikipedia. In Proceedings of ACL 2009

75

10. Fellbaum, C., editor. 1998. WordNet : An Electronic Lexical Database

(Language, Speech and Communication).The MIT Press.

11. George Forman, An extensive empirical study of feature selection metrics for

text classification, The Journal of Machine Learning Research, 3, 2003.

12. Giampiccolo, D., B. Magnini, I. Dagan, and B. Dolan. 2007. The third pascal

recogniz-ing Gilies, 2005) textual entailment challenge. In Proceedings of

ACLWTEP Workshop.

13. Gliozzo, C. Strapparava, and I. Dagan. 2005. Investigating unsupervised

learning for text categorization bootstrapping.. In Proc. of the Joint Conference

on HumanLanguage Technology / Empirical Methods in Natural Language

Processing (HLT/EMNLP), Vancouver.

14. Ido Dagan, Oren Glickman, and Bernardo Magnini, editors. 2006. The PASCAL

Recognising Textual Entailment Challenge, volume 3944. Lecture Notes in

Computer Science.

15. Joachims, T. 1999. Making large-scale SVM learning practical. In B. Scholkopf,

C. Burges, and A. Smola, editors, Advances in kernel methods: support vector

learning.MIT press, Cambridge, MA, USA, chapter 11, pages 169-184.

16. Jun’ichi Kazama and Kentaro Torisawa. 2007. ExploitingWikipedia as external

knowledge for named entity recognition. In Proceedings of EMNLPCoNLL.

17. K. Morik, P. Brockhausen, and T. Joachims. 1999. Combining statistical

learning with a knowledge-based approach - A case study in intensive care

monitoring. Proc. 16th Int'l Conf. on Machine Learning (ICML-99).

18. Ko, Y. and J. Seo. 2004. Learning with unlabeled data for text categorization

using bootstrapping and feature projection techniques. In Proc. of the ACL-04,

Barcelona, Spain, 2004

19. Libby Barak, Ido Dagan, Eyal Shnarch. Text Categorization from Category

Name via Lexical Reference. In Proceedings of North American Chapter of the

76

Association for Computational Linguistics - Human Language Technologies

(NAACL HLT), 2009

20. Libby Barak, keyword based Text Categorization, MSc thesis, Bar-Ilan

university, Israel, 2008.

21. Lili Kotlerman, Ido Dagan, Idan Szpektor and Maayan Zhitomirsky-Geffet.

2009. Directional Distributional Similarity for Lexical Expansion. ACL-IJCNLP

2009 (short paper).

22. Lin, Dekang. "Automatic Retrieval and Clustering of Similar Words."

COLINGACL98. Montreal, Canada, 1998.

23. Liu, B., X. Li, W. S. Lee, and P. S. Yu. 2004. Text classification by labeling

words. In Proc. of AAAI-04, San Jose, July.

24. M. de Buenaga, J.M. Gomez, and B. Diaz. 1997. Using wordnet to complement

training in formation in text categorization. In Recent Advances in Natural

LanguageProcessing II: Selected Papers from RANLP'97, volume 189 of

Current Issues in Linguistic Theory (CILT), pages 353-364. John Ben jamins,

2000. Rodriguez et al. 1997)

25. Mandala, Rila, Takenobu Tokunaga, and Hozumi Tanaka. "Combining Multiple

Evidence from Different Types of Thesaurus for Query Expansion." SIGIR.

Berkley, CA, 1999.

26. Manning, C.D., Raghavan P. and Schutze H. "Introduction to Information

Retrieval." Cambridge University Press, 2008.

27. Marius Pasca and Sanda M. Harabagiu. 2001. The informative role of wordnet

in open-domain question answering. In Proceedings of NAACL Workshop on

WordNet and Other Lexical Resources.

28. Martin S. Chodorow, Roy J. Byrd, and George E. Heidorn. 1985. Extracting

semantic hierarchies from a large on-line dictionary. In Proceedings of ACL.

77

29. McCallum, A. and K. Nigam. 1999. Text classification by bootstrapping with

keywords, EM and shrinkage. In ACL99 – Workshop for unsupervised Learning

inNatural Language Processing.

30. Nancy Ide and V´eronis Jean. 1993. Extracting knowledge bases from machine-

readable dictionaries: Have we wasted our time? In Proceedings ofKB & KS

Workshop.

31. Oren Glickman, Ido Dagan and Eyal Shnarch. Lexical Reference: a Semantic

Matching Subtask. In Proceedings of EMNLP 2006, 22-23 Jul 2006, Sydney,

Australia.

32. Perez-Aguera J.R. and Araujo L. "Comparing and Combining Methods for

Automatic Query Expansion. In Advances in Natural Language Processing and

Applications. Madrid, Spain, 2007

33. S.Scott and S.Matwin.(1999).Feature engineering for text classification.Proc.of

16th International Conference on Machine Learning,Bled,Slovenia.

34. Salton, G. and M. H. McGill. 1983. Introduction to modern information

retrieval. McGraw-Hill, New York.

35. Schutze, H., Hull, D. and Pedersen, J.O. A comparison of classifiers and

document representations for the routing problem. In SIGIR ’95: Proceedings of

the18th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval, 229-279, 1995.

36. Schutze, H. and Pederson J.O. "A cooccurrence-based thesaurus and two

applications to information retrieval." Information Proceedings and

Management, Vol. 33, No. 3, 1997.

37. Xu, J., W.B. Croft, “Query Expansion using Local and Global Document

Analysis”, ACM SIGIR 1996, ACM, 1996.

78

Abstract (Hebrew)

מפתח והקלט היחידי מילות מציגה מחקר בתחום סיווג טקסטים. הסיווג מבוסס זותזהעבודת עבורו הוא טקסונומיה נושאית.

)supervised.(הגישה המחקרית הרווחת עבור סיווג טקסטים הינה גישה מפוקחת , מתויגים ידניתכמות ניכרת של מסמכיםהחיסרון העיקרי של גישה זו הוא בכך שהיא דורשת

. ליצר אותםמעשי ולא אינם מצוייםאשר לעיתים קרובות מהרו עבור כל קטגוריה ניתנות מספר מילות מפתח אותן קלמילות מפתחבסיווג טקסטים מבוסס

לא מבוטלת עבודה ידניתתעדיין דורש גישה זו . עם זאת,להפיק אפילו ממעט מסמכים מתויגים אשר מבוסס על גישה חדשה, בתזה זוהמחקר קטגוריה. כלעבור רשימת מילות מפתח תרייצב

,המבטלת את הדרישה לניתוח ידני של כל)Gliozzo et al., 2005 (הוצעה לראשונה על ידי. תרשימת מילות מפתח ראשוניקטגוריה, באמצעות שימוש בשמות הקטגוריות בלבד כ

ביןהמשלבת שני סוגים של דמיון,)Barak et al., 2009( של גישה את הבתזה זו אימצנו(הוא דמיון בין מילים המכילות התייחסות ספציפית למשמעות שם הקטגוריה. סוג אחד מילים

Reference לשם הקטגוריה אךבהקשר אופייני הנוטות להופיע מילים הסוג השני תופס ). בעוד ). contextספציפית שלה (המשמעות ה אינן בהכרח מרמזות על

תחתתוכן וידאו מותאם אישית), שירות מאגד הנגב (הדור הבא של מתזה זאת היא חלק קורפוס נובניו, וידאני טקסונומיה עבור תכיצרנוצעד ראשון כ תוכן. לכן, ת המלצ שלהמשימה

. נו בהתאמת הגישה הנ"ל למשימת הסיווג שלנותמקד ה. לאחר מכן,וביצענו תיוג העולם האמיתי מעלה נושאים שונים מאשר סיווגה המייצגת את סיווג לטקסונומיה גדול

מחקר זה מציע.מסויםאקדמי נתונים עבור קורפוס במיוחד ה שנוצרתטקסונומיה מלאכותיל IMDBסכימת סיווג והערכת ביצועים המתאימה לטקסונומיה גדולה ובמיוחד לטקסונומית ה

(Internet Movie Database). אנו משפרים הן את מודלIMDBבאמצעות מדידת קורלציות סטטיסטיות בקורפוס ה

). אנוBarak et al., 2009ההתייחסות והן את מודל ההקשר, במטרה לשפר השיטה המוצעת על ( סטטיסטיקה של הופעה משותפת של מילים במסמכימבוסס על ה הקשר פשוטמציעים מודל

( ומשאב ליחסי התייחסות לקסיקלייםdice (Mandala et al., 1999)הקורפוס ובפרט על מקדם Lexical Reference (.המבוסס אף הוא על אותה שיטה סטטיסטית

מבוססת על ההנחה כיה תוצאותערכת והסיווג גישה שונה ל מציעים בתזה זו אנו, בנוסףנגב.כמו במאגד ה תעשייתיותכוונון פרמטר עבור כל קטגוריה היא דרישה מקובלת בנסיבות

מאחר,אמצנו את גישת הסיווג המאפשרת לסווג כל מסמך לקטגוריה אחת, יותר או אף לא אחת , בעוד רבים אחריםשרבים מהמסמכים בקורפוס שלנו שייכים למעשה ליותר מקטגוריה אחת

ספרנו כמה מסמכים השייכים לקטגוריותקטגוריות הטקסונומיה,מ לאף לא אחת מתאימיםאינם ואנו מאפשרים לבחור את רמת הדיוק בהתאם ליחסברמת דיוק מסוימת. האלגוריתם מחזיר הכיסוי-דיוק הרצוי.

ביצועיםות שלנו, שאכן מראשופרת המהשיטה עבור ות מוצגותתוצאות אמפיריות חיובי כי דרישתחושף). הניתוח שלנו Barak et al., 2009( יותר מהשיטה המוצעת על ידי טובים

79

הסיוג, מסייעת לסיווג נושאי של מסמכים לציון כבסיס ההתייחסות הספציפית לשם הקטגוריה חושפים את ההקשר הרחב יותר של המסמכים.שימוש במודלי ההקשר לבד, אשר לבניגוד

קופל ופרופ' משה דגן עידו'פרופ של םבהדרכת נעשתה זו עבודה

המחשב למדעי הפקולטה מן

בר-אילן אוניברסיטת של

80

בר-אילן אוניברסיטתהמחשב למדעי המחלקה

מרובת לטקסונומיה טקסטים סיווגקטגוריות

ליבסקינד חיה

מוסמך תואר קבלת לשם מהדרישות כחלק מוגשת זו עבודהבר-אילן אוניברסיטת של המחשב למדעי בפקולטה

81

חשוון2009נובמבר רמת-גן, ישראל ,

תש"ע

82

Documents

Abstract - BIUu.cs.biu.ac.il/~nlp/wp-content/uploads/liebeskind-chaya... · Web viewWikipedia is a collaborative online encyclopedia that covers a wide variety of domains. Wikipedia