Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Bar Ilan University
The Department of Computer Science
Text Categorization for Large
Multi-class Taxonomy
by
Chaya Liebeskind
Submitted in partial fulfillment of the requirements for the Master's
Degree in the Department of Computer Science, Bar-Ilan University
Ramat Gan, Israel November 2009, Cheshvan 5770
1
This work was carried out under the supervision of
Prof. Ido Dagan and Prof. Moshe Koppel
The Department of Computer Science
Bar-Ilan University
Israel
2
Abstract
This thesis investigates Keyword-based Text Categorization (TC) using only a topical
taxonomy of category names as input. The TC task is most commonly addressed as a
supervised learning task. However, the supervised setting requires a substantial amount
of manually labeled documents, which is often impractical in real-life settings.
In Keyword-based TC methods the knowledge about the classes of interest is
provided in the form of a few keywords per class. Few keywords are typically generated
more quickly and easily than even a small number of labeled texts. However, the
keyword-based approach still requires nonnegligible manual work in creating a
representative keyword list per category. Our research is based on a new approach, first
proposed in (Gliozzo et al.,2005), which eliminates this requirement by using the
category names alone as the initial keyword list.
We adopted the scheme of (Barak et al., 2009) that combines two types of
similarity. One type regards words which refer specifically to the category name’s
meaning (Reference). While the other type captures typical context words for the
category which do not necessarily imply its specific meaning (Context).
This thesis is a part of the Negev Consortium (Next Generation Personalized
Video Content Service), within the content recommendation task. Therefore, our first
step was creating a taxonomy for video content along with video dataset construction
and annotation. We then focused on the adaptation of the above scheme to our specific
classification task.
Classification into a large real-world taxonomy raises different issues than
classification for an artificial taxonomy created specifically for a certain academic
dataset. This study describes a proposed classification and evaluation scheme for such a
taxonomy and particularly for our IMDB (Internet Movie Database) taxonomy.
We utilized statistical correlation measured over the target IMDB corpus, for
improving both the reference model and the context model, aiming to improve the state-
of-the-art method proposed by (Barak et al., 2009). We propose a simpler context model
3
based on Dice coefficient (Mandala et al., 1999), which is a measure of statistical
correlation, along with a new statistical Lexical Reference (LR) resource which is based
on the Dice coefficient as well.
Furthermore, we offer a different classification and evaluation scheme based on
the assumption that tuning a parameter for each category is an acceptable demand under
the industrial circumstances of the Negev Consortium. We adopt the multi-class
classification scheme, since many of the documents in our dataset are classified to more
than one category in the gold standard, while many others are not classified to any of
the taxonomy categories, and measure how much recall can be achieved at a certain
precision level. We then select our precision level according to the desired recall-
precision trade-off.
Positive empirical results are presented for our complete method, which indeed
shows higher performance than the previous state-of-the-art method of (Barak et al.,
2009). Our analysis reveals that the reference requirement as the basis for the TC score
helps to classify documents according to the topic they actually discuss, as opposed to
using context models alone, which only reveal the documents broader context.
4
Acknowledgements
I would like to take this opportunity to thank the people whose joint efforts assisted me
in writing this thesis.
First and foremost, my greatest thanks go to Prof. Ido Dagan for introducing me to the
wonderful world of Natural Language Processing, and for supervising this research.His
constant support, thorough guidance, and great patience enabled this work.
I wish to thank my supervisor, Prof. Moshe Koppel for providing advice and sharing his
experience.
My gratitude goes also to all my NLP lab members for sharing with me their time and
moral support. I especially want to express my appreciation to Eyal Shnarch ,Idan
Szpektor, Jonathan Berant, Lili Kotlerman, RoyBar-Haim and Shachar Mirkin for
sharing with me their words of wisdom, experience and advice when needed.
I would like to thank Naomi Zeichner for her assistance in the taxonomy creation and
the corpus annotation.
I wish to thank Libby Barak for setting up the groundwork for this research, providing
me with her text categorization system and for her guidance at the beginning of this
work.
I want to thank my parents for encouraging me to pursue my academic goals and
dreams, and for giving me the special kind of support only family can provide. I would
also like to thank my husband for his unique support, understanding and faith in me,
which encouraged me greatly throughout this work and to my children for simply loving
me.
This thesis was partly supported by the Negev Consortium (www.negevinitiative.org),
funded by the Israeli Ministry of Industry, Trade and Labor.
5
Contents
Introduction......................................................................................................8
Background....................................................................................................12
2.1 Unsupervised keyword-based text categorization..........................................12
2.2 Categorization based on category name.......................................................13
2.3 Lexical Reference...................................................................................17
2.4 Query expansion....................................................................................21
State of the art performance on IMDB dataset.......................................................24
3.1 The IMDB dataset....................................................................................24
3.2 The limited performance of previous state-of-the-art methods...........................26
3.2.1 Unsupervised single-class classification...................................................27
3.2.2 Bootstrapping:...................................................................................35
3.3 Applying a state-of-the art query expansion method........................................36
Algorithm improvements...................................................................................39
4.1 Utilizing statistical correlation....................................................................40
4.1.1 Dice-based context model....................................................................41
4.1.2 Dice expansions resource......................................................................43
4.2 Combined scoring....................................................................................45
A Classification and evaluation scheme for a large real-world taxonomy.....................47
5.1 Multi-class classification scheme..................................................................47
5.2 Evaluation measures.................................................................................48
5.2.1 Recall-Precision curves........................................................................49
5.2.2 Mean Average Precision (MAP).............................................................52
Results and Analysis.........................................................................................54
6.1 Results...................................................................................................54
6.2 Contribution of Our Method Components.....................................................57
6
6.2.1 Component Ablation Tests....................................................................57
6.2.2 Resources Ablation Tests......................................................................59
6.3 Further Analysis......................................................................................60
6.3.1 Recall-Precision Curves Comparison......................................................60
6.3.2 Error Analysis...................................................................................62
6.4 Bootstrapping results................................................................................65
Conclusion and future work...............................................................................68
Appendix A..................................................................................................70
Our complete IMDB taxonomy.....................................................................70
Appendix B..................................................................................................72
The annotation guidelines............................................................................72
7
List of Figures
3.1: A part of the IMDB taxonomy………………………………………………….25
3.2: An example for the problem with the cosine similarity function……………….34
5.1: A typical recall-precision graph…………………………………………………49
6.1: R@P averaged curves methods comparison…………………………………….54
6.2: Comparison of R@P average curves of ablation tests…………………………..57
6.3: Comparison of R@P average curves of resources ablation tests……………….58
6.4: Recall-precision curve approaches comparison………………………………...60
8
List of Tables
3.1: Single-class classification results for the IMDB dataset………………………28
3.2: Document samples for the passing reference phenomenon……………………29
3.3: Document samples for the ambiguity phenomenon……………………………31
3.4: Missing expanding terms………………………………………………………32
3.5: Incorrect or ambiguous expanding terms………………………………………33
3.6: Final bootstrapping results…………………………………………………….35
3.7: Query expansion results…………………………………………………….......37
4.1: Dice expansions resource marginal contribution……………………………….43
5.1: Contingency Table for one category……………………………………………48
6.1: MAP values methods comparison………………………………………………55
9
Chapter 1
Introduction
Topical Text categorization (TC – also known as text classification) is the task of
automatically classifying a set of documents into categories (or classes, or topics) from
a predefined set.
With the rapid growth of online information, text categorization has become one
of the key techniques for handling and organizing text data. Text categorization
techniques are used to classify news stories, to find interesting information on the web
and to guide a user’s search through hypertext browsing. Since building text classifiers
by hand is difficult and time consuming, it is advantageous to learn classifiers
automatically.
The classical supervised learning paradigm requires many hand-labeled
examples to learn accurately. Manually categorizing unlabeled documents for creating
training documents is difficult due to the amount of human labor it requires. Therefore,
some recent researches have focused on unsupervised learning algorithms with
bootstrapping technique. These algorithms require unlabeled text collections, which in
general are easily available.
Keyword-based TC methods aim at a more practical setting. Each category is
represented by a list of characteristic keywords, which should capture the category
meaning. Classification is then based on measuring similarity between the category
keywords and the classified documents, typically followed by a bootstrapping step. The
manual effort is thus reduced to providing a keyword list per category (McCallum and
10
Nigam, 1999). (Ko and Seo, 2004; Liu et al., 2004) even partly automated this step,
using clustering to generate candidate keywords. Nevertheless, the method still requires
manual specification as part of the classification process.
(Gliozzo et al., 2005) succeeded in eliminating the requirement for manual
specification of keywords by using the category name alone as the initial keyword, yet
obtaining superior performance within the keyword-based approach. This was achieved
by measuring similarity between category names and documents in Latent Semantic
Space (LSA) (Deerwester et al., 1990), which implicitly captures contextual similarities
for the category name through unsupervised dimensionality reduction. They generated
an initial similarity-based classification that assigns a the single most similar category to
each document, with the similarity measure typically being the cosine between the
corresponding vectors. This initial unsupervised classification is used, in the subsequent
bootstrapping step, to train a standard supervised classifier (either with single or multi-
class labels per document), yielding the eventual classifier for the category set.
Requiring only category names as user input seems very appealing, particularly when
labeled training data is too costly, while modest performance (relative to supervised
methods) is still useful.
(Barak et al., 2009) offered a novel taxonomy-based approach for keyword-
based TC, which bases its similarity measure on a Lexical Reference (LR) measure
instead of a context measure only. LR suggested by (Glickman et al., 2006) defines a
more accurate semantic relation, which aims to identify whether the meaning of a
certain term is referenced by some text. This measure aims at a more appropriate
relation to base the TC assumption on, since it requires the actual reference to the
category topic in the text rather than general context similarity. In order to identify
whether the topic is addressed by the text as the main topic and not as a marginal
("passing") reference, they integrate the LSA context model in their overall framework.
Once a reference to the category topic is recognized in a text, they also measure its
context similarity to the category topic. Using this novel integrated framework they
achieve a complementary semantic measure that quantifies the topics mentioned and the
contextual relevancy at the same time. In addition, they use the automatic integrated
measure to create an initial set of classified documents that are then used as input for a
supervised learner in a bootstrapping procedure in order to acquire a final classification.
11
They utilize relations that are likely to correspond to lexical reference from two
resources: the WordNet (Fellbaum, 1998) semantic relation ontology and the online
encyclopedia Wikipedia. The two resources are complementary by nature and, as
expected, they contribute to different types of categories and relations. Their context-
based method is based on the co-occurrence-based method used in (Gliozzo et al.,
2005), utilizing a Latent Semantic Analysis (LSA) method to represent the context
similarity of documents and categories.
Classification by a large real-world taxonomy is a difficult task. It raises
different issues than classification for an artificial taxonomy created specifically for a
certain academic dataset. This study describes a proposed classification and evaluation
scheme for such a taxonomy and particularly for the IMDB taxonomy.
In this thesis we adopt the approach of (Barak et al., 2009), which combines the
reference similarity score with the context similarity score. Aiming to improve their
method, we utilized statistical correlation for improving both the reference model and
the context model.
We propose a simpler context model based on Dice coefficient (Mandala et al.,
1999), which is a measure of statistical correlation. We expand each category name by
the top-k co-occurring terms with the highest Dice score and calculate the cosine
similarity score between the expanded vector and the document vector. This score is
used as our context model score. Combining our context model with the LSA context
model yields performance improvement. We also found our simple dice-based context
model alone as comparable to the useful but complex LSA context model.
Furthermore, we utilized a new statistical LR resource, overcoming the problem
of WordNet and Wikipedia, which sometimes find good references that do not appear in
the corpus. We used the Dice coefficient measure for this purpose as well. We filtered
the top-k co-occurring terms, reduced their noise and achieved relatively precise LR
lists.
We also found that it is better to avoid the single-class classification scheme
suggested by (Barak et al., 2009), since we address a large real-world taxonomy. In a
real-world taxonomy, a portion of the documents may not be classified into any of the
categories, while many documents can be classified into multiple categories. On one
12
hand single-class classification forces a classification for each document, while on the
other hand removes classifications, since only the category with the maximal
classification score for each document is selected. We therefore adopt a multi-class
classification scheme, where each document may be classified to zero, one or more
categories.
In this thesis we offer a different classification and evaluation scheme based on
the assumption that tuning a parameter for each category is an acceptable demand under
certain particularly (industrial) circumstances. We measure how much recall can be
achieved at a certain precision level and select our precision level according to the
desired recall-precision trade-off. Classifications that maintain precision greater than the
given precision level are considered as valid.
Positive empirical results are presented for our complete method, which indeed
shows higher results than the state-of-the-art method suggested by (Barak et al., 2009).
Our results support the hypothesis that the LR-based approach is more accurate than the
context-based approach alone. The results reveal that our classification and evaluation
scheme contributes to the performance improvement as well.
In Section 2 we provide some background on recent works and the resources
used for our method. Section 3 describes the IMDB dataset and analyses the state-of-
the-art performance on it. We describe our new context and reference models in
Sections 4.1.1 and 4.1.2. Section 5 discusses our different classification and evaluation
schemes. Results and analysis are presented in Section 6.
We show that using an initial reference method as the basis for the classification
decision provides promising results, which are restricted mostly by the recall of the LR
resource in use.
Our proposed method achieves higher precision results, suggesting that the
reference assumption along with the context verification is indeed more suitable to the
needs of the TC task. With the ongoing development of promising LR resources and
different context models, it is expected that TC methods based on the combined
approach can attain results showing further improvement.
13
Chapter 2
Background
The goal of Text Categorization (TC) is to classify texts into a number of predefined
categories. Supervised systems for TC require a large number of labeled training texts.
While it is easy to collect unlabeled texts, it is not so easy to manually categorize them
for creating training texts. Unsupervised Text Categorization enables classifiers to
classify texts from unlabeled texts, thereby saving substantial human labor. This section
describes related work and provides motivation for our method. Unsupervised keyword-
based text categorization is first presented (Section 2.1), and then categorization based
on category name is described and the framework and motivation of the method we
employ is presented (Section 2.2). Next, background on the lexical reference framework
and resources and the motivation to use it are explained (Section 2.3). Finally, query
expansion methods are described and their relevancy to TC is explained (Section 2.4).
2.1 Unsupervised keyword-based text categorization
This study focuses on unsupervised keyword-based TC. In Unsupervised Text
Categorization, the knowledge about the classes of interest is provided in the form of a
few keywords per class. Few keywords are typically generated more quickly and easily
than even a small number of labeled texts.
One approach is to apply a bootstrapping procedure starting from a few
describing keywords per class (McCallum and Nigram, 1999). The approach follows
these steps: (a) based on keyword-matching, a rule-based classifier categorizes the
unlabeled examples, (b) the labeled data is then used to train a Naïve Bayes (NB)
14
classifier using an Expectation Maximization (EM) algorithm, (c) the EM step is
performed until the likelihood function reaches the optimal value.
A more recent approach based on the vector-space model of information
retrieval (Liu et al., 2004) was implemented by the following steps: (a) a clustering
algorithm was applied to find a list of candidate keywords, (b) a lexicographer chose
from that list a set of words for each category, (c) the unlabeled examples were
categorized using the highest similarity score defined by similarity metrics in the Vector
Space Model (VSM) (Salton and McGill, 1983), (d) a NB classifier was trained with the
automatically labeled data (e) the whole collection was classified with the obtained
classifier following the EM schema. This approach achieved slightly lower results than
a supervised NB classifier on the same task.
2.2 Categorization based on category name
TC approaches that use only the category name as the input and require no manual
effort during the classification process have been attempted rather rarely in the
literature.
One approach was introduced by (Gliozzo et al., 2005). They obtained their best
performance using only the category name as the input for the bootstrapping algorithm.
Their algorithm includes the following steps: (a) expanding the category names using
Latent Semantic Space (Deerwester et al., 1990), such that the categories are
represented in LSA space, (b) separating relevant and non-relevant category information
using statistics from unlabeled examples by a Gaussian Mixture algorithm, (c)
classifying each unlabeled example to the most probable category and (d) training a
SVM classifier on the set of labeled examples resulting from the previous step. They
reported results on two data sets – 20 news groups1 and Reuters-10 (the 10 most
frequent categories2 in Reuters-215783), showing improvement relative to earlier
keyword-based methods.
1 The collection is available at www.ai.mit.edu/people/jrennie/20Newsgroups.
2 The first 10 categories are: Earn, Acquisition, Money-fx, Grain, Crude, Trade, Interest, Ship, Wheat and Corn.
3 available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
15
(Downy et al., 2009) introduced the Monotonic Feature (MF) abstraction, where
the probability of class membership increases monotonically with the MF’s value. In
document classification, the name of the class is a natural MF; the more frequently it is
repeated in a document, all other factors being equal, the greater the likelihood that the
document belongs to the class. They extended the experiments of (Gliozzo et al., 2005),
presenting theoretical and empirical results, showing that even relatively weak MFs can
be used to induce a noisy labeling over examples, and these examples can then be used
to train effective classifiers utilizing existing supervised or semi-supervised techniques.
They proved that the Monotonic Feature (MF) structure guarantees PAC learnability
using only unlabeled data, and that MFs are distinct from and complementary to
standard biases used in semi-supervised learning, including the manifold and cluster
assumptions.
The most recent approach has been reported by (Barak et al., 2009). They
proposed a novel scheme that models separately two types of similarity. One type
regards words that refer specifically to the category name’s meaning, such as pitcher
and yankees for the category baseball, while the other type regards typical context
words for the category that do not necessarily imply its specific meaning, like stadium
and field for the category baseball.
They were mostly inspired by (Glickman et al., 2006), who coined the term lexical
reference to denote concrete references in text to the specific meaning of a given term,
and assumed that a relevant document for a category typically includes concrete terms
that refer specifically to the category name’s meaning. Referring terms were collected
from WordNet and Wikipedia by utilizing relations that are likely to correspond to
lexical reference.
Referring terms were found in WordNet starting from relevant senses of the
category name. A category name sense was first expanded by its synonyms and
derivations, all of which were then expanded by their hyponyms. When a term had no
hyponyms it was expanded by its meronyms instead, since they observed that in such
cases meronyms often specify unique components that imply the holonym’s meaning,
such as Egypt for Middle East. However, when a term is not a leaf in the hyponymy
hierarchy, then its meronyms often refer to generic sub-parts, such as door for car.
Finally, the hyponyms and meronyms were expanded by their derivations. As a
16
common heuristic, they considered only the most frequent senses (top four) of referring
terms, avoiding low-ranked (rare) senses that are likely to introduce noise, when used
for expansion.
Additional referring terms were extracted from Wikipedia. For each category
name they extracted referring terms of two types, capturing hyponyms and synonyms.
Terms of the first type are Wikipedia page titles for which the first definition sentence
includes a syntactic “is-a” pattern whose complement is the category name, such as
Chevrolet for the category Autos. Terms of the second type are extracted from
Wikipedia’s redirect links, which capture synonyms such as x11 for X-Windows.
The reference vector for a category consists of the category name and all its
referring terms, equally weighted. The documents are vectors in term space, and the
cosine similarity function measures the category-document similarity. This similarity
result is their Reference model score.
where is the document vector in term space
Classifying by the Reference model may yield false positive classifications in two cases:
(a) inappropriate sense of an ambiguous referring term, e.g., the narcotic sense of drug
should not yield classification to Medicine; (b) a passing reference, e.g., an analogy to
cars in a software document, should not yield classification to Autos. In both these cases
the overall context in the document is expected to be a typical for the triggered category.
They therefore measure the contextual similarity between a category and a document
utilizing LSA space, replicating the method in (Gliozzo et al., 2005). Both the category
names and the documents are represented in the latent space and the LSA similarity
score between them is obtained by calculating the cosine similarity. This similarity
result is their Context model score.
where and are the LSA vectors of the category name and the document, respectively.
To combine the scores obtained by these two models of their scoring method
(termed Combined model), they used multiplication. Multiplication reduces the score of
documents that contain referring terms, but relate to irrelevant contexts. Moreover,
when the score obtained by the reference scoring method is equal to zero, the integrated
17
score would also be zero. Ideally, given perfect reference knowledge, this means that
when the text does not refer to the category topic, it would not be classified to that
category topic even if it involves a related context.
The overall similarity score is defined as:
The similarity scores obtained by this Combined measure were used to produce
an initial labeled set of documents for training a supervised classifier. They used the
initial labeled set, in which each document is considered as classified only to the best
scoring category, to train a SVM classifier for each category. They used the default
setting for SVM-light, apart from the j parameter, which was set to the number of
categories in each data set, as suggested by (Morik et al., 1999). For Reuters-10,
classification was determined independently by the classifier for each category,
allowing multiple classes per document. For 20-NewsGroups, the category that yielded
the highest classification score was chosen (one-versus-all), fitting the single-class
setting of this corpus. They experimented with two document representations for the
supervised step: either as vectors in tf-idf weighted term space, or as vectors in LSA
space.
They tested their method on the two corpora used in (Gliozzo et al., 2005). The
Reference model achieves much better precision than the Context model from (Gliozzo
et al., 2005) alone. Combining reference and context yields some improvement for
Reuters-10, but not for 20-NewsGroups. They noticed though that the realistic accuracy
of their method on 20-NewsGroups is notably higher than when measured relative to the
gold standard, due to its single-class scheme: in many cases, a document should truly
belong to more than one category, while that chosen by their algorithm was counted as a
false positive.
In this thesis we base our method on the keyword-based approach, and in
particular the approach described in (Barak et al., 2009), by creating a two-phase
method: (1) automatically creating category representations to acquire an initial set of
labeled documents based on a similarity score between the categories and the document
representations, (2) classifying the unlabeled documents based on the initial categorized
set using a SVM based classifier. We expand the integrated model based on a reference
18
requirement and context fitness. Next we will describe the lexical reference framework
and the lexical semantic relations resource used to acquire lexical reference expansions
(rules).
2.3 Lexical Reference
The Lexical Reference (LR) notion was defined in (Glickman et al., 2006) to
denote in-text references to the specific meaning of a target term. They further analyzed
the dataset of the First Recognizing Textual Entailment Challenge (Dagan et al., 2006),
which includes examples drawn from seven different application scenarios. It was found
that an entailing text indeed includes a concrete reference to practically every term in
the entailed (inferred) sentence.
The LR relation between two terms may be viewed as a lexical inference rule, denoted
LHS => RHS. This rule indicates that the left-hand-side term would generate a
reference, in some contexts, to a possible meaning of the right-hand-side term, e.g.
Jaguar => luxury car. In this example the LHS is a hyponym of the RHS. Indeed, the
commonly used hyponymy, synonymy and some cases of the meronymy relations are
special cases of lexical reference. However, lexical reference is a broader relation. For
instance, the Lexical Reference rule physician => medicine may be useful to infer the
topic medicine in a TC setting. To integrate the LR rules in the TC scheme described
above, the initial seeds based on the category name are expanded with referring terms
extracted from the LR rules. For each rule in which the RHS of the rule is one of the
seed terms for a specific category, the LHS term of this rule is added to the seed terms
of this category to create the set of representing keywords for the category. Below we
describe the external resources used by our method to extract LR rules.
2.2. Lexical Reference Resources
Lexical-semantic resources, which provide the knowledge needed for lexical inference,
are commonly utilized by applied inference systems (Giampiccolo et al., 2007) and
applications such as Information Retrieval, Question Answering and Text
Categorization (Shah and Croft, 2004; Pasca and Harabagiu, 2001; Scott and Matwin,
1999). We based our LR rules extraction methods on external resources available
online. The resources utilized for this purpose are a lexical resource, the WordNet
19
lexical ontology, and a textual resource, Wikipedia, the online encyclopedia. Given the
different nature of the two resources, the method applied to each of them is quite
different. Below we provide a short description of each resource and its characteristics.
WordNet WordNet4 is a large lexical database of English, (Fellbaum,1998),
initially developed under the direction of George A. Miller. Nouns, verbs, adjectives
and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a
distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical
relations.
Every synset contains a group of synonymous words; different senses of a word appear
in different synsets. The meaning of the synsets is further clarified with short defining
glosses (definitions and/or example sentences). A typical example synset with a gloss is:
good, right, ripe (most suitable or right for a particular purpose; “a good time to plant
tomatoes”; “the right time to act”; “the time is ripe for great sociological changes”).
Most synsets are connected to other synsets via a number of semantic relations. Among
the semantic relations WordNet consists of are hyponyms (is-a relation) and meronyms
(is-part-of relation). While semantic relations apply to all members of a synset because
they share a meaning and are all mutually synonymous, words can also be connected to
other words through lexical relations, including antonyms, or derivational relations.
The TC task is one of various NLP tasks for which WordNet is exploited as a
source for lexical expansion. WordNet was used as a source for synonyms and
hypernyms to enhance feature data for TC methods in several works. (de Buenaga
Rodriguez et al., 1997) utilized WordNet as a source for synonyms based on the
assumption that the name of the category can be a good predictor of its occurrence.
They used WordNet synsets to perform category name expansion, similar to query
expansion in search, using the category name synonyms. This information was added to
labeled training examples as the input of supervised learning algorithms. The integrated
algorithm achieved an improvement of 20 points in precision and was found to be
extremely helpful for low-frequency categories, which have a lower number of training
examples.
4 We used version 3.0 of WordNet available at http://WordNet.princeton.edu/obtain
20
Another study that combined WordNet information with labeled training data is
that of (Scott and Matwin, 1999) who used WordNet as a source for synonyms and
hypernyms, which were added to the representation of each document.
A more recent study that combined WordNet information as described earlier in
Section 2.2 is (Barak et al., 2009), which used WordNet as a source for derivations,
synonyms, hyponyms and meronyms. Our method uses their extraction method
described in Section 2.2 to acquire LR rules from WordNet knowledge.
Wikipedia Wikipedia5 is a collaborative online encyclopedia that covers a wide
variety of domains. Wikipedia is constantly growing and evolving based on the
contribution of online users, and had more than 1,700,000 articles on the English
version as of March 2007 (Kazama and Torisawa, 2007). (Gilies, 2005) shows that the
quality of Wikipedia articles is comparable to those of the Britannica internet
encyclopedia.
(Shnarch et al., 2009) developed a Wikipedia-based LR resource. Each
Wikipedia article provides a definition for the concept denoted by the title of the article.
As a starting point, they examine the potential of definition sentences as a source for LR
rules (Ide and Jean, 1993; Chodorow et al., 1985; Moldovan and Rus, 2001). When
writing a concept definition, the aim is to formulate a concise text that includes the most
characteristic aspects of the defined concept. Therefore, a definition is a promising
source for LR relations between the defined concept and the definition terms. In
addition, they extract LR rules from Wikipedia redirect and hyperlink relations. As a
guideline, they focused on developing simple extraction methods that may be applicable
for other Web knowledge resources, rather than focusing on Wikipedia-specific
attributes. Overall, their rule base contains some eight million candidate lexical
references.
(Barak et al., 2009) used this Wikipedia-based LR resource to extract referring
terms of two types: Wikipedia page titles for which the first definition sentence includes
5 We used the English version from February 2007 available at
www.ukp.tudarmstadt.de/software/JWPL
21
a syntactic “is-a” pattern whose complement is the category name (Yamaha SR500
=> motorcycle), and terms extracted from Wikipedia’s redirect links
In our research we adopted a better extraction method, as reported by (Shnarch
et al., 2009). We used a Wikipedia-based LR resource to extract more types of referring
terms, while rules filtering that tend to relate terms that are rather unlikely to occur
together.
The extraction types we used were as follows:
Be-Comp The Be-Comp extraction method identifies the is-a pattern in the
definition sentence by extracting nominal complements of the verb “be,” taking them as
the RHS of a rule whose LHS is the article title.
All-N The Be-Comp extraction method yields mostly hypernym relations,
which do not exploit the full range of lexical references within the concept definition.
Therefore, the All-N extraction method creates rules for all head nouns and base noun
phrases within the definition.
Title Parenthesis A common convention in Wikipedia to disambiguate ambiguous
titles is adding a descriptive term in parenthesis at the end of the title, as in The Siren
(musical), The Siren(sculpture) and siren (amphibian). From such titles the Title
Parenthesis extraction method extracts rules in which the descriptive term inside the
parenthesis is the RHS and the rest of the title is the LHS.
Redirect As any dictionary and encyclopedia, Wikipedia contains Redirect links
that direct different search queries to the same article, which has a canonical title. For
instance, there are 86 different queries that redirect the user to United States (e.g.
U.S.A., America, Yankee land). Redirect links are hand-coded, specifying that both
terms refer to the same concept. The method therefore generates a bidirectional
entailment rule for each redirect link.
Link Wikipedia texts contain hyper links to articles. For each link a rule is generated,
whose LHS is the linking text and RHS is the title of the linked article. In this case the
Link extraction method generates a directional rule since links do not necessarily
connect semantically equivalent entities.
22
Based on the rule filtering method proposed in (Shnarch et al., 2009), we filtered
rules which tend to relate terms that are rather unlikely to occur in combination. The
authors recognized such rules by their co-occurrence statistics within Wikipedia, using
the common Dice coefficient:
where C(x) is the number of articles in Wikipedia in which all words of x appear.
They also adjust the Dice equation for rules whose RHS is also part of a larger noun
phrase (NP). The LR resource enables extracting rules with their Dice score, where
rules filtering is done by setting a threshold on the Dice rule score. We tuned the
threshold parameter (which was set to 0.01) on our development dataset, described in
Section 3.
As an encyclopedic resource containing cultural and day-to-day terms, by its
nature Wikipedia is complementary to the type of rules extracted from the WordNet
resource, which provides the typical terms similar to terms found in a dictionary.
2.4 Query expansion
A major problem in Information Retrieval (IR) is that relevant documents may contain
words that differ from those which appear in the user formulated query, although their
meaning is the same. One way to solve this problem is through automatic query
expansion.
Query Expansion (QE) is known to improve IR performance (Xu et al., 1996).
By expanding the query, the number of returned documents increases, and we expect to
retrieve a large set of relevant documents and to improve recall. On the other hand, by
increasing the number of retrieved documents, the chance of returning non-relevant
documents increases, too, and can decrease precision, as expansion may add noise to the
retrieved set, since the query includes terms which do not contribute to relevance
(Manning et al., 2008).
23
Keyword-based TC and QE are analogous tasks. The category names in TC are
analogous to the queries in QE. Both of the tasks expand the seeds with other related
terms in order to increase recall. Therefore, in this section we describe several QE
methods.
The methods for automatic Query Expansion split into two major classes: global
methods and local methods.
Global methods: In order for an IR engine to perform automatic query expansion, it
would need a large resource that could supply good expansion terms for a variety of
query words. Examples of such resources are WordNet and Wikipedia.
Another possible source for query expansion is a distributional similarity
algorithm, such as in (Lin, 1998). In this case, a query term would be expanded with
words that appear in similar contexts.
Another type of resource is based on co-occurrences of terms in the same
document, as opposed to distributional similarity, which is based on having similar
contexts across documents.
In our keyword-based TC method we utilized large global resources as our LR
resources.
Local methods: Local methods for query expansion reduce the sorce of expanding
terms to a partial collection. These methods adjust a query relative to the documents that
initially appear to match the query.
Local techniques such as pseudo-relevance feedback (PRF) require two passes
over the query. PRF specifies the process of automatically examining top-ranked
documents in an IR system ranking, and using information from these documents to
improve documents ranking.
This is done by assuming that the top-ranked documents are relevant, and using
information from this ‘pseudo-relevant set’ to improve the accuracy of the ranking by
expanding on the initial query and re-weighting the query terms.
24
The Rocchio algorithm is the classic algorithm for implementing RF. It models a way of
incorporating relevance feedback information into the vector space model. Its
underlying theory is to find a query vector that maximizes similarity with relevant
documents while minimizing similarity with non-relevant documents. However, it was
shown that better results are obtained for routing by using only documents close to the
query of interest rather than all documents (Schutze et al., 1995). The Rocchio re-
weighting formula is:
where D is a subset of the collection that considered as relevant to the query, is the
importance of the original query (between 0 and 1), β is the importance of the relevant
documents and γ is the importance of the non-relevant documents.
(Perez-Aguera et al., 2008) adopted a different approach to query expansion,
which is based on studying the difference between the term distribution in the whole
collection and in the subsets of documents that can be relevant for the query. One would
expect that terms with little informative content have a similar distribution in any
document of the collection. On the contrary, terms closely related to those of the
original query are expected to be more frequent in the top-ranked set of documents
retrieved with the original query than in other subsets of the collection.
25
Chapter 3
State of the art performance on
IMDB dataset
The focus of this research is keyword based Text Categorization for a large real- world
taxonomy. Classification by such a taxonomy raises different difficulties and is much
more complex than classification for an artificial taxonomy created specifically for a
certain academic dataset.
In this section we describe the dataset which was built at Bar-Ilan University. In
cooperation with COMVERSE. Its construction and annotation along with the
taxonomy creation are first described (section 3.1). Next, the poor performance of state-
of-the art methods along with an error analysis are detailed (section 3.2). Finally, the
performance of a state-of-the-art query expansion method along with comparison to
other state-of-the-art methods are presented (section 3.3).
3.1 The IMDB dataset
The Internet Movie Database (IMDB)6 is an online database of information related to
films, television programs etc. In many cases, the information goes beyond simple title
and crew credit, but also includes data such as plot summaries and reviews.
The IMDB dataset which we created for our research is a collection of 130,000
movie descriptions downloaded from the IMDB website. Each movie description
6 www.imdb.com
26
(termed document) contains the movie title and plot summary information. These
documents' topics are unknown. The IMDB dataset is thus a large data collection of
movie descriptions, which wasn't labeled to predefined topics..
The IMDB taxonomy creation and corpus annotation were done at Bar-Ilan
University. We researched the IMDB database, its structure and content to see which
information can be useful for building the taxonomy. Browsing the internet, we found
media taxonomies; compared and combined them with annotated IMDB keywords to
create a new optimized taxonomy. Figure 3.1 shows a part of the taxonomy. Appendix
A includes our complete IMDB taxonomy. Our taxonomy includes 97 topical categories
organized in a three level hierarchy structure, where each classification to a daughter
category is considered as a classification to all its ancestors as well. For example, a
document whose category is baseball is considered by the gold standard as a sport
document too, since baseball is a daughter of the sport category.
Figure 3.1: A part of the IMDB taxonomy
We manually annotated 1,970 movie descriptions with topic (category) labels from the
taxonomy. While selecting the documents to be annotated we had to make sure that the
27
set of categories is a representative sample of the actual database. The major issues
which were taken into consideration are the distribution of genres in the IMDB database
and the fact that some genres are better suited for the classification task than others. We
selected 2/3 of the dataset from the group of genres better suited for topical lassification
(Biography, Documentary, History, Music, Sport, and War) and 1/3 from the rest of the
genres. Appendix B includes the annotation guidelines. A couple of iterations were
required to stabilize the annotation. We randomly splitted the annotated set to
development (50%) and test (50%) subsets.
The collection's gold standard obtained is multi-class classified; hence each
document may be classified to zero, one or more categories.
Although we filtered descriptions with less than 150 characters, many of the
descriptions in the collection are still short. Given that 30% of the descriptions in the
annotated set are not classified to any of the taxonomy categories, and the large number
of the categories in the taxonomy (97) the IMDB classification task becomes an even
more challenging.
3.2 The limited performance of previous state-of-the-art
methods
We replicated the method in (Barak et al., 2009), described in section 2.2, as
representing a state-of-the-art classifier, including both its unsupervised and
bootstrapping steps. (Barak et al., 2009) proposed a novel scheme that models
separately two types of similarity. One type regards words which refer specifically to
the category name’s meaning (Reference). While the other type is typical context words
for the category which do not necessarily imply its specific meaning (Context). For one,
it identifies words that are likely to refer specifically to the category name’s meaning
(Glickman et al.,2006), based on certain relations in WordNet and Wikipedia. In
tandem, they assess the general contextual fit of the category topic using an LSA
context model to overcome lexical ambiguity and passing references (as described in
section 2.2). The similarity scores obtained by their combined measure (Combined)
were used to produce an initial labeled set of documents which was then used to train a
supervised classifier in a bootstrapping step.
3.2.1 Unsupervised single-class classification
28
We tested both components of the scoring method in (Barak et al. 2009) (Combined),
the Reference model and the Context model.
The reference model represents each category by its seed terms along with the
referring expansion terms for the seeds (where category names are used as the seeds) ,
and obtains a reference cosine similarity score between the two vectors of each
document-category pair. The referring terms are collected from WordNet and Wikipedia
as detailed in section 2.3
The context model from (Barak et al. 2009) is a replication of the method in
(Gliozzo et al., 2005). That original method includes a Gaussian Mixture rescaling step
for the context model, which (Barak et al. 2009) didn’t find helpful. We created
representing vectors for each category - the category name was represented using Latent
Semantic Analysis (LSA), in which documents and categories are represented in a latent
semantic space. LSA is a dimensionality reduction method which decreases the number
of dimensions in the document-by-term matrix. It converts the co-occurrence data
represented in the matrix to a representation of implicit semantic concepts in the latent
space. The LSA similarity score between documents and categories is obtained by
calculating the cosine similarity between their representing LSA vectors. We used the
LSA toolkit created by Idan Spektor and Jacob Goldberg at Bar-Ilan to generate the
LSA vectors from the IMDB corpus. We have set the LSA dimension to 300.
The combined scoring method was obtained by multiplication of the reference
score with the context score.
We also examined the baseline of including only the category name in the
reference vector (Cat-Name).
This unsupervised step of the algorithm classifies each document to a single
category, the category with the highest similarity to the document.
Table 3.1 presents the relatively poor classification results obtained for these methods.
Scoring
method
recall precision F1
29
Cat-Name 0.29 0.45 0.35
Reference 0.33 0.30 0.31
Context 0.29 0.26 0.28
Combined 0.37 0.35 0.36
Table 3.1: Single-class classification results for the IMDB dataset.
Comparison between the results on the IMDB dataset to the results on the
standard datasets 20 News groups and Reuters-10, which were used in (Barak et al.,
2009) shows that the scoring method is much better in the case of the standard datasets
(the 20 News groups F1 score was 0.41, while Reuters-10 F1 score was 0.76). This
might be due to the artificial structure of these standard academic datasets. Both of these
datasets have attributes which don't exist in our real world dataset. The 20 News groups
documents are partitioned (nearly) evenly across the 20 categories.While the Reuters-10
is only a sub-corpus of the Reuters-21578 collection, constructed from the 10 most
frequent categories in the Reuters taxonomy. In addition, the Reuters categories are
domain specific, and are all relevant to economical topics.
Moreover, the IMDB documents were written by simple users of the IMDB
website. Nonprofessional writers tend to add more unnecessary details such as actor
names, use anecdotal descriptions and sometimes even leave incomplete descriptions.
This makes the IMDB classification task much harder.
Error Analysis
Several error cases that were detected and categorized are detailed below.
1. Frequent passing reference: A dominant phenomenon which causes
misclassification is passing references. Passing references occurs when the topic name
or any partial group of its characteristics terms appear in a document, but they do not
refer to the main topic of the document. This phenomenon is relevant to all types of
topics, including named entities such as company names which are commonly
mentioned, general topics which may be discussed as an allegory or an object which is
30
referred widely in the corpus. Table 3.2 shows several examples of documents which
contain a passing reference to one of the IMDB collection topics.
No
.
Gold
Standard
Category
Method's
Classification
Document Example
1 Political
History
Cinema “Jack Nicholson's portrait of Union leader...
The film follows Hoffa through his countless
battles with the RTA and President
Roosevelt...”
2 Medicine College/University "A medical student... West moves to
Miskatonic University to continue his research
3 Baseball Weather “...when it is winter Ben can spend every
waking hour with Lindsey...Lyndsey gets hit
by a line drive foul ball off of Baltimore
Orioles' Miguel Tejada, and the Sox begin to
loose...”
4 None Arts "Colin's a sad-eyed British artist (Firth) holed
up in a rundown hotel in small-town Vermont
after being dumped by his fiancee...”
Table 3.2: Document samples for the passing reference phenomenon. The problematic
terms are bolded
The first example (no.1) in Table 3.2 is an example of a term which is referred
widely in the IMDB corpus. This term is a good expansion for the category cinema, but
since the IMDB is a movies domain it causes a passing reference. This phenomenon
mostly happens with the crime and cinema categories. Examples (no. 2-4) corresponds
to terms that do not refer to the main topic of the document. They describe a certain
31
place (no. 2), profession (no. 4) or time (no.3) which is insignificant in the document.
This phenomenon is relevant for all types of categories.
(Barak et al., 2009) used two mechanisms to identify the passing reference
phenomenon. The first one is the lexical reference expansion of the category
characteristic terms, which results in higher scores for documents that contain multiple
occurrences of referring terms, and the second is the use of context models. When a
term which refers to a certain topic appears out of context, a context model should give
a lower score to the document since its context is irrelevant for this topic. In the IMDB
Corpus, the second mechanism is important since short documents often don't contain
multiple occurrences of referring terms. In many of the cases the currently used context
model failed to recognize context irrelevancy. When dealing with documents that aren't
classified to any of the topics the situation is even more problematic since any
classification in this case corresponds to a false positive classification.
2. Ambiguity of expanding terms: Ambiguity of the topic name within the collection
is rare since it is typically chosen to be a very precise term which captures the full
meaning behind the topic. However, by using reference expansions as part of the
method, terms are being added to the seed term to represent the category. One of the
reasons for wrong classification is ambiguity of the expanding terms. Table 3.3 shows
several examples of documents which were classified incorrectly due to ambiguity of
the expanding terms.
32
No. Gold
Standard
Category
Method's
Classification
Document Example
1 Crime Space “.. an escape plan that involves reinforcing two of
the mall’s shuttle buses to transport the group to a
nearby marina where Steve has a boat docked..”
2 Airplanes Advertising "...The pilots there deliver mail over a dangerous
and usually foggy mountain pass. Geoff Carter,
the lead flyer, seems distant and cold as Bonnie
tries to get closer to him...”
3 Literature Christianity “A dashing officer of the guard and romantic
poet ... Christian, who is also in love...
4 Medicine Shooting "...Once called Father Frank for his efforts to
rescue lives, Frank sees the ghosts of those he
failed to save around every turn. He has tried
everything he can to get fired, calling in sick...”
Table 3.3: Document samples for the ambiguity phenomenon. Ambiguous terms are
bolded.
Example no.3 in Table 3.3 illustrates a common proper name with an additional
sense, while all the other examples (no.1-2, no.4) are terms which appear in a different
sense than the one which corresponds to the category topic. The context model was
supposed to recognize that the overall context in these documents is not typical for the
triggered categories and avoid these classifications, but it failed to overcome this
problem too.
33
3. Limitations of lexical reference resources:
Referring terms were collected from WordNet and Wikipedia, by utilizing relations that
are likely to correspond to lexical reference. WordNet provides mostly referring terms
of general terminology while Wikipedia provides more specific terms. Both resources
were described in section 2.3.
Several limitations of the currently used resources are detailed below.
Lack of expanding terms: Some of the documents were not classified to the correct
category due to a lack of correct expansions. Table 3.4 shows examples of such missing
expanding terms.
Category Expansions
Medicine cancer. HIV
Disability blind, deaf
Mythology Aphrodite, Oedipus
Table 3.4: Missing expanding terms
Seldom there are documents that require deeper text understanding since the
correct category isn't expressed with any typical word, such as a crime document which
discusses planting a virus inside a computer.
Incorrect or ambiguous expanding terms: Both WordNet and Wikipedia added terms
which were only correct as expansions for very infrequent senses, which caused false
classifications (false-positive errors). This is in contrast to the ambiguity described in
the previous section, where the ambiguous terms didn't correspond to a rare sense. Here
we are presenting ambiguous terms in infrequent senses. Sometimes the term sense is so
rare that it even seem to be an incorrect expansion. Table 3.5 shows several examples of
such expansions.
34
Lexical Resource Category Expansion
Wordnet - Meronyms Advertising promote
Wordnet - Hyponyms Business house,
partnership
Wordnet - Derivations Terror terrified
Wikipedia Pop/Rock machine,
mix
Table 3.5: Incorrect or ambiguous expanding terms
4. Topically close categories: topically close categories are mostly sister terms at the
same level in the topical taxonomy hierarchy. In the IMDB collection, for instance,
topically close categories exist as sister terms in the music group of topics, such as
opera and classical music. Topically close categories also exist as topics in different
branches of the taxonomy, such as the military topic in the interests branch of the
taxonomy which is highly related to the war topic in the miscelenous branch. Most of
the classification errors were between close topics in different branches.
Considering the taxonomy structure, the main problem is that we are not using
the daughter terms for classifying to the parent category. When we are classifying to the
crime category for example, we might have found only the term murder, but if we
consider also the expansions of its daughter mafia we could have find mob as well.
Assuming that the true category is indeed crime, we would have missed it.
Sometimes there is not enough evidences to classify the document to one of the
category's daughters but combining evidence from all daughter categories for
classifying to the parent category will yield a higher score and improve its chances to be
selected.
5. Limitations of classification scheme:
The cosine similarity function: The cosine similarity function normalizes the
multiplication of the document and the category vectors by the length of both.
35
Consequently, Categories with fewer expanding terms are preferred. Often even
when there are more terms matching one category in the document another category is
selected, since its expansion vector is shorter.
Figure 3.2: An example for the problem with the cosine similarity function
Figure 3.2 presents an example for the cosine normalization problem. The
document true category is sport. There are four appearances of terms which belong to
the sport category. There is only one term from the category motorcycle. However,
since the sport vector includes 105 expanding terms while the motorcycle vector
consists of only 35 terms. Consequently, the algorithm classified this document to the
wrong category motorcycle.
Single-class classification: Classifying each document to a single class (termed single-
class classification) has two major disadvantages. (i) It “forces” classification. Each
document is classified to the category with the highest similarity score. Even when the
classification scores are low one of the categories will be selected.
In the IMDB corpus, many documents are not classified to any category and will be
misclassified due to the single-class classification scheme.(ii) It “disconds”
classifications, since only the category with the maximal classification score is selected.
In the IMDB collection many documents are truly classified to multiple categories and
single-class classification will lose these classifications.
36
3.2.2 Bootstrapping:
The bootstrapping step suggested by (Barak et al., 2009) and others (Ko and Seo, 2004),
(Gliozzo et al. 2005) consists of training a supervised classifier with an initial labeled
set which was created by a previous unsupervised step.
The similarity scores obtained by the combined scoring method presented in
Table 3.1 were used to produce an initial labeled set of documents to train a supervised
classifier. Replicating (Barak et al. 2009), we used the initial labeled set, in which each
document is considered as classified only to the higest scoring category, to train a SVM
classifier for each category. For this purpose, we used SVMlight (Joachims, 1999),7 a
state-of-the-art SVM classifier, representing the input vectors in tf-idf weighted term
space. Our initial automatically labeled set contained about 120,000 documents from the
IMDB corpus. The vectors were fed to the classifier using its default settings.
Classification was determined independently by the classifier for each category,
allowing multiple classes per document. Results are detailed in Table 3.6 blow.
Scoring
method
Recall precision F1
Bootstrapping 0.024 0.047 0.032
Table 3.6: Final bootstrapping results
The table results show that in the IMDB case bootstrapping is problematic, yielding
lower performance than the unsupervised classification which cocnstitutes its input
training set, as reported in Table 3.1. There might be two possible reasons for these poor
results. First, the IMDB documents are too short and their quality is low. Second, the
way we selected our training set, where each document which wasn't classified to a
category is considered as a negative example for it, is wrong.
(Barak et al., 2009) have set the j parameter of the SVMlight to the number of
categories in the data set, as suggested by (Morik et al., 1999).
The suggestion of (Morik et al., 1999) was setting the j to the ratio between the number
7 Available at http://svmlight.joachims.org
37
of the negative examples to the number of the positive examples, which equals to
number of categories only under a uniform distribution of the categories. We also tried
setting the j parameter to the number of categories, even though, the IMDB distribution
isn't uniform. Indeed, no better results were achieved. More details about applying the
bootstrapping process over the IMDB dataset can be found in Section 6.
3.3 Applying a state-of-the art query expansion method
Reformulation the user queries is a common technique in information retrieval (IR) to
cover the gap between the original user query and her need of information. The most
used technique for query reformulation is query expansion (QE), where the original user
query is expanded with new terms extracted from different sources. Queries submitted
by users are usually very short and query expansion can complete the information need
of the users. Different types of query expansion methods were described in section 2.4.
Relevance feedback helps the IR system to compute a better representation of
the information need and to extract better expansions. Pseudo relevance feedback
methods create the feedback automatically, assuming that the k-top-ranked retrieved
documents are relevant, thus avoiding manual involvement, at the cost of the confidence
that all k-top-ranked retrieved documents are relevant. However, it has been found that
this automatic technique improves performance (Buckley et al., 1995).
Keyword based TC and QE are analogous tasks. The category keywords in TC
are analogous to the queries in QE. We tested several of pseudo relevance feedback
methods, while trying to optimize our algorithm for keyword based TC. All of the
methods used the whole IMDB corpus for selecting expansions and searched the
expanded query in the annotated IMDB test set. We used the Lucene8 IR system for the
QE process. The k parameter of the pseudo relevance feedback was set to 10, and each
of the queries was expanded by 25 top-ranked terms.
We chose the Rocchio relevance feedback method as our state-of-the-art since it
performed better than other methods which we tried, such as KLD and BO1(Perez-
Aguera et al, 2008) that are based on the probability distribution of terms in the
collection and in the top ranked retrieved documents.
8 lucene.apache.org
38
The Rocchio algorithm models a way of incorporating relevance feedback
information into the vector space model (VSM). Its underlying theory is to find a query
vector that maximizes similarity with relevant documents while minimizing similarity
with non-relevant documents. More details can be found in section 2.4.
Scoring
method
recall precision F1
Cat-Name 0.29 0.45 0.35
Rocchio 0.33 0.28 0.30
Combined 0.37 0.35 0.36
Table 3.7: Query expansion results
Table 3.7 shows a comparison between Rocchio method and other two state-of-the-art
methods from Table 1, the Cat-Name method which doesn't expand the category names
at all and the combined method described in section 2.2. The obtained Rocchio results
are lower. The low results are mainly due to noisy expansion lists, for example, the
category baseball was expanded with play baseball and baseball team but also with
ball, feature and documentary. The reason for these noisy expansion lists is that they are
built from frequent terms in the category documents. These Frequent terms don't
necessarily characterize the meaning of the category. These expanding terms thus do not
correspond lexical references to the category name.
39
Chapter 4
Algorithm improvements
Our research is based on the approach of (Barak et al. 2009) (described in section 2.2)
for keyword-based text categorization (TC), which bases its similarity measure on a
Lexical Reference (LR) measure instead of a context measure only. Their method
consists of the following steps:
1. Initiating each category vector by the category seed terms, which
correspond to the category name.
2. Representing categories in vector space, each category by its seed terms
along with the refereeing terms for the seeds; and calculating the cosine
similarity score (termed Reference score) between the vectors of each
document-category pair.
3. Representing each category and document by a co-occurrence based
vector, and computing a cosine similarity (termed Context) score for
each document-category pair.
4. Combining the reference score and context score, by multiplication, to a
single categorization score for each document-category pair. Finally,
5. Labeling an initial document set by the scores obtained in the previous
step, and using the initial labeled set to train a supervised classifier.
We focused on the second and third steps above aiming to improve the poor algorithm
performance on the IMDB corpus, as was shown in section 3.
40
In this section we first describe the utilization of statistical correlations from the
IMDB corpus (section 4.1). We then show how these statistical correlations are used for
building a new context model (section 4.1.1) and for inducing a new lexical reference
expansions resource (section 4.1.2). Finally, we offer two combination schemes far
expansion resources and a global reference-context combination scheme.
4.1 Utilizing statistical correlation
Co-occurrence based methods are based on the assumption that words that occur
frequently together in the same document are related to the same topic. Therefore word
co-occurrence information can be used to identify topical semantic relationships
between words.
Various metrics can be used for measuring co-occurrence strength. We tested
three common metrics: the Dice coefficient, Pointwise Mutual Information (PMI) and a
probabilistic metric described in (Glickman, et al., 2005) which attempts to grade the
lexical entailment relationship between two terms. For two words x and y from a
vocabulary V and a set of documents D, these metrics all measure the strength of the co-
occurrence relationship between the two words, based on the frequencies of their
independent and co-occurring appearances in the corpus.
The Dice coefficient normalizes the frequency of co-occurrence, or intersection
of the document sets of the two terms, by dividing it by the sum of the individual terms
frequencies and multiplying it by two, so that we get a measure between 0 and 1, with 1
indicating complete co occurrence:
where Dx is the document set where the term x appears and Dy is the document set
where the term y appears.
The PMI metric measures the degree of dependence between two terms based on
their probabilities:
41
The resulting scale is between -∞ and ∞, where complete independence of the
two terms will give a score of 0. Complete dependence between x and y will give a
score that varies according to their individual frequencies.
The above two metrics are symmetric for x and y. The probabilistic lexical
entailment measure presented by (Glickman, et al., 2005) on the other hand measures to
what degree is y entailed by x:
Given a term x (corresponding to a seed term in a category vector), for each of
the above metrics, we can expand x using the vocabulary terms that get the highest
scores according to the metric.
We sampled 20 category names and manually compared their top-50 co-
occurring extracted terms for each of the above metrics. The best co-occurring terms
were obtained when using the Dice coefficient metric. (Sachs, 2008) computed co-
occurrence-based word similarity based on the Reuters Collection Disk-1, using the
same metrics and reported similar results on the query expansion task, favoring the Dice
metric.
4.1.1 Dice-based context model
As described in Section 2.2, the overall context of the document should be typical for
the category topic. This is needed to assure that the referring terms for that category
appear (i) as part of the main topic of the text rather than as a passing reference, and (ii)
not in a different sense than the one referring to the category name. This requirement
can be captured by a set of terms which correspond to typical category contexts, even
though they do not necessarily concretely refer to the category. Such terms frequently
appear in the category context and therefore tend to co-occur with the category's seed
terms. Occurrence of such terms implies that the text might be related to the category.
42
For example, the terms ball and game don't refer to the category baseball, as they can
appear within the context of several other sport categories. However, the presence of a
significant amount of such context words in a document increases the likelihood that
this document may be related to the baseball topic. On the other hand, the lack of any
context word in a document decreases the likelihood that this document is relevant to
the category's topic. For that purpose, we need to use context models based on co-
occurrence data of terms.
(Barak et al., 2009). utilized a Latent Semantic Analysis (LSA) method to
represent the context similarity of documents and categories. LSA is a dimensionality
reduction method which maps similar terms, by means of co-occurrence data, to a lower
dimensional space in which terms and documents are represented by new dimensions
that may be perceived as "concepts". Those "concepts" aim to capture the context
similarity of the data. LSA has the advantage of modeling both first order and second
order similarity, and by that offers a powerful context-similarity measure. It measures
not only the likelihood of terms to appear in the same document as standard co-
occurrence based methods, but it also captures the likelihood of terms to co-occur with
other common terms by their joint mapping to the same LSA "concepts".
LSA is useful but uses an implicit representation, therefore its behavior is hard
to analyze or predict. LSA is somewhat crude and has difficulties to distinct between
topically close categories; Moreover, LSA is complex to implement and
computationally expensive.
We suggest a different simpler context model based on the Dice coefficient
metric. We expand each category name by the top-k (k=100 in our case) co-occurring
terms with the highest dice score and calculate the cosine similarity score between the
expanded vector and the document vector. This score is used as our context model
score. Like (Barak et al., 2009) ,we used multiplication as the integration method of the
reference and context scoring methods to reduce the score of documents which contain
refereeing terms but relate to irrelevant context. Moreover, when the score obtained by
the reference scoring method is equal to zero the integrated score would also be zero.
However, when the score obtained by the context scoring method is equal to zero, we
used a smoothing factor so the integrated score would be low but not zero, in this way
43
the context score actually re-rank the reference score according to the context
likelihood.
Looking at the expanded vectors, we observe that the dice-based context model
indeed captures typical category contexts, for example, the category baseball was
expanded by the following unigrams: pitcher, league, bat, Yankees and the following
bigrams: baseball player/team, major league, Jackie Robinson, Ted Williams. Notice
that some of these terms are actually refereeing terms for baseball, while the others are
only related context terms.
4.1.2 Dice expansions resource
Analyzing our dice-based context model, we realized that many refereeing terms which
are not covered by WordNet or Wikipedia can be reveled from the lists of the co-
occurring terms, as they were extracted from the IMDB corpus itself.
We thus created a new resource, dice expansions resource which uses the huge
available unlabeled data of the IMDB corpus to extract statistical correlations.
Overcoming the problem of WordNet and Wikipedia which sometimes find good
expansions that don't appear in the IMDB corpus at all.
Taking the top-100 co-occurring words as we did in our dice-based context
model is too noisy. It captures both lexical references (LR) and general context terms.
Below we describe the filtering factors which we used aiming to reduce noise and get
relatively precise LR lists. We used the annotated development set for parameters
tuning.
Weight filtering: We used the dice co-efficient score for terms weighting.
44
We filtered terms which their weight with a given category name is lower than a
threshold which was set to 0.05.
Seed filtering: Often one category name appeares as an expansion of another. We
filtered these expansions since the seeds were chosen to be very precise and to capture
the full meaning behind the topic and mostly fit only their original category
Frequent term filtering: .Some terms are referred widely in the corpus and can't be
used to distinct between categories. We filtered these expansions by setting a threshold
on the term frequency in the corpus. Terms which appear in more than 4% of the
documents in the corpus are omitted from the category expansions list.
Multiple expansions filtering: Some terms expand more than one category and are
therefore less distinctive. We attribute the term only to the category which gets the
highest dice coefficient score with the term, which is:
This filtering is very important since assigning a term to more than one category
produces a lot of noise, for example, the term mob boss originally fits both the mafia
and drugs categories, but might increase confusion if assigned to drugs.
Category Name Expanding terms
Buddhism dalai lama, Karmapa, Kisaeng Hwang, lama
Gambling bookie, gambling casino/debt, illegal gambling, poker
Karate black belt, karate kid, kenpo, Miyagi Daniel
Military colonel, commander, troops, weapon
Wrestling Ric Flair, Roddy Piper, Vince Mcmahon, wcw, wwf superstars, wwf
Table 4.1: Dice expansions resource marginal contribution
45
Table 4.1 shows correct references which were found by our dice lexical
references resource. These expansions were not found nor by WordNet neither by
Wikipedia. Using a statistical LR resource we found more proper names of known
personalities in the category areas and more concepts which are strongly related to the
category names.
Statistical correlation extraction has two important advantages. (i) The number
of the category name appearances in the corpus is a direct measure of the quality of the
its expansions list, since the more documents we have for the statistics collection, the
more accurate and reliable expansions list we can get. As a result of this estimation
ability (ii) we can add more documents for categories with a small amount of documents
by crawling the web or using some other corpus. We suggest that this line of research
may be investigated further to enrich and optimize the dice LR reference resource in
order to exploit additional reference knowledge.
4.2 Combined scoring
We have now three expansions resources of referring terms: WordNet, Wikipedia and
Dice which have to be combined. We propose two different combination schemes:
Union: Referring terms are collected from all the resources. The term lists for each
category are unified to a single list which is then used to represent the category vector.
The Cosine similarity measure is then used to measure the similarity between document
vectors and category vectors. This type of combination was applied by (Barak et al.
2009) as described in section 2.2.
The weight of the seed terms and their referring terms in the category's vector
are all equally weighted, and set to 1.
Geometric Mean: Another option is to consider first each resource separately; for each
resource we represent the category vectors with its own term lists and calculate the
Cosine similarity score between these category vectors and the documents. The
category names were treated as a separate resource as well. Then we combined the
resources cosine similarity scores using Geometric Mean (GM).
46
where Simx is the similarity score of resource x, n is the number of combined resources
(n=4=|X|), and λ is a smoothing factor, which we have set to 0.0001.
The GM is lower when there is a high difference between the averaged numbers
and higher when this difference is low. This mathematical property of the GM might be
beneficial for the classification task. As resources agreement will lead to a higher
similarity score and a referring term that is supported by more than one resource will
obtain a higher similarity score.
These combination schemes deal only with the references resources
combination. Both of these schemes thus return the similarity score of the Reference
model. We still have to combine the Reference model with at least one of our context
models.
Aiming to maximize our performance we combined both of the context models
in the following way: We first combined the Reference model with the LSA context
model using multiplication as (Barak et al., 2009) did and then combined the resulting
score with the dice context model by multiplication with a smoothing factor as
described earlier this section (4.1.1).
Overall, our primary method configuration, which is evaluated in Section 6,
contains three LR resources; WordNet, Wikipedia and Dice, combined by the union
scheme, with two context models, Dice-based and LSA context models
More details about the empirical contribution of each of the algorithm
components and combination schemes can be found in section 6.
47
00
x xx
x
Sim SimSim
Sim
ngm x
x X
Sim Sim
_ , , ,X cat name wn wiki dice
Chapter 5
A Classification and evaluation
scheme for a large real-world
taxonomy
Classification by a large real-world taxonomy is a difficult task. It raises different issues
than classification for an artificial taxonomy created specifically for a certain academic
dataset. This section describes a proposed classification and evaluation scheme for such
a taxonomy and particularly for the IMDB taxonomy. First, a multi-class classification
scheme is presented (5.1), and then corresponding evaluation measures are described
(5.2).
5.1 Multi-class classification scheme
Categorization of documents can be done according to two different approaches:
classifying each document to a single category, referred here as single-class
classification, or classifying each document in several categories, referred here as multi-
class classification. This unsupervised step of the (Barak et al., 2009) algorithm, as
described in Section 2.2, used the single-class classification approach.
Single-class classification has two major disadvantages: (i) It “forces” a
classification for each document; even when the classification scores are low, one of the
categories will be selected. In the IMDB corpus,described in Section 3, many
documents are not classified into any category and will be misclassified as a result of
48
the single-class classification scheme. (ii) It removes classifications, since only the
category with the maximal classification score for each document is selected. In the
IMDB collection, many documents are truly classified into multiple categories, and
single-class classification misses these classifications.
Multi-class classification is a ranking-oriented task. A ranked list of documents
is created for each category. The documents are sorted in descending order according to
their categorization score and the top ranked classifications are selected as positive.
There are two types of possible thresholds on the selected classifications: A threshold
on the classification score value or a threshold on the percentage of the top ranked
classifications (top-k%). Ranking the documents aims at achieving a better precision at
the top of the sorted list, which means ranking true category documents at the top of the
list, while ranking irrelevant documents at the bottom. The ranking task allows
evaluation of the scoring method quality per each category.
Preliminary experiments on the IMDB dataset showed that single-class
classification using both the Reference and Context models as described in section 2.2
achieved lower results than multi-class classification using only the category names.
Considering the drawbacks of single-class classification, we decided to adopt the multi-
class classification scheme.
In the following section we will describe different evaluation measures suitable for our
new classification scheme and the rationale behind them.
5.2 Evaluation measures
Two basic evaluation measures, given the gold standard of the collection, are precision
and recall
49
Category i
Gold standard
TRUE FALSE
Classifier judgement
TRUE TPi FPi
FALSE FNi TNi
Table 5.1: Contingency table for one category
where Pi is the precision of category i and Ri is the recall of category i. Both precision
and recall have a fixed range: 0.0 to 1.0 (or 0% to 100%).
Recall and precision are measures for the entire list of documents classified to a
category. They do not account for the quality of ranking the documents in the document
list. We assume that users would want the classified documents to be ranked according
to their relevance to the category, instead of just being presented with an unordered
document set.
5.2.1 Recall-Precision curves
A common way to depict the degradation of precision with the increase of recall, as one
traverses the ranked document list, is to plot interpolated precision numbers against
percentage recall. A percentage recall of say 50% is the position in the documents list at
which 50% of the relevant documents in the collection have been retrieved. It is a
measure of the number of documents one has to examine before seeing a certain
percentage of the relevant documents. The same plot expresses the notion of recall at
50
ii
i i
TPP
TP FP
i
ii i
TPRTP FN
precision too, referring to the percentage of relevant documents which can be found at a
certain precision level.
Figure 5.1 shows a typical recall-precision graph. The graph shows the trade-off
between precision and recall. Trying to increase recall typically introduces more
incorrectly classified documents, which do not belong to the target category, into the
documents list, thereby reducing precision (i.e., moving to the right along the curve).
Trying to increase precision typically reduces recall by removing some good documents
from the document list (i.e., moving left along the curve). An ideal goal for a classifier
is to increase both precision and recall by making improvements to the classification
algorithm, i.e., the entire curve must move up and out to the right so that both recall and
precision are higher at every point along the curve.
Figure 5.1: A typical recall-precision graph
A recall-precision curve can be drawn for each of the categories separately.
However, when measuring overall system performance averaging the points is
necessary.
5.2.1.1 Macro averaging vs. Micro averaging
51
recall-precision graph
0
0 .2
0 .4
0 .6
0 .8
1
1 .2
0 0 .2 0 .4 0 .6 0 .8 1 1 .2
re c a ll
prec
isio
n
There are two conventional methods of calculating the performance of classification and
retrieval systems based on precision and recall. The first is called micro-averaging,
while the second one macro-averaging. Micro-averaged values are calculated by
constructing a global contingency table (as was shown for a single class) and then
calculating precision and recall using these sums. In contrast macro-averaged scores are
calculated by first calculating precision and recall for each category separately and then
taking the average of these. The notable difference between these two calculations is
that micro-averaging gives equal weight to every document while macro-averaging
gives equal weight to every category.
In this way micro and macro average precision and recall are calculated as can be seen
in the formulas below which are based on he definitions given in Table 5.1 at the
beginning of section 5.2
Typically, micro-averaging is used in text categorization, since the evaluation
measure should reflect the system’s performance on most common categories and
shouldn't be influenced by rare categories. As opposed to information retrieval, where
all the queries are equally weighted and macro-averaging is commonly used.
Moreover, in the IMDB annotated test set there are many categories with only
few documents or none at all. Giving these categories equal weight as the others will
give us a misleading view of the system performance.
Thus, we adopted the Micro-average averaging method; all the recall-precision
points in the graphs presented in Section 6 were calculated in this manner.
In Section 5.2.2 we describe another evaluation measure termed Average
Precision. Average Precision is a macro-average averaging type; therefore, our two
evaluation schemes cover both aspects of micro and macro averaging.
52
| |
1| |
1
C
i imicro C
ii i
TPP
TP FP
| |
1| |
1
C
i imicro C
ii i
TPR
TP FN
| |
1
1| |
Ci
macroi i i
TPPC TP FP
| |
1
1| |
Ci
macroi i i
TPR
C TP FN
5.2.1.2 R@P average curve
The multi-class classification scheme requires setting a cut-off point in the ranked
document list for each category. Typically, the cut-off might be top-k percent of ranked
classifications or a certain threshold on the classification score.
In this thesis we propose a different cut-off scheme which better fits an
industrial classification setting for large real world taxonomy. Since this thesis is a part
of the Negev Consortium (Next Generation Personalized Video Content Service), the
classification setting should enable satisfying the industry demand for a reasonable
precision. However, manual tuning of parameters is considered acceptable in an
industrial classification setting. Tuning parameters is much cheaper than training dataset
creation for a supervised classifier. Under these circumstances, our assumptions are as
follows: (i) the user would not want to descend from a certain predefined precision
level. (ii) The threshold for each category will be tuned separately, such that a different
score threshold which fits the desired precision will be set for each category. (iii) The
desired precision level would be the same for all the categories.
We propose an evaluation measure, which better fits the industrial classification
setting for large real world taxonomy, the Recall at Precision average curve (R@P
curve). The R@P curve is an averaged recall-precision curve where each cut-off point
corresponds to a certain precision level. The precision level is presented in 1/k intervals,
where k is the number of the cut-off points. For each category, we calculate the number
of correct classifications in the ranked document list that maintain precision greater than
the given precision level. We then sum the number of these correct classifications of all
the categories and divide it by the total number of classifications in the gold standard,
obtaining the recall of that precision level. Therefore, the R@P curve illustrates how
much recall the classifier can provide under a certain precision level.
Since the R@P curve measure better fits the industrial classification setting, we
use the R@P curve as our main evaluation measure. In Section 6 we also compare it
with other methods for setting cut-off points in the list of ranked documents.
5.2.2 Mean Average Precision (MAP)
53
Average precision is a common evaluation measure for system rankings, and is
computed as the average of the system's precision values at all points in the ranked list
where recall increases (Voorhees and Harman 1999). More formally, it can be written as
follows:
where n is the number of documents classified by the system to a specific category in
the test set, R is the total number of correct classifications in the test set gold standard
for this category, E(i) is 1 if the ith document is classified to this category according to
the gold standard and 0 otherwise, and i ranges over the documents, ordered by their
ranking. The score calculated by the average precision measure ranges between 0 – 1,
where 1 stands for perfect ranking which places all the category documents before the
non category ones. This value corresponds to the area under the non-interpolated recall-
precision curve for the target word. Mean Average Precision (MAP) is defined as the
mean of the average precision values for all the categories. We averages only categories
which contain at least one document in the test set gold standard (84 categories out of
97), since ranking categories with no documents is meaningless.
In the next section we evaluate our improved algorithm (Section 4) using the
evaluation measures described in this Section.
54
Chapter 6
Results and Analysis
We evaluated the classification results of our improved scoring method described in
Section 4, using Recall at Precision average curve (R@P curve) with our new cut-off
scheme presented in Section 5. Our aim was to allow the user to set her precision
constraints and choose the desired recall-precision trade-off.
We also evaluated the ranking quality of our scoring method, using the MAP
measure. The MAP measure averages only categories that contain at least one document
in the test set gold standard (84 categories out of 97), as explained in Section 5.
The Results and analysis are presented in this section. We first compare our
scoring method to three other baselines (section 6.1). Then we present ablation test
results aimed at testing the contribution of each component of our scoring method
(section 6.2). We further analyze our results with a detailed error analysis in Section 6.3.
Finally, we describe our experiment with a bootstrapping scheme and its results
(Section 6.4).
6.1 Results
We compared our scoring method explained in Section 4 to three baselines. The first is
the single-class Combined method described in Section 2.2, which was used in (Barak
et al., 2009) for the unsupervised categorization step. The second baseline is the multi-
class Combined method, on which (Barak, 2008) reported her MAP. The multi-class
Combined method ranks documents that contain at least a single occurrence of a
55
referring term. The single-class Combined method classifies each document to a single
category, the category with the highest similarity to the document, while the multi-class
Combined method classifies each document to all the categories whose referring term
are mentioned in it. The multi-class Combined method was used by (Barak, 2008) only
for ranking evaluation.
Finally we applied a baseline from an information retrieval query expansion
algorithm, the Rocchio pseudo-relevance feedback described in Section 2.4.
R@P curves
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
recall
prec
isio
n single-class combined
multi-class combined
rocchio
our method
Figure 6.1: R@P averaged curves methods comparison
Figure 6.1 presents the R@P average curves obtained by using the following
similarity scores: ours, single-class Combined, multi-class Combined and Rocchio. It
shows that using our scoring method consistently outperforms other methods by several
points. The recall of our method is higher, since we utilized a new additional statistical
LR resource. In addition, the accuracy of the statistical LR resource with the additional
dice-based context model caused the precision to increase as well. A comparison
between the curve denoting the single-class Combined scoring and the other curves,
which were obtained by multi-class classification, shows that the single-class
classification scheme is limited in recall since it selects only the best category for each
of the documents. It implies that there are more good highly ranked classifications that
the single-class classification scheme ignores.
A comparison between the different multi-class classification scoring methods;
ours, multi-class Combined and Rocchio, shows that Rocchio consistently performs
56
worse than the other methods. Its recall is limited at low-precision cut-off points, while
at high-precision cut-off points its recall is even lower than the recall of the single-class
scoring method.
A comparison between the two curves denoting our scoring and the multi-class
Combined scoring suggested by (Barak et al. 2009) shows that integrating the statistical
knowledge of the Dice-based context model in the multi-class Combined scheme
achieves higher recall, showing an average recall improvement of 6.8 points.
To complete the ranking evaluation, in Table 6.1 we present the MAP values of
the scoring methods. Ranking by the our score achieves higher MAP value than all
methods. In particular, it achieves a MAP value seven points higher than ranking
according to the multi-class Combined score.
Method MAP
Single-class combined 0.35
Multi-class combined 0.50
Rocchio 0.41
Our method 0.57
Table 6.1: MAP values methods comparison
We checked the statistical significance of our results, aiming to assess that our method
is indeed better than the multi-class Combined one. We used the Wilcoxon signed rank
sum test, which is used to test the null hypothesis that the median of a distribution is
equal to some value, or, in case of paired data, that the median difference is equal to
zero.
We compared the differences between the average precision values of each of
the categories in both methods, ignoring cases where the paired difference is 0. The
number of pairs was large enough, so we used a normal approximation. The
57
approximation gave a two-sided p-value of p=0.0004. This assesses that our results are
statistically significant and that there is strong evidence that our improved method is
better than the Multi-class combined method.
6.2 Contribution of Our Method Components
As described in Chapter 4, our method adds two components to the Multi-class
combined scoring method suggested by (Barak et al. 2009), which are included in our
scoring method. These components include (i) a co-reference dice-based context model
and (ii) a dice lexical references (LR) resource. After analyzing the results of using all
components together, we wished to assess the contribution of each component
individually. This was done by ablation tests. For ablation tests, the starting point is the
results obtained from using all components. For each component, we assessed the
influence on performance when this component is removed from our scoring method.
Furthermore, we compared the existing union-based resource combination scheme to
the new resource combination scheme based on geometrical mean, which we presented
in Section 4.2.
6.2.1 Component Ablation Tests
The results of comparing the R@P average curves of each of the ablation tests are
shown in Figure 6.2.
58
Ablation tests
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
recall
prec
isio
n
All
Ab LSA context model
Ab dice-based context model
GM combination scheme
Ab dice expansions resource
Figure 6.2: Comparison of R@P average curves of ablation tests
These results show that context verification is important. Combining different
context models is beneficial, since each of the context models has some additional
information relative to the others.
We compared our dice-based context model to the LSA context model, aiming
to measure the quality of our new dice-based context model. Figure 6.2 illustrates that
both of the context models show quite similar behavior, except for one cut-off point, of
0.7 precision, where the LSA performs better. The MAP of the LSA context model is
0.56, while the dice-based context model’s MAP is 0.55. We checked whether the
method with the LSA context model (omitting the dice context model) is better than the
method with the dice-based context model (omitting the LSA context model) using the
Wilcoxon significance test. The results showed that the difference between the context
models is not statistically significant. We conclude that our new simple dice-based
context model is comparable to the much more complex LSA context models.
A comparison of the two different resource combination schemes shows that
using geometrical mean (GM), described in Section 4.2, is no better than the resources
59
union suggested by (Barak et al. 2009). The rationale of the GM combination scheme is
interesting, but ineffective in the IMDB case.
The contribution of the dice LR resource is important when recall is considered.
Starting from the 0.6 precision cut-off point, the recall of our scoring method increases
due to the dice LR resource.
6.2.2 Resources Ablation Tests
In order to evaluate the contribution of each of the reference expansion resources used,
we did another series of ablation tests. The resources which are included in our scoring
method are WordNet, Wikipedia and Dice. We wished to assess the contribution of each
resource individually. For these ablation tests, the starting point is the results obtained
from using all resources. For each resource, we assessed the influence on performance
when it is removed from our scoring method. In addition, we checked the influence of
removing all the LR resources and of using only the dice LR resource. The context
models were left in the system in all of these ablation tests.
Figure 6.3 presents R@P average curves for each of the resources ablation tests.
resources ablation tests
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1recall
prec
isio
n
our method
Ab dice
Ab WordNet
Ab Wikipedia
No resources
Dice only
Figure 6.3: Comparison of R@P average curves of resources ablation tests
60
Analyzing our results, we can readily conclude that using LR resources is
valuable. Context models are obviously not enough for the TC task. LR defines a more
accurate semantic relation, which aims to identify whether the meaning of a certain
category name is referenced by another text. This measure aims at a more appropriate
relation to base the TC assumption on, since it requires the actual reference to the
category topic in the text, rather than general context similarity.
Wikipedia has been a potentially good resource for LR, providing the typical
knowledge found in an encyclopedia. However, there is an overlap between the dice
and Wikipedia resources. When we use the dice and WordNet resources, the
contribution of Wikipedia is insignificant since it increases the recall only where the
precision is low.
WordNet as a different type of resource that has more impact on the
classification results. Typical knowledge that can be found in a dictionary tends to occur
less frequently in co-occurrence statistics collected from a corpus. WordNet expansions
improve the document’s ranking and increase the recall, maintaining a high rate of
precision.
Dice expansions resource increases the recall where the precision is below 0.6,
but when we use only dice LR resource, the recall at the same range is not maximized.
The Dice LR resource performance is reasonable even as a single resource. It would be
especially efficient in cases where there are no additional resources available, such as in
a different language for example. However, combining the Dice LR resource with other
external resources, such as WordNet, leads to better performance.
6.3 Further Analysis
In this section we further analyze our scoring method we presented in Section 4, and our
evaluation methodology described in Section 5.
We first compare our R@P average curve to standard average curves (Section 6.3.1).
Then we present a deeper error analysis for our scoring method.
6.3.1 Recall-Precision Curves Comparison
61
In Section 5 we described our R@P average curve, which differs from other standard
recall-precision average curves in the points’ cut-off level. In our R@P average curve
each cut-off point is a certain precision level (from 1.0 to 0.1), while a standard cut-off
criterion might be a threshold on the score or a certain percentage of the top ranked
classifications.
Recall-precision curve approaches
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
recall
prec
isio
n Percentage cut-off
Scores cut-off
P@R cut-off
Figure 6.4: Recall-precision curve approaches comparison
Figure 6.4 shows a comparison between our cut-off approach and two other
typical cut-off approaches: (i) similarity score cut-off, where the threshold is set over
the classification scores and (ii) classification percentage cut-off, where the threshold is
set on the percentage of the top ranked classifications (top-k%) to be taken. Both the
percentage and the score cut-off approaches do not require manual tuning of each
category separately. We just have to select a certain cut-off point according to the
desired global recall-precision trade-off. The cut-off point itself defines a threshold on
the number of the classification or on the classification score. For each category, we
calculate the number of correct classifications that meet the threshold condition. We
then sum the number of these correct classifications of all the categories and divide it by
the total number of classifications in the gold standard, to obtain the recall level for this
threshold. For each category we calculate the global number of classifications that meet
the threshold condition as well. We then sum the number of total classifications of all
the categories and divide the sum of correct classifications of all the categories by the
sum of total classifications of all the categories, to obtain the precision level for this
threshold.
62
However, since tuning parameters before product marketing is acceptable in the
industry as described in Section 5.2, our proposed cut-off scheme is applicable in the
IMDB case.
Figure 4 show that our more expensive cut-off scheme indeed outperforms other
standard cut-off schemes consistently by several percentage points. The performance of
our cut-off scheme is better, since the cut-off of a certain precision level defines
different thresholds over different categories. The Figure also illustrates that by setting
our desired precision level to 0.6 for example, we can achieve 0.56 recall, and the F1
score of 0.58, which is comparable to previous results on academic datasets.
6.3.2 Error Analysis
The multi-class classification schemes classify each document into numerous
categories. Any random mention of a refereeing term yields a classification with a
positive score. The main issue is whether the scoring method succeeded in ranking the
category's documents properly. Ranking the documents according to their score
revealed interesting characteristics of the scoring methods and of the topical
categorization in general. Below is a detailed error analysis explaining some of the
reasons for the remaining errors and suggesting how our method could be improved in
future work.
1. Passing reference: Passing references occurs when the topic name or any partial
subset of its characteristic reference terms appears in a document in their required sense,
but they do not refer to the main topic of the document.
This phenomenon in the IMDB collection is mostly relevant to the following types of
terms: (i) Professionals, such as model (fashion), photographer, artist(arts),
student(school), actor(theater). (ii) Places, such as school, college/university,
island(beach), ocean(ships). (iii) Adjectives that are strongly related to the categories,
such as wealthy(business), free(prison), ancient(history), and (iv) Terms that are part of
a noun phrase, such as “Motor Pool crew”(motoring, swimming) “architecture
student”(arts) .
The additional dice-based context model is a mechanism that should decrease
the score obtained by documents with a passing reference. When a term that refers to a
63
certain topic appears out of context, the context model gives a lower score to the
document since its context is irrelevant for that topic.
We address the passing reference phenomenon by combining two context
models, which are responsible for demoting irrelevant documents that obtained a high
reference scores due to low context score.
2. Ambiguity of expanding terms: Using reference expansions as part of our method,
terms are being added to the seed term to represent the category. Often some of these
expanding terms are ambiguous. Sometimes the ambiguity is due to two relatively
frequent senses of a term, for example the term rebound is an expansion of the seed
basketball but also appears in the senses of a movement back from an impact, or a
reaction to a crisis or setback or frustration. Ambiguity is mostly a problem of a specific
term in a document. The document’s context is usually unambiguous. Therefore,
combining context models address this phenomenon too..
3. Lexical reference resources limitations: Referring terms were collected from
WordNet, Wikipedia and Dice by utilizing relations that are likely to correspond to
lexical reference. WordNet provides mostly referring terms of general terminology,
while Wikipedia provides more specific terms. Both resources were described in
Section 2.2. Dice provides statistical correlations that include both types of refereeing
concepts as well as more specific terms. Several limitations of the currently used
resources are detailed below.
Lack of expanding terms: Our additional dice LR resource partly solves the lack of
expanding terms. There are still missing terms not covered by any of our lexical
reference resources. Category names that tend to appear rarely in the corpus suffer from
this problem most.
Incorrect or ambiguous expanding terms: Ambiguity might be caused by expanding
the category name with a term in an infrequent sense, such as expanding the category
athletics with the term meet, which refers to meeting at which a number of athletic
contests are held. There are many terms which refer to the corresponding category only
in an infrequent sense that were added as expanding terms.
64
The WordNet and Wikipedia LR resources include many frequent ambiguous
terms, such as house, life and union for the businesses category, while the Dice LR
resource added many terms that aren’t LR at all, such as roommate, boyfriend and
fraternity for the college/university category. The Dice LR still added context terms,
even though we filtered many of them, as explained in Section 4. The quality of the
Dice expansions lists of category names that appear frequently in the IMDB corpus is
relatively high, therefore we suggest adding documents for categories with a small
number of documents in the IMDB corpus by crawling the web or using some other
corpus. This line of research may be investigated further to enrich and optimize the dice
LR resource.
The problem of Incorrect or ambiguous expanding terms is destructive when
there are multiple ambiguous or wrong terms in a document. In these cases irrelevant
documents get a high ranking and the performances of our scoring method are harmed.
For example, a sentence such as "They deal with the life challenges of finding women to
love and be loved by, committing to a relationship, and getting past their childhood
dreams and desires to deal with reality and appreciate life", where life, relationship and
deal are referring terms for the category business, will cause the document to be
misclassified with a high classification score.
4. Topically close categories: Topically close categories are problematic since both of
the contexts models encounter difficulties distinguishing between close categories.
Unfortunately, there are still mis-classifications between close categories such as
fashion and arts, war and military, since these documents tend to contain multiple LR
terms from both of the close categories and none of the context models knows how to
distinguish between them.
5. Unclassified documents: Many of the documents in the IMDB corpus are not
classified to any of the categories in the taxonomy. Our scoring method succeeded in
ranking short unclassified documents lower, since even when a LR term was found
there were no other context terms. However, when these documents were longer,
unclassified documents sometimes did get a higher ranking.
6. Categories characteristics: We examined two interesting issues considering
category characteristics: (i) whether categories with more documents in the test set are
65
ranked better than categories with fewer documents, and (ii) whether estimating the size
of the category and taking classifications from the ranked list relative to the frequency
of the category in the corpus might be beneficial. Unfortunately, the answer to both of
these questions is negative.
The quality of the categories’ ranking does not depend on the category size, but
rather on the specification of the category name. Category names that express their
specific topic meaning, such as football, buddhism and motorcycle, were ranked much
better than category names that express more general meaning of the topic, such as
history, disability, environment and travel.
Estimating the size of the categories is also a problematic issue since the
frequency of the category name in the whole corpus does not provide an accurate
estimation. For example, the category name showBiz or show business is rather rare in
the corpus but this category contains a relatively high number of documents.
In our research we did not make any adjustments to the category names, but
simply set the seeds to be the category names as they were given in the taxonomy. The
reason for this policy was that we wanted our results to be replicable, so we did not use
any prior knowledge on the resources behavior. Prior knowledge on seeds that get more
effective expansions such as cooking vs. cookery in Wikipedia or computer vs.
computing in Dice LR resource might be very helpful. By manual effort of tuning the
category name, which would be acceptable in the industry, further improvement can be
easily achieved
6.4 Bootstrapping results
The bootstrapping step suggested by (Barak et al., 2009) and others (Ko and Seo, 2004),
(Gliozzo et al. 2005) consists of training a supervised classifier with an initial labeled
set created by a previous unsupervised step. In Section 3 we showed that bootstrapping
performance on the IMDB corpus by the method of (Barak et al., 2009) was very poor,
yielding lower performance than the unsupervised classification that constitutes its input
training set. (Barak et al., 2009) used the similarity scores obtained by the combined
scoring method presented in Section 2.2 to produce an initial labeled set of documents
to train a supervised classifier. In their initial labeled set, each document was considered
66
as classified only to the best category, to train a SVM classifier for each category.
Classification was determined independently by the classifier for each category,
allowing multiple classes per document. Their first unsupervised step was based on
single-class classification of each document in the unlabeled set.
In Section 5 we introduced our multi-class unsupervised classification scheme
where each document may be classified to zero, one or more categories. Consequently,
in this section we present a different approach for producing an initial labeled set of
documents using a multi-class classification scheme as our first step.
We also introduced (in Section 5) a different evaluation methodology, which
suggested drawing average recall-precision curves, where each cut-off point
corresponds to a certain precision level. Ideally, we would have wanted to set a high
precision level and take the documents that meet our requirement as positive training
examples for the supervised classifier. However, we lacked the human resources for
manually tuning each of the categories under this thesis framework. We therefore had to
adopt a standard global cut-off scheme. We selected the percentage cut-off scheme,
where a top percent of the highly ranked documents of each category is selected as
positive examples for the category classifier.
Good negative examples for the supervised training process need to have two
properties: (i) high confidence that they are indeed negative examples and (ii) at least
some of them should be close enough to the positive examples, containing passing
references or sharing similar contexts. Otherwise the classifier might simply classify
according to the appearance of the category names in the documents, rather than trying
to learn more features for the categories.
We used our dice-based context model for selecting the negative examples.
Analyzing our dice-based context model, we found out that 99% of the classifications
that were ranked relatively lower in the classification list of a certain document, were
inappropriate. However, some of these documents did have something in common with
the inappropriate category as we had hoped. Therefore, we sorted the classification list
of a given document and selected the document as a negative example for categories
that were ranked lower than a certain rank in the sorted list.
67
The last issue was selecting the ratio between the positive and negative
examples. Since we were unable to estimate the real portion of a category in the corpus,
we selected the same number of examples for both the negatives and the positives for
each category.
We manually tuned both of our parameters. The percentage classification
threshold of the positive examples was set to 0.3, with the aim of having enough
training documents out of the 120,000 unlabeled documents in the IMDB corpus. The
dice-based context rank was set to 5, based on experiments on the development set. We
represented the input examples vectors for the SVM supervised classifier in tf-idf
weighted term space and used a common feature selection suggested by (Forman,
2003), which removes the least common features in the corpus, features which appear
fewer than 3 times.
However, we did not obtain any reasonable results. Both the recall and precision
were lower than 0.1. These experiment results lead us to believe that the problem is not
the bootstrapping scheme, but rather with the IMDB collection. The IMDB documents
are too short and their quality is low; more efforts to collect better textual information
on the IMDB movies are needed in the future.
Moreover, the bootstrapping step actually contradicts the rationale of LR-based
approaches. LR specifies an accurate semantic relation, which aims to identify whether
the meaning of a certain category name is referenced by the document text. This
measure aims at a more appropriate relation to base the TC assumption on, since it
requires an actual reference to the category topic in the text, rather than general context
similarity. In contrast, in the bootstrapping step a supervised classifier is used to
perform the final categorization step on the test corpus. The supervised classifier does
not capture the exact semantic relation needed to assess classification decision. It might
model the broader context of the text and not the specific topic it discusses. Therefore,
we conclude that when LR-based approaches are applied, it is desirable to avoid the
bootstrapping step.
68
Chapter 7
Conclusion and future work
In this work we investigated the keyword-based TC approach, which is based on the
integration of reference models and context models. The proposed method integrates a
new LR and a new context model in to the scoring method, proposed by (Barak et al.,
2009). Our research focused on the multi-class classification scheme with a novel
evaluation approach, revealing a new perspective on the classification results.
Our investigation highlights several important conclusions about the integration
of the two models, about each of the new models and about the classification and
evaluation scheme:
1. Indeed as (Barak et al., 2009) reported, our analysis reveals that the reference
requirement as the basis for the TC score helps to classify documents according to the
topic they actually discuss, as opposed to using context models, which only reveal the
documents’ broader context.
2. Utilizing statistical correlation from a curpos of the target domain can be useful for
both context representation and lexical reference extraction. Word co-occurrence
information captures topical semantic relationships between words. These relationships
include both referring terms and other context related terms.
3. Our dice-based context model is much simpler than the LSA context model, which is
complex to implement and computationally expensive. Furthermore, the dice-based
context model uses a direct representation of word co-occurrence that is easy to analyze,
69
while the LSA representation is implicit. However, it is comparable to the LSA context
model in performance.
4. Combining different LR resources provides a more complete perspective and
expansion abilities. However, when there are no external resources available, in a
different language for example, statistical LR resources might be very beneficial.
5. When a small degree of manual intervention is possible, such as in an industrial
classification setting for a large real-world taxonomy, our new classification and
evaluation methodology is more suitable, and increases performance significantly.
We will now describe several promising research directions observed during the course
of this study.
1. One of the advantages of a statistical LR resource is that more documents can be
added for categories with a small amount of documents by crawling the web or using
an additional corpus. We suggest that this line of research may be investigated further to
enrich and optimize the Dice LR resource in order to exploit additional reference
knowledge.
2. In our research we did not adjust the category names, but have simply set the seeds to
be the category names as they were given in the taxonomy Future work should include
investigate the contribution of human involvement in the seeds selection.
3. Obviously, the recall of our method can be improved by utilizing further reference
knowledge resources. (Kotlerman et al., 2009) defined novel directional statistical
measures of semantic similarity. Their measure was based on the Distributional
Similarity Hypothesis, which suggests that words occurring within similar contexts are
semantically similar (Harris, 1968). However, their measure is asymmetric. The
directionality was based on Distributional Inclusion, which assumes that prominent
semantic traits of an expanding word should co-occur with the expanded word as well.
We suggest improving the keyword-based TC task by utilizing Directional
Distributional Similarity expansion resources, which might be based on the above
measure.
70
Appendix A
Our complete IMDB taxonomy
Categories 1 Religion
1.1 Buddhism 1.2 Hinduism 1.3 Christianity
1.3.1 christmas1.4 Islam 1.5 Judaism
2 Sport 2.1 Bicycle 2.2 Boxing 2.3 Fishing 2.4 Football 2.5 Golf 2.6 Hockey 2.7 martial-arts
2.7.1 karate2.8 Athletics 2.9 Running
2.10 shooting 2.11 Skiing 2.12 soccer 2.13 water sports
2.13.1 surfing2.13.2 swimming
2.14 Tennis 2.15 Baseball 2.16 Wrestling 2.17 basketball 2.18 Horseracing 2.19 Olympic games
3 Interests (NON-CAT) 3.1 Beach 3.2 Outdoor 3.3 Gardening 3.4 Pets 3.5 Fitness 3.6 Cookery 3.7 Fashion 3.8 Computing 3.9 Travel
3.10 Motoring 3.10.1 cars3.10.2 Motorcycle
3.11 Trains 3.12 Airplanes 3.13 Ships 3.14 Radio 3.15 Business
71
3.16 Nature 3.16.1 Animals
3.17 outer Space 3.18 the environment 3.19 Showbiz 3.20 Traditions 3.21 Infants 3.22 Military 3.23 Weather
4 Arts 4.1 Cinema 4.2 Advertising 4.3 Theater 4.4 Music
4.4.1 Opera4.4.2 classical music 4.4.3 Jazz4.4.4 Pop/rock4.4.5 country music 4.4.6 Hip Hop
4.5 Dance 4.5.1 Ballet
5 Science 5.1 Medicine
5.1.1 disability5.2 Technology 5.3 Psychology
6 Education 6.1 School 6.2 College/University
7 Miscelenous (NON-CAT) 7.1 crime (NON-CAT)
7.1.1 prison7.1.2 mafia7.1.3 drugs7.1.4 fraud7.1.5 gambling7.1.6 terrorism
7.2 Literature 7.3 History 7.4 Political 7.5 Social (NON-CAT)
7.5.1 racism7.6 Legal 7.7 Communism 7.8 War
7.8.1 World war 17.8.2 World war 2
7.9 Aliens 7.10 comic-book 7.11 journalism 7.12 mythology
72
Appendix B
The annotation guidelines
You are given a list of films with their plot description and a taxonomy of film categories.
The taxonomy
The taxonomy is made up of film subject matters and is arranged in a hierarchical order so that if a sub-category is marked its ancestors are also relevant. This is true in all cases except when a category is only present in order to group similar subject together in which case it is marked with the text
(NON-CAT) next to it.
For example:
A film categorized as dealing with 'cars' will also be relevant to 'motoring' but not to 'interests' as it is not a category.
3 Interests (NON-CAT)
3.9 Travel
3.10 Motoring
3.10.1 cars
3.10.2 Motorcycle
* Note - the taxonomy is not exhaustive, you may find that there is no category in the taxonomy which accurately fits the film even though you can think of a subject matter that does. If a broader category is present choose it, otherwise choose none.
For each film, you must decide which categories (if any) out of the taxonomy are relevant to it. You can choose as many or as few categories as you see fit, or none.
* Note – if you find more than one category, please put each category in a separate line (insert lines if necessary).
You must categorize according to the following guidelines:
1. Is the background story prominent – not just a passing reference.Examples:
73
The following film should be categorized as relevant to 'crime':"Jessie is an ageing career criminal who has been in more jails, fights, schemes, and lineups than just about anyone else. His son Vito, while currently on the straight and narrow, has had a fairly shady past and is indeed no stranger to illegal activity. They both have great hope for Adam, Vito's son and Jessie's grandson, who is bright, good-looking, and without a criminal past. So when Adam approaches Jessie with a scheme for a burglary he's shocked, but not necessarily disinterested...."
The following film should be categorized as relevant to 'animals':Farmer Hoggett wins a runt piglet at a local fair and young Babe, as the piglet decides to call himself, befriends and learns about all the other creatures on the farm. He becomes special friends with one of the sheepdogs, Fly. With Fly's help, and Farmer Hoggett's intuition, Babe embarks on a career in sheepherding with some surprising and spectacular results. Babe is a little pig who doesn't quite know his place in the world. With a bunch of odd friends, like Ferdinand the duck who thinks he is a rooster and Fly the dog he calls mom, Babe realizes that he has the makings to become the greatest sheep pig of all time, and Farmer Hogget Knows it. With the help of the sheep dogs Babe learns that a pig can be anything that he wants to be.
The following film should not be categorized as relevant to 'baskeball':"This gritty drama follows two high school acquaintances, Hancock, a basketball star, and Danny, a geek turned drifter, after they graduate. The first film commissioned by the Sundance Film Festival, it portrays the other half of the American dream, as Hancock and his cheerleader girlfriend Mary wander to a middle-class mediocrity out itself out of reach for Danny and his psychotic wife Bev."
2. You must not base your decision on prior knowledge of the film, only on information provided in the plot.
74
Bibliography
1. Buckley, C., Singhal A. and Mandar M. "New retrieval approaches using
SMART: TREC 4".In Proc. TREC, 1995.
2. Chade-Meng Tan, Yuan-Fang Wang, Chan-Do Lee: The Effectiveness of
Bigrams in Automated Text Categorization. ICMLA 2002: 275-281
3. Chirag Shah and Bruce W. Croft. 2004. Evaluating high accuracy retrieval
echniques. In Proceedings of SIGIR
4. D. Downey and O. Etzioni. Look ma, no hands: Analyzing the monotonic
feature abstraction for text classification. In Advances in Neural Information
Processing Systems (NIPS) 21, 2009, January 2009.
5. Dan Moldovan and Vasile Rus. 2001. Logic form transformation of wordnet and
its applicability to question answering. In Proceedings of ACL.
6. Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The
third pascal recognizing textual entailment challenge. In Proceedings of ACL-
WTEP Workshop.
7. Deerwester, S., S. Dumais, G. Furnas, T. Landauer, and R. Harshman. 1990.
Indexing by latent semantic analysis. Journal of the American Society of
Information Science.
8. Ephraim Sachs, Semantic Aspects of Information Retrieval, MSc thesis, Hebrew
university, Israel, 2008.
9. Eyal Shnarch, Libby Barak, Ido Dagan. Extracting Lexical Reference Rules
from Wikipedia. In Proceedings of ACL 2009
75
10. Fellbaum, C., editor. 1998. WordNet : An Electronic Lexical Database
(Language, Speech and Communication).The MIT Press.
11. George Forman, An extensive empirical study of feature selection metrics for
text classification, The Journal of Machine Learning Research, 3, 2003.
12. Giampiccolo, D., B. Magnini, I. Dagan, and B. Dolan. 2007. The third pascal
recogniz-ing Gilies, 2005) textual entailment challenge. In Proceedings of
ACLWTEP Workshop.
13. Gliozzo, C. Strapparava, and I. Dagan. 2005. Investigating unsupervised
learning for text categorization bootstrapping.. In Proc. of the Joint Conference
on HumanLanguage Technology / Empirical Methods in Natural Language
Processing (HLT/EMNLP), Vancouver.
14. Ido Dagan, Oren Glickman, and Bernardo Magnini, editors. 2006. The PASCAL
Recognising Textual Entailment Challenge, volume 3944. Lecture Notes in
Computer Science.
15. Joachims, T. 1999. Making large-scale SVM learning practical. In B. Scholkopf,
C. Burges, and A. Smola, editors, Advances in kernel methods: support vector
learning.MIT press, Cambridge, MA, USA, chapter 11, pages 169-184.
16. Jun’ichi Kazama and Kentaro Torisawa. 2007. ExploitingWikipedia as external
knowledge for named entity recognition. In Proceedings of EMNLPCoNLL.
17. K. Morik, P. Brockhausen, and T. Joachims. 1999. Combining statistical
learning with a knowledge-based approach - A case study in intensive care
monitoring. Proc. 16th Int'l Conf. on Machine Learning (ICML-99).
18. Ko, Y. and J. Seo. 2004. Learning with unlabeled data for text categorization
using bootstrapping and feature projection techniques. In Proc. of the ACL-04,
Barcelona, Spain, 2004
19. Libby Barak, Ido Dagan, Eyal Shnarch. Text Categorization from Category
Name via Lexical Reference. In Proceedings of North American Chapter of the
76
Association for Computational Linguistics - Human Language Technologies
(NAACL HLT), 2009
20. Libby Barak, keyword based Text Categorization, MSc thesis, Bar-Ilan
university, Israel, 2008.
21. Lili Kotlerman, Ido Dagan, Idan Szpektor and Maayan Zhitomirsky-Geffet.
2009. Directional Distributional Similarity for Lexical Expansion. ACL-IJCNLP
2009 (short paper).
22. Lin, Dekang. "Automatic Retrieval and Clustering of Similar Words."
COLINGACL98. Montreal, Canada, 1998.
23. Liu, B., X. Li, W. S. Lee, and P. S. Yu. 2004. Text classification by labeling
words. In Proc. of AAAI-04, San Jose, July.
24. M. de Buenaga, J.M. Gomez, and B. Diaz. 1997. Using wordnet to complement
training in formation in text categorization. In Recent Advances in Natural
LanguageProcessing II: Selected Papers from RANLP'97, volume 189 of
Current Issues in Linguistic Theory (CILT), pages 353-364. John Ben jamins,
2000. Rodriguez et al. 1997)
25. Mandala, Rila, Takenobu Tokunaga, and Hozumi Tanaka. "Combining Multiple
Evidence from Different Types of Thesaurus for Query Expansion." SIGIR.
Berkley, CA, 1999.
26. Manning, C.D., Raghavan P. and Schutze H. "Introduction to Information
Retrieval." Cambridge University Press, 2008.
27. Marius Pasca and Sanda M. Harabagiu. 2001. The informative role of wordnet
in open-domain question answering. In Proceedings of NAACL Workshop on
WordNet and Other Lexical Resources.
28. Martin S. Chodorow, Roy J. Byrd, and George E. Heidorn. 1985. Extracting
semantic hierarchies from a large on-line dictionary. In Proceedings of ACL.
77
29. McCallum, A. and K. Nigam. 1999. Text classification by bootstrapping with
keywords, EM and shrinkage. In ACL99 – Workshop for unsupervised Learning
inNatural Language Processing.
30. Nancy Ide and V´eronis Jean. 1993. Extracting knowledge bases from machine-
readable dictionaries: Have we wasted our time? In Proceedings ofKB & KS
Workshop.
31. Oren Glickman, Ido Dagan and Eyal Shnarch. Lexical Reference: a Semantic
Matching Subtask. In Proceedings of EMNLP 2006, 22-23 Jul 2006, Sydney,
Australia.
32. Perez-Aguera J.R. and Araujo L. "Comparing and Combining Methods for
Automatic Query Expansion. In Advances in Natural Language Processing and
Applications. Madrid, Spain, 2007
33. S.Scott and S.Matwin.(1999).Feature engineering for text classification.Proc.of
16th International Conference on Machine Learning,Bled,Slovenia.
34. Salton, G. and M. H. McGill. 1983. Introduction to modern information
retrieval. McGraw-Hill, New York.
35. Schutze, H., Hull, D. and Pedersen, J.O. A comparison of classifiers and
document representations for the routing problem. In SIGIR ’95: Proceedings of
the18th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 229-279, 1995.
36. Schutze, H. and Pederson J.O. "A cooccurrence-based thesaurus and two
applications to information retrieval." Information Proceedings and
Management, Vol. 33, No. 3, 1997.
37. Xu, J., W.B. Croft, “Query Expansion using Local and Global Document
Analysis”, ACM SIGIR 1996, ACM, 1996.
78
Abstract (Hebrew)
מפתח והקלט היחידי מילות מציגה מחקר בתחום סיווג טקסטים. הסיווג מבוסס זותזהעבודת עבורו הוא טקסונומיה נושאית.
)supervised.(הגישה המחקרית הרווחת עבור סיווג טקסטים הינה גישה מפוקחת , מתויגים ידניתכמות ניכרת של מסמכיםהחיסרון העיקרי של גישה זו הוא בכך שהיא דורשת
. ליצר אותםמעשי ולא אינם מצוייםאשר לעיתים קרובות מהרו עבור כל קטגוריה ניתנות מספר מילות מפתח אותן קלמילות מפתחבסיווג טקסטים מבוסס
לא מבוטלת עבודה ידניתתעדיין דורש גישה זו . עם זאת,להפיק אפילו ממעט מסמכים מתויגים אשר מבוסס על גישה חדשה, בתזה זוהמחקר קטגוריה. כלעבור רשימת מילות מפתח תרייצב
,המבטלת את הדרישה לניתוח ידני של כל)Gliozzo et al., 2005 (הוצעה לראשונה על ידי. תרשימת מילות מפתח ראשוניקטגוריה, באמצעות שימוש בשמות הקטגוריות בלבד כ
ביןהמשלבת שני סוגים של דמיון,)Barak et al., 2009( של גישה את הבתזה זו אימצנו(הוא דמיון בין מילים המכילות התייחסות ספציפית למשמעות שם הקטגוריה. סוג אחד מילים
Reference לשם הקטגוריה אךבהקשר אופייני הנוטות להופיע מילים הסוג השני תופס ). בעוד ). contextספציפית שלה (המשמעות ה אינן בהכרח מרמזות על
תחתתוכן וידאו מותאם אישית), שירות מאגד הנגב (הדור הבא של מתזה זאת היא חלק קורפוס נובניו, וידאני טקסונומיה עבור תכיצרנוצעד ראשון כ תוכן. לכן, ת המלצ שלהמשימה
. נו בהתאמת הגישה הנ"ל למשימת הסיווג שלנותמקד ה. לאחר מכן,וביצענו תיוג העולם האמיתי מעלה נושאים שונים מאשר סיווגה המייצגת את סיווג לטקסונומיה גדול
מחקר זה מציע.מסויםאקדמי נתונים עבור קורפוס במיוחד ה שנוצרתטקסונומיה מלאכותיל IMDBסכימת סיווג והערכת ביצועים המתאימה לטקסונומיה גדולה ובמיוחד לטקסונומית ה
(Internet Movie Database). אנו משפרים הן את מודלIMDBבאמצעות מדידת קורלציות סטטיסטיות בקורפוס ה
). אנוBarak et al., 2009ההתייחסות והן את מודל ההקשר, במטרה לשפר השיטה המוצעת על ( סטטיסטיקה של הופעה משותפת של מילים במסמכימבוסס על ה הקשר פשוטמציעים מודל
( ומשאב ליחסי התייחסות לקסיקלייםdice (Mandala et al., 1999)הקורפוס ובפרט על מקדם Lexical Reference (.המבוסס אף הוא על אותה שיטה סטטיסטית
מבוססת על ההנחה כיה תוצאותערכת והסיווג גישה שונה ל מציעים בתזה זו אנו, בנוסףנגב.כמו במאגד ה תעשייתיותכוונון פרמטר עבור כל קטגוריה היא דרישה מקובלת בנסיבות
מאחר,אמצנו את גישת הסיווג המאפשרת לסווג כל מסמך לקטגוריה אחת, יותר או אף לא אחת , בעוד רבים אחריםשרבים מהמסמכים בקורפוס שלנו שייכים למעשה ליותר מקטגוריה אחת
ספרנו כמה מסמכים השייכים לקטגוריותקטגוריות הטקסונומיה,מ לאף לא אחת מתאימיםאינם ואנו מאפשרים לבחור את רמת הדיוק בהתאם ליחסברמת דיוק מסוימת. האלגוריתם מחזיר הכיסוי-דיוק הרצוי.
ביצועיםות שלנו, שאכן מראשופרת המהשיטה עבור ות מוצגותתוצאות אמפיריות חיובי כי דרישתחושף). הניתוח שלנו Barak et al., 2009( יותר מהשיטה המוצעת על ידי טובים
79
הסיוג, מסייעת לסיווג נושאי של מסמכים לציון כבסיס ההתייחסות הספציפית לשם הקטגוריה חושפים את ההקשר הרחב יותר של המסמכים.שימוש במודלי ההקשר לבד, אשר לבניגוד
קופל ופרופ' משה דגן עידו'פרופ של םבהדרכת נעשתה זו עבודה
המחשב למדעי הפקולטה מן
בר-אילן אוניברסיטת של
80
בר-אילן אוניברסיטתהמחשב למדעי המחלקה
מרובת לטקסונומיה טקסטים סיווגקטגוריות
ליבסקינד חיה
מוסמך תואר קבלת לשם מהדרישות כחלק מוגשת זו עבודהבר-אילן אוניברסיטת של המחשב למדעי בפקולטה
81
חשוון2009נובמבר רמת-גן, ישראל ,
תש"ע
82