Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Ontologies and Query expansion
Agissilaos Andreou
Master of Science
School of Informatics
University of Edinburgh
2005
Abstract
This master thesis will explore the use of ontologies in information retrieval and in
query expansion in particular. Ontologies are usually huge, hand-coded repositories of
concepts and relations between them so using them in information retrieval seems to
be a reasonable goal. We feel that the use of ontologies for query expansion in par-
ticular has been overlooked in contemporary literature, as the main related papers date
before 2000. In this thesis we will attempt to present a query expansion method using
ontologies that outperforms non-ontological query expansion methods. Note, however,
that the presented approach is not purely ontological but is rather a hybrid approach
as it uses non-ontological steps. We also propose a method for purely probabilistic
query expansion that outperforms all methods tested. Finally we explore word sense
disambiguation based on ontologies as that is a prerequisite step for ontological query
expansion. The ontology used was WordNet. The results of our experiments were
based on standard TREC conferences data and showed that an ontological approach
can cause improvement over non-ontological methods.
i
Acknowledgements
To my Greek professors that taught me how to think and my British professors that
taught me how to actually work.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Agissilaos Andreou)
iii
Table of Contents
1 Introduction 1
2 Background 6
2.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Probabilistic Query Expansion . . . . . . . . . . . . . . . . . 8
2.2.2 Ontological Query Expansion . . . . . . . . . . . . . . . . .12
2.3 Ideal query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
2.4 Semantic similarity measures for ontologies . . . . . . . . . . . . . .15
2.5 Ontology based Word Sense Disambiguation . . . . . . . . . . . . .18
3 Methodology 21
3.1 Probabilistic query expansion . . . . . . . . . . . . . . . . . . . . . .21
3.2 Ontology based Word Sense Disambiguation . . . . . . . . . . . . .23
3.3 Re-ranking of expansion terms based on ontologies . . . . . . . . . .25
3.3.1 Boosting based on relation to query concepts . . . . . . . . .28
3.3.2 Boosting based on importance measure drawn from hierarchies30
3.3.3 Boosting based on network importance measure . . . . . . . .33
4 Implementation 35
4.1 Modules used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
4.2 Interactive version . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
4.3 Batch processing version . . . . . . . . . . . . . . . . . . . . . . . .38
4.4 Various visualisation tools . . . . . . . . . . . . . . . . . . . . . . .40
iv
5 Evaluation and Results 42
5.1 TREC tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43
5.2 Ontology based word sense disambiguation . . . . . . . . . . . . . .44
5.3 Query expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
5.3.1 Probabilistic Query expansion . . . . . . . . . . . . . . . . .46
5.3.2 Ontological Query expansion . . . . . . . . . . . . . . . . .50
5.3.3 Hybrid Query expansion . . . . . . . . . . . . . . . . . . . .51
6 Discussion and Conclusions 61
6.1 Ideal query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61
6.2 Probabilistic methods . . . . . . . . . . . . . . . . . . . . . . . . . .62
6.3 Pure ontological query expansion . . . . . . . . . . . . . . . . . . . .63
6.4 Hybrid query expansion . . . . . . . . . . . . . . . . . . . . . . . . .65
6.4.1 Boosting based on relation to query concepts . . . . . . . . .66
6.4.2 Boosting based on network importance measure . . . . . . .67
6.4.3 The effect of the probabilistic method . . . . . . . . . . . . .69
6.5 A note on our adapted version of Pagerank . . . . . . . . . . . . . . .69
7 Summary 71
Bibliography 73
v
Chapter 1
Introduction
In the past years the growth of the World Wide Web both in content and users and the
vast improvement in search engine technology has radically changed the way knowl-
edge and information is collected and shared. Gathering information had never been
so easy and open to such a wide audience as it is today. However, there are still a
significant number of cases where the results obtained through a search engine con-
tain a high number of irrelevant results. Ordinary web users in many cases simply do
not know how to create efficient queries and even the more experienced users usually
cannot create good queries when moving on to an unknown domain.
An alternative approach to keyword based information retrieval (IR) for the web is
the so called Semantic Web (SW). The Semantic Web uses Ontologies as a structured
representation of knowledge to improve information retrieval and to assist both humans
and machines to better find information in web pages. However, despite the significant
effort during the last years to make the Semantic Web a reality, several issues are
preventing its growth. One of the main drawbacks of the Semantic Web as an IR system
is that it requires the semantic annotation of all the documents it can use. This process
has proven to be a significant bottleneck on the deployment of SW and although several
semiautomatic methods have been proposed this is thought to prevent the growth of the
SW in the near future.
In our research we will focus on using ontologies with the standard web. Actually,
we are going to focus on using ontologies for query expansion. Query expansion is
the process of augmenting the user’s query with additional terms in order to improve
1
Chapter 1. Introduction 2
results. For example the query “mad cow disease”, the terms “Creutzfeldt Jakob”
might be automatically added so that pages that contain these additional terms along
with the original terms get higher ranking. Although this thesis is focused on query
expansion, we also describe related uses of ontologies, namely using ontologies for
evaluating semantic similarity and word sense disambiguation.
There is a significant and successful non-ontological literature on query expansion
which we will attempt to take into account. The probabilistic methods have proven
to be the predominant approach for query expansion in the most important IR confer-
ences, namely SIGIR (http://www.acm.org/sigir) and TREC (http://trec.nist.gov).
The main motivation for query expansion is, needless to say, to improve results by
including terms that would lead to retrieving more relevant documents. There is an
issue, however, as to what constitutes a good expansion term. Terms that are similar
and relevant to query terms are usually considered as good terms for expansion. How-
ever, note that is not always the case as we describe later in this thesis. A proposed
probabilistic method that uses criterion other than relation to query terms is perform-
ing better than methods attempting to detect relations to query terms. Moreover, by
analysing the ideal queries we found that optimal terms tend to form semantic clus-
ters, however, sometimes these clusters are not related to query terms. More precisely,
they are related to the query but only under the particular context of the specific query
and this relation cannot be captured by a general semantic similarity and relatedness
measures. For example in the query “mad cow disease” the terms “britain, british, eu-
ropean, france” form a semantic cluster and are very good for expansion. Nevertheless,
they are not semantically close to “mad cow disease” and their relation to the query
is difficult to capture. Is it that “mad cow disease is a disease, diseases outbrake in
specific locations, these are the locations” or is it that “mad cow disease is about cows,
but which cows? british and european”.
There are two main strategies to find expansion terms: the first is to add related
terms based on some automatic relatedness measure and the second is based on rele-
vance feedback. Relevance feedback involves identifying which documents are rele-
vant and then selecting the terms that lead to a query that best distinguishes relevant
from irrelevant documents. Because relevance feedback requires the user to select
Chapter 1. Introduction 3
which documents are relevant, it is quite common to use pseudo-relevance feedback.
Pseudo-relevance feedback does not involve the user and assumes that all top-n doc-
uments retrieved by an initial query are relevant. There is a variety of methods based
on what relevance metric is used and how the ideal query is extracted from pseudo
relevance feedback data. We will review some of these methods in the background
section. In the rest of the dissertation we will focus on a novel hybrid method that uses
both pseudo-relevance feedback and relatedness drawn from the ontology.
Query expansion has some inherent dangers. The main are related to a phenomenon
namedquery drift, that is moving the query in a direction away from the user’s inten-
tion. This happens frequently when the query is ambiguous. For example the query
“windows” might be about actual windows in houses or the Microsoft Windows oper-
ating system. A system might choose an interpretation different than the user’s inten-
tion and augment the query with terms related to the wrong interpretation. This kind
of query drift is quite common in ontological methods and stresses the importance of
disambiguation of query terms and the query in general. Actually most ontological
methods include a disambiguation preprocessing step. In this thesis we will describe
some methods for disambiguation of queries terms by using ontologies.
A specific kind of query drift is calledoutweightingand is well described by Mahler
(2003). Outweightingrefers to the phenomenon where the augmentation terms are
strongly related to individual query terms but not to the overall query for example the
query “dogs training” might be augmented with terms as “Poodle, Retriever, Setter,
jogging, weights” instead of “obedience, sit, heel, leash, reward”. Ontological query
expansion methods are prone to this kind of error but the phenomenon can be observed
in statistical methods too.
An issue specific to ontological methods is that specific types of relations in the
ontology direct the query in specific directions. For example most ontologies include
is-a andpart-of relations. Thus if expanding withis-a relations a query about “car
accidents”, then the terms “vehicle event” could be added since “car is-a vehicle” and
“accident is-a event” but would lead the query to a direction that could include train
accidents or car breakdowns. Similarly expanding based on “part-of” relations make
the query focus on the structure of things discussed in the query.
Chapter 1. Introduction 4
Despite these problems, query expansion actually works and significantly improves
average performance of IR systems. This finding is well documented in literature
and is apparent from the results of this research. However, because of these dangers
query expansion degrades the performance in unpredictable ways in some queries and
many IR systems do not deploy query expansion at all, or use very cautious expansion
approaches.
In the rest of the thesis we will test the following hypotheses:
H1: A hybrid query expansion method that re-ranks query expansion termssuggested by a probabilistic method based on relatedness drawn from theontology outperforms the original probabilistic method.
H2: Terms with near uniform distribution (high entropy) of term frequen-cies in the top documents returned by an initial query are good expansionterms.
H3: The correct senses of query terms will be more important in a networkwhich has as nodes the terms extracted from a probabilistic method and asedges the semantic similarity between those terms.
Our main focus will be testing H1. The remaining hypotheses are used in our
exploration of H1 and their examination in this thesis is neither thorough nor com-
plete. The inclusion of H3 is justified on the grounds that the disambiguation approach
greatly affects the performance of ontological methods. H3 is contrasted in this the-
sis to other disambiguation approaches that use the same semantic similarity measures
but only mutually disambiguate query terms and do not take into account informa-
tion extracted from the actual documents returned by the query. Using networks and
importance measures on them expresses an attempt to successfully incorporate this
additional information in the disambiguation process.
For the evaluation we used data from the TREC-2003 HARD track using Lucene
as an Information Retrieval engine (http://lucene.apache.org/).
The rest of the thesis is organised as follows. In the background chapter we review
some of the dominant query expansion methods and cover some issues used in the rest
of the thesis. In the methodology chapter we describe our proposed methodologies for
query expansion and disambiguation. In the implementation chapter we give the details
Chapter 1. Introduction 5
of how we implemented the methodologies. In the evaluation and results chapter we
describe how we evaluated the methodologies and present the results. In the discussion
and conclusions chapter we comment on the results. Finally we include a summary of
this thesis as a last chapter.
Chapter 2
Background
2.1 Ontologies
Ontologies provide a structured way of describing knowledge. According to Gruber
(1993) an ontology is a “shared specification of a conceptualisation”. Philosophically
speaking ontology is the “metaphysical study of the nature of being and existence”
(WordNet). Practically speaking, ontologies can be seen as special kinds of graphs
describing the entities that exist in a domain, their properties and the relations between
them. The basic building blocks of ontologies are concepts and relationships.
Concepts (or classes or categories or types) can be thought of as sets and appear
as nodes in the ontology graph. Concepts in the ontologies usually have a textual
description defining them, although some ontologies include a formal definition in
some kind of logic as well. In almost every ontology concepts are described by one
or more terms. Note that each concept might have more than one term describing it
and that a term need not match only one concept. For example, to describe the concept
of bicycle the terms “bicycle” and “bike” can be used. However, the term “bike”
might also refer to the concept of motorcycle. Usually, ontologies include a single and
unambiguous term for each concept. This might be more appropriate for specifying
and sharing knowledge, however, it is not usually good for detecting concepts in text
because in real text the same concepts are usually referred to with many different terms.
Furnas et al. (1987) describe an experiment showing that people use the same term to
6
Chapter 2. Background 7
describe the same concept less than 20% of the time. Mapping a term found in text to
a unique ontology concept is one of our main goals in this thesis.
Relationships are usually of a specific type and connect two or more concepts.
Most ontologies includeis-a (or subclass, or hyper/hyponymic) relationships between
concepts i.e. “car is-a vehicle”. Many ontologies include apart-of (or holo/ meronymic)
relationship “Earth is-part-of the Solar-System”. Usually ontologies include other
types of relationships as well but we will focus on these two because they can found
in almost any ontology and can be used to create hierarchies which we will use in our
approach. Note that there are some issues in creating hierarchies frompart-of relations
regarding the implied transitivity in hierarchies. For example, “My foot is part-of me”
and “I am part-of a committee” thus we are led to the rather strange conclusion that
“my foot is part-of a committee”. This phenomenon is caused when different types of
part-of relationships are mixed as it is well described in (Winston et al., 1987).
Throughout this thesis we will use WordNet (http://wordnet.princeton.edu/) as our
ontology. However, we specifically avoided the use of any ontology specific features
so that our approach can be easily applied to other ontologies. Concepts in WordNet
are calledsynsets; that is synonym sets. Usually in the context of WordNet concepts
are referred to assenses. The terms describing each concept are the synonyms con-
tained in thesynset. For example thesynsetof bicycle is “bicycle, bike”, thus both
terms are representing the same concept. The definition of a concept in WordNet is
calledgloss. The main relations in WordNet areis-a relations andpart-of relations.
Other relations exist as well asdomain, pertains-to, similar, see-alsoetc., however,
they are quite sparse and not worth dealing with independently. Parents in anis-a
relationship as “vehicle” in “vehicle is-a car” are calledhypernymsand children are
calledhyponymsi.e. vehicle is ahypernymof car and car ahyponymof vehicle. Par-
ents in apart-of relationship, such as “car” in “car has-an engine” are calledholonyms
and the childrenmeronyms. WordNet roughly distinguishes between different types of
thepart-of relation and thus is suitable for creating hierarchies. The types ofpart-of
relations used in WordNet are:
• member sense (hmem/mmem): professor is-a-member-of staff
• substance sense (hsub/msub): tears are-made-of water
Chapter 2. Background 8
• all other senses (hprt/mprt): China is-part-of Asia, Amusementpark has rides
etc.
2.2 Query Expansion
There are two main approaches to query expansion covered in the literature. The dom-
inant one is that of probabilistic query expansion. Probabilistic query expansion is
usually based on calculating co occurrences of terms in documents and selecting terms
that are most related to query terms. Ontological methods suggest an alternative ap-
proach which uses semantic relations drawn from the ontology to select terms. In this
section we will compare the probabilistic approaches and ontological methods and then
present some methods we have drawn some ideas from.
2.2.1 Probabilistic Query Expansion
An excellent review of early probabilistic methods can be found in the introduction
section of (Xu and Croft, 2000), in the related work section of (Hang et al., 2002) and
in section 2 of (Carpineto et al., 2001). Here we are going to provide a summary of
those reviews and introduce some more recent methods. Most probabilistic methods
can be categorised as global or local. Global techniques extract their co-occurrence
statistics from the whole document collection and might be resource intensive as the
calculations can be performed off line. Local techniques extract their statistics from the
top-n documents returned by an initial query and might use some corpus wide statistics
such as the inverse document frequency but they must be fast because they delay the
response of the system as. All calculations for local methods are done on-line; just
after the user supplies the query and before presenting the results to the user.
One of the first successful global analysis techniques was term clustering (Jones,
1971). Term clustering is based on the association hypothesis. Namely that terms re-
lated in some corpus tend to co-occur in the documents of that corpus. Using this hy-
pothesis, the terms were clustered based on their co-occurrences and expansion terms
were selected from the clusters which contained the query terms. Other well-known
global techniques include Latent Semantic Indexing (Deerwester et al., 1990), and
Chapter 2. Background 9
Phrasefinder (Jing and Croft, 1994). These techniques use different methods to build a
similarity matrix of terms and select terms that are most related to the query terms in
that matrix.
Local analysis can be traced at least back to (Attar and Fraenkel, 1977) which used
a similar approach to term clustering to select expansion terms, but, needless to say,
since it was a local method the clusters were created from terms of the top-n results of
an initial query. Local techniques are based on the hypothesis that the top-n documents
are relevant to the query. This assumption is called pseudo-relevance feedback and has
proven to be a simple but effective assumption to make. However, it can cause a
significant variance in performance depending on whether the documents retrieved by
the initial query were actually relevant.
Most local analysis methods use the notion of Rocchio’s (Rocchio, 1971) ideal
query as a start point. This method is discussed in more detail later in this section
and could be described as a method to find the query that has maximum similarity to
relevant documents and minimum similarity to irrelevant documents.
Several methods have been proposed which differ on how they select the terms
from the top-n documents and their attempts to minimise the effect of irrelevant docu-
ments returned by the initial query (Mitra et al., 1998) (Lu et al., 1997) (Buckley et al.,
1998). However, the most successful local analysis method of this kind is Local Con-
text Analysis (Xu and Croft, 2000) which we will present in more detail as it is one of
the most successful query expansion methods and we are going to evaluate it in this
thesis.
2.2.1.1 Local context Analysis
Local context analysis (LCA) (Xu and Croft, 2000) is a local technique but uses a
method for selecting terms which is more similar to that found in global techniques
(actually Phrasefinder). More specifically, expansion terms are selected not based on
their frequencies in the top-ranked documents but rather on their co-occurrences with
query terms. Alternatively this can be seen as a method to implicitly cluster the top
ranked documents and select terms that appear in the most relevant cluster. The rele-
vance of a cluster is measured by the term frequency of the query terms in that cluster.
Chapter 2. Background 10
Consider a single term query, if a term appears with term frequencyt f1 in document
d1 and another term appears with the same term frequencyt f1 in documentd2 but the
query term frequency is higher ind1 than ind2 then the first term will get a higher
score as it appears in a presumably more relevant document “cluster”. In this way,
local context analysis overcomes the problem of irrelevant initial documents to some
extent and produces better results.
In the term scoring formula this is expressed as the replacement of the standard
tf*idf (term frequency) measure with a measure of co-occurrence degree. A simplified
version of the LCA formula is the following:
∏wi∈Q(∑d∈St f (term,d)∗ t f (wi ,d))N
∗ id f (term) (2.1)
whered is a document,S is the set of top-n documents returned by an initial query,N
is the number of documents inS (a normalising factor not having effect on ranking),
wi is theith term of the query, andQ is the query. This formula compared to standard
tf*idf term selection as expressed in the formula:
∑d∈St f (term,d)N
∗ id f (term) (2.2)
shows that LCA weights the term frequencies by the frequency of query terms, so that
terms that appear with higher frequencies in documents where query term frequencies
are high get better score. Moreover, there is an attempt to favour terms that co-occur
with all query terms at the same time.
There is an inherent problem to methods which are based on term frequency and
this is more apparent with LCA because of the weighting process. These methods are
biased towards terms contained in small documents. Usually a term will exist once or
twice in a document so the actual value of term frequency depends on the size of the
document. Small documents will have high term frequencies for all terms contained
in them. Thus terms contained in small documents get an unreasonable high score.
In (Xu and Croft, 2000) this is addressed by using fixed length passages instead of
documents for the scoring function.
LCA has also some parameters that affect performance and these are:
• the number of top documents (passages) used
Chapter 2. Background 11
• the number of terms selected and
• the weighting scheme of the selected terms.
In the original paper the top 100 documents were used, the 70 top-scoring expansion
terms were included for expansion and there was a weighting scheme where each ex-
pansion term had a different weight based on its rank according to the score; lower
ranked expansion terms get lower weights.
In general, LCA is probably the one of the most successful and well established
query expansion methods. Results in experiments show an improvement in average
precision of more than 20%.
2.2.1.2 Other probabilistic methods
Another very successful query expansion method that uses information from query
logs is described in (Hang et al., 2002). Using query logs is very attractive because
they can be used to train parameters of any model. Nevertheless, access to query logs
is required and thus such approaches cannot be used from the initial deployment of a
system but could rather be used to adjust its performance as the system is being used.
An alternative approach that attempts to use information theoretic measures for
query expansion is described in (Carpineto et al., 2001). The main hypothesis of this
method is that the difference between the distribution of terms in a set of relevant
documents and the distribution of the same terms in the overall document collection
reveals the semantic relatedness of those terms to the query. More specifically, the
frequency of appropriate terms is expected to be higher in relevant documents than in
the whole collection. Kullback-Leibler divergence is used to determine the difference
of the distributions and the overall results stated are very good and comparable to LCA.
Finally, the last probabilistic approach we will present is that of Holistic Query Ex-
pansion (Mahler, 2003) . This method was actually developed to answer relation ques-
tions. Unlike all other methods that build a similarity matrix based on co-occurrences
in documents, this method uses the notion of an explicit similarity measure to build a
graph of terms and selects terms from the resulting graph. Several similarity measures
were tested and several methods for selecting terms from the graph were tested as well.
Chapter 2. Background 12
The results reported for this method are not comparable to the other methods as it was
tested on a different corpus. However, it is important to note that this approach uses
an explicit similarity measure to graphs for selecting terms and that makes it perhaps
closer to ontological methods.
2.2.2 Ontological Query Expansion
Probabilistic methods are attractive because they are effective and the relations are eas-
ily generated from the document collection. However, there are a significant number of
manually edited large repositories of relations between concepts stored in ontologies
and using those data for query expansion is covered in the literature. Most approaches
use large lexical ontologies (usually WordNet or Cyc [http://www.cyc.com]) because
they are not domain specific and because their relations are not sparse.
Using ontologies for query expansion can be dated at least up to (Voorhees, 1994).
In her paper, Elen Voorhees outlines a method for using ontologies for query expansion
that is adapted by most of the following research:
• First the query terms must be disambiguated so that they map to a unique ontol-
ogy concept
• Then terms related in the ontology to the disambiguated concepts are added to
the query.
Usually in the literature this is followed by an analysis on the effect of specific onto-
logical relations in the results.
2.2.2.1 Disambiguation
As we have already noticed, concepts in the ontologies need not to be described by a
single term. Usually each concept is described by several synonyms. In some cases
the converse is also true: a single term (word) might by used to describe more than one
concepts. In such an event the system must disambiguate the term so that it matches
to a unique ontology concept. (Voorhees, 1994) manually disambiguated the concepts
as the main goal was to prove that an ontological expansion method would be help-
ful in the first place. Automatic disambiguation methods are suggested by more recent
Chapter 2. Background 13
papers such as (Navigli and Velardi, 2003). The issue of the importance of disambigua-
tion for ontological query expansion methods and information retrieval effectiveness in
general is discussed in great detail in literature. (Sanderson, 1994) and (Gonzalo et al.,
1998) used different evaluation approaches but both agreed than in order to achieve any
improvement an error rate in WSD of less than 10% is required. However, this result
was questioned by more recent research (Stokoe et al., 2003) and (Navigli and Velardi,
2003) on the grounds that better strategies for disambiguation and better expansion
methods can be used. We will return to this issue on the discussion section.
2.2.2.2 Term Selection
After disambiguating the terms most methods go on to select terms that are related
to the disambiguated concepts by direct relations in the ontology. Usually specific
kinds of relations are tested (“synonyms”, “synonyms and hyponyms”, “synonyms,
hyponyms and hypernyms”, “meronyms”, etc) along with a method that mixes the
several relations.
Note, however, that we came across no attempt to actually verify that relation holds
when multiple options exist. In some cases the same term maps to concepts that have
more than one relations to query terms. For example in WordNet “human” is both a
sibling of “animal” under “organism” and a hyponym of “animal”. The ontological
methods we encountered do not distinguish such cases. If a version of the concept is a
hyponym of a query concept it is equally used for expansion as unambiguous concepts.
Moreover, we came across no attempt to combine relations, all approaches do a per
relation analysis.
In general the conclusion drawn by most ontological expansion research is well
stated by Voorhees (1994):
The most useful relations for query expansion are idiosyncratic to the par-ticular query in the context of the particular document collection
Query expansion terms selected by all of these methods cause a smaller improvement
than the one achieved by the previously mentioned probabilistic methods. Navigli and
Velardi (2003) propose a method of expanding with terms appearing in the definition
of the disambiguated concepts and report an improvement comparable to that of the
Chapter 2. Background 14
probabilistic methods. However, many artifact relations find their way into the query.
For example in a query about “uniforms in public schools” where the definition for
public school is “a free school supported by taxes and controlled by a school board”,
the word “tax” finds its way into the query producing irrelevant results. This find-
ing was also confirmed by our experiments which showed that although this kind of
expansion improves average performance it is very unstable.
2.3 Ideal query
Since the goal of query expansion is to improve the query, it is useful to know what
the ideal query would look like. Moreover it would help to set an upper bound on
expected performance. To determine the ideal query we use Rocchio’s query expansion
(Rocchio, 1971) using the actual relevance judgements given by the TREC conference.
Rocchio’s query expansion is a method for detecting the ideal query. The ideal
query is the one that has maximal similarity with relevant documents and minimal
similarity with the irrelevant ones. Assuming a vector space retrieval model this query
Q is given by the following formula:
Q =1
|Dr | ∑dr∈Dr
dr −1|Di | ∑
di∈Di
di (2.3)
Dr is the set of relevant documents andDi the set of irrelevant documents. In other
words Rocchio’s query expansion finds the average term frequency in relevant docu-
ments, the average term frequency in irrelevant documents, subtracts the latter from
the former and thus calculate a per term weight. In this way terms that appear with
high frequencies in relevant documents and low frequencies in irrelevant documents
will get higher weight.
Usually this could end up in a very large weighted query, so a pruning of the top-n
best terms can be used.
Note, however, that this method for extracting good terms has an inherent flaw
as it overfits on the documents and does not use actual similarity measures or any
background information. If a rare term or a spelling error just happens to appear in the
relevant documents then it would be a good expansion term according to this method
Chapter 2. Background 15
but will presumably not generalise well in a new document collection.
An alternative way we followed to detect the ideal query was the following: For
each probabilistic method we extracted the top-n suggested terms and randomly re-
ranked those terms to form a large number of queries. We then issued these queries on
our system and selected the best performing query as the “ideal” query.
This was done in order to have more than one independent ways of defining the
ideal query and thus presumably minimise the bias of our analysis.
Another important use of ideal queries created by this method is that they give
some insight on what is feasible by only reranking the top-n terms. This was usefull to
set an upper bound on our term re-ranking process. Moreover, by considering not only
the best but all the randomly generated queries and considering their average we set a
baseline for our term re-ranking process.
2.4 Semantic similarity measures for ontologies
In our approach we use semantic similarity measures to disambiguate the terms. Thus
we dedicate this section to semantic similarity measures. Similarity measures can
be seen as a symmetric function that gets as arguments two concepts and returns a
similarity score.
There are two kinds of semantic similarity measures: one is based on detecting
similarity based on distribution of concepts in documents and the other on evaluating
similarity from ontologies. A good review on distributional semantic similarity mea-
sures can be found in (Manning and Shutze, 1999). Here we will focus on semantic
similarity measures created for ontologies and especially WordNet as that is the ontol-
ogy that we are going to use. An excellent starting point for this kind of methods is the
WordNet::Similarity Perl library. Note that although this package uses WordNet the
same algorithms can be used with any ontology.
There are two main categories of ontological methods. The first type of methods,
which we will call “structural”, attempt to extract some similarity measures from the
structure of ontology when seen as a graph, in other words they calculate the similarity
score based on the properties of the paths that connect the concepts in the ontology. The
Chapter 2. Background 16
second type of methods which we will call “gloss-based” calculate semantic similarity
based on the overlap of the definition of the terms, in other words these methods rely
on the hypothesis that similar terms will have similar definitions. Needless to say the
most effective methods combine these two approaches.
A good review of structural methods can be found in (Maki et al., 2004). The basic
notion in structural methods is that of a connecting path. A connecting path is a path
in the ontology that connects the two concepts whose similarity we wish to evaluate.
The path consists of a series of relation-edges and concepts-nodes. Various structural
similarity measures exist based on how they calculate the similarity score. Several
alternatives exist: the number of paths, the length of paths, the kinds of relationships
existing in the path, the kinds of nodes in the paths etc.
The basic notion in “gloss-based” methods is the notion of overlap. The definitions
of terms are checked for common words or phrases and the semantic similarity score is
determined by the number of common words or phrases. Various definitional similarity
measures exist that differ mostly on how they weight the overlapping terms and how
much they favour phrases.
Next we will present a simple structural, a simple definitional and two hybrid ap-
proaches.
Probably the simpler structural approach uses only taxonomic (is-a) relations and
calculates similarity based on the length of the path. For example assume that “car is-a
vehicle” and “bus is-a vehicle”. In this case a path of length 1 (1 intermediate node)
exists “car-vehicle-bus”, so the score according to this method would be 1/1=1. By
contrast score for “cat” and “mouse” would be 1/4 because “cat is-a feline”, “feline
is-a carnivore” , “carnivore is-a placental mammal”, “rodent is-a placental mammal”,
“mouse is-a rodent” (4 intermediate nodes).
Probably the simplest “gloss based” approach is described in (Lesk, 1986). This
method calculates the similarity of two concepts simply by counting the number of
common words between the definitions of the concepts and assigns that count as a
score.
The papers describing these approaches report reasonable results; however, they
are outperformed by methods using a hybrid approach. One approach which is basi-
Chapter 2. Background 17
Figure 2.1: Simple taxonomic relationships in WordNet
cally structural is described in (Navigli and Velardi, 2003). In their paper they describe
augmenting WordNet with an explicit “gloss” relation. This relation is created for
each non stop-word existing in the definition of the concept. For example since the
gloss of “car” is “four wheel motor vehicle, usually propelled by an internal combus-
tion engine” explicit relations of type “gloss” are added, the relations start from “car”
and end on “wheel”, ”vehicle”, etc. Note however that the target terms need to be
disambiguated before such relation can be added. After augmenting the ontology this
method scores terms based on standard connection-path measures (actually the num-
ber of connecting paths is used). Although the results reported are very promising,
the exact process of the gloss disambiguation step is rather unclear in the paper, so we
could not reproduce the results.
An alternative method, that does not need to disambiguate the terms of glosses,
is based on (Lesk, 1986) and has been successfully tested on standard SENSEVAL
conference data, is described in (Banerjee and Pedersen, 2003). This method slightly
changes the overlap score to strongly favour phrases and collocations. Instead of sim-
ply counting the number of overlapping words, if an overlapping phrase (more than
one continuous words) is detected then the similarity score is much higher. More pre-
cisely when an overlap between the definitions is detected the Lesk score adds 1 to
Chapter 2. Background 18
the overlap score but the extended overlap score addsn2 where n is the length of the
overlap. No attempt is made, however, to use a language model.
The most important difference of this method compared to Lesk is that it takes into
account the overlap of related concepts as well. More specifically the formula calcu-
lating the relatedness of two concepts is:
relatedness(A,B)=score(gloss(A),gloss(B))+score(hype(A),hype(B))+
score(hypo(A),hypo(B))+score(gloss(A),hype(B))+score(hype(A),gloss(B))
whereAandBare the concepts of which the relationship is being measured,score(A,B)
is a function that returns the overlap score of two strings,gloss(X) returns the defini-
tion of X, hype(X) returns a concatenated string of the definitions of all hypernyms,
andhypo(X) does the same for hyponyms.
Note that other combinations of relations might be used, but the experiments de-
scribed in the paper concluded that this combination produced the best results.
2.5 Ontology based Word Sense Disambiguation
Word sense disambiguation (WSD) refers to the process of selecting the correct sense
of a word from a set of possible senses or in terms of ontologies to map a term to
the correct unique concept. Several state of the art algorithms can be found in the
SENSEVAL conference (http://www.senseval.org).
One category of WSD algorithms uses semantic similarity measures such as the
ones described in the previous section. Actually an algorithm for WSD was the moti-
vation for calculating semantic similarity in (Banerjee and Pedersen, 2003).
In (Banerjee and Pedersen, 2003), a window around the target word is selected, and
for each word in that window a set of candidate senses is identified. The algorithm is
outlined as follows
1. For each CANDIDATESENSE of the target word assign SENSESCORE[SENSE]=02. For each CANDIDATESENSE of the target word– 2.1 For each CONTEXTWORD in window—- 2.1.1 For each CONTEXTWORD SENSE of CONTEXTWORD
Chapter 2. Background 19
—— 2.1.1.1 SENSESCORE[CANDIDATESENSE] +=score(CANDIDATESENSE,CONTEXTWORD SENSE)3. Select the sense with the maximum SENSESCORE
Note that each decision of sense is taken independently. The complexity of this
algorithm isO(n∗m2) wheren is the numbers of words considered andm is the max-
imum number of senses per word . Although the complexity is polynomial in both
terms this algorithm is very slow when large context window is used (in our system we
managed to calculate only few queries with 100 context words per day).
An alternative approach for WSD using ontologies (WordNet) is described in (Mi-
halcea et al., 2004). In their approach they treat the ontology as a graph (network)
and use Pagerank (Page et al., 1998) to disambiguate senses from that network. The
Pagerank algorithm was originally designed to perform link analysis in web pages and
detect the most important pages. The basic idea behind Pagerank is that, if there is a
link from page A to page B then the author of A is implicitly conferring some impor-
tance to page B. More specifically it confers some of its own importance to page B,
thus if A is important then B will be also become important but if A is not so important
then B would only slightly benefit from the link from A. Thus importance is defined
recursively and the algorithm runs in several iterations until convergence. Initially all
pages have the same importance but after each iteration importance is concentrated in
specific pages. An alternative way to view Pagerank is that it roughly expresses the
probability of a random web serfer staying in a specific page.
Pagerank has proven to be very successful when applied to web pages but whether
this analogy can be applied in the concepts of an ontology for disambiguation purposes
is examined in the referenced paper. To successfully apply Pagerank on WordNet some
relationships were pruned and few more are added. Moreover, the outputs of Pagerank
are then mixed with the outputs of the Lesk algorithm described earlier. Although, the
results reported are very good, it seems that this approach is tailored to WordNet and
extending it to other ontologies might require pruning/adding relations.
In the methodology chapter we describe a very similar approach. However, to avoid
of tailoring the ontology to the needs of disambiguation we do not use the original re-
lations as edges of the graph. We rather create a fully connected graph where the edges
are weighted based on a similarity measure as in (Banerjee and Pedersen, 2003). This
Chapter 2. Background 20
way we incorporate the extended Lesk measure before running Pagerank rather than
doing sophisticated ranking merging after Pagerank. Moreover, this method is less
sensitive to the density of relations in the original ontology. Our approach requires,
though, an adaptation of Pagerank to handle weighted edges (links) and prior proba-
bilities. We based that adaptation on (Haveliwala, 2002) and actually used the JUNG
Java library (http://jung.sourceforge.net) in our implementation.
Chapter 3
Methodology
3.1 Probabilistic query expansion
A prerequisite step for the methods described later is the use of a probabilistic method
to extract representative terms from documents. We used Local Context Analysis but
found that it had two probably unwanted properties. LCA does not take into account
the number of documents in which a term appears. Thus, if there is good evidence of
correlation to query terms in few documents LCA will include that term. Secondly, as
we show in the evaluation and results chapter, the quality of terms suggested by LCA
significantly drops as we consider lower ranked concepts. This is successfully dealt
with within LCA by gradually lowering the weight of lower ranked terms. However,
this weighting scheme seems to be appropriate when terms are used for expansion
but perhaps a different weighting scheme would be more appropriate when terms are
used for disambiguation. In this thesis, we decided not to incorporate any weighting
scheme in subsequent steps. Perhaps this is a flaw of our approach, however, selecting
an appropriate weighting scheme for each usage of the terms is a difficult task and we
felt that it is part of an optimisation and fine tuning process while in this thesis we
explored whether the approaches are useful in the first place.
To compensate for these probably unwanted properties of LCA and to create a
diverse set of probabilistic methods for subsequent steps we explored several other
methods. More specifically, we propose a method that has entirely different properties
21
Chapter 3. Methodology 22
from LCA. The hypothesis behind the proposed methods is the following:
H2: “Terms with near uniform distribution (high entropy) of term frequen-cies in the top documents returned by an initial query are good expansionterms.”
The motivation behind the proposed hypothesis is that terms that appear with uni-
form distribution across the documents will be good expansion terms or at least their
inclusion will not seriously hurt the query. Using this method, terms that constantly
appear with low frequencies are also considered for expansion although they are usu-
ally overlooked by standardt f ∗ id f measures. Moreover, because entropy is used,
the terms are less likely to create a query drift; they will give no specific direction to
the query but rather simply make the subject of the documents retrieved by the initial
query more dominant.
To test this hypothesis and get some diverse probabilistic methods for the next steps
we developed three methods we called “ENT”, ”TST” and “MIX” .
ENT sorts the terms according to the entropy of their frequency distribution. More
specifically, the frequency distribution vector[t fi∗] (wheret fi is the term frequency of
the term in documenti ) is normalised and now each row represents an estimation of
P(document|term). P(document|term) expresses the probability of getting this docu-
ment as the first document if we searched by that term within the documents returned
by the initial query. This method attempts to find terms that are less discriminative thus
have the same (= 1/numo f documents) probability for most documents.
As we show in the evaluation and results chapter, this method produced much better
results than (unweighted) LCA in almost any setup. However, the terms selected are
not always semantically related to the queries. The term “said” appeared in the top
places in almost any query when searching the AQUAINT corpus (mostly newspaper
articles) and the terms “home”, “site”, “search”, “contact” etc. appeared in almost any
query when searching the web. This effect is magnified as more documents are used.
We could deal with these words as stop words and thus hand-code their exclusion
for expansion; however, we felt that they reveal an inherent property of the method
so experimented with two other methods as well. Nevertheless, note that it could be
argued that there is no need to actually exclude those terms; although they are not
actually related to the query they do not harm the query.
Chapter 3. Methodology 23
TST multiplies the score of ENT with the standardt f ∗ id f measure leading to the
equation
TSTscore(term,docs) = ENTscore(term,docs)∗ t f (term,docs)∗ id f (term) (3.1)
MIX combines the score of LCA and ENT. After conducting some experiments we
concluded to a value of 0.01 as the mixing factor thus the final formula for MIX is:
MIXscore(term,docs) = LCAscore(term,docs)1−0.01∗ENTscore(term,docs)0.01 (3.2)
We should also mention that after the experiments described in the evaluation and
results chapter (figure 5.2) we settled on using the top 80 documents returned by the
initial query, expanding with 20 query terms.
3.2 Ontology based Word Sense Disambiguation
To disambiguate the query terms we followed the following hypothesis:
H3: “The correct senses of query terms will be more important in a net-work which has as nodes the terms extracted from a probabilistic methodand as edges the semantic similarity between those terms”
The network we used contained as nodes all the possible concepts of the top-n terms
returned by a probabilistic method.
Note that we did not actually add all the possible senses of each top-n terms but
conducted some pruning. Each concept in the ontology is assigned a list of compatible
parts-of-speech for its terms. Thus we add a possible sense for that term only if the
term appeared in the documents with a compatible part-of-speech. For example the
term “are” maps to both the verb “are” and the noun “are” (the area measure). If the
term “are” was not encountered as a noun in the initial documents then the concept of
noun “are” is not included in the network. In a small experiment we found that this
pruning did not affect the results in any significant way, nevertheless, it promoted the
time-efficiency of the calculations.
The graph is fully connected because we add edges for each pair of concepts.
The edges were weighted according to the extended Lesk semantic similarity measure
(Banerjee and Pedersen, 2003).
Chapter 3. Methodology 24
The actual value ofn top terms used was determined from experiments described
in the evaluation and results chapter. Note, however, that we conducted no experi-
ments with more than 100 terms because evaluating the similarity of each possible pair
took a significant amount of time. Actually, about 2 similarity evaluations per second
were possible in our system. Thus considering 100 terms for each query usually cor-
responded to about 120 concepts, thus 7140 similarity calculations per query which
translated to 3570secs that is almost an hour per query. Needless to say, this is not
an acceptable the amount of time for any on-line system, but these calculations can
be performed off-line. In our implementation we used caching and that significantly
improved performance.
To measure the importance of nodes we used an adapted version of Pagerank. The
adapted version works with weighted edges and priors. The edges were weighted as
described before with the extended Lesk measure. We did not assign uniform priors
and experimented with several values of prior weight (betaparameter in JUNG library
terms). Thebetaparameter expresses the percentage of the final importance that will
be determined by the prior distribution.beta= 100% means that the final importance
measure will be determined 100% by the prior importance.beta= 0% means that prior
importance will have no effect on the final importance measure.
If we assigned very highbeta, near 100%, then only query terms would be taken
into account. Only mutual disambiguation of query terms would occur; the rest of the
terms would not contribute at all. If we assigned a uniform prior (orbeta= 0%) then
the main importance weight might shift away from the terms we want to disambiguate.
Note that it is preferred to use uniform prior instead of lowbetabecause the latter
might lead the algorithm not to converge. Although we would expect the main per-
centage of importance to be focused on the query terms regardless of thebetafactor,
this proved not to be the case especially when few of the top-n terms are used. For ex-
ample in the query “animal protection” and when Pagerank is used with lowbetathe
most important concept turns out to be “city”. That was because several cities where
mentioned and had a strong semantic relation with “city” and no other strong links.
Thus “city” gathered most of their weight. We describe the effect in ofbetafactor and
number of terms in the evaluation and results chapter.
Chapter 3. Methodology 25
An important issue we came across was dealing with words that did not appear in
the ontology. We could choose not to include those terms in the network. However, in
some cases some of the query terms were absent from the ontology. Thus we decided
to actually put these terms in the network and to measure similarity we used the cosine
similarity of the term frequencies in the documents. More specifically for each term we
created a vector[t f1, t f2, ..., t fn] wheret fi is the term frequency of each term in doc-
umenti. The documents we used were the same top-ranked documents used for the
probabilistic methods. Then, we calculated the cosine of each pair of vectors and as-
signed that as a score of semantic similarity (a cosine of 1 means identical distribution
and thus maximum similarity). We had no scale issues because Pagerank normalcies
the weights per concept.
Another issue we came across was that the terms describing some concepts were
actually phrases consisting of more than one word. This was not frequent so we used
a simple approach to tackle it. We simply checked if the top-10 most similar terms
according to the cosine similarity measure formed a multi-word term describing an
existing concept in the ontology. We did not formally evaluate this approach. However,
from a short inspection it proved to have high recall but lower precision: it detected all
the terms we could manually spot and some we did not detect manually, but included
also some irrelevant concepts (this was expected, as the actual locations of terms in
documents were not taken into account). These irrelevant collocations, though, have
no significant effect on the performance of the system.
3.3 Re-ranking of expansion terms based on ontolo-
gies
The final step of our suggested method is to re-rank the terms suggested by a proba-
bilistic method from information drawn from the ontologies following the following
hypothesis:
H1: “A hybrid query expansion method that re-ranks query expansionterms suggested by a probabilistic method based on relatedness drawnfrom the ontology outperforms the original probabilistic method”
Chapter 3. Methodology 26
We implemented the re-ranking by calculating a boosting score from the ontology and
mixing that score according to the formula:
score(term) = probabilisticScore(term)1−mix∗boost(term)mix (3.3)
We did not consider all terms for re-ranking, but rather used only the top-n terms
proposed by the probabilistic method. We describe the results for the various values of
n in the evaluation and results chapter.
To derive the boosting factor we used three methods:
• the first method derives the boosting factor of a concept according to the relation
of that concept to query concepts in the various ontology hierarchies. Specific
types of relations namely children, parents, siblings get different boosting factors
and these boosting factors were trained from a set of training queries.
• the second method derives the boosting factor of a concept based on the im-
portance of its position in the various hierarchies. In a nutshell, if there is an
indication of concentration of concepts under a specific node in a hierarchy in
the context of the specific query compared to the same concentration in the con-
text of the whole document collection, then all children of that node are boosted
otherwise they are penalised.
• the final method does not use hierarchies but is based on creating a network sim-
ilar to the one proposed for disambiguation. This is a fully connected network
where concepts are nodes and edges are weighted according to semantic simi-
larity drawn from ontologies. Pagerank is run to determine the boosting score of
each concept; important concepts get a higher boosting score.
In all methods we have to deal with two specific issues. The first issue is about
terms that do not map to an ontology concept and the second is about how to handle
terms when more than one terms map to a single ontology concept.
The first issue was quite common when the probabilistic method used was LCA
because the latter tends to select rare terms. Rare people and location names are com-
monly encountered in the results of LCA. However, concepts not covered in the on-
tology are quite common for all methods. We tested several methods for deriving a
Chapter 3. Methodology 27
boosting score for unknown concepts. The obvious method is to select a boosting fac-
tor of 1 for such concepts. However, this lead to a bias towards these terms because
the boosting factors are usually less than 1. Thus a less biased method is to assign
unknown terms the average boosting factor assigned to terms that appear in the ontol-
ogy. An alternative method is to assign those terms the minimum boosting factor of
known concepts. This causes a deliberate bias towards ontology concepts which might
be desired in some cases. In the first approach we deployed training of this score, that
is all ontology concepts and non ontology concepts get a prior score simply based on
whether the term maps to an ontology concept or not; we describe how we trained this
score in the next section. In the rest of the methods we used the intermediate approach
of assigning unknown concepts a boosting factor ofboostingaverage+boostingminimum2 .
The second issue was less common but proved to significantly affect the results.
By using ontologies we can derive a boosting factor for a concept, however, at some
point we must use that score to derive a boosting factor for actual terms. The issue
here is that more than one term might map to the same concept and a term might map
to more than one concept. When a term maps to more than one concept we assign it
the sum of the boosting factor for the first approach and the maximum boosting factor
for the rest approaches. More sophisticated methods could be used to favour terms
that express more than one concept or penalise terms that actually express penalised
concepts, however, in this thesis we followed the simple approaches of summing and
getting the maximum.
When a concept is expressed by more than one term (synonyms) then we followed
the simple approach of assigning the concepts boosting factor to the first term (accord-
ing to the probabilistic method) and minimum boosting factor to the rest of the terms.
This might not be the desired method in some cases because expanding with synonyms
is one of the most commonly used expansion methodologies. However, note that there
is a distinction between expanding with synonyms of the query terms and expanding
with synonyms of the expansion terms. Although the former method makes the query
terms more dominant in the query the latter method simply makes some expansion
terms more dominant. Actually, in our setting, where only 20, not weighted expan-
sion n terms are used, the expansion concepts might become even more dominant than
Chapter 3. Methodology 28
actual query terms in the final query. Thus by assigning the concept’s boosting factor
only to the first term and minimum factor to the rest of the terms we get more diverse
expansion terms.
3.3.1 Boosting based on relation to query concepts
For this method we used the relation of query concepts in hierarchies drawn from on-
tologies. We used only relations that form hierarchies because they allow the definition
of specific types of relations.
The types of relations to query terms we used were:
• synonyms: terms that match to the same concepts as query terms.
• parents: parents of query terms in the hierarchy
• children: terms that match to direct children of query concepts.
• children-subtree: terms that match to concepts in the sub-trees under children.
• siblings: terms that match to concepts that share the same parent with query
concepts.
• siblings-subtree: terms that match to concepts in the sub-trees under siblings.
The hierarchies in which we search for these relations areis-a andpart-of hierar-
chies, although other relations forming hierarchies might be used as well. Note that
we did not use the generalpart-of i.e. “meronym” relation in WordNet as it mixes the
types of part-of. Thus for WordNet we used the following hierarchies:
• is-a (holo/mero): taxonomic hierarchy
• part-of (member sense) hmem/mmem: professor is-a-member-of staff
• part-of (substance sense) hsub/msub: tears are-made-of water
• part-of (all other senses) hprt/mprt: China is-part-of Asia, Amusementpark has
rides etc.
Chapter 3. Methodology 29
Note that these hierarchies are not trees because ontologies allow multiple inheri-
tance. Actually multiple parents are encountered frequently in WordNet.
For each hierarchy and each relation we estimated the appropriate boosting factor
using Rocchio’s ideal query. That is we traversed the index of all documents and for
each query and end each term we used equation 2.3 to calculate the weighting fac-
tor for that particular term in the ideal query. Note that this weighting factor might
be negative if the term is encountered more frequently in irrelevant documents. The
relevance judgements used were the actual relevance judgements used to evaluate the
queries. To estimate the boosting factor for a specific relation in a specific hierarchy
we use the average weight of all terms having that specific relation to query terms.
Note that we used manually disambiguated query terms to make sure that relation to
the correct query concept is considered. The trained parameters can be summarised in
the following table:
parents children children-subtree siblings siblings-subtree
is-a 0.024772 0.031324 0.045891 0.010591 0.019223
part-of (member) -0.013287 0.711712 0.018197 0.005665
part-of (substance)
part-of (rest) 0.089481 0.473094 0.001387 0.064584 -0.001692
OTHER synonyms Mapping to ontology concept Not in ontology
0.571086 -0.002474 -0.002785
Note that the cells missing where not considered significant because of low counts.
Also Note that along with the average weight according to specific relations we
evaluated the average weight of terms that mapped to ontology concepts and terms
that did not map to ontology concepts. In this method we used those parameters to
determine the boosting factor of concepts not covered in the ontology.
To derive the boosting factor for each concept we use the equation:
boost1(term) = ∑c∈concepts(term)
∑q∈Q
∑h∈H
∑r∈relations(h,c,q)
score(h, r) (3.4)
whereconcepts(term) returns all the concepts mapping to termterm, Q is the set of
disambiguated query concepts,H is the set of hierarchies used andrelations(h,c,q)
Chapter 3. Methodology 30
returns all relations of conceptc to query conceptq in hierarchyh that isparentsif c
is a parent ofq in h, children if c is a child ofq in h etc. Finally,score(h, r) returns the
value of the specific cell in the presented or zero in the event of absence of value in the
table.
boost2(term) adds toboost1 the score of the appropriate values of the last row of
table presented that is the score for synonyms if the term maps to a query concept,
and the score of either mapping or not mapping terms depending on whether the term
actually matches any ontology concept.
To ensure that the boosting factor is greater than zero we find the term with the
minimum boosting factor and subtract its boosting factor fromboost2(term). Thus
supposing that the minimum boosting factor isminBoost2 the final equation is:
boost(term) = boost2(term)−minBoost2 (3.5)
3.3.2 Boosting based on importance measure drawn from hierar-
chies
We used two methods to derive boosting factors from hierarchies and combined them to
a final boosting score for each concept. The first method focuses on detecting important
paths and the second focuses on discovering important nodes according to the density
of the subtree under a specific node. Both methods use hierarchies drawn from the
ontology. The same hierarchies as described in the previous section were used.
Note that as described in the evaluation and results chapter this method produced
very bad results. That was because of disambiguation errors (we did not disambiguate
the terms before adding them in the hierarchy). Nevertheless, the justification of in-
cluding this method in this thesis is, appart from commenting on its failure as a query
expansion method, its success in detecting how wrong senses of the words can combine
in unpredictable ways. This property makes it suitable for building an error analysis
tool as we explain in the implementation chapter.
In the hierarchies used we define two notions which we name “path” and “density”.
For each concept in the hierarchy we define as “path” any path consisting of the concept
and all its parents up to the root of the hierarchy. “Density” is defined in relation with
Chapter 3. Methodology 31
a list of concepts L. The “density” of a concept is defined as the number of times the
concept or one of its children appears in the list, divided by the length of the list.
Using the notion of density we can weight each node in a hierarchy. Actually
both methods rely on comparing two weighted versions of the hierarchies. The first
version, which we will refer to with the subscript “prior ”, is weighted according to
counts gathered from the whole document collection (the density is defined in relation
with a concept list L that is created by concatenating the concepts that appear in all
documents). The second, which we will refer to with the subscript “query”, is weighted
according the concepts returned by a probabilistic expansion method.
The two methods for deriving importance from the hierarchies express the follow-
ing:
• if a higher than expected concentration of concepts is detected under concept X
in the context of the query, then all children of X are boosted, otherwise all chil-
dren of X are penalised. For example if in the context of the query we encounter
many specific animals, then the concept of animal is probably important and all
its children (animals) are boosted.
• if in the context of the query we encounter a concept and also encounter its
parents then the concept is boosted, otherwise it is penalised. The amount of
boosting or penalising depends on the probability of encountering the parent
in the whole document collection. If the parent of the encountered concept is
frequent in the whole collection, then absence of that concept in the context of
the query leads to greater penalisation. For example if in the context of the
query we encounter many animals but do not encounter the term “animal” then
according to this criterion all animals are penalised (although according to the
previous criterion they were boosted). If on the other hand we encounter a single
specific animal and the term “animal” then that specific animal is boosted.
A concept might appear in more than one locations in the hierarchies and each place
is defined by a unique path. Thus there are several ways to derive a unique boosting
factor for a concept by combining the boosting factors of its paths. We simply selected
the maximum boosting factor.
Chapter 3. Methodology 32
3.3.2.1 Density Measure
The main goal of this measure is to detect higher than expected concentrations in spe-
cific areas of the hierarchies. For each node in the ontology we use the notion of
probability of a child given the parentP(child|parent). This probability determines a
density distribution of children of the specific parent. The expected density of a child
is:
Pprior(child|parent) =densityprior(child)
∑c∈children(parent) densityprior(child)(3.6)
wherechildren(X) is a function returning the set of children of concept X in the hi-
erarchy anddensityprior(X) returns the number of times X or one of its children is
encountered in the whole document collection. The distribution in the context of the
query is defined by:
Pquery(child|parent) =densityquery(child)
∑c∈children(parent) densityquery(child)(3.7)
Note that for estimatingPprior(child|parent) we did not consider the probability of
encountering a child given that we have encountered the parent in the same document,
but used the probability of encountering a child in any document. Estimating the distri-
bution only from documents containing parent could be presumably more appropriate,
however, in our rather small document collection this lead to very sparse counts.
To estimate the boosting factor we consider the difference in distributions and de-
fine the boosting factor of a child as:
boostdensity(child|parent) = 1+Pquery(child|parent)−Pprior(child|parent) (3.8)
And to estimate the boosting factor of a path we simply take the average of boosting
factors for each concept in the path:
boostdensity([root,c1,c2, ...,concept])= avg([boost(c1|root), ...,boost(ci |ci−1), ...,boost(concept|cn−1)])
(3.9)
where[...] defines a list,[root,c1,c2, ...,concept] defines the path considered,avg re-
turns the average of a list andboost(X|Y) is as defined in the previous equation.
Chapter 3. Methodology 33
3.3.2.2 Path importance measure
The main goal of this measure is to derive a boosting factor for a path that will estimate
the significance of that path . Actually, if the parent is encountered in the context of
the query the boosting factor should be greater than one, otherwise it should be less
than one. The amount of boosting should be less if the parent concept is common and
the amount of penalising should be greater if the parent concept is common.
To derive this measure we use the notion of
Pprior(concept) = density(concept)− ∑child∈children(concept)
density(child) (3.10)
which expresses the probability of encountering the specific concept (and not any of
its children). If we encounter a specific concept in the context of the query then that
concepts gets a boosting factor ofboostpath(concept) = 2−Pprior(concept). If we do
not encounter a concept in the context of the query then that concept gets a boosting
factor ofboostpath(concept) = 1−Pprior(concept).
To estimate the boosting factor of a path we simply multiply the boosting factors
of the concepts in that path.
3.3.2.3 Combined hierarchy measure
To get the boosting factor of a child in the hierarchy we simply multiply the path with
the density measure.
3.3.3 Boosting based on network importance measure
Alternatively, to derive the boosting score from the ontology we used the importance as
measured by the Pagerank algorithm described in the disambiguation section. Actually
for selecting terms we run Pagerank twice.
In the first run terms are disambiguated and we get aP( j, i) for each possible sense
j of term i. Recall that in PagerankP( j, i) roughly expresses the probability of a web
surfer staying in a specific node-page (j,i) and that this probability is used to measure
the importance of a sense j in the network. In our context of semantic similarity based
Chapter 3. Methodology 34
networks this expresses the probability of stopping to a specific sense while randomly
moving through senses.
In the previous section we were concerned with disambiguation, thus we selected
the most important sense of each term that issensei = argmaxjP( j, i). In this section
we do not want need to make a crisp discrimination between senses because as we show
later on the evaluation usually more than one senses are equally appropriate (inter-
annotator agreement of about only 58% percent). Thus we use an amount proportional
to P( j, i) as our prior probability for a second run of Pagerank.
We did not use directlyP( j, i) but rather weighted it appropriately so that each term
gets equal prior weight on the final network. Intuitively this expresses that if a query
term is less important than another query term in the network created by the terms
suggested from the probabilistic method, then that query term is probably overlooked
by the terms suggested from the probabilistic method. In other words anoutweight-
ing is probably we going to occur and probably should focus on the overlooked term
more. For example in a query about “history of skateboarding” most terms in the graph
were about skateboarding and very few of them about history so expanding with the
suggested terms causes the results to move towards skateboarding shops, contests, etc.
Thus we attempt to balance the importance of terms by assigning different priors for
the second run of Pagerank (more for “history” and less for “skateboarding” in the
previous example). Mathematically this is expressed by assigning prior probabilities
according to the equation:
PriorP2( j, i) = P1( j, i)∗∑ j ∑k P1( j,k)
∑ j P1( j, i)∗numo f queryterms(3.11)
The second run of Pagerank calculates the actual boosting score for each term.
Note that for each run of Pagerank we can tune how significant the prior probabil-
ities would be through the parameterbeta. In the evaluation we describe the results
for various values ofbeta. We will call the betaparameter of first run of Pagerank
disambBetaand the same parameter for the second runrankBeta.
Chapter 4
Implementation
In this section we describe the details of what modules were used, how we imple-
mented the methodologies described in the previous chapter and finally the details of
some analysis tools we developed. Note that we implemented the methodologies in a
way that two ways of using our system are possible:
• Interactive version: This version provides a web page. In this web page the actual
documents returned by an initial query and the expansion terms proposed by the
several methodologies are displayed. This version was developed to examine
the feasibility of actual deployment of the proposed methods in terms of time
efficiency and most importantly to create a framework for exploratory evaluation
using any queries.
• Batch processing version: This version automates the procedure of evaluating
the various expansion methodologies. The input of this version is a topics file
describing the queries and a file containing relevance judgements. The output of
this version consists of files describing the performance of the system.
4.1 Modules used
As a search engine we mainly used the open sourceLucene(http:// lucene.apache.org)
Java-based search engine but have designed the system in a modular fashion so any
35
Chapter 4. Implementation 36
other system might be used as well, we actually usedGoogleTM API for an exploratory
research and confirmation of our results.
Our ontology was WordNet (http://WordNet.princeton.edu) version 1.7.1. To de-
rive similarity measures we used the WordNet::Similarity (http://wn-similarity.sourceforge.net/)
Perl package. Other ontologies can be used with this package, but note that to do that
the WordNet::QueryData (a simple package reading WordNet files) needs to be im-
plemented for the new ontology, the rest of the algorithms will work with the new
ontology.
For network importance algorithms we used the JUNG Java library (http://jung.sourceforge.net).
From this library we also used some graph visualisation tools for analysing and debug-
ging our methodologies.
The glue connecting those modules together and most of our code is written in
Python.
Finally as a part-of-speech tagger we used TnT (http://www.coli.uni-saarland.de/ thorsten/tnt/)
trained on Wall Street Journal data.
4.2 Interactive version
The system architecture of the interactive version is sketched in figure 4.1.
In this figure the main modules are presented along with a process description
specifying the order by which the modules are used.
The usage scenario is the following:
• The user enters a web page similar to those of standard web search engines and
issues a query
• (1) search.py conducts an actual search using the appropriate API and the appro-
priate collection
• (2) the resulting documents of the initial query are immediately presented to the
user
• (3) At the same time the documents are passed to indexer.py. Indexer.py trans-
lates html pages to text using the python sgml parser. Note in some cases of ill-
Chapter 4. Implementation 38
formed html documents this process might fail thus indexer.py passes the html
document through the W3C utility ?tidy? which alters the html file to a syntacti-
cally correct xhtml document. Moreover, indexer.py tokenizes the text extracted
from the html documents and passes the tokens through a Part-Of-Speech tagger
namely TnT.
• (4) The tokenized documents are passed to lca.py which although named ?lca.py?
implements all probabilistic query expansion methods described in this thesis.
The output of lca.py is a list of the terms encountered in the documents along
with their score according to the various probabilistic expansion methods.
• (5) The list of terms is passed to ontology.py. ontology.py implements the vari-
ous ontology based re-ranking methods described in this thesis and
• (6) The suggested expanded queries are presented to the user for inspection and
the user might click on the query proposed by a specific method to see its results.
4.3 Batch processing version
To compute large numbers of queries (TREC collections) we include a batch operating
mode where several instances of the batches can run in different machines to speed
up evaluation time. The process consists of the following stages (implemented by
different python modules which must be run using the specified order):
• search.py: reads topics (queries) from files in TREC file format and creates
.search files containing the top 100 documents returned by those queries sorted
by their ranking using the predefined search engine (either google on the web or
lucene on the AQUAINT corpus).
• index.py: reads the .search files and for each distinct document it creates a .index
file containing terms appearing in the document, their frequencies, their Part of
speech and their location in documents.
• expand1.py: for each query it reads the top-n documents it creates a METHOD.
NUM OF DOCS.expand file containing the terms and their scores according to
Chapter 4. Implementation 39
the probabilistic method used (currently LCA, MIX, TST and ENT) sorted by
their scores. NUMOF DOCS refers to the number of the top-n documents used.
• preparesimilarities.py: reads all .expand files (excluding .hybrid.expand), gets
all concepts referring to that terms and calculates semantic similarity of each
pair of terms. The files are stored in NUMBER.similarity where number is an
arbitrary number and contain a line for each per of concepts plus it’s similarity.
Note that this step takes enormous time to compute and that semantic similarity
files are not specific to the queries. So they are stored in a different directory and
should not be deleted.
• disambiguate.py: disambiguate.py does two things. Firstly for each query it
creates a small .query file containing the one line per query term. In each
line the term along with all possible concepts is included. Secondly, for each
METHOD.NUM OF DOCS.expand file it creates a METHOD.NUMOF DOCS.
NUM OF TERMS.rank file which contains all concepts ordered by their impor-
tance according to pagerank. NUMOF TERMS refers to how many of the terms
in METHOD.NUM OF DOCS.expand were actually used.
• expand2.py: for each METHOD.NUMOF DOCS.NUM OF TERMS.rank file
it creates a METHOD.NUMOF DOCS.NUM OF TERMS.hybrid.expand file
containing the concepts along with their scores as re-ranked from the ontology.
Needless to say, information from the .expand file is used as well in this step.
• search2.py: creates a .METHOD.search file for each .expand file. This file con-
tains the top-100 documents returned by the expanded queries sorted by their
ranking.
• eval.py: reads document relevance information (qrels) in standard TREC and
creates a tab delimited text file containing the results of each expansion method.
Note that it might be the case that some documents are not included in the qrel
file. In such a case the documents are considered irrelevant and a .missing file is
created containing one file for each missing relevance judgement.
Chapter 4. Implementation 40
This whole process is automated by batch.py which as stated before can run in
several machines at the same time.
4.4 Various visualisation tools
In the context of this thesis we needed to make some case based analysis of how the
various disambiguation and query expansion methods performed. To conduct this per
query analysis we needed some visualisation tools. We developed two tools:
• Network visualisation tool: This tool is based on modules provided by the JUNG
Java library. Nodes are visualised as circles and edges as lines connecting these
circles. An important issue for network visualisation is the layout of nodes. We
left the exact layout algorithm used as a parameter of our system; any of the
JUNG layout algorithms can be used. We found no satisfactory way to visualise
weights of edges. However, visualising importance of nodes was already imple-
mented in JUNG. The diameter of the circle describing the node is proportional
to the importance of the node.
• Hierarchy visualisation tool: This tool was developed from scratch and the out-
put of this tool is an HTML page visualising a tree similar to the trees dis-
played by the most browsers for XML documents. However, the standard XSL
stylesheet used by browsers restricted parameters of visualisation. Thus we im-
plemented an custom XSL stylesheet. This stylesheet takes an XML file and
visualises it as an html page adding the following useful visualisation properties:
Not all attributes of an XML nodes are displayed. The children of each node are
sorted according to the value of a specific attribute. And finally the color of the
text describing a node is a value of gray proportional to the actual value of an
attribute. Thus to visualise hierarchies we created an XML file for that hierarchy.
This xml contained the name of the concept, the terms mapped to that concepts
and the value of the boosting factor as calculated by the methodology described
in the “Boosting based on importance measure drawn from hierarchies” section
of the methodology chapter. Using our stylesheet children were sorted according
to the boosting score and the font color visualising the concept was proportional
Chapter 4. Implementation 41
to that boosting factor. Thus concepts with high boosting factor were visualised
by having almost black font color while concepts with low boosting factor were
visualised by having an almost white color.
The network visualisation tool proved not to be very usefull, after including more
than 20 nodes to the network it is very difficult to understand what is going on. The
hierarchy visualisation tool on the other hand proved to be surprisingly effective and
useful.
Chapter 5
Evaluation and Results
Our evaluation is mainly focused on evaluating query expansion methodologies. How-
ever, because performance of query expansion for ontological methods depends on
word sense disambiguation performance we also evaluated the performance of WSD
algorithms.
To evaluate the query expansion results we used the standard measure of relevance
on top-n retrieved documents. That is, we counted how many of the top-n pages re-
trieved by the query were actually relevant. The baseline system was that of query-
ing without query expansion and the tested systems were the various query expansion
methodologies described in this thesis. As the value ofn we used 20, that is, we evalu-
ate the relevance in the top-20 documents. As an upper bound for expansion methods
we use Rocchio’s methodology to derive the ideal query from the set of relevant and
irrelevant documents as specified by the TREC query relevance data (the same data we
used for evaluating the score).
Along with this standard measure we used the non-standard but rather informative
measure of counting the number of queries for which there was a degrade in perfor-
mance after the expansion. That is, we calculated relevance at top-20 documents for
the unexpanded and the expanded query and considered the percentage of queries that
had a lower score when expanded. Needless to say, less is better for this score.
To evaluate WSD performance, we simply calculated the number of query terms
that were correctly disambiguated. Note that we excluded unambiguous terms; we
calculated the performance of disambiguation algorithms by only considering ambigu-
42
Chapter 5. Evaluation and Results 43
ous terms. This explains to some extend the low scores compared to those reported in
the literature and our decision is justified as follows. The frequency of unambiguous
terms is an important measure of its own, and should perhaps be used independently.
Including both ambiguous and unambiguous term gives presumably a better picture
of the number of terms that should be expected to be correctly disambiguated. Ex-
cluding unambiguous terms gives presumably a more objective picture of actual WSD
performance.
Our baseline for WSD is random assignment of senses to words and the meth-
ods tested are the one described in (Banerjee and Pedersen, 2003) and our proposed
method. To find the correct sense of the query we asked 3 users to choose the appro-
priate sense of each query term in the context of the query. The upper bound for WSD
is defined by the inter-annotator agreement, that is the percentage of terms where all
three users selected the same sense.
Both WSD and query expansion proved to be very sensitive to the parameters by
which we run each method. We could choose a very close value for a parameter and
get very different results. Because of this fact we preferred to display the results in
three dimensional tables, where x and y axes are parameters of the method and the z
axes is the performance. To visualise the z axes we used different shades of gray to
color the appropriate cell of the table. In all tables darker means better.
5.1 TREC tracks
For our experiments we used the queries and data used in the TREC conference. Every
year several information retrieval tracks run in the TREC conference. Initially the track
of ad-hoc information retrieval on large collections or snapshots of the web was run.
Ad-hoc retrieval corresponds to general search over those large collections. However
after year 2001 the ad-hoc track was replaced by more specialised tracks. The terabyte
track focuses on the extension of IR techniques to huge collections. The HARD track
is focusing on the extraction of passages and using targeted interaction with the user.
The web track is focused on finding homepages (not actual documents containing the
information but rather one-jump links to that information) and has also discontinued
Chapter 5. Evaluation and Results 44
from 2004. Finally the robust track is focused on difficult queries.
From these options we chose to focus on the HARD track data. This decision
was affected by corpus availability issues but proved to be a good choice because the
queries contained are diverse and quite difficult without being as difficult as those of
the robust track. Moreover, the data collection of this track is neither huge (as in the
terabyte track) nor structured (as that of web track which is specialised for using the
structure of sites to find homepages). The data collection consists of newspaper and
magazine articles of the year 1999. From this data we used only the AQUAINT corpus.
The whole collection contains about 300.000 documents and that can be considered
relatively small. As for the queries we must mention that the queries supplied for the
HARD track and which we used as a baseline are far better than those supplied by
ordinary users. Query expansion is well known to work better when the queries are
not so good. Thus, our results can be characterised as being rather pessimistic and
that explains to some extend why the actual results we got for the various expansion
methods show far less improvement than those reported by the papers we derived the
approaches from.
5.2 Ontology based word sense disambiguation
To evaluate WSD performance, we counted the number of query terms that were cor-
rectly disambiguated. We only considered ambiguous query terms for this evaluation.
Our baseline is random assignment of senses to words and the methods tested are the
(Banerjee and Pedersen, 2003) and our proposed method. The results are summarised
in figure 5.1.
To identify the correct senses of the terms we manually disambiguated query terms.
Actually we asked 3 users to provide the correct sense and selected the correct sense
by a majority vote. This way we also define an upper bound for our methods which is
the inter-annotator agreement.
From figure 5.1 we can see that when no context terms are used and only mutual
disambiguation takes place, our proposed method is about 1% better than (Banerjee
and Pedersen, 2003) although the same similarity measures are used. This improve-
Chapter 5. Evaluation and Results 46
ment is to be attributed to not making independent decisions for each sense. Actually,
as thedisambBetaparameter minimises and the effect of prior importance is less, the
performance gets better. The performance, when some context terms are used, de-
pends on the probabilistic method that actually proposed the terms. Note that all meth-
ods show a significant decrease in performance when only 10 context terms are used
and thedisambBetaparameter is low thus the prior importance is less significant and
the importance weight is free to move through the terms. Performance is restored for
the various methods after adding 30-40 terms and reaches its peak (for the measured
values) for 50 terms.
Each method seems to have a distinct preference for the value of thedisambBeta
parameter. When context words are extracted from “ENT”disambBeta= 0.5 seems
to produce better results. When context words are extracted from “LCA” or “TST”
disambBeta= 0.25 seems to be more appropriate while “MIX” performs better for
disambBeta= 0.75.
5.3 Query expansion
For query expansion performance we used two measures: average precision in the top
20 documents and percentage of queries experiencing degrade after expansion.
5.3.1 Probabilistic Query expansion
In this section we present the results of the experiments related to probabilistic expan-
sion methods. We present the effect of the number of documents returned by the initial
query used and the effect of the number of query terms used in combined tables in
figures 5.2 and 5.3. In both figures darker means better.
As baseline performance we use the unexpanded query which corresponds to an
average of 42.3% precision at top-20 documents. As an upper bound of performance
we use the Rocchio’s (Rocchio, 1971) ideal query derived from the documents as clas-
sified by the TREC relevance judgements; that is we used equation 2.3 to derive the
query that best distinguishes the relevant from the irrelevant documents. Using this
method we get a performance of 77.1%.
Chapter 5. Evaluation and Results 48
Figure 5.3: Percentage of queries experiencing degrade in performance
Chapter 5. Evaluation and Results 49
From figure 5.2 we can see that LCA has an average performance of 40.9%-47.3%.
The performance of LCA improves significantly as more documents are considered.
Using LCA with less than 50 documents can produce a degrade in performance (base-
line average was 42.3%) but after 70 documents LCA has an average of about 44% giv-
ing an approximate 2% improvement over unexpanded queries. The same conclusions
can be drawn from the “percentage of queries experiencing degrade in performance”
metric. The percentage of queries experiencing degrade after using LCA for expansion
ranges from 22.0% to 36.0%. However, when used with over 70 documents it rarely
causes in degrade in more than 28.0% of the queries.
ENT has an average performance of 42.4%-49.7%. The performance of ENT
reaches its peak when about 30 documents are used. Using more documents can de-
crease performance but never below baseline. The number of terms used in the query
is also important for this method. The best results are when 40 or less terms are used.
When used with less than 40 terms ENT is giving always above 47% that is a 5% im-
provement over the unexpanded query. Using the “percentage of queries experiencing
degrade in performance” metric we can see that ENT can cause degrade in perfor-
mance to 18.0% to 42.0% of the queries. However, percentage of over 35% can be
found only when very few (less than 30) documents and many (more than 50 terms)
are used. When used with more than 10 documents and less than 40 terms it rarely
causes a degrade in more than 24% of the queries.
The results of MIX and TST are more balanced and similar to each other. The aver-
age performance ranges from about 42% to about 49.6% and when used properly they
usually have a performance of more than 46%, which corresponds to 4% improvement.
The effect of number of terms is less apparent TST but MIX performs better with less
documents. The “percentage of queries experiencing degrade in performance” reveals,
an important difference between these methods. Similar to ENT, MIX can cause a
decrease in performance in large percent of the queries ( more than 30%) when used
with more than 30 terms. TST, on the other hand, improves on this metric with more
documents.
In total we could say that when properly used in our document collection all meth-
ods create an average of more than 4% improvement over unexpanded queries and
Chapter 5. Evaluation and Results 50
cause degrade in performance in about 18%-25% of the queries. Note, however, that
these results are rather pessimistic as we used the HARD queries over a rather small
collection. In a web setup where users enter average queries and the document collec-
tion is much larger, significantly better results should be expected.
For all subsequent steps we used a value of 80 for documents from the initial query
and a value of 20 for the number of query terms.
5.3.2 Ontological Query expansion
We did not conduct any experiments for pure ontological query expansion. Expanding
based on specific type of relations used is well covered in the literature (Voorhees,
1994) and (Navigli and Velardi, 2003). A summary of the results of (Voorhees, 1994)
and (Navigli and Velardi, 2003) is that the most effective expansion method is that
with synonyms plus descendants. Expanding with hypernyms causes a small if any
improvement. Expanding with any related concepts (not taking account the type of
relations) causes a small if any improvement as well. (Navigli and Velardi, 2003) also
report a remarkable improvement when expanding with words in glosses.
Although we did not expand based on specific relations we used a different mea-
sure to estimate the quality of expanding with specific kinds of relations and found
results consistent to the ones reported in the literature. As mentioned in the methodol-
ogy section we train the boosting factor for one of the proposed hybrid methods. The
results of the training procedure are summarised in the methodology chapter and are
repeated here for convenience:
parents children children-subtree siblings siblings-subtree
is-a 0.024772 0.031324 0.045891 0.010591 0.019223
part-of (member) -0.013287 0.711712 0.018197 0.005665
part-of (substance)
part-of (rest) 0.089481 0.473094 0.001387 0.064584 -0.001692
OTHER synonyms Mapping to ontology concept Not in ontology
0.571086 -0.002474 -0.002785
Each cell in the row expressed the average weight of the specific kind of terms in the
Chapter 5. Evaluation and Results 51
ideal query. That is we used Rocchio’s ideal query methodology to derive the weight
of each term in the ideal query. Next, for each hierarchy and each kind of relation we
found all the concepts related to the query concepts in the specific hierarchy with the
specific relation. In this table we report the average weight of terms.
Although it is difficult to predict the exact performance of expanding with the spe-
cific terms from the reported weight, this weight is a very useful measure for comparing
the usefulness of the relations in the hierarchies.
5.3.3 Hybrid Query expansion
In this section we present the performance of our proposed approach. The parameters
of our approach are:
• the probabilistic method used to extract the initial terms namely “LCA”, “ENT”,
“MIX” or “TST”
• the number of initial terms considered for re-ranking, which will refer to as “nu-
mOfTerms”
• the mixing factor which we will refer to as “mix” (lowercase). Mixing factor
of 0 means pure probabilistic score. Mixing factor of 1 means pure ontological
re-ranking score.
• the method used to derive the boosting factor, namely “Boosting based on re-
lation to query concepts”, “Boosting based on importance measure drawn from
hierarchies” and “ Boosting based on network importance measure”.
Since the goal of using ontologies is to re-rank the terms suggested by a proba-
bilistic method we used the performance of the original probabilistic performance as a
baseline. Moreover we provide a baseline of random re-ranking of terms. This baseline
is useful to illustrate the quality of the terms proposed by the probabilistic method. As
an upper bound of performance we use the score of best performing query after run-
ning 100 randomly re-ranked queries. This upper bound is described in figures 5.4 and
5.5.
Chapter 5. Evaluation and Results 52
Figure 5.4: Average precision at top-20 documents of top randomly re-ranked query
Figure 5.5: Percentage of queries experiencing degrade in performance using the top
randomly re-ranked queries
Chapter 5. Evaluation and Results 53
Note that performance on both measures should improve as greater number of nu-
mOfTerms is used. More terms give more options and the best query available when
re-rankingnumO f Terms1 is still available when re-rankingnumO f Terms2 terms if
numO f Terms1 < numO f Terms2. However, this is not captured in our method. It
seems that our decision to use 100 random re-rankings proved to be a very low num-
ber for discovering the optimal query. Nevertheless, it illustrates an important issue:
when more terms are used finding the optimal query becomes harder as the re-ranking
algorithm has to choose from a larger set of re-ranking options.
Another, interesting finding is that even when the best queries discovered are cho-
sen, there is still a significant percentage of queries experiencing degrade in perfor-
mance. This illustrates that for some queries it is better not to expand the query at all;
regardless of the expansion method used performance will degrade.
5.3.3.1 Boosting based on relation to query concepts
In this section we present the results of ontological re-ranking when the boosting score
is derived from the relation of the candidate term to query concepts in the ontology.
Figure 5.6 shows the average performance after re-ranking based on this criterion for
the various factors of parametermix. Recall thatmix = 0 means no re-ranking and
mix= 1 corresponds to full ontological re-ranking (the score assigned to the term by
the original probabilistic method is not taken into account). In figure 5.7 the percentage
of queries experiencing degrade in performance is shown. In both figures random re-
ranking is included as a baseline and darker means better.
From figure 5.6 we can see that although full re-ranking causes improvement in
only a small number of parameter settings, when re-ranking score is mixed with the
original score from the probabilistic method the results are much better. 2% to 3%
improvement over the performance of the original probabilistic method can be ob-
served and this improvement translates to almost doubling the improvement over the
unexpanded query. The same picture of significant improvement can be drawn from
the percentage of queries experiencing a degrade in performance measure in figure
5.7. Using this re-ranking method we get a decrease in the percentage of queries that
experience a degrade in performance after expansion of about 2%.
Chapter 5. Evaluation and Results 54
Figure 5.6: Average precision at top-20 documents when boosting based on relation to
query concepts
Chapter 5. Evaluation and Results 55
Figure 5.7: Percentage of queries experiencing degrade in performance when boosting
based on relation to query concepts
Chapter 5. Evaluation and Results 56
5.3.3.2 Boosting based on importance measure drawn from hierarchies
This measure performed surprisingly bad. Results were worse than random re-ranking.
Nevertheless, areas of high concentration were successfully detected in the hierarchies.
However, because the terms were not disambiguated strange combinations of senses
were captured and boosted by this method. Perhaps one of the most successful cases
for this method was the inclusion of the term “reading” in the query “Alexandria’s li-
brary”, however, this happened for the wrong reasons. “Reading” was boosted because
a high concentration of concepts was discovered under the concept “city”. “Alexan-
dria”, “Cairo” and the British town of “Reading” were detected.
To correctly evaluate this method as a query expansion method perhaps manually
disambiguated terms should be used. Note, however, that manually disambiguating all
top-150 terms suggested by each probabilistic method for each query s a considerable
task on its own. Perhaps, the solution would be to query a sense tagged document
collection.
Nevertheless, because of this extreme sensitivity to WSD errors we do not discuss
this method as a method for query expansion any further.
5.3.3.3 Boosting based on network importance measure
In this section we present the results of semantic similarity based re-ranking where se-
mantic similarity is calculated by the extended Lesk measure (Banerjee and Pedersen,
2003) and boosting score is derived from the importance of the concepts as measured
by Pagerank with priors. In all results reported in this section we used a value of 0.5
for the parameterdisambBeta. Figure 5.8 illustrates the average performance with-
out mixing the score with the original probabilistic method for the various values of
rankBeta. For all subsequent steps we usedrankBeta= 0.99.
Figure 5.9 illustrates the average performance after re-ranking based on this crite-
rion for the various factors of parametermix. In figure 5.7 the percentage of queries
experiencing degrade in performance is displayed. In both figures random re-ranking
is included as a baseline and darker means better.
From figure 5.9 we can observe that the performance of this re-ranking method
strongly depends on the probabilistic method used. For LCA re-ranking causes a sig-
Chapter 5. Evaluation and Results 57
Figure 5.8: Average precision at top-20 documents when boosting based on network
importance (mix= 1)
Chapter 5. Evaluation and Results 58
Figure 5.9: Average precision at top-20 documents when boosting based on network
importance (rankBeta= 0.99)
Chapter 5. Evaluation and Results 59
Figure 5.10: Percentage of queries experiencing degrade in performance when boost-
ing based on network importance
Chapter 5. Evaluation and Results 60
nificant decrease in performance. For all other methods performance increases as more
terms are considered. Unfortunately performance does not seem to have reached its
peak in the specific figures. Probably when considering more terms the improvement
would be more apparent. Nevertheless, performance is always better than random re-
ranking. When MIX, TST or ENT is used and more than 80 terms are considered there
is an increase in performance over the original probabilistic method which ranges from
0 to 1%.
Chapter 6
Discussion and Conclusions
In this section we comment on the results and present the conclusions that can be drawn
from our research for the properties of the ideal query, query expansion in general and
ontological query expansion in particular.
6.1 Ideal query
As described earlier in this thesis we explored the notion of ideal query using two
methods. The first was using Rocchio’s equation and the second one was by randomly
re-ranking terms proposed from probabilistic query expansion methods and selecting
the best performing query. We manually clustered the top-15 terms of the ideal query
based on semantic similarity. A surprising finding was that in a significant number
of cases the original query terms were not included in the top-15 terms of the ideal
query. This illustrates the query-document word mismatch described in the literature
even in well formed queries. For example in the query “red cross activities” the term
“activities” does not appear in the list but instead “aid” and “relief” do appear. This
was a consistent finding: when query terms do not appear in the top query terms then
a significant number of terms semantically similar to the missing term do appear. An-
other consistent finding was that top terms of the ideal query tend to form semantically
related clusters.
These two findings were confirmed from the inspection of the ideal queries pro-
61
Chapter 6. Discussion and Conclusions 62
posed by the best randomly re-ranked method for deriving the ideal query.
A surprising finding exposed by the latter method of deriving the ideal query was
that remarkable improvement can be achieved only by reranking the terms suggested
by the probabilistic methods. We used 20 term queries and by using the top-20 terms
suggested by probabilistic methods we got results of no more than 49% (5% im-
provement over unexpanded query). The actual performance depended on the prob-
abilistic method used. Nevertheless, regardless of the probabilistic method used we
can get a 60% performance (16% improvement) only by reranking the top-30 terms.
When reranking the top-150 terms performance reaches the upper bound of Rocchio’s
method. In other words probabilistic methods have a high recall but low precision
in locating good expansion terms. They detect very good expansion terms but also
include some worse and mixing them causes a far less improvement than possible.
The first two findings were the main motivation of our attempt to use semantic
similarity measures, so that such semantically related clusters are detected and boosted.
The third finding justifies our decision to deploy ontologies (an independent method)
to re-ranking the terms suggested by the probabilistic methods.
6.2 Probabilistic methods
The probabilistic methods for query expansion tested were the well established Local
Context Analysis and a proposed entropy based method. Local Context Analysis per-
formed in our setting much worse than expected causing an improvement of only about
3% which significantly lower than the more than 20% claimed by the original paper
describing the method. The proposed entropy method performed better than LCA in
almost any setting. However, although performance of LCA improves as more docu-
ments are considered, we tested LCA with less documents than needed for it to reach
its peak. The entropy based method reaches its peak quicker when only few documents
are considered. This property might be desirable in real applications where resource
efficiency is an important issue. However, further exploration of this method under
different settings would be required before such deployment. In the context of this
thesis ENT and its descendants (MIX and TST) were used simply to create a diverse
Chapter 6. Discussion and Conclusions 63
set of methods for subsequent steps and the variance in performance of the methods
discussed later illustrates that although the performance of the methods is quite close
the suggested terms are diverse.
6.3 Pure ontological query expansion
For ontological query expansion in particular the conclusion of (Voorhees, 1994) was
verified by this research:
The most useful relations for query expansion are idiosyncratic to the par-ticular query in the context of the particular document collection.
The contribution of this research for pure ontological query expansion is perhaps
an attempt to explain this idiosyncratic effect based on the analysis of the particular
queries in our test set.
Advocates of ontologies prize them because they make explicit ontological com-
mitments. The simplest ontological commitment is perhaps the selection of the term
to describe a concept and more complex ontological commitments are related to how
to define a concept and place it in taxonomies. Some ontologies might partition the
concepthumanto maleand femalewhile others might choose to partition the same
concept tochild and adult. These are some decisions expressing ontological com-
mitments, but note that they express a specific choice from a set of options. Explicit
ontological commitments simplify knowledge sharing, promote consistency and allow
automatic usage of knowledge. Nevertheless, these merits come with a cost.
Ontology mismatch and the need of sophisticated mapping between ontologies are
two issues due to different ontologies making different ontological commitments and
are discussed in great detail in ontological literature. In the context of our research
the important issue is that there are many alternative options for a specific ontological
commitment and a single ontology usually makes an explicit commitment to a single
approach. People deploy what seems to be an endless variety and what proves to be
a surprisingly effective set of heuristics to select the appropriate commitments in the
specific context. Automatic methods, on the other hand, use single ontologies that
either make single decisions for each ontological commitment (as the commitment to
Chapter 6. Discussion and Conclusions 64
a single taxonomy when using WordNet) or use ontologies that allow multiple options
for each ontological commitment (as the commitment to multiple terms describing a
single concept in WordNet). Presumably using multiple options at the same time is
more appropriate but note that this adds the complexity of selecting the appropriate
commitment in the context of the query which is usually a very difficult problem. The
performance of WSD algorithms is a good indication of this difficulty.
Thus we feel that the idiosyncratic effect is not related to the type of relation used,
but rather to whether the ontological commitments implied by the query and the doc-
uments actually match the ontological commitments made by the specific ontology
used.
It is always good to expand using hyponyms but the question is in which taxonomy.
Consider the query “animal protection”. In the context of this query protection of
antelopes, lions, birds, fish etc. is relevant. But what about humans? The concept
humanwas consistently and significantly boosted in all ontological approaches tested
for this query and that was presumably correct but reasons behind this boosting were
wrong.
The query seems to imply an exclusion of humans from animals. It seems that
in the context of the query, human is a sibling of animal perhaps under the common
parent concept “organism, being” and thus protection of humans is not related to the
query. However, one could argue that, under different circumstances,humanwould
be implied to be a kind of animal by the query and thus protection of humans would
be relevant. As far as the query and the ontology is concerned this is not an irrational
assumption, it is our prior knowledge and the documents that make this assumption
inappropriate. The concepthumanis widely referred in both the actually relevant
documents and the documents returned by the initial query. Neverthelesshumansare
not mentioned as kind-of animals but rather as agents of protection; the documents are
about human activities for animal protection rather than protection of humans. The
documents imply a taxonomy where the dominant location of humans is under agent
and not animal. Thus the concepthumanis and should be boosted but because of the
relation ofhumanto “protection” and not because its relation to “animal”.
Chapter 6. Discussion and Conclusions 65
Expanding with the pure ontological approaches as described in the literature makes
no attempt to detect whether a relation of terms is actually valid in the context of the
specific query. Even if that was attempted, although it is not easy to clearly see how, the
best possible outcome would be to correctly filter the ontological commitments made
in the ontology to the current context and use only those that are appropriate. Thus
for such a method to work the appropriate commitment should exist in the ontology.
Finally, even if all the commitments are there, the inter-annotator agreement for WSD
is an indication of the precision to be expected for such a method.
This is not to say that pure ontological expansion is not worth further exploring.
It is rather to explain the idiosyncratic effect, stress the difficulty of the problem and
justify the more cautious usage of ontologies in this thesis.
6.4 Hybrid query expansion
There is a significant difference between using a relation and looking for indication of
relation.
Using a relation requires understanding the exact properties of that relation. For ex-
ample expanding with terms that map to hyponyms of query terms requires correctly
disambiguating query terms and probably deciding if the proposed term is actually a
hyponym in the context of the query and the documents used. That is, to use a relation
we must decide what is related, how is it related and whether that relation seems to
hold in the specific context. The failure of our proposed “Boosting based on impor-
tance measure drawn from hierarchies” was mostly due to not even attempting to an-
swer these questions. All possible concepts that mapped in to the detected terms were
added and all possible positions in the hierarchies were considered for these concepts.
Areas of higher density were discovered but they were usually containing wrongly
disambiguated or misplaced concepts.
Looking for indication of relation is the approach followed by the more success-
fully methods described in this thesis. In “Boosting based on relation to query con-
cepts” we attempt to favour terms that seem to be related with query terms. Note
that instead of starting from ontology relations, this approach starts from the suggested
Chapter 6. Discussion and Conclusions 66
terms and only re-ranks them if there is some indication of relation. In “Boosting based
on network importance measure” we attempt to derive from the ontology a presumably
more abstract similarity measure. Thus instead of using the specific relations we use
those relations to get an estimation of similarity and use that similarity as an indication
of relation.
6.4.1 Boosting based on relation to query concepts
In figure 5.6 we present the performance when the score of each term is fully deter-
mined by the probabilistic method (mix= 0.00), when the score is fully determined by
the ontological method (mix= 1.00) and several cases where the score is mixed. The
exact value of mixing when 0.00< mix< 1.00 is not of great importance as it depends
on scaling issues of the two original scores. Nevertheless, it is clear that when mixing
the scores we almost always get better performance than determining fully the score
from any of the original methods.
Parameters of the ontological method (how much to boost any kind of specific re-
lation) should be considered near optimal as they were trained from the actual queries.
Nevertheless, the actual construction of the query should not be considered optimal.
Using only this boosting score we simply expand using a specific order of relations.
The parameter values as trained form an ordering of relations. Synonyms are used
first if they are not enough to complete the 20-term query then children in part-of hi-
erarchies are used, if they are not enough too then the rest of relations are considered.
Perhaps this method is not optimal and that justifies the low performance when only
the ontological boosting score is used.
When the score is mixed with the original probabilistic score a significant improve-
ment is observed. However, depending on the probabilistic method used best perfor-
mance is reached for different number of terms considered. For “ENT” and “TST”
performance improves regardless of the number of terms used and reaches its peak
in about 100 terms. “MIX” and “LCA” on the other hand illustrate a more unstable
performance.
Chapter 6. Discussion and Conclusions 67
6.4.2 Boosting based on network importance measure
An alternative approach to use ontologies without using specific relations was explored
in this thesis. This approach can be seen as an attempt to use hierarchies in the ontology
in a more cautious way and attempt to break through hierarchy boundaries through the
association of keywords and key phrases to each ontology concept. Consider the case
where each concept in the ontology is associated with a set of tags (key words and
phrases). In such an ontology we could find concepts related to query terms based on
whether concepts share the same tags. Few ontologies use tags and augmenting every
concept in a ready ontology with tags would be a difficult and time-consuming task.
Most ontologies include a definition for each concept thus the definition can be
used to approximate the assignment of key words and phrases to each concept. (Baner-
jee and Pedersen, 2003) calculate semantic similarity based on definition overlap and
location in taxonomies thus combining these two approaches.
However, not using specific relations requires a new paradigm for selecting terms.
The simplest approach would be to select the concepts most similar to query concepts.
However, motivated by the absence of query terms in some ideal queries we explored a
method that also considers concepts not immediately related to query terms if there is a
high concentration of such concepts in a specific semantic area. This was implemented
by building a network where nodes are concepts and edges are weighted according to
semantic similarity and, finally, using Pagerank with priors to detect important nodes
in that network. The control parameter which determined the degree that independent
areas were considered was therankBetaparameter. Very highrankBetaassigns sig-
nificant prior importance to query concepts. Very lowrankBetaminimises the prior
importance of query concepts and allows importance weight to move freely through
the network.
As shown in the results section when low prior importance is used the performance
is rather unstable. Some excellent results were accomplished but small change in pa-
rameter values caused a dramatic change in performance. This instability was expected
because of a fundamental fallacy in our initial motivation. Terms in the ideal query tend
to form semantically related clusters but that does not mean that the clusters formed by
terms extracted by a probabilistic query expansion method will be the same clusters to
Chapter 6. Discussion and Conclusions 68
those of the ideal query. In other words, when importance is allowed to move freely
the clusters might or might not be the correct ones. Actually the only indication avail-
able in this model that they are related and correct is their relation to query terms. To
express the need of such indication we need highrankBeta.
When used with very highrankBeta, the results are generally much better and more
consistent. Actually there seems to be a consistent improvement when more terms
from the initial probabilistic method are considered. Unfortunately, due to time and
resource limitations we were not able to push this approach to its limits. The main time-
consuming process was that of calculating semantic similarity by using the (Banerjee
and Pedersen, 2003) similarity measure. To tackle this issue we used caching, however,
a significant amount of time and a significant amount of space was required for this
cache. In real word deployment of this method this would not be a problem as the
cache needs to be build once and can be built off-line. What might be a problem with
this approach would be the on-line calculations needed, namely running Pagerank on
the network of the top-n concepts. Further exploration of this method regarding its
time-efficiency is required before actually deploying it.
Note that selecting concepts with very highrankBetais not equivalent to simply
selecting the concepts most related to query concepts. Using Pagerank adds two impor-
tant properties: firstly concepts related to all query concepts are favoured and secondly
inter-concept relations are considered. If two concepts are equally related to query
concepts but one of them has more strong relations to many non query concepts then
that concept will be preferred.
In general this approach, produced significant and consistent improvement over the
original probabilistic methods when many terms are used and a desirable property is
its resistance to disambiguation errors. Nevertheless, the initial probabilistic method
used to extract terms has a great impact on performance.
Another issue deserving further exploration for this method is whether the inclusion
of more terms improves performance because there is a greater variety of terms to
select from or because the Pagerank algorithm performs better with more terms.
Chapter 6. Discussion and Conclusions 69
6.4.3 The effect of the probabilistic method
Both methods for ontological based boosting proved to be very sensitive to the prob-
abilistic method used to extract the initial terms. For both methods ENT produces the
best and most stable results. LCA, on the other hand, produces unstable if any im-
provement. For some parameter settings we get an improvement while for others we
do not. It is difficult to settle on some parameters that could be generally used when
LCA is used.
To explain this effect we provided the baseline of random reranking of terms. Us-
ing this measure we can see that LCA performs much worse than all other methods
when its top-n terms are randomly re-ranked. This illustrates that the quality of LCA
suggested terms gets lower as we move towards lower ranked terms. Presumably this
indicates a good ranking method and justifies the use of a weighting scheme within
LCA. Nevertheless, in the context of our method this expresses that top-n terms are al-
ready well ranked and thus getting a gain from reranking those terms is more difficult.
Another issue with LCA is that it detects rare terms strongly related to query terms.
Because the terms are rare it is quite often that they are not included in the ontology
and that minimises the information that can be used for reranking.
Finally, compared to ENT the terms suggested from LCA are actually more se-
mantically related to the query. That narrows the space of possible improvement from
semantically related reranking. As mentioned before many terms suggested by ENT
are not actually relevant to the query (“home”,“site”,“said”,“reports”,etc.). The ontol-
ogy based method can penalise these terms for ENT and thus produce a better query.
When LCA is used to actually improve through re-ranking more sophisticated reason-
ing is needed; a rough “looks related” measure would fail on terms suggested by LCA
because all terms already “look related” or we do not know if they are related (rare
names and places).
6.5 A note on our adapted version of Pagerank
Throughout this thesis we used an adapted version of Pagerank. This version works
with weighted edges and priors. We found this version to be a surprisingly expressive
Chapter 6. Discussion and Conclusions 70
model. Just to name a few, it was very easy to express the following:
• do a mutual disambiguation of query terms unless the rest of the terms strongly
suggest otherwise.
• do not use a single correct sense for each query term but rather use all possible
senses at the same time, but weight them by their probability according to a
disambiguation method.
• detect if there is a bias towards specific query terms and attempt to compensate
for this bias.
• select concepts for expansion that are more related to query concepts but also
consider independent important clusters.
The ease with which many complex features and parameters get incorporated to Pager-
ank make it very attractive. Nevertheless, we found it to have two undesired properties
in our setting:
• it is difficult to understand why something goes wrong when the model does not
behave as expected.
• there is no indication of whether all the features and parameters we added in the
model are combined optimally.
Perhaps the first property would not be so important if we had some indication that the
parameters were optimally combined. In the absence of such indication, in numerous
cases we were tempted to slightly change some weight so that a specific case works as
expected and usually that lead to decrease in average performance.
We feel that this is a significant disadvantage of Pagerank, at least for our multi-
featured model. If we had more time we would attempt to use the features which
proved to be useful through the use of Pagerank to train and test other models. Note,
however, that expressing some features outside Pagerank is a difficult task.
Chapter 7
Summary
In this thesis, we explored several query expansion methods. We reviewed some of
the most important probabilistic methods and proposed a simple but effective entropy
based probabilistic method. We also explored the use of ontologies for query expansion
and word sense disambiguation. Finally we explored the notion of ideal query and
attempted to discover the properties of that query.
The results of our experiments showed that the hypotheses posed in the introduction
of this thesis hold but not to the extent that we expected. H2 was about an entropy
based probabilistic method and actually proved to hold significantly. H3 was about
disambiguating query terms using network importance algorithms and additional terms
extracted from a probabilistic expansion method. H3 proved to hold to some extent,
however, the additional information used by this method seems not to be optimally
taken advantage of.
Our main hypothesis H1 was that a hybrid query expansion method that uses on-
tologies to re-rank terms suggested by a probabilistic method would outperform the
original probabilistic method. That is, ontology based reranking would add some ad-
ditional gain in query expansion’s performance.
H1 proved to hold, but not under all settings. We implemented three reranking
methods. One focusing on the relation to query terms, one focusing on the concen-
tration’s of concepts in the hierarchy and one using semantic similarity and network
importance algorithms. The first method trained the weights for each relation from
the actual queries. The second method compared concentrations in the context of the
71
Chapter 7. Summary 72
query to the same concentrations in context of the whole document collection. The
third method did not use the ontology directly, it only used them to derive a semantic
similarity score and used only that measure in subsequent steps.
From these methods only the first and third caused improvement. This improve-
ment is significant (up to doubling the gain of the expansion process compared to
using only probabilistic method). However, this depends on the probabilistic method;
For some methods we do not get stable improvement. Our proposed entropy based
probabilistic method performed very good on its own and re-ranking its terms added a
stable and significant additional gain When LCA was used as a probabilistic method
the gain from reranking was not significant and very unstable.
Bibliography
Attar, R. and Fraenkel, A. (1977). Local feedback in full-text retrieval systems.J.
ACM, 24, 3 (July):397–417.
Banerjee, S. and Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic
relatedness. InIn: Proceedings of the Eighteenth International Joint Conference on
Artificial Intelligence (IJCAI-03), pages 805–810,.
Buckley, C., Mitra, M., Walz, J., and Cardie, C. (1998). Using clustering and supercon-
cepts within smart. In Voorhees, E., editor,In Proceedings of the 6th Text Retrieval
Conference (TREC-6), page 107124.
Carpineto, C., Mori, R. D., Romano, G., and Bigi, B. (2001). An information theoretic
approach to automatic query expansion.CM Transactions on Information Sys- tems,
19(1):1–27.
Deerwester, S., Dumai, S., Furnas, G., Landauer, G., and Harshman, R. (1990). Index-
ing by latent semantic analysis.J. Am. Soc. Inf. Sci., 41, 6:391407.
Furnas, W., Landauer, T., Gomez, L., and Dumais, S. (1987). The vocabulary prob-
lem in human-system communication.Commun. ACM Commun. ACM 30, 11,
11:964971.
Gonzalo, J., Verdejo, F., Chugur, I., and Cigarran, J. (1998). Indexing with wordnet
synsets can improve text retrieval. InProceedings of the COLING/ACL ’98 Work-
shop on Usage of WordNet for NLP.
Gruber, T. R. (1993). Towards principles for the design of ontologies used for knowl-
edge sharing. In Guarino, N. and Poli, R., editors,Formal Ontology in Conceptual
73
Bibliography 74
Analysis and Knowledge Representation, Deventer, The Netherlands. Kluwer Aca-
demic Publishers.
Hang, C., Ji-Rong, W., Jian-Yun, N., and Ma, W.-Y. (2002). Probabilistic query expan-
sion using query logs. InIn Proceedings of the eleventh international conference on
World Wide Web (2002), page 325332. ACM Press.
Haveliwala, T. H. (2002). Topic-sensitive pagerank. InIn Proceedings of the Eleventh
International World Wide Web Conference.
Jing, Y. and Croft, W. (1994). An association thesaurus for information retrieval. InIn
Proceedings of the Intelligent Multimedia Information Retrieval Systems (RIAO 94,
New York, NY), page 146160.
Jones, K. S. (1971). Automatic keyword classification for information retrieval. But-
terworths, London, UK.
Kruschwitz, U. and Al-Bakour, H. (2004). Users want more sophisticated search as-
sistants - results of a task-based evaluation.Journal of the American Society for
Information Science and Technology (JASIST).
Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionar-
ies: how to tell a pine cone from an ice cream cone. InIn Proceedings of the 5th
Annual International Conference on Systems Documentation, pages 24–26.
Lu, A., Ayoub, M., and Dong., J. (1997). Ad hoc experiments using eureka. InIn
Proceedings of the 5th Text Retrieval Conference, page 229240.
Mahler, D. (2003).New Directions in Question Answering, chapter Chapter 24 - Holis-
tic Query Expansion using graphical models.
Maki, W., McKinley, L., and Thompson, A. (2004). Semantic distance norms com-
puted from an electronic dictionary (wordnet).Behavior Research Methods, Instru-
ments, & Computers, 36:421–431.
Manning, C. and Shutze, H. (1999).Foundations of statistical natural language pro-
cessing pp.294-307. The Mit Press.
Bibliography 75
Mihalcea, R., Tarau, P., and Figa, E. (2004). Pagerank on semantic networks, with ap-
plication to word sense disambiguation. InIn Proceedings of the 20st International
Conference on Computational Linguistics (COLING 2004).
Mitra, M., Singhal, A., and Buckley, C. (1998). Improving automatic query expansion.
In In Proceedings of the 21st Annual International ACM SIGIR Conference on Re-
search and Development in Information Retrieval (SIGIR 98, Melbourne, Australia,
Aug. 2428),.
Navigli, R. and Velardi, P. (2003). An analysis of ontology-based query expansion
strategies. InWorkshop on Adaptive Text Extraction and Mining (ATEM 2003), in
the 14th European Conference on Machine Learning (ECML 2003).
Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The pagerank citation
ranking: Bringing order to the web.Stanford Digital Libraries Working Paper.
Rocchio, J. J. (1971). Relevance feedback in information retrieval.In The SMART
Retrieval System: Experiments in Automatic Document Processing., pages 313–323.
Sanderson, M. (1994). Word sense disambiguation and information retrieval. In17th
Int. Conf. on Research and Development in Information Retrieval.
Stokoe, C., Oakes, M., and Tait, J. (2003). Word sense disambiguation in information
retrieval revisited. InIn proceedings of ACM SIGIR Conference (26), pages 159–
166.
Voorhees, E. (1994). Query expansion using lexical-semantic relations. InProceedings
of the 17th annual international ACM SIGIR conference on Research and develop-
ment in information retrieval Dublin, Ireland, page 61 69.
Winston, M., Chaffin, R., and Hermann, D. (1987). A taxonomy of part-whole rela-
tions. Cognitive Science, 11:417444.
Xu, J. and Croft, W. (2000). Improving the effectiveness of information retrieval with
local context analysis.Transactions on Information Systems (ACM TOIS), 18(1):79–
112.