Upload
doanlien
View
218
Download
0
Embed Size (px)
Citation preview
A FRAMEWORK FOR INFORMATION
RETRIEVAL BASED ON BAYESIAN
NETWORKS
by
Maria Indrawan
B.Comp.(Hons), MACS
School of Computer Science and Software Engineering
Monash University
Thesis Submitted for Examination
for the Degree of
Doctor of Philosophy
1998
iv
Declaration
I declare that the thesis contains no material which has been accepted for the
award of any degree or diploma in any university and that, to the best of my
knowledge, the thesis contains no material previously published or written by any
other person except where due reference is made in the text.
Signed
Date
School of Computer Science and Software Engineering,
Monash University,
Caulfield, Victoria, 3168
1998
vi
Table of Contents
TITLE............................................................................................................... i
ABSTRACT .................................................................................................... ii
DECLARATION............................................................................................ iv
ACKNOWLEDGEMENT .............................................................................. v
CHAPTER 1
INTRODUCTION
1.1 Background and Motivation .................................................................. 1
1.2 Uncertainty and Artificial Intelligence .................................................... 2
1.3 Previous Work Using Network Models for Information Retrieval .......... 6
1.4 Contribution of the Thesis...................................................................... 9
1.5 Research Methodology........................................................................ 12
1.6 Thesis Overview.................................................................................. 13
CHAPTER 2
AUTOMATIC INFORMATION RETRIEVAL
2.1 Introduction......................................................................................... 16
2.2 Information Retrieval Model................................................................ 16
2.3 Document and Query Indexing ............................................................ 21
2.3.1 Indexing Problems........................................................................ 22
2.3.2 Indexing Language....................................................................... 24
2.4 Matching Functions ............................................................................. 27
2.4.1 Boolean Model ............................................................................ 28
2.4.2 Vector Space Model .................................................................... 29
2.4.3 Probabilistic Model ...................................................................... 32
2.4.3.1 Binary Independence Model.................................................... 33
2.4.3.2 Unified Model......................................................................... 37
vii
2.4.3.3 Retrieval with Probabilistic Indexing (RPI) Model................... 39
2.5 Increasing Retrieval Performance......................................................... 40
2.5.1 Stop List ...................................................................................... 41
2.5.2 Term Weighting ........................................................................... 42
2.5.3 Thesaurus .................................................................................... 43
2.5.4 Relevance Feedback.................................................................... 44
2.6 Summary............................................................................................ 47
CHAPTER 3
THEORY IN BAYESIAN NETWORKS
3.1 Introduction........................................................................................ 49
3.2 Bayes Theorem................................................................................... 50
3.3. Bayesian vs Classical Probability Theory............................................ 57
3.4 The Bayesian Network as a Knowledge Base...................................... 61
3.4.1 Bayesian Network Structure........................................................ 63
3.4.2 Conditional Independence ........................................................... 65
3.5 Probabilistic Inference in Bayesian Networks ...................................... 69
3.5.2 Pearl's Inference Algorithm ......................................................... 71
3.5.2 Handling Loops in the Network .................................................. 75
3.6 Summary............................................................................................ 76
CHAPTER 4
.......A SEMANTICALLY CORRECT BAYESIAN NETWORK MODEL
FOR INFORMATION RETRIEVAL
4.1 Introduction........................................................................................ 78
4.2 The Bayesian Network Model............................................................. 83
4.2.1 Probability Space ........................................................................ 84
4.2.2 The Document Network.............................................................. 87
4.2.3 The Query Network .................................................................... 89
4.2.4 Prior Probability.......................................................................... 90
4.3 Probabilistic Inference in Information Retrieval................................... 93
4.3.1 Link Matrices.............................................................................. 95
viii
4.3.1.1 OR-link matrix..................................................................... 96
4.3.1.2 AND-link matrix.................................................................. 96
4.3.1.3 WEIGHTED-SUM link matrix ............................................ 97
4.4 Directionality of the Inference............................................................. 98
4.5 Comparison with Other Models ........................................................ 105
4.5.1 Simulating the Boolean Model .................................................. 106
4.5.2 Simulating the Probabilistic Retrieval Model.............................. 108
4.5.3 Inference Network .................................................................... 110
4.6 Summary.......................................................................................... 116
CHAPTER 5
HANDLING LARGE BAYESIAN NETWORKS
5.1 Introduction...................................................................................... 118
5.2 An Illustration of an Exact Algorithm ............................................... 119
5.3 Reducing the Computational Complexity .......................................... 126
5.3.1 Node and Link Deletion ............................................................ 126
5.3.2 Layer Reduction........................................................................ 128
5.3.3 Adding a Virtual Layer.............................................................. 130
5.3.3.1 Clustering the Parent Nodes .............................................. 135
5.4 Handling Indirect Loops ................................................................... 139
5.4.1 Clustering ................................................................................. 141
5.4.2 Conditioning ............................................................................. 144
5.4.3 Sampling and Simulation........................................................... 145
5.5 Dealing with a Loop Using Intelligent Nodes .................................... 147
5.5.1 Example of the Feedback Process Using Intelligent Nodes ........ 151
5.6 Summary.......................................................................................... 152
CHAPTER 6
MODEL PERFORMANCE EVALUATION
6.1 Introduction...................................................................................... 155
6.2 The Relevance Judgement Set........................................................... 160
6.3 Performance of the Basic Model ....................................................... 164
ix
6.4 Estimating the Probabilities............................................................... 166
6.4.1 Estimating P(ti|Q=true)............................................................. 166
6.4.2 Dependence of Documents on Index Terms............................... 170
6.4.2.1 Estimating the tf and idf Components ................................ 170
6.4.2.2 Estimating the Combination of tf and idf Components........ 171
6.4.3 Estimating the Virtual Layer Distribution .................................. 176
6.5 Performance Comparison with Existing Models ................................ 181
6.5.1 Comparative Performance for the ADI ....................................... 183
6.5.2 Comparative Performance for the MEDLINE............................. 186
6.5.3 Comparative Performance for the CACM................................... 187
6.6 Summary........................................................................................... 192
CHAPTER 7
MEASURING THE EFECTIVENESS OF VIRTUAL LAYER MODEL
7.1 Introduction....................................................................................... 195
7.2 Minimum Message Length................................................................. 196
7.2.1 Encoding Real Valued Parameters.............................................. 198
7.3. Measuring the Effectiveness of Virtual Layer Model with MML........ 200
7.4 Illustration of MML Calculation for Index Term Clusters ................... 202
7.5 Summary........................................................................................... 208
CHAPTER 8
CONCLUSION AND FUTURE RESEARCH
8.1 Conclusion ........................................................................................ 210
8.2 Future Work...................................................................................... 212
8.2.1 Phrases and Thesaurus ............................................................... 213
8.2.2 Retrieval Fusion ......................................................................... 213
8.2.3 Index Term Clustering................................................................ 214
8.2.4 Comparison Model for Bayesian Networks ................................ 214
REFERENCES ............................................................................................ 215
x
APPENDIX A ............................................................................................. 234
APPENDIX B .............................................................................................. 264
APPENDIX C .............................................................................................. 269
1
Chapter 1
Introduction
1.1 Background and Motivation
Information is a vital resource for all organisations. The efficient management
and retrieval of information is therefore an important organisational function. It
has been suggested that the quantity of new information produced in the western
world is growing at a rate of 13 percent each year [Freimuth89]. With the
development of the internet and other global networks this figure is expected to
increase markedly. As a result, people who need information are frequently
overwhelmed by the sheer amount of information available and finding useful
information requires enormous effort.
In early examples of information retrieval systems such as library
catalogue systems, searching is achieved by the use of catalogue systems
whereby documents are represented by several fixed categories such as author,
title and subject. The assignments of categories are done manually by domain
experts. With the present explosion in the amount of information available,
handling information effectively and efficiently in this way will be very difficult,
if not impossible. In addition to the problem of information volume, the current
format of the information has also introduced another dimension to the
information retrieval task. Most information is currently delivered in electronic
format which lacks the well-structured form of books. Such ill-structured
2
documents include articles, WEB pages, medical records, patent records, legal
case records, software libraries and manuals. Compared with traditional library
systems, a different kind of computer-based search strategy is required for these
electronic documents. The search strategy needs to be able to retrieve documents
based on the ‘content’ of the items directly. The objective of modern information
retrieval systems is to provide such types of search.
The automation of search and retrieval by content is not straightforward. Most of
the information available is written in natural language such as English and, to
date, information systems have not been able to process and ‘understand’ the
natural language as competently as human beings, despite of extensive efforts by
natural language researchers [Carmody66, Schank77, Allen87, Boguraev87,
Mel’cuk89, Amsler89, Brent91, Kupiec92]. Thus, a major problem inherent in
information retrieval systems may be seen as the ‘uncertainty’ in understanding
user's information needs and the content of documents. In the last two decades,
researches in computer science have been actively investigated the possibility of
applying Artificial Intelligence (AI) techniques to handle uncertainty. In the next
section we will present a summary of different AI techniques used to handle
uncertainty in information retrieval.
1.2 Uncertainty and Artificial Intelligence
In the recent past, one focus in Artificial Intelligence has been the problem of
"Approximate Reasoning". This problem deals with the decision making and
reasoning processes in the situation where information is not fully reliable, the
representation language is inherently imprecise and information from multiple
3
sources is conflicting. In information retrieval, documents and queries are
represented by index terms. The precision in representing document and query
content relies on the effectiveness of the text analysis methods in 'understanding'
the natural language. As stated in the previous section, existing natural language
processing models have not been able to process and 'understand' the natural
language as competently as human beings. As a result, document and query
representation cannot be represented precisely, or in other words, it may be seen
as a problem that requires an "approximate reasoning". Thus, an AI approach
may be considered as a solution to this uncertainty problem inherent in
information retrieval task.
Representation techniques for uncertain or imprecise information can be
classified into numeric and non-numeric (or also known as symbolic) techniques.
In the numeric context, the approximation can be viewed as a value with a known
error margin such as in Bayesian models, Evidence Theory [Shafter76] and
Fuzzy theory [Zadeh78]. The Bayesian belief network was introduced in the
eighties as an extension to the traditional Bayesian models. It incorporates graph
theory into the Bayesian model to enrich the semantic representation.
The symbolic representation approach at first concentrated on the use of
logic or, more specifically, first order predicate calculus. This classical symbolic
logic failed to produce consistent representation due to its lack of tools for
describing how to devise a formal theory to deal with inconsistencies caused by
new information [Bhatnagar86]. A modification of the symbolic logic namely
non-monotonic logic [McDermott85] was introduced to overcome the problem of
first order logic.
4
In addition, there are some AI methods such as neural networks, genetic
algorithms and hidden Markov models. The first two methods have attracted
quite a number of researchers, especially in the recent years with the increase in
computational power. The hidden Markov model is supported by rigorous
mathematical theory and mainly used in the area of speech and character
recognition [Hansen95].
There have been many conflicting views regarding the merits of particular
models [Cheeseman85, Cheeseman91, Zadeh86]. Each model exhibits
comparative advantages depending on the domain and application being
considered. In information retrieval research probabilistic methods, which can be
categorised into numeric AI techniques, are well accepted and have shown
promising results [Robertson76, Rijsbergen79, Turtle91, Ghazfan94]. However,
there is a new surge in research concerned with adopting a symbolic AI technique
in particular non-classical logic approach to the information retrieval in the last
few years [Rijsbergen86, Rijsbergen89, Crestani94, Chevallet96]. The results
have not been widely reported due to the computational complexity of the model
[Crestani95].
We will adopt the probabilistic approach, more specifically that of
Bayesian networks, in our information retrieval model. A Bayesian network is a
directed acyclic graph where the nodes represent events or propositions and the
arcs represent causal relations between those propositions represented in the
nodes. The support of explicit relations between the propositions in the Bayesian
network can overcome the following problems experienced by other probabilistic
retrieval models:
5
1. The traditional probabilistic model, such as those of Maron and Kuhn
[Maron60], Robertson and Sparck-Jones [Robertson76], Fuhr [Fuhr89]
and Rijsbergen [Rijsbergen79], uses two different models to produce the
initial ranking and to handle relevance feedback. The initial ranking is
usually produced using some ad-hoc probability estimations and the
relevance feedback is handled using some learning models.
2. The relevance feedback is confined only for relevance information
gathered from documents, although Fuhr [Fuhr92] shown that relevance
feedback gathered from queries can also be used to improve the retrieval
performance.
3. Multiple representations of document and queries are not possible,
although Turtle [Turtle90] showed that an information need represented
by different query representations generates different ranked output and
the combinations of these outputs may increase the retrieval performance.
4. Thesaurus, citation and synonyms are created as an addition to the
retrieval model instead of part of the retrieval model itself.
Our proposed Bayesian network model for information retrieval addresses the
problems inherent in the traditional probabilistic retrieval model in the following
ways:
1. The probabilistic inference in the Bayesian network retains the sound
theoretical basis of the traditional probabilistic models, but also
incorporates a common method for producing initial rankings of
documents and for handling relevance feedback.
2. Relevance feedback fits naturally into the model. The probabilistic
inference approach provides an automatic mechanism for learning.
6
3. The probabilistic inference approach allows us to incorporate relevance
information from other queries into the model by using separate network
representation for the query and exploiting the used of multiple query
network representations for a single information need.
4. Documents in the collection may be represented as a complex object with
multilevel representations, not merely as a collection of index terms.
5. Dependencies between documents are built implicitly in the model by
using the independence assumption principle of Bayesian networks which
allows the retrieval of documents that do not share common index terms
with the query. Citation or nearest neighbor links can be easily
incorporated into the model because of the graphical nature of the model.
6. Synonyms and a thesaurus can be easily implemented as part of the
network. Any index terms that are synonyms can be linked, so the system
can use all those synonyms during retrieval.
Graph and network structures have been widely used in information retrieval.
Salton [Salton68] showed the early use of tree and graph models in information
retrieval to describe the implementation of many basic structures used in retrieval
systems in graph theoretic terms. However, their use in combination with a
formal inference technique is still a current topic of research.
1.3 Previous Work Using Network Models for
Information Retrieval
Salton's [Salton68] early use of tree an graph models for information retrieval
provides a starting point for many information retrieval researches that uses tree
7
or graph model. There are few numbers of current information retrieval models
that use network representation. These information retrieval models can be
loosely categorised based on whether they support clustering, rule-based
inference, browsing, spreading activation, or connections.
Clustering. In the clustering approach, the network structure is derived
naturally from the representation of document and term clusters. Sparck-Jones
[Sparck-Jones71] investigated the term clustering technique and later used it to
develop the automatic indexing technique [Spark Jones74]. Croft [Croft80]
describes a retrieval model incorporating document and term clusters. Croft and
Parenty [Croft85] compare the performance of cluster based network
representation with a conventional database implementation. A survey of
document clustering techniques, especially those for hierarchic clustering, is
presented by Willet [Willet88]. All the different approaches to clustering have
one common feature, namely that they assume there is a natural similarity
between index terms or documents and these similarities can be exploited to
increase the retrieval performance.
Rule-based inference. The rule-based inference method in RUBRIC
systems [Tong83, Tong85] represents queries as a set of rules in an evaluation
tree that specifies how individual document features can be combined to estimate
the certainty that a document matches the query. One of the objectives of the
RUBRIC design was to allow the comparison of different uncertainty calculi
models [Tong86]. Recently, the RUBRIC system included the inference network
approach [Fung90a]. Rule-based inference using network structures has also been
used with the construction of automatic thesauri [Croft87b, Shoval85].
8
Browsing. A network representation is essential in information retrieval
systems that support a browsing capability. Hypertext systems are a typical
example of browsing systems and are also common in thesaurus based systems.
The THOMAS system [Oddy77] uses a method that allows browsing in a simple
network of document and terms. A more complex network model for browsing is
investigated by Croft and Thomson [Croft87b] using the I3R system. Croft and
Turtle [Croft89a] and Frisse and Cousins [Frisse89] describe a retrieval model for
hypertext networks. A survey of hypertext retrieval research can be found in
Coombs [Coombs90].
Spreading activation. Spreading activation is a search technique in
which the query is used to activate a set of nodes in the representation network,
which in turn activates the neighbouring nodes. The rank of the retrieved
documents is generated by the pattern of activation in the network. The variation
between such models usually arises due to different halting conditions and
weighting functions. Jones and Furnas [Jones87] present a representative
spreading activation model which is compared to the conventional retrieval
models by Salton [Salton88]. Croft [Croft89b] used spreading activation in a
network based on document clustering. Cohen and Kjeldson [Cohen87] used
spreading activation in a more complex representation network with typed edges.
Connections approach. The connectionist approaches are similar to
spreading activation. However, the connectionist approach does not include a
clear semantic interpretation of the links in the network, which is clearly defined
in the spreading activation approach. The weights associated with the links are
learned from training samples or through user guidance. Croft and Thompson
[Croft84] used a connectionist network in an attempt to learn and select a query
9
strategy. Brachman and Mcguiness [Brachman88] used a connectionist approach
to retrieve facts from knowledge bases on programming language. Belew
[Belew89] and Kwok [Kwok89] describe other connectionist approaches to
information retrieval. Lewis [Lewis90] further explores the relationship between
information retrieval and machine learning.
All the network approaches discussed in this section lack one major
feature required to produce a good information retrieval model, namely that of a
strong mathematical foundation. In this thesis, we introduce a new formal model
based on a Bayesian network that provides a strong mathematical foundation.
1.4 Contribution of the Thesis
Recent information retrieval research has suggested that significant
improvements in retrieval performance will require techniques that, in some
sense, ‘understand’ the content of document and the queries [Rijsbergen86,
Croft87a], in order for to infer probable relationships between documents and
queries.
The idea that the retrieval process is an inference or evidential reasoning
process is not new. Cooper’s logical relevance approach [Cooper71] is based on
deductive relationships between representations of documents and information
needs. Wilson [Wilson73] used situational relevance to extend Cooper’s logical
relevance by incorporating inductive inference.
In the research described in this thesis we present semantically sound
Bayesian network model for a formal model of information retrieval. This thesis
contains two areas of contribution, namely to information retrieval modeling and
10
to Bayesian network inference theory. In detail, the thesis contains the following
contributions:
• We formally define a new model for information retrieval based on a
Bayesian network. The model provides a strong mathematical
foundation to model uncertainty in information retrieval. The new
model can be used as a general framework for information retrieval
because it can represent different existing information retrieval
models, such as the Boolean and probabilistic models, by using
appropriate network representations. With this framework, the
decision of adopting specific retrieval model can be postponed until
the implementation level.
• We introduce a specific implementation of the above model to
perform probabilistic retrieval. The probability model presented
includes the probability estimations that will produce better
performance compared with other well known information retrieval
systems, such as the vector space model [Salton83] and Turtle and
Croft's [Turtle90] network model. The performance tests were carried
out on three well-studied test collection, namely ADI, MEDLINE and
CACM. Moreover, The adoption of a graph, which captures the
connectivity between the index terms and documents, enables our
proposed model to produce higher recall compared with those
produced by the information retrieval models previously mentioned.
• We provide a framework within the Bayesian network model to
support both evidential and dependency alteration relevance feedback.
Existing information retrieval models have failed to provide a
11
common model for both approaches to relevance feedback, although
the two approaches have been shown to benefit different retrieval
situations. The evidential feedback is suited for modeling the situation
where we perceive that the probability distribution has been correctly
modeled, hence the data received from the relevance feedback is
treated as a new evidence to this probability distribution. Altering the
dependencies, on the other hand, is best used when we perceive the
probability distribution to be incorrect and the data gathered from the
relevance feedback process should be used to correct this probability
distribution. As we can see, the two relevance feedback approaches
each have their own place in information retrieval applications.
Therefore the ability to support both approaches in a single framework
is essential to information retrieval.
• Cooper [Cooper90] proved that the complexity of an exact inference
algorithm for Bayesian network is NP-hard. It is common for
information retrieval systems to deal with large document collections.
Therefore, we see the importance of adopting some approximation
methods to reduce the inference complexity in the Bayesian network
model for information retrieval. We introduce some heuristics to
reduce the complexity of the inference in Bayesian networks.
• Finally, we present an evaluation model that can be used to measure
the complexity of the heuristics proposed in the previous point. The
model is based on the idea of A Minimal Message Length
[Wallace68]. The best approximation or heuristics is given by the
approximation model that produces the shortest coding in describing
12
the probability distribution. This evaluation model will enable us to
evaluate the efficiency of a given approximation model without
performing extensive retrieval tests.
1.5 Research Methodology
In information retrieval research, experiments are performed using test
collections. Recall and precision levels are used to measure the performance of
the system. The recall measures the ability of the system to retrieve all the
relevant documents. The precision measures the ability of the system to
discriminate between the relevant and non-relevant documents. A test collection
in information retrieval experiments comprised of:
• A set of documents – current test collections generally contain
information from the original documents such as title, author, date and
an abstract.
• A set of queries – These queries are often taken from actual queries
submitted by the users. They can be expressed either in natural
language or in some formal query language such as Boolean
expressions.
• A set of relevance judgements – For each query in the query set, a set
of relevant documents is identified. The identification process can be
done manually by a human expert or by using pooling methods for
results from several information retrieval systems.
The interaction of those sets in an information retrieval experiment is depicted by
figure 1-1.
13
Testcollection
StandardQueries
Retrievalmodel
Documentranking
Relevancejudgement
RecallPrecision
level
Figure 1-1 Model for experiments in information retrieval systems.
Using the standard queries in the test collection, the retrieval system under
evaluation is used to perform a search in the document set. The result of the
search is a list of document identifiers with the documents assumed most relevant
being ranked first. This list of rankings is then compared with the list of relevance
judgments. The relevance judgment itself does not include any ranking. It only
contains the document identifier of documents judged relevant to the query.
Using the recall and precision formulae, the recall and precision levels are then
measured.
1.6 Thesis Overview
Researchers have adopted artificial intelligence to solve the problem of
uncertainty across different knowledge domains. We have adopted one particular
artificial intelligence technique, namely that of the Bayesian network, to solve the
problem of uncertainty in information retrieval.
14
In the next chapter, Chapter 2, a summary of the current state of
information retrieval is given. The research problem, which has been introduced
in the current chapter, will be discussed further in this chapter. We also present a
comparison of the two major existing retrieval models; the vector space and the
probabilistic models.
Chapter 3 describes the development of Bayesian network theory. The use
of inference in the Bayesian network is also discussed in this chapter.
Based on the information discussed in chapter 2 and chapter 3, we present
a semantically sound Bayesian network model for information retrieval in chapter
4. We show that our model provides a correct semantic interpretation of the
retrieval model and also provides a general model for information retrieval
through its effectiveness to simulate existing models using appropriate network
representations.
One major consideration in implementing Bayesian network for
information retrieval is the computational complexity inherent within the
network. Chapter 5 investigates the possibilities of adopting an approximation
model that can reduce the computational complexity in the network in order to
make the implementation practical.
We report the results of our experiments in chapter 6. Different
probability estimations and their effect on the performance of the system are
tested and reported in this chapter. We also compare the performance of our
network model with other well-known retrieval models.
Chapter 7 introduces an evaluation model that can be used to measure the
effectiveness of approximation models introduced in chapter 5. The evaluation
model enables us to choose the optimal approximation without performing
15
extensive retrieval performance test. Finally, we provide the conclusion of our
research and possible future direction in chapter 8.
16
Chapter 2
Automatic Information Retrieval
2.1 Introduction
Information retrieval systems are designed to help people extract useful or
interesting information from document collections. Information or document
retrieval systems are not recent innovations. They existed since the first libraries in
the form of manual library catalogue systems. Since that time, information
retrieval systems have changed rapidly due to the growth in the amount of textual
information available in both digital and paper format. This dramatic increase in
available information has driven the need for the development of automatic
information retrieval.
In this chapter, we present models of the retrieval systems and their
associated problems. We organise this chapter into three major parts. The first
section defines the information retrieval models and their problem domain. The
second provides detailed explanations of those parts that constitute the
information retrieval models. The third and last section examines some methods
that can be used for improving the performance of information retrieval systems.
2.2 Information Retrieval Model
An information retrieval system involves three major tasks (figure 2-2), namely
document indexing, query formulation and the use of a matching function. The
17
document indexing task involves building and organising representations for each
document involved in the collection. Query formulation is a similar task to that of
document indexing, translating the user’s information needs to a format which can
be understood by a matching function. Document and query indexing are
discussed in detail in section 2.3.
Query Documents’Representation
Matching Function
Relevant Documents
Query Formulation Documentsindexing
reading
Figure 2-1 Information retrieval task model.
Once the two representations are built, the matching function will use both
the document and query representations to find those documents judged to be
relevant by the system. However, the documents returned by the system may not
necessarily be relevant from the user’s point of view. The two main factors that
influence the disparity between the set of documents judged relevant by the system
and those perceived to be relevant by the user to their original query are natural
language ambiguity and the possible limited background knowledge of users on
the query subject.
18
The first problem of natural language ambiguity results from the fact that a
concept may be expressed in many ways. For example, consider the word
windows. A user may use this word to search for documents explaining windows
based operating systems or for documents explaining how to classify different type
of architecture by looking at the shape of windows. The formulation of methods
to overcome the problem of the ambiguity in natural languages is a major
objective of information retrieval research.
The second problem, that of limited background knowledge, from the
point of view of information retrieval research, may not be completely eliminated
since it is partially the responsibility of the users. Upon the delivery of the
documents judged relevant by the retrieval system, the users may read the
documents to expand their knowledge (we refer to this as the reading process).
As the users’ knowledge of the subject expands, the query submitted to the system
can be refined using the new knowledge learnt. Therefore, the responsibility of the
system lies with providing means of refining and resubmitting the query that
reflects the additional knowledge learnt. This facility is known as relevance
feedback (section 2.5.4 discusses relevance feedback in more detail).
The reading process we discussed in the previous paragraph plays an
important role in solving the problem of users’ limited background knowledge
concerning the topic of a query. Indeed, the emphasis on the role of the reading
process differentiates information retrieval systems from other information systems
like data or knowledge retrieval systems. Although most of the literature to date
uses the terms information retrieval system and data or knowledge retrieval
system interchangeably, further investigation of the area reveals that information
19
retrieval systems are in fact a broader generalisation of data or knowledge retrieval
systems. In essence, the three systems differ in the nature of the queries involved
and in the expected result of the queries.
Data retrieval systems provide users with the ability to retrieve specific
data. Thus, the queries in data retrieval systems are necessarily very precise in
nature. Aside from the query, the data are usually also organised in a well defined
structure. An example of a data retrieval system is a relational database.
Knowledge retrieval systems provide users with the ability to find answers to
specific questions. Unlike data retrieval systems, knowledge retrieval systems’
data may not necessarily be well structured. However, in both knowledge retrieval
and data retrieval systems queries on the data are very specific and precise.
On the other hand, a query in an information retrieval systems as we have
stated previously, may involve ambiguity or uncertainty. The user of an
information retrieval system does not search for specific data as in data retrieval,
nor search for direct answers to a question as in knowledge retrieval. The
information or knowledge is acquired by the user through the reading of the
documents. For example, a user may want to get information on the topic cheap
production methods for assembled electronic goods. This does not necessarily
imply that the user wants a specific answer to the specific question, What are the
cheap methods? or How do the cheap and expensive methods differ? Even in the
situation whereby one has some specific questions in mind, the aim is to acquire
overall information such that not only those questions but also others suggested by
reading the documents can be answered.
20
As well as increasing the user’s knowledge of the subject behind the
query, the reading process may clarify the relationship between the user’s needs
and those documents perceived to be relevant by the users since the relationship
between those needs and what information meets them is not necessarily obvious.
For instance, the user’s query on cheap production methods for assembled
electronic goods may be met by the article entitled “Assembly line workers in
third world countries : human right vs national income”. This article may not
specifically discuss how to cheaply produce electronic goods, but the related
assertion that cheap labour in third world countries can be a way of reducing
production cost. There is ,therefore, a link in terms of relevance between the
user’s query and the article. However, if ,the information retrieval system in use,
based its matching function solely on matching keywords as do traditional
systems, the above article may not be retrieved because the article may not
actually contain the word cheap, production or electronic goods.
As we have stated, traditional information retrieval systems performed
matching at the keyword level. We have shown through the above example that
this approach may miss relevant articles which match the query at the concept
level. As a result, adding a knowledge base into the systems has become necessary
in order to provide better retrieval. Current research in information retrieval
systems aims to perform matching at the concept level. Section 2.4 discusses
different methods used in defining the matching functions. In the next section, we
investigate the two other tasks involved in the information retrieval, namely
document and query indexing.
21
2.3 Document and Query Indexing
Document and query indexing are very important tasks in any information retrieval
system. However, document and query indexing are also considered to be the
most difficult task to carry out successfully. The indexing task is considered
difficult to implement because natural language ambiguity introduces uncertainty
in the text analysis process of indexing document and queries. Salton [Salton88]
suggests the indexing process is not required if the collection is considered small.
In a small collection, a full text scanning method will be more efficient in
retrieving the documents from the collection than using matching function on
document and query indexes.
Today’s document databases are large due to the amount of information
available in digital form and this volume of information will only increase with
time. Full text scanning methods are impractical for such databases given the
capability of the current computer technology. In other words, document indexing
has to be performed regardless of the problem of uncertainty in text analysis
during the indexing process. As a result, reducing uncertainty becomes part of the
problem domain of both the document and query indexing tasks in information
retrieval systems.
2.3.1 Indexing Problems
Three main factors contribute to the problem of uncertainty in document and
query indexing. Firstly, there is the problem posed by the variability in the ways
that a concept may be expressed [Fuhr86]. One word may have different
22
interpretations in different contexts. This is partly a matter of language.
Considering the same query example introduced earlier in this chapter, cheap
production methods for assembled electronic goods, the word assembled may be
interpreted as unit of construction or as how the assembly will be carried out, e.g.
machine-made.
The second problem may occur due to underspecification of the request.
Sometimes a user does not provide enough details or specifications in the query.
This produces a vague request, such as the qualification of cheap methods in the
example query. Does this mean cheap in the sense of economical production or
cheap as in low quality? Request underspecification can also occur when the
request itself is incomplete. For example, considering our example query, the user
may want information not only about the production method of assembled
electronic goods, but also about design aspects of the goods. However, it is
unlikely that the system will retrieve documents containing design aspects of cheap
electronic goods since design and production cannot be generalised into
“production method”. Both vague and incomplete requests contribute to the
request underspecification, the difference between them being that in the first case,
a vague request, the user may not realise the inherent ambiguity in the query,
whereas in the second case, that of an incomplete request, the user has failed to
include sufficient detail in the query. Request underspecification is less obvious
than the variability problems. Nevertheless, it still contributes to uncertainty in the
information retrieval process. Both request underspecification and variability
problems follow from the user’s ignorance before the reading process is
undertaken.
23
The third problem is that of document descriptor reduction. The following
example illustrates this problem: In the article “ Assembly line workers in the third
world countries: human rights vs national income“, the term national income has
a narrower meaning than may be expected. The article actually describes national
income but in the narrower sense of national income generated from export. In
this case, the reduction of a document description by the author lead to an indirect
description or a generalisation of export generated income to national income.
This problem can never be completely avoided - the author of a document always
leaves much unsaid on a subject - nor it is always harmful. Forming compact
descriptions of document contents may seem to increase ambiguity, but it can
increase both the efficiency of matching and the effectiveness of document
classifying.
Information retrieval can thus be seen to impose conflicting demands on
text descriptors. It requires that they be generalising but accurate, as well as
discriminating and summarising. Meeting these demands becomes the fundamental
goal of an indexing language, a language that is required to perform the indexing
process [Lewis96].
2.3.2 Indexing Language
In the previous section, we have examined problems associated with the indexing
process. Since human beings have the capability to handle ambiguity in natural
language, the obvious solution to the indexing problems would seem to be
manual indexing. In fact, the indexing process in early information retrieval
systems was carried out manually by human experts in the subject domain. To
24
date, manual indexing is still considered superior to automatic indexing in its
capacity to handle uncertainty. However, manual indexing suffers from high
operational costs and would be almost impossible to perform in today’s document
databases due to their size. Automatic indexing has become an active area of
information retrieval research. To perform automatic indexing, an indexing
language needs to be defined. The indexing language consists of a term
vocabulary and methods of constructing representation.
An indexing language’s term vocabulary can be either derived from the
text of the document described or may be arrived at independently from the text.
The use of elements of vocabulary derived from the text itself is called the natural
language approach. The other approach, which uses terms independent of the text
in the vocabulary is known as the controlled vocabulary method.
There are many representation construction methods in text retrieval
systems [Milstead89]. However, they all share the common goal of indexing, that
is to create documents and queries representations which are both summarising
and discriminating. To achieve this goal, the index construction methods perform
the following steps:
1. Eliminate common terms from the document or query which are bad
discriminators. The systems usually have a list of common terms which
are kept in a stop word list (refer to section 2.5.1).
2. Break down the document and query into individual terms.
3. Eliminate suffixes and prefixes from the terms.
4. Assign weights to the term to identify the terms significant in the
collection.
25
One of the common methods for assigning weights to the indexed terms uses
stastistical methods, with each term given a weight according to its importance to
the collection. The first such weighting scheme was introduced by Luhn[Luhn58].
He proposed the use of a term frequency (tf) to measure the term’s siginificance in
the document. In fact it provides a local weight calculation for each term and can
be formulated as:
x fi k i k, ,= (2.1)
where
xi,k is the weight of term i in document k
fi,k is the frequency occurrence of term i in document k
This idea was developed further by Sparck-Jones [Sparck-Jones72] who added an
inverse document frequency(idf) to the weighting scheme as a global weight
which can be formulated as follows:
⎟⎠⎞⎜
⎝⎛=
ii f
Ny log (2.2)
where
yi is term i inverse document frequency.
N is number of documents in the collection
fi is number of documents in which term i appears.
Global weighting is important for discriminating terms because very high
frequency words cannot be considered to be good discriminators if they appear in
26
most of the documents in the collection. By taking into consideration the number
of documents containing a given term, this problem can be tackled.
Combining the term frequency (equation 2.1) and inverse document
frequency (equation 2.2), the final weight of a term in the collection can be
calculated as
ikiki yxw ×= ,, (2.3)
where
wi,k = weight of term i in document k
xi,k = term i's term frequency in document k
yi = term i's inverse document frequency
There have been some objections from natural language processing
researchers (namely [Strzalkowski93] and [Lewis96]) to the use of pure statistical
methods for estimating the terms’ weight in a collection. Their objection concerns
mostly their doubt in the ability of such a weighting scheme to handle phrases. It
becomes more difficult to justify the assertion of term independence in a collection
once phrases are introduced. For example, consider the phrase take over. Weights
used to discriminate documents containing this phrase may not be successful
because the individual words take and over are common words (i.e. they may have
a small global weighting). Although the idea of using phrases is quite attractive, it
has not been shown for all cases that this form of retrieval exhibits advantages
over non-phrase supporting information retrieval systems.
Indexing languages may be classified as pre-coordinate or post-coordinate
according to the time at which they choose to organise and use the terms resulting
27
from the indexing process. In pre-coordinate indexing, the terms are coordinated
at the time of indexing by logically combining any index terms as a label which
identifies a class of documents. In post-coordinate indexing, the same class of
documents would be identified at search time by combining classes of documents
labeled with the individual terms.
In the next section, we examine how matching functions use the document
and query representations resulting from the indexing process.
2.4 Matching Functions
Matching functions are the main engine of information retrieval systems. Once
representations for documents and queries are built, these representations are used
by the matching function to achieve the three following related tasks:
1. To locate or identify items related to a user query.
2. To identify both related and distinct documents in the collection.
3. To predict the relevance of a document to the user’s information request
through the use of index terms with well defined scope and meaning.
Many matching functions have been proposed over the years by researchers in the
information retrieval area. In this section we examine three different matching
function models, namely the Boolean, vector space and probabilistic models.
2.4.1 Boolean Model
The Boolean model is considered to be the simplest matching function in
information retrieval. Relationships or similarities between individual documents
28
are not utilised, neither are any relationships between query terms. In systems
which use the Boolean model, the users’ query is represented only as
combinations of terms that a relevant document is expected to contain. For
example, one may require all documents which contain the two terms (design and
production) or the three terms (cheap, electronic and good). The query Q can be
formulated as
Q = (design AND production) OR (cheap AND electronics AND good)
Simplicity of implementation is the main advantage of Boolean model. The
documents’ similarity to the query is calculated solely on the basis of a binary
decision as to whether the query terms exist in the document representation. As a
result, documents retrieved by the Boolean model are weighted equally against the
users’ query. Thus the first document retrieved is not necessarily the most relevant
document. This drawback of the Boolean model due to the binary nature of its
retrieval decision function is frequently cited [Croft86, Salton83, Salton88,
Losee88].
The solution to the problem of equally weight documents exhibited in the
Boolean model has become a goal of research in information retrieval. This
research has concentrated on building a retrieval model that has the ability to
weight the relevance of the documents against the query. In the following
sections, 2.4.2 and 2.4.3, we examine two other models of retrieval, the vector
space and probabilistic model, that do produce a ranked output.
29
2.4.2 Vector Space Model
The vector space model represents both the documents and queries as a vector of
terms. Both document and query representations are described as points in T
dimensional space, where T is the number of unique terms in the document
collection. Figure 2-2 shows an example of a vector space model representation
for a system with three terms.
TERM1
TERM2
TERM3
D1=(TERM31,TERM32,TERM33)
D2=(TERM11,TERM12,TERM13)
α
Query Qβ
Figure 2-2 Three dimensional vector space.
Each axis in the space corresponds to a different term. The position of each
document vector in the space is determined by the magnitude (weight) of the
terms in that vector. A similarity computation measuring the similarity between a
particular document vector and a particular query vector as a function of the
magnitudes of the matching terms in the respective vectors may be used to identify
the relevant documents. The simplest such scheme to calculate the similarity is to
assume that the document containing the most terms from the query will be the
most relevant. Thus the similarity between a query Q and the kth document, Dk,
30
can be calculated as an inner product of term vectors in Q and Dk. Formally it can
be represented as
( ) ∑=
=n
iikik tqDQsim
1
, (2.4)
where
Q is the query vector
Dk is the kth document vector in the collection
qi is the term i in the query Q
tik is the term i in the document Dk
n is the total number of query terms.
Besides the inner product approach, another well understood (and more
widely accepted in information retrieval systems) vector similarity measure is the
cosine correlation function. In the cosine correlation function, the angle between
documents or documents and a query measures the similarity between the vectors
that represents them. Consider the situation depicted in figure 2-2. The similarity
between D1 and D2 would be measured by the angle α. The similarity between
documents D1 to query Q is measured by angle β. The cosine correlation
function is shown in Table 2-1 (which also includes some other common vector
space similarity measures).
Similarity Measure
Sim(Q,Dk)
Formula
Inner product ∑
=
n
iiki tq
1
31
Cosine correlation 2
1
1
22
1
1
2
1
⎟⎠
⎞⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛ ∑∑
∑
==
=
n
iik
n
ii
n
iiki
tq
tq
Dice measure ∑ ∑
∑
= =
=
+n
i
n
iiki
n
iiki
tq
tq
1 1
22
1
2
Jaccard measure ∑ ∑∑
∑
= ==
=
−+n
i
n
iikj
n
iiki
n
iiki
tqtq
tq
1 11
22
1
Table 2-1 Similarity measures.
We note that the numerator of the cosine formula gives the sum of the
product of the matching terms between query Q and document Dk. That is, when
binary indexing is used, the numerator is the total number of matching terms in
query Q and document Dk. When the indexing is not binary, the numerator
represents the sum of the products of term weights for the matching terms in Q
and Dk. The denominator in the cosine similarity function acts as a normalising
factor because it takes into consideration the number of terms contained in a
document. The longer the documents, that is the more terms used to describe the
documents, the smaller the cosine similarities. Thus, unlike the inner product, the
cosine measure takes into consideration the effect of a document’s length. Inner
product measures always discriminate against short documents because short
documents always produce a shorter term vector sum compared with long
documents.
32
Using such similarity functions, the vector space model can produce a
ranked output. The capability to produce ranked document output gives the
vector space model an advantage over the Boolean model. However, the lack of
formal methods to support the vector space model in handling uncertainty has
driven research in information retrieval towards seeking models that can support
uncertainty. In the next section, we analyse the probabilistic matching function
model which does provide more formal support to handle uncertainty.
2.4.3 Probabilistic Model
The probabilistic model attempts to address the uncertainty problem in
information retrieval through the formal methods of probability theory. Unlike in
the vector space model, in this model the document ranking is based on the
probability of the relevance of documents and the query submitted by the user.
This has been formalised and is known as the Probability Ranking Principle
[Robertson77]. There are three different models of probabilistic retrievals: binary
independence [Robertson76, Rijsbergen79], the unified model [Roberston82] and
retrieval with probabilistic indexing (RPI) [Fuhr89]. The models differ in their
treatment of and assumptions behind the probability of relevance. In this section,
we analyse the formulation of these probablistic models and state the assumptions
associated with them.
2.4.3.1 Binary Independence Model
As the name implies, this model assumes that the index terms exist independently
in the documents and we can then assign binary values to these index terms. For a
33
further illustration of this model, consider a document Dk in a collection, is
represented by a binary vector t = (t1,t2,t3,…,tu) where u represents total number
of terms in the collection, ti=1 indicates the presence of the ith index term and
ti=0 indicates its absence. A decision rule can be formulated by which any
document can be assigned to either the relevant or non-relevant set of documents
for a particular query. The obvious rule is to assign a document to the relevant set
if the probability of the document being relevant given the document
representation is greater than the probability of document being non relevant, that
is, if:
P(relevant|t) > P(non-relevant|t) (2.5)
Using Bayes’s theorem, equation 2.5 can be rewritten as:
( ) ( )relevantnontPrelevanttP −> || (2.6)
This decision rule, when expressed as a weighting function g(t), becomes:
( ) ( )relevantnontPrelevanttPtg −−= |log|log)( (2.7)
This means now we can use the weighting function g(t) to rank the document
according to their g(.) value such that the more highly ranked a document is, the
more likely it is to be relevant to the query.
Since the calculation of the probabilities P(t|relevant) and P(t|non-relevant) are
difficult, we have to assume that the index terms occur independently in the
relevant and non-relevant documents so that we can calculate P(t|relevant) as:
P(t|relevant)=P(t1|relevant)P(t2|relevant)…P(tn|relevant) (2.8)
and similarly for P(t|non-relevant).
Now let:
34
pi=P(ti=1|relevant) (2.9)
qi=P(ti=1|non-relevant) (2.10)
So pi and qi are the probabilities that an index term occurs in the relevant or non-
relevant document sets respectively. Then
( ) ii ti
n
i
ti pprelevanttP −
=
−= ∏ 1
1
1)|( (2.11)
( ) ii ti
n
i
ti qqelevantnontP −
=
−=− ∏ 1
1
1)|( (2.12)
Subtituting 2.11 and 2.10 into 2.7, we have
( )
( ) ∑∑== −
−+
−−
=n
i i
in
i ii
iii q
p
qp
qpttg
11 1
1log
1
1log)( (2.13)
The second summation in equation 2.13 is constant for a given query and does not
affect the ranking of documents. Since probabilistic models assume the relevant
and non-relevant sets can only be calculated for a single query, this second
summation can be omitted from the calculation. However, it can be interpreted as
a cut-off value to the retrieval function. That is, only documents that have a
relevance value greater than this constant value are retrieved as relevant
documents. This capability in fact gives the probabilistics model an advantage over
the vector space model. In vector space model, such a cut-off value has to be
found through trial and error, because its mathematical model does not provide
support for it.
35
Omitting the second part of equation 2.13, the weighting function g(t) can be
formulated as
( )
( )∑= −
−=
n
i ii
iii qp
qpttg
1 1
1log)( (2.14)
Observation of equation 2.14 shows that g(t) is equivalent to a simple matching
function between query and document where query term i has the weight of
( )( ) ii
ii
qp
qp
−−
1
1log . This weighting scheme was first introduced by Robertson and
Sparck-Jones [Robertson76].
As in any probabilistic model, a prior probability needs to be defined
before any inference can be calculated. Therefore, probabilistic models rely on the
major assumption that relevance information is available in the collection to define
the prior probability. That is, that some or all of the relevant document and non-
relevant documents have been identified. In reality, this assumption is very difficult
to satisfy because the relevance information is not easy to obtain at the early stage
of a search. One way of overcoming this problem is to use an interactive search at
an early stage of a search. An interactive search can be used to provide the
information retrieval systems with relevance information. The users’ judgement of
the document ranking in this search is then used as relevance information in the
next search [Sparck-Jones79].
In the situation where there is no relevance information available or in the
case of a non-interactive search, a combination of similarity measures shown in
table 2.1 with the inverse document frequency can be used to define prior
probability [Croft79]. Consider the inner product similarity measure. The
36
combined measure using the inverse document frequency in this case can be
formulated as:
∑=
=n
i iikiqk f
NffDQsim
1
log),( (2.15)
where
fiq is term i's term fequency in query Q.
fik is term i's term frequency in document Dk.
N is the total number of document in the collection.
fi is the total number of document where term i exists.
n is the total number of query terms.
We have stated above that the probabilistic model assumes that the terms
in the document are distributed independently. However, Rijsbergen [Harper78]
argues that this assumption is often made as a matter of mathematical
convenience, although it is generally agreed that exploitation of associations
between items of information retrieval systems, such as index terms or documents
will improve the effectiveness of retrieval. In our studies we analyse the possibility
of exploiting these associations in order to improve retrieval performance. We use
the Bayesian belief network mode which will be explained in detail in chapter 4.
2.4.3.2 Unified Model
The Unified model exists as a combination of Maron-Kuhn’s model [Maron60]
and Cooper’s model [Cooper78] with the binary independence model. The Maron-
Kuhn model differs from the binary independence model in the assumption of
document relevancy. In their model, a record of the number of times each query is
submitted and of which documents are judged relevant or non-relevant to each
37
query is kept. This information is then used to determine the frequency of a
document that has been judged relevant to each query submitted. This frequency
in turn is used to estimate the probability of relevance and documents are ranked
accordingly.
Thus, this model combines the judgements of multiple users in order to
compute the probability of relevance with respect to a set of equivalent queries.
This differs from the binary independence model which viewed the association
between document and the index terms as fixed by the collection and independent
from the use of the index terms in the queries. In other words, the relevance
judgment in the binary independence model does not come from the association of
the index terms in the query and the documents. To illustrate the difference in
more detail, consider the following, letting:
Q be the set of all (past and future) queries of the retrieval system.
D be the set of all (past and present) documents in the system.
QS be the set of queries that using the same query terms.1
DS be the set of all the documents to which the same index terms have
been applied.
qm be an individual query (qm ∈ QS).
dk be an individual document (dk ∈ DS).
R be the event of relevance.
1 It is assumed that the same query terms may represent different information need. The same apply to the documents, document represented with the same index terms may contain different information.
38
The set consisting of all pairs of (dk,qm) represents the event space. The
relevance R is a subset of this event space. Using the above notations, the method
of calculating relevance according to the Maron-Kuhn, binary independence and
Unified models respectively, are as follows:
Maron-Kuhn model: P(R|QS,dk).
Binary Independence: P(R|qm,DS).
Unified: P(R|dk,qm).
The unified model combines the estimation provided by Maron-Kuhn’s model and
that of the binary independence model to derive the relevance judgment of the
individual document to a query. This unified model attempts to generalise the two
models such that Maron-Kuhn model is used when only the query history is
available, and reduces to the binary independence model when only document
representation data is available. When both query and document representation
data is available, the unified model is used. However, the combination of the two
models provided by the unified model does not solve the problems of probability
estimations inherent by the two individual models [Fuhr92]. Thus, a better model
that incorporates good probability estimation when no initial stastisical data is
available is still required.
2.4.3.3 Retrieval with Probabilistic Indexing (RPI) Model
The RPI model is a generalisation of the binary independence model. This model
includes a more detailed assumption of the relevancy of the index terms
assignment to the document compared with that of the binary independence
model. To illustrate the model, consider the following. Let:
dk represent a document in the collection,
39
ti be the binary vector (t1,t2,t3,…,tn) of index terms in document dk,
qm be a query,
C denotes the event of correctness.
Unlike the binary independence model and unified models which calculate the
probability of relevance as P(dk,qm), the RPI model measures the correctness of
the assignment of ti to dk by assigning value to C. The probability is now measured
as P(C|ti,dk,qm). The decision as to whether the assignment of ti to dk is correct or
not can be specified in various ways, for example, by comparison with the results
of manual indexing, or by comparing the retrieval results. Thus, parameter
estimations or more specifically the estimation of correctness of the index term
assignment is still relies on ad-hoc estimation.
We have in this section discussed several information retrieval matching
function models, namely the Boolean, vector space, binary independence, unified
and RPI models. Each have their own individual drawbacks. The vector space
model exhibits a lack of mathematical support for the handling of uncertainty. The
probabilistic approach attempted to provide models with strong mathematical
foundations but fell short due to the need for some ad-hoc probability estimations.
We will introduce in chapter 4 a probabilistic retrieval model which overcomes the
problems of the probability estimations, specifically one based on Bayesian
networks.
2.5 Increasing Retrieval Performance
Regardless of the limitations of the retrieval models discussed in the previous
section, there are several methods available to improve the retrieval performance
40
of such information retrieval system. These methods are usually not considered to
be part of the retrieval model as such, but rather as additional components of the
retrieval model. Before we further discuss these methods, we will present a
common definition of performance measurement in information retrieval systems.
An information retrieval system finds documents that are intended to be
relevant to the user’s query. In a very real sense only the user knows exactly what
is relevant to his or her information needs. Information retrieval system can only
suggest “relevant” documents for the user to read. In this situation, providing
uniform performance evaluation can be difficult. Research in this area, however,
has provided common performance measurements, based on recall-precision on a
standard test collection. The recall level describes the completeness of the
retrieval; precision represents the accuracy of the retrieval. These can be defined
formally as follows:
R
rrecall = (2.16)
N
rprecision = (2.17)
where
r is the number of relevant documents retrieved for a given query.
R is the number of relevant documents in the collection for a given query.
N is the number of documents retrieved for a given query.
Both high recall and high precision are desirable in information retrieval systems.
However, they are very difficult to achieve simultaneously. High recall
performance usually means poor precision. Chapter 6 discusses performance
measurement in information retrieval systems in detail.
41
The following sub-sections 2.5.1-2.5.5 analyse different methods that can
be used to improve the performance of all the retrieval models discussed in section
3. These methods may be combined to achieve the optimum retrieval.
2.5.1 Stop List
Every word in a language has its meaning. However, not all of them have the
ability to distinguish one document from another. For example, the word “the”
will never provide such information. Many retrieval systems provide a stop list, a
list containing such words that do not have any discrimination capacity. The stop
list is used during document indexing and query formulation. Any word within the
list that appears in the document or query is discarded. The word may have
discrimination capacity in one domain and not in other domain. Thus, different
knowledge domains may employ different stop lists, but caution is required when
adding word to stop list. A very specific stop list can result in low recall because
the query becomes too specific.
2.5.2 Term Weighting
Document can be described by the presence/absence of index terms, that is any
document can be represented by binary vector. For example, if document dk is a
member of a documents collection, which has 6 terms in its vocabulary, contains
terms t1,t3,t4,t5 but not t2 and t6, it can be represented as
dk=(1,0,1,1,1,0)
42
Every term in the index is treated equally. One may argue that this does
not reflect the real life situation, where one term or word may have more
importance than others. Indeed, many information retrieval systems employed
term weightings to capture the importance of individual terms in the collection as
we discussed in section 2.3.2.
The term frequency (tf) within a document can indicate the importance of
the terms in the document. In other words, the terms frequency can be used to
summarise the contents of a document. However, using within document
frequency alone is not enough because it cannot be used to discriminate
documents in the collection effectively [Sparck-Jones79]. Consider the following
case; the word computer may have a very high frequency in a document belonging
to the Communication of ACM collection. However, almost every document has a
high term frequency of the word computer because the collection’s domain is
computer theory and application. This situation shows that the word computer
does not have ability in discriminating the documents. The more documents
represented by a particular term the less importance this term has in terms of
distinguishing one document from another. As we have explained in section 2.3.1,
a good document representation has to be able to summarise and discriminate the
documents at the same time. Inverse document frequency (idf) may be introduced
to the term weighting as a discriminator. The combination of tf and idf is usually
used as in equation 2.3.
43
2.5.3 Thesaurus
One obvious problem with query formulation is that there are often many ways to
say the same thing. Introducing a thesaurus to match synonyms and closely
related words is one solution to this problem. It can be used to expand the user’s
query by adding the synonyms or related words to the initial query submitted.
The thesaurus can be generated automatically from the text in the
collection by means of calculating similarity amongst the terms in the collection.
Given the matrix of document-term relation .
T1 T2 … Tm D1 w11 w12 … w1m D2 w21 w22 … w2m D3 w31 w32 … w3m . … … … … . … … … …
DN wn1 wn2 … wNm
The similarity measure between term Tj and term Tm can be calculated by
sim T T w wj m iji
N
im( , ) ,==∑
1
(2.18)
where
N is the number of documents in the collection
wij is the weight of term i in document j
Once the similarities for all the terms are computed, a term can be put into
a group if it has a similarity exceeding a stated threshold with at least one of the
members of the cluster or with all the members of the cluster. The first situation is
44
called single-link classification, the second is called complete-link classification. It
has been claimed that an automatic thesaurus generated from the text in the
domain which it is used can increase the recall up to 20% [Salton71, Croft88].
2.5.4 Relevance Feedback
Retrieving all relevant documents or achieving a 100% level of recall has not been
accomplished by existing information retrieval systems. The problem of limited
recall has been recognised as the major difficulty in information retrieval systems
[Lancaster69]. More recently, van Rijsbergen spoke of the limits of providing
increasingly better ranked results based solely on the initial query. He indicated a
need to modify the initial query to enable increased performance after a certain
level of recall reached [Rijsbergen86].
For many years researchers have suggested relevance feedback as a
solution for query modification because a user may give vague or incomplete
initial requests as we discussed in indexing problems in section 2.3.1. The
feedback given by the user can be used to re-weight the query terms and/or
expand the query by adding new terms to the query.
In the vector space model, the term relevance feedback is achieved by
merging the relevant document vectors with the initial query vectors. This
automatically re-weights the query terms by adding weights to the initial query
terms for any query terms existing in the relevant documents and subtracting the
weights of those query terms occurring in non-relevant documents. Ide (1971)
formulated this as:
45
Q Q R Skk
x
kk
y
1 01 1
= + −= =∑ ∑ (2.19)
where
Q1 is the modified query
Q0 is the original query
Rk is the vector for relevant document k
Sk is the vector for non-relevant document k
x is the number of relevant documents
y is the number of non-relevant document
The query is also automatically expanded by adding all terms not in the
original query that are in both the relevant document and non-relevant documents.
These terms are added using positive or negative values based on whether they are
coming from relevant or non-relevant vectors respectively. Although the new
query includes new terms from non-relevant documents, the fact that such terms
carry negative weight means that they only contribute to determining the weights
of new terms introduced by relevant documents.
Probabilistic retrieval treats the relevant and non-relevant set equally in re-
weighting the query. Harman [Harman92] suggests that this particular treatment
may cause poor performance of the probabilistic model in relevance feedback.
Her study has shown that the performance of relevance feedback in probabilistic
retrieval varies from one collection to other.
Another issue that the probabilistic models have to address in
incorporating relevance feedback in the model is that of probability estimations.
The fact that most of the probabilistic models use different probabilistic
46
estimations for producing the initial document ranking and relevance feedback2
may contribute to the inconsistency of the ability of probabilistic models to handle
relevance feedback. A model that has a common method for estimating the
probability for the initial document ranking and for relevance feedback is required.
A promising solution may be presented by considering the inference model. The
inference model is known to have learning capability. Relevance feedback is, in
fact, a learning process since the document ranking may change due to new
knowledge learnt from the previous retrieval. Bayesian networks are one such
inference model, and may thus be used as an effective tool for incoporating
relevance feedback into the information retrieval system.
2.6 Summary
We have analysed and discussed the problems inherent in models for information
retrieval systems. The major problem faced by information retrieval systems is the
uncertainty involved due to the ambiguity of the natural language. This ambiguity
itself can not be totally eliminated, which makes the reading process important in
information retrieval and also makes information retrieval systems different from
data retrieval or question-answer systems. Regardless of the ambiguity inherent in
natural language, information retrieval systems must find those documents
assumed relevant documents for a user’s given information needs.
We have discussed several approaches to information retrieval models in
this chapter, namely te Boolean, vector space and probabilistic models. With the
2 See section 2.4.3, the estimation of prior probability which is used to produce the initial
47
limitation of these models in mind, there are some methods which usually are not
considered as part of the model that can be used to further improve the retrieval
performance. These methods of improvement include the stop list, thesaurus, term
weighting and relevance feedback. The probabilistic approach may be considered
the best of the approaches because it is based on well-established mathematical
theory for handling uncertainty. However, it still requires improvement in terms
of the development of built-in methods for estimating the probability for the initial
ranking and relevance feedback. We will introduce a model based on a Bayesian
network that can overcome this problem in chapter 4.
In the next chapter, we review probablilty theory in detail, in particular that
of Bayesian belief networks. Bayesian networks are a good candidate for a
framework that can provide the retrieval model with a common probability
estimation method for the initial document ranking and the relevance feedback
through its support for inference.
document ranking is derived from ad-hoc estimation [Sparck-Jones79, Croft79].
49
Chapter 3 Theory in Bayesian Networks
3.1 Introduction
Over the last few decades, interest in artificial intelligence research has been
growing rapidly, especially in the area of knowledge based systems. The phrases
knowledge based systems or expert systems are usually employed to denote the
computer systems which incorporate some symbolic representation of human
knowledge. The symbolic representation of this knowledge is used in turn by the
computer systems to make decisions as if they had been made by a human expert.
By studying many knowledge based systems developed for many
different problem domains, artificial intelligence researchers have found that the
knowledge required for the decision process often cannot be precisely defined. In
fact, many real-life problem domains are fraught with uncertainty. Chapter 2 has
shown that information retrieval systems are not spared uncertainty in their
problem domain. The challenge in the research of building knowledge based
systems can be seen as that of modeling a human expert’s capability for handling
uncertainty. Human experts in particular problem domains are able to form
judgements and take decisions based on uncertain, incomplete or even
contradictory information. Therefore, a good knowledge based systems to be of
practical use, has to perform at least equally well compared with a human expert
in handling uncertainty in a given problem domain.
50
In this chapter, we introduce a formalism for representing uncertainty
using Bayesian networks and associated algorithms for manipulating uncertain
information. There are many other formalisms including Rule based systems and
fuzzy logic. However, Bayesian networks have been accepted by a large
population of artificial intelligence researchers due to their powerful formalism
for representing domain knowledge and its associated uncertainty. In section 3.2
we recap classical probability theory and Bayes theory. In this section we provide
the formal development of Bayes theorem from that of the classical probability
theory. Section 3.3 reviews the difference between the Bayesian and classical
approaches to probability theory. Section 3.4 discusses the use of Bayesian
network in knowledge based systems. This section includes the discussion of
Bayesian network formalism and properties. These properties include the
implementation of conditional independence.
Any knowledge based system needs to be able to adapt to additional
knowledge that arrives at the system as evidence. The procedure to perform this
operation is known as the inference process. Section 3.5 looks at inference
processes in the Bayesian network. We conclude the chapter with a summary in
section 3.6.
3.2 Bayes Theorem
To understand Bayesian networks, it is important to understand the Bayesian
approach to the probability and statistics. In this section, we contrast the Bayesian
view of probability to the classical view of probability. We also present the main
theorem on which Bayesian probability and statistics are based, that is of Bayes
51
theorem. First, we present in this section the development of the Bayes theorem.
The following derivation follows that of Neapolitan [Neapolitan90]
According to Laplace [Neapolitan90, pp28] probability is defined as :
The theory of chance consists in reducing all the events of some kind to a
ceratin number of cases equally possible, that is to say, such as we may be
equally undecided about in regard to their existence, and in determining the
number of cases favorable to the event whose probability is sought. The
ratio of this number to that of all the cases possible is the measure of the
probability.
Laplace's definition gives the framework of the classical approach which states
that every possible outcome of an experiment has an equal chance. We discuss
the meaning of the above definition in more detail below. First, we define the
meaning of a sample space where the possible outcomes can be derived. During
this discussion we use the example of picking up a card from a 52 card deck.
Definition 3.1. Let an experiment which has a set of mutually exclusive and
exhaustive outcomes be given. That set of outcomes is called the sample
space and is denoted by Ω.
In our experiment picking a card from a deck, the sample space Ω is the set of 52
different outcomes. Next, we define an event in a sample space.
Definition 3.2 Let ℑ be the set of subsets of Ω such that
1. Ω∈ℑ
2. E1 and E2 ∈ ℑ implies E1 ∪ E2 ∈ ℑ
3. E ∈ ℑ implies E∈ ℑ
Then ℑ is called a set of events relative to Ω.
52
According this definition, an event is simply a set of propositions which has its
corresponding set of possible outcomes in the sample space Ω. For example, if an
event E is the proposition of getting a king from the deck of cards, then there are
4 corresponding possible outcomes in sample space Ω, namely king of spades,
king of hearts, king of diamonds and king of clubs. Next, we define the means of
a probability value for an event.
Definition 3.3 For each event E ∈ ℑ, there is corresponding a real number P(E),
called the probability of E. This number is obtained by dividing the
number of equipossible alternatives favorable to E by the total number of
equipossible alternatives or outcomes.
According to definition 3.3, the probability of the event king card turn up is 4/52.
Using the above definitions we can now prove some properties of probability
theory and of conditional probability [Neapolitan90].
Theorem 3.1 Let Ω be a finite set of sample points, ℑ a set of events relative to
Ω, and, for each E ∈ ℑ, P(E) is the probability of event E according to the
classical definition of probability in definition 3.2. Then
1. P(E)>=0 for E ∈ ℑ
2. P(Ω)=1
3. If E1 and E2 are disjoint subsets of ℑ, then P(E1∪E2)=P(E1)+p(E2)
Proof. Let n be the number of equipossible outcomes in Ω.
1. If k is the number of equipossible outcomes in E, the, according to
definition 3.3,
0)( ≥=n
kEP
53
2. Following definition 3.3,
1)( ==Ωn
nP
3. Let E1 and E2 be disjoint events, let k be the number of equipossible
outcomes in E1, and let m be the number of euipossible outcomes in E2.
Then, since E1 and E2 are disjoint, k+m is the number of equipossible
outcomes in E1∪E2. Thus, following definition 3.3,
)()(
)()( 2121 EEPn
mk
n
m
n
kEPEP +=+=+=+
Definition 3.4. Let Ω be the set of sample points, ℑ a set of events relative to Ω,
and P a function that assigns a unique real number to each E ∈ ℑ. Suppose P
satisfies the properties defined by theorem 3.1. Then (Ω,ℑ,P) is called a
probability space and P is called probability measure of Ω.
Defining a probability space is very important for measuring any
probability value. Laplace [Neapolitan90] states that there is no absolute
probability value. Any probability space exists relative to partial information or
knowledge. Different knowledge will generate different probability spaces. For
example, Natalie, a sneaky girl, peeks at the top of the card deck before a card is
drawn from it. She sees that the top card is a king but does not know to which suit
that king belongs. By doing this, Natalie has changed her probability space from
52 possible outcomes to 4 possible outcomes. The probability of the card drawn
being a king of hearts now becomes 1/4 instead of 1/52. This example illustrates
the importance of conditioning the probability on some known knowledge or
information. Now, we define the meaning of conditional probability.
54
Theorem 3.2. Let (Ω,ℑ,P) be the probability space created according to the
classical definition of probability. Suppose E1 ∈ ℑ is nonempty therefore
has a positive probability. Then, we assume that the alternatives in E1
remain equipossible when it is known for certain that E1 has occurred, the
probability of E2 given that E1 has occurred is equal to
Proof. Let n,m, and k be the number of sample points in Ω, E1 and E1 ∩ E2,
respectively. Then the number of equipossible alternatives based on the
information that E1 has occurred is equal to m while the number of these
alternatives which are favorable to E2 is equal to k. Therefore the
probability of E2 given that E1 has occurred is equal to
)(
)(
1
21
EP
EEP
nm
nk
m
k ∩==
Definition 3.5. Let (Ω,ℑ,P) be a probability space and E1 ∈ ℑ such that P(E1) >
0. Then for E2∈ℑ, the conditional probability of E2, given E1, which is
denoted by P(E2|E1), is defined as follows :
Definition 3.6. Let (Ω,ℑ,P) be a probability space and E1,E2,...,En be a set of
events such that for i≠j
NULLEE ji =∩
)(
)(
1
21
EP
EEP ∩
)(
)()|(
1
2112 EP
EEPEEP
∩=
55
and
Un
iiE
1=
Ω=
Then the events in E1,E2,...,Enare said to be mutually exclusive and
exhaustive.
Lemma 3.1. Let (Ω,ℑ,P) be a probability space and E1,E2,...,Enbe a set of
mutually exclusive and exhaustive events in ℑ such that 1<=i<=n,
P(E1)>0. Then for any E ∈ℑ.
∑=
=n
iii EPEEPEP
1
)()|()(
Proof. Since the Ei’s are exhaustive, we have that
E=(E∩E1)∪(E∩E2) ∪ .. ∪(E∩En)
Therefore, since the Ei’s are mutually exclusive, by definition 3.4 we have that
P(E)= (E∩E1)+(E∩E2)+...+(E∩En)
From definition 3.5,
P(E)= P(E|E1)P(E1)+ P(E|E2)P(E2)+…+ P(E|En)P(En)
The definition 3.5 is known as the classical or traditional view of conditional
probability. In the following discussion illustrate development of Bayes theorem
which is a different view of conditional probability.
Theorem 3.3. Bayes Theorem. Let (Ω,ℑ,P) be a probability space and
E1,E2,...,En be a set of mutually exclusive and exhaustive events in ℑ
such that for 1≤ i ≤ n, P(Ei) > 0. Then for any E ∈ ℑ such that P(E) > 0,
we have that for 1 ≤j≤n
56
P E EP E E P E
P E E P E
j j
i ii
n( | )
( | ) ( )
( | ) ( )1
1
=
=∑
Proof. Let E1,E2,...,Enas set of mutually exclusive and exhaustive events in ℑ
such that for 1≤ i ≤ n, P(Ei) > 0.
It follows from definition 3.5 P E EP E E
P Ejj( | )
( )
( )=
∧
or P E EP E E P E
P Ejj j( | )
( | ) ( )
( )=
From Lemma 3.1 we have P E P E E P Ei ii
n
( ) ( | ) ( )==∑
1
.
If E and E' are any two events such that P(E) and P(E') are both positive, then the
following equality follows directly from definition 3.5 :
)'(
)()|'()'|(
EP
EPEEPEEP =
Notice that in Bayes theorem, the conditional probability is not represented as
joint events as in classical conditional probability. This different treatment of
conditional probability leads to several philosophical differences between the
Bayesian and classical approach. Section 3.4 compares and discusses these two
different approaches towards probability in detail.
In our study, we use Bayes theorem as a diagnosis tool. A typical
diagnostic process consists of a hypothesis that has been postulated and some
evidence that can be used to verify the hypothesis. For a diagnostic process,
Bayes theorem as in theorem 3.3 can be written as
P H eP e H P H
P e( / )
( / ) ( )
( )=
(3.1)
57
P(H|e) represents the belief that we yield a hypothesis H upon obtaining
evidence e. The belief can be calculated by multiplying our previous belief P(H)
by the likelihood P(e|H), that is, if e will materialise if hypothesis H is true. P(H)
is sometimes called the prior probability and P(H|e) is posterior probability. The
denominator P(e) hardly enters into the calculation because it is a normalising
constant. We will use this format of Bayes theorem in the rest of the discussion in
this thesis, unless we feel necessary to go back to the general format as in
theorem 3.3.
3.3. Bayesian vs Classical Probability Theory
We compare the Bayesian and classical view of probability on two of the
important aspects of probability, namely the meaning of the probability and the
meaning of conditional independence. First, we will discuss the different
meanings of probability according to the Bayesian and classical views
respectively.
The Bayesian approach views probability as a person’s degree of belief in
an event x occurring given the information available to that person. A probability
of 1 corresponds to the belief in the absolute truth of a proposition, a probability
of 0 to the belief in the proposition’s negation, and the intervening values to the
partial belief or knowledge.
Classical probability theory considers the probability of an event x as the
physical probability of the event x occurring. The probability values are acquired
through a number of repeated experiments. The larger the number of experiments
performed, the more accurate the value of the probability. Thus, the classical
58
approach relies on the existence of the experiments and is not willing to attach
any probability value to an event that is not a member of a repeatable sequence of
events. The Bayesian approach, on the other hand, consider a probability as a
person’s degree of belief, a belief can be assigned to unique events that are not
members of any repeatable sequence of events. For example, consider assigning
the probability to the belief that the Australian will win the Ashes in 1997,
although the matches have not yet taken place. Although Bayesian approach is
willing to assign a probability value to this event, the assignment of this
subjective probability should be considered carefully. It must be based on all the
information available to the individual who makes the prediction. This
information may include those items that are known to be true, deducible in a
logical sense and empirical frequency information. For example in predicting the
probability of the Australian team winning the Ashes, information about all the
Australian and England players’ current form, the Australian team's past
experience in playing in England, as well as the weather pattern in England
during summer may be used.
The second main difference between the Bayesian and classical
approaches is their treatment of conditional independence. We define the
conditional independence as follows:
Definition 3.7. Let (Ω,ℑ,P) be a probability space and H and e events in ℑ such
that one of the following is true:
1. P(H)=0 or P(e)=0
2. P(H|e)=P(H)
Then H is said to be independent of e.
Based on this definition, classical probability introduced the following theorem.
59
Theorem 3.4. Let (Ω,ℑ,P) be the probability space and H and e be arbitrary
events in ℑ. Then H and e are independent if and only if
)()()( ePHPeHP =∩
Proof. Let H and e be independent event in ℑ and P(H) > 0.
It follows from definition 3.5 that P(H∩e) = P(H|e)P(e).
From definition 3.6 we have P(H|e)=P(H) for H independent of e.
Combining the two definitions we have P(H∩e)=P(H)P(e)
Theorem 3.4 shows that the classical probability formalism checks the
conditional independence through the equality of the joint probability of the
events and the product of the individual events. The problem with this checking
is that the result of the joint probability calculation does not provide
psychological meaning to the user or developer of the knowledge-based system
about the dependency between the events. Human can not easily attach numerical
values to an event but can easily determine whether two events are independent
from looking at the cause-effect relationship between the events involved. The
Bayesian approach, on the other hand, bases its conditional independence concept
around the human reasoning process.
Bayesian approach sees the conditional relationship as the more basic
than that of joint events. According to this approach, conditional probability
should reflect the organisation of human knowledge. The organisation of human
knowledge consists of a set of evidence e that serves as pointer to a context or
frame of knowledge H. In other words, H|e stands for an event H in the context
specified by e. Consequently, empirical knowledge invariably will be encoded in
60
the conditional statements, while belief in joint events, if ever needed, will be
computed from those statements via the product
P(H,e)=P(H|e)P(e) (3.2)
Therefore, Bayesian approach states conditional independence in terms of
conditional probabilities, for example P(H|e) which specify the belief of
hypothesis H under the assumption that evidence e is known with absolute
certainty. If P(H|e)=P(H), it is said that H and e are independent.
Treating conditional independence using conditional probabilities rather
than joint probabilities not only mirrors the human reasoning process but also
provides the capability for knowledge based systems to use the recursive and
incremental updating of the belief value. Consider the following situation. Let H
denote a hypothesis, en = e1,e2,…,en denote a sequence of data observed in the
past and e denote a new fact. A brute force way of calculating the belief in H
would be to add the new datum e to the past data en and perform a global
computation of the impact on H of the entire set en+1 = en,e. In other words, the
systems needs to compute the joint probability of H, en and e. To calculate this
joint probability, the entire stream of past data needs to be stored and made
available for subsequent computation. In practise, this can be time and storage
consuming. Using Bayes theorem, to include the new datum e, we have
)|(
)|(),|(),|(
n
nnn eeP
eHPHeePeeHP = (3.3)
The above equation shows that the prior probability P(H|en,e), represented
by P(H|en) is the old belief of the hypothesis H given the en data. Thus, P(H|en)
can be considered as a summary of past experience. An update of the belief due
to the new datum then can be calculated using this past experience multiplied by
61
the likelihood function P(e|en,H). Thus, the calculation of the new belief for a
hypothesis given a new datum does not require the memory of the past data
values. It can always be performed as a recursive and incremental computation.
We have shown the background of Bayes theorem and its advantages
compared with the traditional approach to probability in this section. Bayes
theorem provides us with greater ability to quantify the probability model of a
situation by a method close to the human reasoning process, however this purely
numerical representation lacks psychological meaningfulness. The numerical
model can produce coherent probability measures for all propositional sentences,
but often leads to computations that a human reasoner would not use. As a result,
the process leading from the premises to the conclusions cannot be followed,
tested, or justified by the users, or even the designer of the reasoning system. An
extension of the numerical representation is needed to provide psychological
meaningfulness of the reasoning system. Such an extension of the numerical
representation of the Bayes theorem is provided by the Bayesian network.
3.4 The Bayesian Network as a Knowledge Base
We have mentioned in the previous section that a purely numerical representation
is inadequate in representing the human reasoning process. For that reason, many
researchers in AI consider probability theory to be epistimelogically inadequate.
Due to this perceived inadequacy of the probabilistic approach to AI, some
researchers have looked into representing qualitative reasoning through a
symbolic reasoning approach. This includes non-monotonic logic[Reiter87],
fuzzy logic[Zadeh78], certainty factors[Shortlife75] and Shafer-Dempster belief
62
functions[Gordon85]. Further investigation of the probabilistic approach to AI
shows that the exploitation of the conditional independence assumptions
implicitly in the qualitative structure of the expert knowledge provides a rich way
of representation of knowledge in the probability approach. This qualitative
structure of the expert knowledge can be represented by a graph. Using this
graph, we can capture and exploit the human ability to easily detect events or
proposition dependencies without knowing precisely the numerical estimates of
their probabilities.
Consider the following situation. A person may be reluctant to estimate
the probability of having Third World War at the end of the century or winning
the Lotto jackpot in the next draw. However, this person can nevertheless state
with ease whether these two events are dependent, that is, whether knowing the
truth of one event or proposition will alter the belief in the other. Evidently, the
notions of relevance and dependence between propositions are far more basic to
human reasoning than are the numerical values attached to the probability
judgements.
A knowledge based system that models the human expert reasoning
process therefore needs to use a language to represent probabilistic information
that allows assertions about dependency relationships to be expressed
qualitatively, directly and explicitly [Pearl88]. One way of providing qualitative
dependence relationships in the probability model is by the use of graph theory.
The nodes in graphs can be used to represent proposition variables, and the arcs
can be used to represent conditional independence.
There are several graph models are used in AI. These graph models can be
classified into two main groups. The first group uses undirected graph. Falling
63
into this category are the Markov networks [Lauritzen88]. The second group uses
directed graphs in order to represent explicitly causal dependency between
proposition. Bayesian network falls into the second group. Pearl [Pearl88]
suggests that this directed graph is a closer representation of human reasoning
process and a semantically richer model compared with the undirected graph. For
the reason of its richness in capturing diagnostic reasoning processes, we use
Bayesian network model in our study.
3.4.1 Bayesian Network Structure
A Bayesian network is a directed acyclic graph (DAG) whereby a node represents
a proposition or an event and an arc represents a direct cause-effect dependency
between two propositions or events. Consider the following situation1: an office
worker called Mr.Goody lives in the outer suburbs of Melbourne and works in
the central business district of Melbourne. His boss, Ms Habib has noticed
recently that he comes to the office late most of the time. She does not like the
situation, but she wants to give Mr. Goody another chance because he is a good
worker and it is only recently that he is often coming late to the office. One day,
Mr Goody has a very important meeting with a client and he is late. Ms Habib,
currently doing Mr. Goody’s performance evaluation, needs to decide whether to
give a good or bad evaluation of him. She needs to know whether Mr. Goody is
late because of his carelessness or whether he is just an innocent person caught in
the bad traffic. Ms.Habib's decision process can be described by figure 3-1.
1 The name of the characters in this example are copywrite of BBC program "Thin Blue Line"
64
While she is waiting for Mr.Goody, Ms.Habib listens to the radio station and
learns that there is an accident in the freeway taken by Mr.Goody everyday to
work. And since the freeway is under going some repairs and only one lane is
open, the traffic is almost at a stand still. With the arrival of this new knowledge,
Ms.Habib concludes that Mr.Goody is caught in bad traffic and therefore he is
innocent. Her belief in Mr.Goody being late because of his carelessness in getting
up late has decreased. In this situation, we can say that the two events of the
traffic being heavy and Mr.Goody sleeping in are dependent, giving the new
evidence of the event Mr.Goody is late.
Pr(C|B,¬A)=0.5
Pr(C|A,B)=0.99
Pr(C|A,¬B)=0.9
Pr(C|¬A,¬B)=0.01
Mr.Goody is late
Traffic is heavyMr.Goody sleeps in
AB
C
Pr(B)=0.001 Pr(A)=0.01
Figure 3-1 An example of a Bayesian network.
The situation depicted through the Bayesian network model in the
previous paragraph shows how a directed graph can be used to explain the
qualitative part of a decision process. It can clearly show the dependency or
cause-effect relations between events or propositions. If the above situation is
explained only using only the probability distribution, the dependency between
the event Mr.Goody sleeping in and traffic is heavy has to be checked through the
65
numerous probability computations. This computation is not only time
consuming but also sometimes difficult to interpret. The example clearly shows
that the richness of Bayesian network through the use of directed graph can make
it a powerful tool in building knowledge base system.
3.4.2 Conditional Independence
The example and discussion in section 3.4.1 has shown that the semantics of the
Bayesian network demands a clear correspondence between the topology of a
DAG and the dependence relationship potrayed by it. Finding conditional
independence of events in Bayesian network can be done through the checking of
the d-separation.
Definition 3.8 If X,Y,Z are the three disjoint subsets of nodes in a DAGD, then Z is
said to d-separate X from Y , denoted <X|Z|Y>D, if there is no path between a
node in X and a node in Y along which the following conditions hold :
1. Every node with converging arrows is in Z or has a descendent in Z
2. Every other node is outside Z.
Any path satisfies the above condition is said to be active and blocked
otherwise. Consider a diagnostic procedure of a metastatic cancer patient
illustrated by figure 3-2. A patient who is diagnosed of having metastatic cancer
might have resulted in showing two different symptoms, namely increased total
serum calcium or brain tumor. Both of these symptoms may cause the patient fall
into the stage of coma. If the patient fall into a stage of coma for a long period of
time, it will damage the brain cells.
66
A
B C
D
E
Metastatic cancer
Brain tumor
Increasetotal serum
Coma
Brain dead
Figure 3-2 Bayesian network model of metastatic cancer diagnostic.
The dependency between the events in this diagnostic process can be checked by
using d-separation (definition 3.8). Consider the different situations below :
1. Let X=C and Y=B and Z=A. Z d-separates or blocks X and Y
because along the path C-A-B there is no converging arrow in A or in
any of descendent of A and other nodes in the network ( D,E ) are
outside Z. The fact that node A separates node C from B, once the
belief value in A is known the belief in node C will not contribute to
the belief value in B. If a patient has been diagnosed as having a
metastatic cancer, the belief that the patient has an increase in total
serum calcium will not increase or decrease the belief that the patient
suffers from brain tumor because once the patient diagnosed of having
metastatic cancer, brain tumor will be present. It is said that the
67
knowledge of the existence metastatic cancer in a patient makes the
event of increased total serum calcium and brain tumor independent.
2. The situation would be different if we take X=C and Y=B and
Z=D. The path C-D-B has converging arrows namely path C-D and
B-D, thus D does not d-separate node C and B. If a patient has been
found in coma, using the same Bayesian network, the doctors can
consider that the patient suffers from increased total serum calcium or
brain tumor. A further medical test carried by the doctors on the patient
and find out that the patient has brain tumor. This new finding will
decrease the possibility of this patient increase in the total serum
calcium. Thus, the knowledge of coma occurred makes the belief of the
two events increased total serum calcium and brain tumor dependent.
Change in one of these events will change the belief value in the other
event.
3. Let the same values are assigned to X and Y, if Z=E, then nodes X
and Y does not d-separate by Z because node E is a descendent of node
D which has converging arrows, thus it violates the first condition of
definition 3.8. Looking back at the patient diagnostic example, a
patient has been found suffering from a brain damage. The Bayesian
network representation of the problem shows that a patient can suffer a
brain damage after falling into a stage of coma over a long period of
time. In this situation once the truth that the patient suffer a brain
damaged is known, this new evidence introduced to the diagnostic
process can be explained as the result of being in the coma. The second
68
situation above shows that the coma event makes the events increasing
total serum calcium and brain tumor become dependent.
The procedure for testing d-separation performed above shows that it separation
criteria follow the basic pattern of diagnostics reasoning, such as two inputs of a
logic gate are presumed independent, but if the output becomes known, the
learning of one input has bearing on the other. The d-separation test is the formal
way of determining the independence assumption between proposition in the
network. Heuristically, the independence assumption can be derived by looking
at the topology in the network. There are three basic topologies that can cause
different independence assumption. These topologies are depicted by figure 3-3.
Fig.3.3a head-to-head Fig.3.3b head-to-tail Fig.3.3c tail-to-tail
a b
c
a b
c
a b
c
Figure 3-3 Different topologies for independence assumption.
In figure 3-3a, the network is considered having arrows a→c and b→c meet
head-to-head at node c. In this type of topology where two arrows meet in a
node, any instantiation of the root node, ie. node a or b causes these two nodes to
be independent. On the other hand, the revelation of the value of proposition in
node c causes the node a and node b to be dependent.
The second topology depicted by figure 3-3b contains two arrows a→c
and c→b meet head-to-tail at node c. In this topology, the instantiation of the root
69
node, ie. node a, causes the node b and node c to be dependent because the
proposition in node c is the cause of the proposition in node b. The similar
situation occurs when we instantiate the leaf node, ie node b. It effects will be
computed all the way to node a. However, if we instantiate node c, it causes the
node a and node b becomes independent. Once we instantiate node c, it blocks
any reasoning from node a to node b and vice versa.
The third topology is depicted by figure 3-3c. It contains two arrows c→a
and c→b which meet tail-to-tail at node c. Only the instantiation of node c
causes the node a and node b to be independent. Any instantiation of other node
causes the nodes to be dependent in the network.
The heuristic checking explained above provides us with an easier way to
determine the conditional independence of the proposition in the network. This
checking process using the heuristic can be aided by human vision.
A diagnostic reasoning involves not only building a diagnostic model of
events and its dependency but also observing the changes of the events behaviour
when a new evidence or knowledge arrive to the reasoning process. The
observation and adjustment process due to the arrival of new evidences is known
as inference process. In this study we use probabilistic model to represent our
knowledge thus we will concentrate the discussion in section 3.4 around the
probabilistic inference.
3.5 Probabilistic Inference In Bayesian Networks
The basic task of any probabilistic inference system can be regarded as a task to
compute the posterior probability distribution for a set of query variables, given
70
the exact values for some evidence variables. In other words, the computation of
P(Query|Evidence). We use the following notation for the discussion of the
inference algorithm.
Upper case letters such as A,B,C,D,...,X,Y,Z represents variables
Lower case letters such as a,b,c,d,...,x,y,z represents the possible value of
the corresponding variables.
+ represents an affirmation of a proposition.
¬ represents a denial of a proposition.
E represents a set of evidence variables.
e represents a set of evidence variables which the value is known or
instantiated.
BEL(x) represents the overall belief accorded to the proposition X=x by
all received evidences. BEL(x)=∑ P(x|e) where P(x|e) = probability that x
is true given this evidence e.
α represents the normalising constant such that this Bel(x) when x is
vector is normalised to 1, for example α[2,2,1]=[0.4,0.4,0.2]
My|x represents a fixed conditional probability matrix which quantifies the
link X→Y.
In this study, we will only dealing with the discrete variables, thus BEL(x) can be
regarded as vectors, which its component corresponds to different values of X.
For example if the domain of X is High, Short, BEL(x) can be written as
BEL(x)=(BEL(X=High),BEL(X=Short))=[0.4,0.6]
71
3.5.1 Pearl’s Inference Algorithm
Pearl’s algorithm of probabilistic inference works on the directed graph approach
to the Bayesian network. We adopt the directed graph approach due to its
semantic richness in representing knowledge. The main idea behind this
algorithm is to create a two ways communication between nodes. Each
communication line contains different type of message. The nodes in the network
are arranged according to the rank of parent-child association. A direct link
between two nodes constitutes a parent-child link and the direction of the arrow
determined the rank of a node. The node that has an emanated arrow is the parent
whereas the node with a coming arrow is the child. Consider the network in
figure 3-4.
A B C
πA πB
λB λC
Figure 3-4 Inference in Bayesian network.
Node B is considered as a parent in relation to node C, but is considered a child in
relation to node A. The messages that are passed in the networks go through two
different channel. The first being the parent-to-child channel and the second
being the child-to-parent channel. The messages passes through parent-to-child
channel gives the inference process the causal support(π) message. In figure 3-4,
the πA and πB are the causal support message. The causal support messages are
72
passed to the direct child/children of the node where the message is originated.
The child-to-parent channel on the other hand is used to pass the evidential
support(λ) messages. The evidential support messages are passed to the direct
parent/parents of the originated node. In figure 3-3, the evidential support
message are represented by λB and λC.
In the every single node in the network the value of π and λ are used to
update the belief value of the node through the following formula:
BEL x x x( ) ( ) ( )= αλ π
where
x is a proposition X=x
α is the normalising constant
π is the causal support value
λ is the evidence support value
By exploiting the conditional independence of the nodes in the network
and using the Bayesian recursive update, the calculation of the belief in the
Bayesian network can be performed locally in the set of nodes without losing the
global effect of the new evidence. Moreover, parallel computation become
permissible once the nodes dependency has been determined and independent set
of nodes has been found. This local and parallel computation provides Pearl’s
algorithm with the ability to perform inference efficiently.
To update the new causal belief (πx) and evidential belief (λx) from a node
X to Y (X→Y), a link matrix My|x consisting conditional probability of the
73
variables in Y given some known variables in X is used. To illustrate the
inference process, consider the following situation (this example is the modified
version of Pearl’s example [Pearl88], pp 151 ):
In a murder case trial, there are two suspects, one of whom definetly
commited a murder. A gun has been found with some fingerprints. Let A
identify the event of person X is the last user of the gun, namely the killer.
Let B identify the event of person Y is the last person hold the gun, and let
C represent an event of the fingerprint finding obtained from the
laboratory. The following probability distribution is held for the situation.
After getting the some evidence during investigation, the police believe
that suspect 1 has 0.8 probability as the killer or BEL(a=1)=0.8 and suspect
2 is 0.2or BEL(a=2)=0.2.
The above example can be simply represented in a Bayesian network by figure 3-
5. The detailed message passing scheme in this example is as follow:
π(a)=[0.8,0.2]λ(a)=[1,1]BEL(a)=[0.8,0.1]
0.8 0.20.2 0.8
π(b)=[0.68,0.320]λ(b)=[1,1]BEL(b)=[0.68,0.32]
Observation C=c
Mc|bMb|a
Figure 3-5 The use of link matrix in the inference.
⎩⎨⎧
=≠==
=2,1,2.0
2,1,8.0)|Pr(
babaif
babaifab
2,11)|Pr( == bforbc
74
Prior to the inspection of the fingerprint, all λ are unit vector of 1. The links
matrix Mb|a is used to calculate the values of πb ,λb and in turn using the formula
BEL(b).
[ ] [ ]32.0,68.08.02.0
2.08.02.0,8.0 =⎥
⎦
⎤⎢⎣
⎡•=bπ
[ ][ ] [ ]32.0,68.01,132.0,68.0)( == TbBEL α
Now we assume that the laboratory report arrives, summarised a evidential
support λc=(0.8,0.6), the knowledge about the the last person holding the gun
change as
[ ][ ] [ ] [ ]181.0,819.012.0,544.06.0,8.02.0,68.0)( === αα TbBEL
Using this the update belief in B, the evidential message to the A is changed to
[ ] [ ]309.0,691.08.02.0
2.08.0181.0,819.0 =⎥
⎦
⎤⎢⎣
⎡•=aλ
In turn, the belief in person 1 being a killer changes from 0.8 to
[ ][ ] [ ] [ ]10.0,902.006.0,553.0309.0,691.02.0,8.0)( === αα TaBEL
It is showed from this new belief about the last person that hold the gun
resulted from the fingerprint report from the laboratory, the belief that person 1 is
guilty increases from 0.8 to 0.902.
We have shown the inference process in a Bayesian network in the shape
of a chain. This inference algorithm can be used in other shape of network
including tree and singly-connected-networks. All of these different shapes have
one thing in common, they do not contain any cycle in the network. The
75
inference process for a Bayesian network that has a cycle is different from the
process we presented above.
3.5.2 Handling Loops in the Network
In a cyclic network, the propagation or inference process will face problem in
reaching a stable equilibrium state. The message passing scheme that we
discussed in previous section will cause the inference process goes indefinitely.
The cycle does not necessarily obvious, it may be implicitly constructed because
of the d-separation rule. Consider our previous example in figure 3-2 the
metastastic cancer diagnostic. There is a cyclic in the network through the links
metastatic cancer-increase total serum-coma and metastatic cancer-brain tumor-
coma. According to d-separation rule, the instantiation of node coma cause the
event increase total serum and brain tumor to be independent, that is changing
the belief in brain tumor and increase total serum when a new knowledge about
coma is obtained. In turn these two new beliefs will cause the change the belief in
metastatic cancer. Since metastatic cancer is the cause of increase total serum
and brain tumor, the new belief in metastatic cancer change the belief of these
two events, and at last it will change the belief in the patient will go to state of
coma. Thus, we are back to square one.
There are several methods that can be used to solve the problem of the
cycle in a Bayesian network. They are clustering, conditioning and stochastic
simulation. Clustering involves forming compound variables in such way that the
resulting network of cluster is singly connected. Conditioning involves breaking
the communication pathway along the loops by instantiating a selected group of
76
variables. Stochastic simulation involves assigning each variables a definite value
and having each processor inspect the current state of its neighbours, compute the
belief distribution of its variables, and select one value at random from the
computed distribution.
From the three approaches to handle a loop in Bayesian network,
stochastic simulation gives the best estimation of the posterior probability.
However, it suffers from the complexity of the calculation. The accuracy of the
stochastic simulation approach is related to the number of simulation runs. Pearl
[Pearl88] suggests that to achieve 1% of accuracy we need to perform 100 runs of
simulation. In this study, we introduce a new method in handling cycle in
Bayesian network, namely intelligent node. We discuss and compares the
different methods of handling loops in chapter 5.
3.6 Summary
In this chapter we have discussed the theory behind the Bayesian network.
Bayesian network is built as a combination of Bayes and Graph theory. The
Bayes theorem provides the numerical representation of the model, whereas the
graph theory provides the semantic representation. A belief distribution in a
Bayesian network experience changes when a new knowledge arrives to the
network. The changes are computed through the inference process. One of the
inference algorithm for a directed Bayesian network is Pearl’s algorithm. It is
based on the concept of message passing in the network through the concept of
parent-child communication. This algorithm works for chain, tree and singly
connected network, but does not work perfectly for a network that contain a cycle
77
or loop. To handle a cycle or loop in the network, clustering, conditioning and
stochastic simulation can be used. Stochastic simulation provides the best
accuracy at the expense of computation complexity.
In chapter 4, we will introduce a Bayesian network model for an
information retrieval system. The basic model will be presented together with an
appropriate inference algorithm for the model.
78
Chapter 4
A Semantically Correct Bayesian Network Model for Information
Retrieval
4.1 Introduction
Probability theory has long been recognised as a useful tool for dealing with
uncertainty. The information retrieval process, as explained in Chapter 2, involves
some uncertainty. This suggests that an appropriate approach to information
retrieval is one which uses probability theory as its basic framework. Indeed,
researchers in the area of information retrieval have been investigating the
probabilistic model since the 60’s [Maron60]. This model was the first
information retrieval model with such a firm theoretical foundation for handling
uncertainty. Despite the apparent attractiveness of an information retrieval model
with firm theoretical basis, acceptance of this model has not been universal. This
lack of wide acceptance is due to the fact that the estimation of the probability
parameters in the model is perceived to be somewhat unsatisfactory.
The estimation of probability parameters in traditional probability models
[Maron60, Robertson76, Rijsbergen79] requires us to look at the frequency of
term occurrences in the set of relevant and non-relevant documents. The
probabilistic model on the other hand, usually relies on the relevance feedback
process. This is a process whereby the system presents the top-ranked documents
79
to the user for a judgement as to whether they are relevant or not. In such a
model, before any relevance feedback data is available, it is very difficult for the
system to determine the relevance status of a document for an ad hoc query in
order to produce an initial document ranking. The existing probabilistic models
usually circumvent this parameter estimation problem by producing initial ranked
documents based on ad hoc estimations of the probabilistic model parameter or by
using an alternate retrieval model, for example, using the number of index terms
that are in common with those in the query to produce the initial ranking. These
models then use the probabilistic formulae to calculate the revised document
ranking after the data that can be used to determine relevant or non-relevant set
becomes available. Based on this relevance data, it is then possible to estimate the
parameters of the probabilistic model by computing the proportion of times each
term occurs in the documents that have been judged relevant and non relevant.
There have been some attempts to formulate methods which perform the
estimation of probability parameters of the probabilistic model using either no
prior knowledge of relevance data [Croft79] or using partial relevance information
[Sparck-Jones79]. However these models are still based on some degree of ad-hoc
estimation. Thus, one weakness of the traditional probabilistic models such as that
of Maron and Kuhn [Maron60], Robertson and Sparck-Jones [Robertson76], Fuhr
[Fuhr89] and Rijsbergen [Rijsbergen79] is that they use two different computation
methods: one to produce the initial ranking and another to handle relevance
feedback. We will also refer to the traditional probabilistic models as non-
inference probabilistic models.
80
An additional weakness of existing probabilistic models concerns their lack
of ability to learn from past queries to determine the prior distribution for the
model parameters. This mean that parameters are applied only for current query,
and as such a large potential database of relevance judgement based on past
queries would be wasted, although it is considered to be useful [Fuhr92].
In this chapter we introduce a new model for information retrieval based
on probabilistic inference. Probability inference, as explained in chapter 3, is the
mechanism that can be used to revise beliefs as new evidence arrives at the body
of belief. This approach to the probabilistic information retrieval model overcomes
the weakness of the traditional probabilistic model in the following ways:
• The probabilistic inference approach to the probabilistic model retains
the sound theoretical basis of the traditional probabilistic model, but
also incorporate methods for producing initial ranking of documents
and of handling relevance feedback. In this approach, the initial ranking
of document is produced using the prior probability distribution and
relevance feedback data is used as new evidence to update the prior
probabilities.
• Relevance feedback fits naturally into the model. The probabilistic
inference approach provides an automatic mechanism for learning. By
modeling prior distributions on the model parameters, we can
coherently update the prior distributions as more feedback data
becomes available.
• The probabilistic inference approach allows us to incorporate relevance
information from other queries into the model by using its ability to
81
incorporate learning into the model. This and the natural application of
relevance feedback means that the probabilistic inference model
provides a better learning framework than do the traditional
probabilistic models.
The difference between the existing non-inference probabilistic models and
the probabilistic inference approach arises due to their differing treatment of the
meaning of probability. In the former, the probability is considered from a
frequency point of view. The probability values in this model are obtained simply
by counting the number of documents containing a particular descriptor or index
term. On the other hand, the probabilistic inference models interpret the
probability as the degree of belief in an event or proposition. This is known as an
epistemological view of probability, a view which considers the assignment of
beliefs in propositions, quite devoid of any statistical background. The
epistemological view provides the information retrieval model with the ability to
capture the model’s semantics. Although these two views of probability are
considered to be contradicting, the statistical notion of probability may be used as
a way to measure chance at the implementation level [Rijsbergen79]. A clear
explanation of probability at the conceptual level to capture the semantics of the
document collection is required and has been lacking from the traditional
probabilistic models [Rijsbergen92]. A major objective of our model therefore is
that of providing a conceptual model for information retrieval.
Probabilistic inference is also superior to the traditional probabilistic
models in providing the information retrieval mechanism with effective means by
which important semantic information can be incorporated into the retrieval
82
process, thus circumventing some statistical problems inherent in the traditional
probabilistic model.
We will adopt one specific approach to probabilistic inference, namely that
of Bayesian networks. A Bayesian network is a model for probabilistic inference
which uses a combination of probability and graph theory. The inclusion of graph
theory into the model provides Bayesian network with some additional
characteristics lacking in those retrieval model based on non-graphical approaches
to probabilistic inference. These additional characteristics are:
• The documents in the collection may be represented as complex
objects with multilevel representations, not merely as collections of
index terms. The document may be considered as a collection of either
index terms, sentences or phrases. The level of representation of the
document is implementation dependent.
• Dependencies between documents are built implicitly into the model by
using the independence assumption of the Bayesian network. We will
explore this in detail in section 4.2.2. Citation or nearest neighbor
links can also be easily incorporated into the model because of its
graphical nature. Citation and nearest neighbour links have been shown
to improve the performance of information retrieval systems
[Turtle90].
• Synonyms and a thesaurus can be easily implemented as part of the
network. Any index terms that are synonyms can be linked, so that the
system can use all those synonyms during retrieval. The index terms
that belong to the same concept may be linked into a concept node.
83
The collection of the concept nodes in the network forms a thesaurus
in the system. The addition of a thesaurus has been shown to increase
the performance of a retrieval system [Sparck-Jones71].
We will present in detail the characteristics of the model and how the
model addresses problems of the traditional probabilistic models in section 4.2.
This section also examines the concept of prior probability which is one of the
major issues in probabilistic information retrieval models. Section 4.3 includes the
discussion of the inference process in the network which occurs once new
information or evidence has arrived. Since a Bayesian network is directed acyclic
graph, the determination of the correct causal direction of the inference is
considered an important issue. Different causal directions in the model lead to
different inference results. We will compare different causal direction models and
discuss their application to information retrieval in section 4.3. To show that our
model may be used as a conceptual model, in section 4.4 we compare our model
with existing information retrieval models such as the Boolean, vector space and
traditional probabilistic models.
4.2 The Bayesian Network Model
In the probabilistic inference model, we assume that there exists an ideal concept
space U called the universe of discourse or the domain reference. The elements in
this space are considered to be elementary concepts. A proposition is a subset of
U. The correspondence between propositions and subsets of elementary concepts
can translate the logical notions of conjunction, disjunction, negation, and
implication into the more familiar set-theoretic notions of intersection, union, and
84
complemetation respectively. We will use the set-theoretic and logical operations
interchangeably during our discussion in this chapter.
4.2.1 Probability Space
In information retrieval, the ideal concept space can be interpreted as the
knowledge space, in which documents, index terms, and user queries are all
represented as propositions or subsets.
A probability function P is defined on the concept space U. The probability
P(d) is interpreted as the degree to which U is covered by the area occupied by
the knowledge contained in document d. Similarly, P(q∩ d) represents the degree
to which U is covered by the knowledge common to both query q and document d.
We assume that all documents in the collection have been indexed and are
represented by a set of index terms and that the concept space U includes all the
index terms in the collection.
Definition 4.1 Let n be the number of index terms in the system and let ti be an
index term. U=ti,t2,t3,…,tn) is the set of all the index terms and defines
the sample space. Let u⊂ U be a subset of U.
The concept u may represent a document or a query in the information retrieval
model. In the Bayesian network, set relationships are specified using random
variables as follows.
Definition 4.2 Each index term ti is associated with a random binary variable ki
as follows:
ki=1 ⇔ ti ∈ u
85
ki=0 ⇔ ti ∉ u
We also define a function g: U → R such that g(ti) represents the term
weight of the index term ti
The term weights g(ti) will be implemented in our model as the weights of the
links in the network. Now we define the document and query concepts in the
sample space U.
Definition 4.3 A document is represented as a concept d=k1,k2,k3,…,kn) where ki
is binary random variable and ki=1 if ti∈ d and ki=0 otherwise. Similarly,
the user query can be represented as a concept q=k1,k2,k3,…,km ) where
ki=1 if ti∈ q or ki=0 otherwise.
In a Bayesian network, the concepts ti, concept d and concept q are represented as
nodes. The relations g(ti) are represented by the links in the network. Then, given
that documents and queries are represented as concepts in the space U, we can
apply a basic model of information retrieval as a concept matching system
[Kwok90].
To represent the knowledge contained in document d and the knowledge
represented by query q, a Bayesian network model uses two separate networks.
One network, the document network represents the documents in the collection.
The other network, the query network represents the user query [Ghazfan95].
The two networks are combined when a retrieval process is performed. The
network model is depicted by figure 4-1.
86
Query
information retrieval
automatic information retrieval image
doc-1 doc-2 doc-3
QueryNetwork
DocumentNetwork
Figure 4-1 Bayesian network model for information retrieval systems.
The document and query networks are similar except that the document
network is established at the creation of the database collection and remains the
same unless new documents are added to or obsolete documents are deleted from
the collection. The query network, on the other hand exists only for the duration
of the user’s query and is very dynamic in the sense that the query network
changes for different queries as opposed to the document network which remains
the same for different queries.
The output of retrieval consists of the collection's relevant documents
which are ranked by calculating P(d|q) ie: the value of probability in nodes di
when the value of query q is known. This inference network approach to
information retrieval was first introduced by Turtle and Croft [Turtle91]. In their
model, however, they use the inference P(q|d) which produces a different
document ranking output than does our model. We will show in section 4.4 that
87
our model provides a more general and appropriate framework for modeling
information retrieval than the model of Turtle and Croft.
4.2.2 The Document Network
A document in a Bayesian network is represented as a complex object with
several levels of representation. The smallest possible document network consists
of two layers of nodes. The top layer represents the system's dictionary or the
concept universe U. The next layer down represents arbitrary objects that can be
constructed by the combination of index term ti in U. These objects are the
concept objects in U and may represent arbitrary sets or propositions. In
information retrieval implementations, these objects include phrases, subject
classifications, sentences and documents. In the document network used for
example in figure 4-1, the concept universe U consists of the index terms as the
elementary concepts and documents as higher level concepts that are built from
the elementary concepts. The index terms automatic, information, retrieval, and
image make up the concept universe U (the top layer of the nodes) and the
document objects doc-1, doc-2 and doc-3 make up the bottom layer. The number
of layers in the network is not limited to two.
The number of layers in the network depends on the level of abstraction to
be modeled. For example, if we do not want to implement classification of the
index terms into subjects, we only need a two layers network, one layer to
represent the dictionary layer and the other to represent document layer as in
figure 4-1. However, if we need to model an information retrieval system that
represents subject classification explicitly, we need to introduce another layer
88
between the document and the dictionary layer. Such a situation is depicted in
figure 4-2. We can see that the index terms image and information are combined
into a node multimedia in the middle layer. The network in this situation consists
of three layers. Regardless of the number of layers required for the implementation
of the model, the top most layer is always the dictionary layer and the bottom
most layer is always the document layer. The index terms in the dictionary layer
are always required because they are the common elements shared by documents
and queries and are used during the matching process. The documents themselves
are also required as they are the objects be presented to the user as the result of
the matching process.
automatic informationretrieval image
multimedia
doc 1 doc 2 doc 3
Figure 4-2 Document network with a subject classsification layer.
The link arrows in the network signifys the causal relationship between the
layers. In figure 4-1, the arrows emanate from the index terms automatic,
information, and retrieval to the doc-1 node signifies the fact that these three
index terms are the cause for document 1 to exist in the collection. In other
89
words, the documents may be considered as the subsets of U formed by the index
terms ti in U for which gi(t)>0.
4.2.3 The Query Network
A user’s information need is represented in a Bayesian network model by the
query network. Compared to the document network, the query network can be
considered as an up side down network. The root node of this network represents
an abstraction of the user’s information need. The nodes in the layers underneath
the root node further explain the abstract concept of the information need. This
lower layer consists of the index terms tj which are part of the universe U. The
number of layers in the query network is always two.
We note here that as in any model in information retrieval, the index terms
used in the query come from the system’s vocabulary of index terms. Without this
limiting assumption, the task of matching between document and query
representation becomes almost impossible.
The user’s information needs are represented explicitly in our model in
order to allow the user to further refine their query by assigning weights to the
index terms used to explain the information need. Consider the example shown in
figure 4-3.
90
Query
information image retrieval
Figure 4-3 An example of a query network.
In this example, the user’s information needs are expressed by the index terms
information, image and retrieval. If the user is certain about the relative
importance of these index terms, they can quantify their degree of belief in the
individual index terms for meeting their information need, that is they can assign
weights to the links query→index term. For example, they may decide to give
more weight to the term image than information, if they prefer image retrieval
rather than information retrieval articles but do not wish to eliminate the
possibility that some of the articles concerned with information retrieval may
discuss image retrieval.
4.2.4 Prior Probability
Before any inference process can be performed, any inference based probability
model requires that a prior probability be defined. In a Bayesian network, the prior
probabilities for all the conditional probabilities for a node given the belief values
for its parent nodes and the prior probability for all the root nodes need to be
defined. Therefore in our model for information retrieval, we need to define
probabilities for the following:
91
• In the document network:
• P(d|ti,d=true) - conditional probabilities for a document node
given the probability of the index terms that construct the
document. These conditional probabilities act as the weights of
the link between the document and index term nodes.
• P(ti) – probability that the index term ti will be found relevant
to the user’s information need.
• In the query network:
• P(ti|Q=true) - conditional probabilities for an index term ti
given the probability of the query.
Any interior nodes and leaf nodes do not required prior probabilities to be defined
and their values will change according to the inference process. Note that we do
not have to assign the probability value to the query node because we will
instantiate this node during the inference process.
For reasons of clarity and simplicity, we will use the model with a two
layer document network depicted in figure 4-1. The conditional probabilities of a
document node given its parents, that is P(d|ti,d), are defined using the term
weights. The term weight can be calculated as the product of the term frequency
(tf) and inverse document frequency (idf) (refer section 2.5.2). This term weight
based on tf*idf may not produce values in the range [0,1], the usual range for
probability values. Therefore to use term frequency in the conditional probability
we need to normalise the term weight. Table 4.1 shows two such term frequency
normalisation methods.
92
Max tf/max(tf)
Augmented 0.5 + 0.5*tf/max(tf)
Table 4-1 Variety of term frequency normalisations.
The main difference between the max and the augmented methods lies with
the density of the distribution they produce. The augmented term frequency for a
particular document will be in the range [0.5,1], whereas the max method
produces a term frequency in the range [0,1].
We can use different combinations of term frequency normalisations and
inverse document frequency weighting schemes. For example, we may adopt the
combination of the max term frequency and the inverse document frequency. With
these choices, the term weights are then calculated as
)log(*max df
N
tf
tftermweight = (4.1)
We found that different weighting schemes provide different levels of
recall and precision. These differences occur due to the different densities of the
probability distributions provided by the different weighting schemes. We found
that the best recall and precision was achieved by the middle range density
distribution, that is the distribution in the max approach. Further detailed
discussion on the behaviour of the model with different weighting schemes is
presented in Chapter 6.
The index terms are treated as though they have an equal chance to be
found relevant to the information need. Thus we can assign 1/(total no. of index
terms) as the prior probability of each index term node. The values of these prior
93
probabilities in the index term nodes will change when a new query is submitted
and a query network is attached to the document network.
4.3 Probabilistic Inference in Information
Retrieval
The Bayesian probability approach views probability values of nodes as degrees of
belief in an event or proposition The links in the network represent the cause and
effect relationships between two propositions. The whole network may be viewed
as our universe of belief which all the propositions interact with, their effect on
each other being derived through the links and any independence assumptions
inherent in the network. This process is recognised as being similar to the human
reasoning process. A person always has some belief value for a particular issue.
These belief values are arrived at using knowledge which comes through
experience. A new experience arrives at the human belief system regularly through
observation of new evidence. This new evidence causes the human to adjust their
belief value, ie. to perform a reasoning process. The belief in some propositions
may be amplified while belief in others may be lessened. A Bayesian probabilistic
inference algorithm tries to model this human reasoning process. Any new
evidence which comes to the network will alter the belief distribution in the
network. In order to change the belief distribution, a probabilistic inference
process is used to a make decision as to which propositions will be affected and by
how much the belief in these propositions may have to change. A Bayesian
probabilistic inference algorithm carries out this decision using two characteristics
94
of the network model, namely the semantic of the network to determine the
independence assumption, that is the decision as to which propositions will be
affected by the new evidence, and the numeric contents (quantitative
representation) to calculate the value of new belief for all the affected nodes.
We will use Pearl’s inference algorithm [Pearl88] in our model. In this
algorithm, the independence assumption can be validated formally through a d-
separation check or heuristically by considering the shape of the network (see
chapter 3 ). Once the independence assumption between nodes has been
established, the process of the belief updating is performed using the link matrices.
A link matrix represents all possible conditional probabilities of a node given the
belief values of its parents. For example, if a node x has a set of parents π
x=p1,p2,p3,…,pn, we must estimate P(x|πx) = P(x|p1,p2,p3,…,pn). Since we are
dealing with binary valued propositions ( see definition 4.2), this link matrix can
be represented by a matrix of size 2x2n for a node with n parents. The matrix
elements specify the probability taken by a node x given the truth value of its
parents. Given that all the parents of x are independent, the estimate P(x|
p1,p2,p3,…,pn) can be presented as the sum of all these truth values. For
illustration, we will assume that a node doc-1 is constructed from three index
terms X,Y and Z (figure 4-4) and that
P(X = true) = x
P(Y = true) = y
P(Z = true) = z
95
doc
X Y Z
Figure 4-4 Network for the link matrix example.
The link matrix for the information retrieval network in figure 4-4 can be
constructed as L[i,j], i∈ 0,1, 0 ≤ j < 2n given that the parents correspond to pj (
j∈0,1,2 in our example) [Turtle91]. We will use the row number to index the
values assumed by the child node and use the binary representation of the column
number to index the values of the parents. The high order bit of the column
number indexes the first parent’s value, the second highest order bit indexes the
second parent and so on. The link matrix for figure 4-4 is therefore:
⎥⎦
⎤⎢⎣
⎡
XYZZXYZYXZYXYZXZYXZYXZYX
XYZZXYZYXZYXYZXZYXZYXZYX
The first row represent the possible values of the parent nodes X,Y,Z when
Pr(doc = false) and the second row when Pr(doc = true).
4.3.1 Link Matrices
We will describe three link matrix forms that can be used for different information
retrieval implementations, namely the or and and link matrix for Boolean retrieval
96
and the weighted-sum matrix for probabilistic retrieval. We will base our
discussion in this section on the network example in figure 4-4.
4.3.1.1 OR-link matrix
In an or-combination link matrix, the doc node will be true when any of X,Y, or Z
is true and false when all of X,Y,Z are false. So, for our example,
⎥⎦
⎤⎢⎣
⎡=
11111110
00000001orL
Using a closed form update procedure we have
P(doc=true) = (1-x)(1-y)z + (1-x)y(1-z) + (1-x)yz + x(1-y)(1-z) +
x(1-y)z + xy(1-z) + xyz
The update procedure can be simplified as
P(doc = true) = 1 – (1- x)(1-y)(1-z)
P(doc = false) = (1-x)(1-y)(1-z)
4.3.1.2 AND-link matrix
For an and-combination link matrix, the doc node will be true when all of X,Y
and Z are true and false otherwise. Thus we have a matrix of the form
⎥⎦
⎤⎢⎣
⎡=
10000000
01111111andL
Again using closed form update, we have
P(doc=false) = (1-x)(1-y)(1-z) + (1-x)(1-y)z + (1-x)y(1-z) + (1-x)yz +
x(1-y)(1-z) + x(1-y)z + xy(1-z)
The calculation can be simplified as
97
P(doc = true) = xyz
P(doc = false) = 1 – xyz
The AND and OR link matrices infer different degrees of influence to the belief in
the child node. The influence of the belief values for (parents = true) are greater in
the OR link matrix than in the AND matrix. Therefore, we can use the OR matrix
when we interested in having the child belief values significantly influenced
greatly by the belief values of (parent = true).
4.3.1.3 WEIGHTED-SUM link matrix
The weighted-sum link matrix is an attempt to weight the influence of individual
parent nodes by the probability values of the child node. A parent with a larger
weight will influence the child more than a smaller weight parent. If we let the
links between the node doc and nodes X ,Y, Z be weighted as wx,wy,wz respectively
and set t= wx + wy + wz, for our example we have the link matrix of the form
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
+++
−+
−+
−−+
−−−=
ddyxdzxdxdzydydz
ddyxdzxdxdzydydz
ws
wt
www
t
www
t
ww
t
www
t
ww
t
ww
wt
www
t
www
t
ww
t
www
t
ww
t
ww
L)()()(
0
1)(
1)(
11)(
1111
Evaluation of the link matrix produces
P(doc= true) =
xyzwzxyt
wwwzyx
t
www
zyxt
wwyzx
t
wwwzyx
t
wwzyx
t
ww
ddyxdzx
dxdzydydz
+−+
+−+
+−−+−+
+−−+−−
)1()(
)1()(
)1)(1()1()(
)1()1()1)(1(
= t
wzwywxw dzyx )++
98
P(doc = false) =
)1()1()1)(1()(
)1(
)1()1()(
)1)(1()(
)1)(1)(1(
zxyt
wwzyx
t
wwzyx
t
wwwyzx
t
ww
zyxt
wwwzyx
t
wwwzyxw
dzdydzydx
dzxdyxd
−+−+−−+
+−
+−−++−−+
+−−−
= t
wzwywxw dzyx )(1
++−
We may use the term weight (such as term frequency (tf)) for the parent’s
or index term weight (wz,wy,wz in the above example) to implement the weighted-
sum because the parent’s or the index term’s weight are summed and normalised
over a document. Analogously, we may use the inverse document frequency (idf)
for the weight of the document node to represent the index term’s ability to
discriminate the document from the other document in the collection. Since we
always multiply the weight of the individual parent (index term) by the weight of
the child (document) when P(doc=true), the weight we have used is actually
equivalent to a tf*idf weighting. Thus, we can assign the tf*idf values as the link
weights. In other word, the link weight represents the P(dj|ti).
In a query network, the link weight may be interpreted as the user
weighting the index term’s relative importance in representing their information
need. The link matrix of the query network is less complicated than that of the
document network since all index terms in the query network only have one parent
node (ie. the query node).
4.4 Directionality of the Inference
The notion of causation, that is the idea that a given random variable can be
perceived as the cause for another variable to exist or change its belief value, is
99
fundamental to inference using a Bayesian network. Different causal direction in
the network produce different reasoning models and thus it is important to
consider the direction of the causation in the network. In many cases the direction
of the causation is clear, in others it is difficult to distinguish between causal and
evidential support.
Causal support is represented as an arc in the network whereas evidential
support flows against the direction of the arc. By drawing an arc from node x to y
we are asserting that proposition x has in some way caused proposition y to be
observed. That is, if we observe proposition x, then this observation in turn will
determine our belief in proposition y, assuming that x is the only parent of y. If y
has other parents in addition to x, then we need to consider the influence of these
other parents.
Evidential support, on the other hand, means that the observation of
proposition y may change the belief in proposition x because y is a potential
explanation of x. Thus, in this case, knowing y will confirm or oppose the belief in
x.
In the basic information retrieval model depicted in figure 4-1, there are
three different propositions which may be used as causal or evidential support,
namely the queries, the index terms and the documents. In our model, we assert in
the query network that the observation of a query influences our belief as to which
index terms are useful in representing the user’s information need. In the
document network, we assert that the combination of the index terms causes the
object document to exist. The inference process is performed by instantiating the
100
query network and observing the result of inference at the document nodes. The
topology is depicted in figure 4-5a.
query
document
index term
Figure-4.5a
document
query
index term
Figure-4.5b
index term
query document
Figure 4.5c
Figure 4-5 Contrasting causal topologies.
There are at least two others possible topologies which may be used in
information retrieval modeling. In the first, we simply invert the network as shown
in figure 4-5b. Thus, we assert that the observation of the document causes a
change in belief in the index terms and in turn changes the belief in the query. This
approach to modeling the inference network was taken by Turtle and Croft
[Turtle91]. Superficially, the difference between the two topologies appears
trivial, however we have found that the topology shown in figure 4-5b does not
provide a “correct” inference model for information retrieval [Ghazfan96]. In this
section, we will show what we mean by the “correct” inference. We will also
defer until later in this section the discussion of the third topology shown in figure
4-5c.
As we have stated previously in chapter 2, an information retrieval model's
task is to estimate the relevance of the document to a given query. In other words,
101
it attempts to estimate P(Relevant|documenti,queryj), ie. the probability that a
document is relevant to a given query. Applying this estimation to the topology
shown in figures 4-5a and 4-5b respectively, we have the situation as shown in
figure 4-6
Document
Query
Index term
Query
Document
Index term
Relevance Relevance
Figure-4.6a Figure-4.6b
Figure 4-6 Causal topologies with relevance node.
The node relevance in figure 4-6 is introduced to the graph to explicitly represent
the belief value that exists at the end of the inference network chain. In this case,
this is the belief value of whether the document is relevant to the query. Thus, the
node relevance in figure 4-6 corresponds to an area in the universe U which is
occupied by the query Q, the set of index terms ti and the document d, in other
words P(Q∩ti∩d). Using Bayes theorem notation, the graph in figure 4-6a gives:
P(Relevant|d,Q) = P(Relevant|d,t,Q)
= P(d,t,Q) (4.2)
By conditioning the probability values only on the index terms in the evidence set,
i.e the index terms in the query, the above equation can be simplified further into:
102
P(Relevant|di,Qj)= P(di,|Qj) (4. 3)
Using the same procedure, the graph in figure 4-6b gives:
P(Relevant|di,Qj)= P(Qj| di) (4.4)
The belief value presented by the relevance node is the result of the belief
propagation process triggered by the arrival of new evidence. The arrival of new
evidence for figure 4-6a (our approach) is indicated by the introduction of a query
into the system. On the other hand, the result of the inference for the graph in
figure 4-6b is obtained by instantiating one document node at a time in the
network. In our approach, the relevance value to be measured is that associated
with the probabilities at the document nodes. The hypothesi that we have to verify
is that as to whether the documents are relevant to a given query (calculating
P(d|Q)). On the contrary, the approach adopted in figure 4-5b and figure 4-6b
measures the relevance of the query node. The hypothesis then to be verified is
the relevancy of the query to a selected document, ie. P(Q|d).
If we apply Bayes theorem to equations 4.3 and 4.4, the two above models
of retrieval produce the following interpretations respectively:
)(
),()|(
j
ji
jiQP
QdPQdP = (4.5)
)(
),()|(
i
jiij dP
QdPdQP = (4.6)
Note that the denominators of equations (4.5) and (4.6) are normalisation
factor for the equations. Therefore, in the process of answering an arbitrary query,
equation (4.5) uses the same normalisation factor in every document matching
process for that query. Equation (4.6) on the other hand, uses a different
103
normalisation factor for each observed document in the same query. Since an
information retrieval system fully evaluates all the documents in the collection for
a single query introduced to the system to find the relevant documents, equation
4.6 will produce different normalisation values across documents instantiated for a
single query. Hence we can assert that an implementation that exhibits the features
of equation (4.5) will give a “correct” result. To clarify this assertion and the
importance of a common normalisation value, consider the following case as an
example. Suppose there are four objects in the knowledge universe U: a book A, a
thesis B, an article C, and a query Q. The mapping of the document and query set
to the knowledge space are given in figure 4-7. A quick visual observation on the
graph reveals these facts:
Book A covers around 50% of the knowledge space occupied by set Q.
Thesis B covers around 30% of the knowledge space occupied by set Q.
Article C covers very little of Q.
A
C
Q B
U-concept universe
Figure 4-7 Document and query mapping in concept universe U.
104
Conversely, note that query Q covers around 30% of set A, around 70% of
set B and most of set C. Intuitively, we would choose A, followed by B and C as
our order of ranking for the retrieval of the documents relevant to query Q.
However, if we apply equation (4.6) and examine the document ranking produced
with this equation, we obtain the following order: article C followed by thesis B
and then book A. This example illustrates that adopting a topology that uses an
inference direction of P(d|Q) (figure 4-5a) provides a semantically accurate result
compared with the model that uses P(Q|d) as in figure 4-5b.
The third topology in figure 4-5, (figure 4-5c) asserts that both the query
and the document are causal agents or propositions for the existence of the index
terms. This model is actually the Bayesian network equivalent of the document
space modification model adopted by [Yu88, Fuhr90]. To see why this topology
is not appropriate as an information retrieval model we have to consider the
independence assumption of Bayesian networks. Using the independence
assumption heuristic in chapter 3, we can see that the observation of index term
nodes in network topology in figure 4-5c will in fact cause the query and
document to be independent. This is not a desired outcome because we would like
to infer our belief in the query to the document through the index terms. In fact if
we use the topology in figure 4-5c, we observe that the query node and the
document nodes become competing explanations for the index terms. That is, if
we observe the query to be true then it will diminish the causal effect of the
document on the index terms. We know however, that the effect of documents on
the index terms is constant for a set of document collections. The effect of
documents on index terms will only change if a new document is added or an
105
obsolete document is deleted from the collection. Thus, the topology in figure 4-
5c is also not able to model the information retrieval task correctly. Therefore, we
use the topology depicted in figure 4-5a as the basis for our model.
4.5 Comparison with Other Models
We have provided our model of Bayesian network for information retrieval in
sections 4.3 and 4.4. In this section we will show that we can use our model to
implement other retrieval models such as the Boolean and binary independence
probabilistic model [Rijsbergen79, Robertson76]. We began a comparison with
Turtle and Croft’s inference model [Turtle91] in the discussion of causation in the
network model in section 4.4. We will further analyse the difference between their
model and ours in this section. We will show that our model provides not just a
semantically correct document ranking in general but also a richer framework for
information retrieval modeling. It is worth noting at this point that the drawbacks
of Turtle and Croft’s model do not preclude it producing good recall and
precision [Turtle90, Rajashekar95]. Our model has also shown promising results
(see chapter 6) as well as providing a more general and richer framework for
information retrieval.
4.5.1 Simulating the Boolean Model
A Bayesian inference network can be used to simulate the Boolean information
retrieval model. In this model, each document is evaluated independently of the
other documents in the collection (the fact that there is no document ranking
106
involved means that documents in the collection may be assumed to be
independent). Thus, to simulate such retrieval we can create a disjoint network for
each individual document in the collection. Each network is then evaluated to
determine its relevance. In the implementation we will actually have one network
with different prior probabilities assigned to the index term (ti) nodes. In the
Boolean model, no weighting is applied, ie. P(ti|Q=true)=1 for all index terms ti
used to represent the user’s information need Q. The simulated network can be
built using the following steps:
• Build an expression tree for the query network.
• Assign P(ti|Q)=1 for all ti∈ d and Pr(ti|Q)=0 for all ti∉ d.
• Instantiate node Q.
• Use the logical link matrix or or and ( section 4.3.1) to calculate the
value of P(d), depending on the Boolean relationship between the node
terms in the query.
Then, P(d)=1 means that the document d satisfies the query and P(d)<1 means it
does not satisfy the query. To illustrate the situation, consider The Boolean query
(information or image) and retrieval
applied to the document collection in figure 4-1. The new network with the
expression tree is depicted in figure 4-8.
107
information image retrieval
or
and
d
Q
Figure 4-8 Boolean model implemented by a Bayesian Network.
Consider a document which includes all those index terms appearing in the
query. That is, P(information|Q),P(image|Q) and P(retrieval|Q) are all equal to
1. Using the Lor link matrix (section 4.3.1), the new belief in the index terms in
turn will cause the node or to have the value 1. Using the Land link matrix, the or
node and retrieval node cause the value of node and to become equal to 1. This
value is then passed down to the node document d so that P(d)=1, therefore d
satisfies the query as expected.
We can consider another example, a document which contains only the
index terms information and image but not the index term retrieval. In this
example, we can assign P(retrieval|Q)=0, P(information|Q)=1 and
P(image|Q)=1. Using the same link matrix, the value of P(d) becomes 0, therefore
the document d does not satisfy the query as expected. These two examples show
how the Boolean retrieval model can be effectively simulated by the Bayesian
network model.
108
4.5.2 Simulating the Probabilistic Retrieval Model
In the probabilistic retrieval model, a document is described by the presence or
absence of index terms. Any document can therefore be represented as a binary
vector d(k1,k2,…,kn) where ki=1 indicates the presence of index term i in the
document and ki=0 indicating its absence. The document ranking is calculated as
the cost function for the retrieval of a particular document that contains index
term i and the document is either considered relevant or not relevant. The
Bayesian network model for the probabilistic model is represented by figure 4-9.
R ele vant
t1 t2 t3 tn
d
Figure 4-9 Probabilistic retrieval using a Bayesian network.
In a traditional probabilistic model, each individual document is considered
in isolation from the other documents in the collection. The set of index terms
observed in a document are restricted to a subset which occurs in the query. The
values used for the ranking depend on the ratio of the values of P(ti|relevant) and
P(ti|non-relevant) (see section 2.4.3.1).
In a Bayesian network, we use the values of P(d) at the leaf nodes of the
network (the document nodes) to determine the ranking of the document d
109
relative to that of the other documents in the collection. There is no explicit
representation of query in the network. The query node is replaced by the node
relevance. The index terms are conditioned over the index terms that occur in the
query. The main consideration for the probabilistic model is the estimation of the
relevant and non-relevant document set. This estimation will not be accurate
without comprehensive sampling of queries and relevance judgments. One way to
overcome this problem is to estimate the relevance of the index terms which are
found in the relevant documents. This estimation may be achieved using a small
sample set of relevant documents and the relevance of other documents can be
estimated using the existence of the index terms found in the relevant document
on the current observed document.
The probability that an index term ti is found to be relevant is given by the
conditional probability P(ti|relevance). These values may be estimated from a
small sample retrieval or by using the inverse document frequency [Croft79] when
no relevance information is available. One advantage of the probabilistic retrieval
implementation using our model is that this estimation may be derived from a
user’s confidence in the terms they use in the query. This is possible because we
provide an explicit relationship between relevance and index terms via the
relevance->index term link. When a set of relevance judgements are available after
the retrieval is made, only the P(ti|relevance) need be changed to represent the
new confidence level of the user in the index terms. We can see from this that our
model provides a consistent interpretation of the P(ti|relevance) which has been
lacking in the traditional probabilistic model. Thus, we also can claim that our
model subsumes the binary probabilistic retrieval model since it is able to simulate
110
the model and provides a more intuitive interpretation of the relevance estimation
in the model.
4.5.3 Inference Network
We began discussion on the different approaches to the inference network model
for information retrieval in section 4.4.3. In this section, we will further analyse
and comment on the differences between our model and Turtle and Croft’s
inference model.
The two models differ in their assumptions of the causation in the
network. Turtle and Croft’s model assumes that document are the main cause of
index terms’ existence which in turn causes the information need to exist. Our
model, on other hand, asserts that the main cause of the existence of index terms
is the information need. These contradicting causation assumptions lead to
different respective inference directions. Turtle and Croft’s model infer the
evidence as P(Q|d) whereas ours infers evidence as P(d|Q). We have shown in
section 4.4 that the inference process of P(d|Q) produce a more accurate
document ranking in the retrieved document set. The benefits of applying P(d|Q)
rather than P(Q|d) are not limited to the provision of accurate ranking. There are
other benefits which may be gained through our model compared with Turtle and
Croft’s model. For example, our model is able to capture interconnectivity
between documents in a collection. This will enable us to easily implement
relevance feedback into the model. To illustrate this, consider the two networks
depicted in figure 4-10.
111
q
T3 T4
D1 D2 D3
D1 D2 D3
T1 T2 T3 T4 T5 T6
Figure 4-10a Our model Figure 4-10b Turtle and Croft’s model
q
T1 T2 T5 T6
T3 T4
T3 T4
Figure 4-10 Contrasting Bayesian networks.
Relevance feedback is a method used by information retrieval to refine a
user’s information requirements after the user judges the relevancy of the
document retrieved. There are two basic ways by which feedback data can be
incorporated in a Bayesian network: adding evidence and altering the
dependencies represented in the network. The two approaches are fundamentally
different. Adding evidence always leaves the probability distribution in the
network unchanged. However, it will alter the beliefs in the network to be
consistent with that distribution. Altering the dependencies, either by changing the
topology of the network or by altering the link matrices changes the probability
distribution which in turn alters beliefs in the network.
Implementing the addition of evidence in the network is very
straightforward in our model (see figure 4-10a). The document nodes for all the
documents that the user chose as relevant are set as the evidence node, or in other
words, we assign the belief = 1 to those document nodes found relevant by the
user. Then we can instantiate the query node again and calculate the new
112
probabilities in all the document nodes. Note that considering the document nodes
as evidence will set all the index terms for those documents to be dependent, and
in turn they will change the beliefs in the other documents that share the same
index terms with the relevant documents. Therefore the introduction of new
evidence into the network will change the beliefs not only in those documents
found to be relevant but also in other documents in the network. This approach
cannot be implemented in Turtle and Croft’s model because they have disjoint
inference for each document; ie. they instantiate each document in isolation from
other documents in the collection and only consider those index terms that exist in
the query. If we assign the document nodes as evidence, then it is only possible to
change the belief of those index terms activated by the instantiated document
nodes and shared by both document and query.
Consider the small network depicted in figure 4-10b. The new evidence of
D1 and D3 (the shaded document node) will not change the belief in D1,D2 or D3
because D1 and D2 do not share common index terms with the query. Even if there
is an index term shared by D1 and the query, say for example D1 contains term T3,
the instantiation of the document nodes D1 and D3 will make the index terms
independent. Thus, it will fail to alter the belief in the document D2 which the user
has not chosen as a relevant document.
The only way relevance feedback can be implemented in Turtle and Croft
model is by adopting altering the dependencies in the network. Given a set of
documents that a user has chosen as the relevant documents, a new query
representation layer in the network can be built. The new query representations
can either replace or extend the original query representations. Considering the
113
same situation as in our previous example for evidence feedback, this dependency
feedback is implemented as a network depicted in figure 4-11 and figure 4-12.
This approach can be implemented in both our model and in the model of Turtle
and Croft.
Figures 4-11a and 4-11b show how dependency feedback can be
implemented in both our model and Turtle and Croft’s model by augmenting all
the index terms that are found in the judged relevant documents (i.e. D1 and D2 in
our example). The inference process will be performed as normal without
assigning evidence to the relevant documents as in adding evidence approach.
q
T1 T2
T3 T4
T5 T6
D1 D2 D3
D1 D2 D3
T1 T2 T3 T4 T5 T6
Figure 4-11a Our model Figure 4-11b Turtle and Croft’s model
q
T1 T2 T5 T6 T1 T2 T5 T6
T3 T4
T3 T4
Figure 4-11 Dependency feedback by augmenting.
Figure 4-12, on the other hand, shows dependency feedback implemented
by replacing the index terms in the query with the index terms in the documents
judged relevant. Notice that T3 and T4 are deleted from the query network
regardless of whether they were in the set of original index terms used to represent
the query.
114
q
T1 T2
T3 T4
T5 T6
D1 D2 D3
D1 D2 D3
T1 T2 T3 T4 T5 T6
Figure 4-12a Our model Figure 4-12b Turtle and Croft’s model
q
T1 T2 T5 T6 T1 T2 T5 T6
Figure 4-12 Dependency feedback using replacement.
It has been shown that our model supports both relevance feedback
methods whereas Turtle and Croft only supports the dependency alteration
approach. The efficiency of the respective approaches depends on the retrieval
situation. Evidential feedback is appropriate when we are confident that the
distribution in the collection is “correct”. A very specific collection domain is an
example where this approach is appropriate. Altering dependencies is appropriate
when we have low confidence in the model distribution and therefore want to
obtain better information about the nature of the true distribution. An example of
this approach is document space modification [Yu88, Fuhr90]. They use the set of
queries and relevance judgements to learn the “correct” distribution for documents
and representation concepts.
The failure to capture the document interconnectivity means that Turtle
and Croft’s model is static. By this we mean that the probabilities used to rank the
documents will not change when a new document is introduced to the system.
This situation occurs because the relevance of a document is calculated in
115
isolation from other documents in the collection. Although information retrieval is
often considered a static system whereby document addition and deletion are not
performed frequently, the ability to capture the changes of the distribution of
knowledge in the collection is still a desirable feature which our model is able to
provide.
The difference between the two models can also be seen in terms of the
efficiency of the inference process. In our model, the inference process starts with
the instantiation of the query node. Since there is only one query we only need to
perform one inference process. The Turtle and Croft model starts the inference
process by instantiating individual document in the collection. Thus repeated
inference processes are required, proportional to the number of documents in the
collection.
Although our model is more efficient in terms of the inference process, this
can only be achieved by carefully handling the independence assumption and the
link matrix. Our model is richer than Turtle and Croft model by virtue of its
document interconnectivity, however this also means that our network is more
complex than Turtle and Croft’s model and thus the independence assumption
needs to be handled carefully. The link matrix also will be larger in our model
because , in our model, we have to create the link matrix for index terms. Since
the size of the link matrix is 2n, where n is equal to the number of index terms in
the document collection, n greater than 20 will be common. This link matrix size
issue is not a problem in Turtle and Croft’s model because the maximum size of
their link matrix is proportional to the number of index terms used in the query.
We will discuss these implementation issues in Chapter 5.
116
4.6 Summary
In this chapter, we have presented a new formal model for information retrieval
based on Bayesian network theory. The proposed model subsumes the existing
models by providing more a general framework to model information retrieval.
The proposed model can represent existing models by using appropriate network
representations. As a result, the decision of adopting a specific network model can
be seen as an issue of implementation.
The proposed model consists of two separate networks, namely a
document network and a query network. These two networks are combined
during the matching process. The matching process is started by instantiating the
query node and calculating the effect of this new evidence in the probability
distribution in the network.
Different inference directions in the network have different effects on the
probability distribution. The two possible inferences in information retrieval are
P(d|Q) or P(Q|d). We shown in this chapter that the first approach gives a more
accurate result and also provides a richer model through its ability to support
evidence and dependency relevance feedback.
We have concentrated on discussion of our model of information retrieval
using Bayesian networks in this chapter. Implementation issue will be discussed in
Chapter 5. These can be categorised into two groups, namely the computational
complexity of the inference algorithm and the indirect loop which exists during the
relevance feedback process. We will discuss some existing approaches to these
two issues and their practicality in information retrieval implementations. We will
117
also present our approaches to reducing the computation complexity and for
dealing with the indirect loop.
118
Chapter 5
Handling Large Bayesian Networks
5.1 Introduction
We have presented a Bayesian network model for information retrieval in the
previous chapter. This chapter presents discussion on issues associated with the
practical implementation of the model. Implementing an information retrieval
system using a Bayesian network is not a straightforward task. Exact inference
algorithms for a Bayesian network such as Pearl’s algorithm have been shown to
be NP-hard [Cooper90]. Thus, approximation techniques need to be considered in
order to implement the model in practice. Before we discuss the issue of
implementing the Bayesian network using an exact algorithm, first we will
illustrate the use of the exact algorithm in a retrieval process (section 5.2).
Following this example we will discuss possible problems that may occur during
implementation if we strictly follow Pearl’s algorithm or use the basic model
presented in chapter 4. There are two main issues to be considered, namely the
complexity of the computation and the indirect loop.
The complexity of computation is caused by the size of the link matrix.
The link matrix size in an information retrieval network is determined by the
number of index terms contained in a document. Documents with more than 30
index terms are common in a document collection. Therefore, the total size of the
link matrices for the collection will be generally very large. We propose an
119
approximation method involving the addition of a virtual layer into the network
[Indrawan96] This method provides the solution to the problem of computation
complexity without losing much of the accuracy required by the network to
perform the retrieval. Existing approximation methods such as node deletion, link
deletion and layer reduction do not provide adequate accuracy in modeling the
retrieval task. We will discuss these different methods and compare them with our
proposed method in section 5.3.
The indirect loop problem in the network occurred when evidence
relevance feedback is used. We will discuss the existing solutions to the indirect
loop problems in section 5.4. The discussion includes the methods of clustering,
conditioning and sampling. The main problem with these existing methods lies
with their own individual computational complexity. This complexity prevents us
from adopting these methods for the information retrieval. We propose a new
method involving the use of an intelligent node [Indrawan98]. This method
provides for much less complex computation than the existing methods, thus
providing a good solution to the indirect loop problem. We present this new
method of handling the indirect loop in section 5.5.
5.2 An Illustration of an Exact Algorithm
As we discussed in chapter 4, we have adopted Pearl’s algorithm to carry out the
inference process in our model. There are two main approaches to inference
algorithms, namely the exact and approximate approaches [Henrion90] and
Pearl’s algorithm falls into the former category. An exact algorithm defines the
complete probability distribution of the propositions in the network, and as such is
120
computationally intensive. The approximate approach on the other hand uses some
estimation techniques to estimate the probability distribution in the network. To
illustrate the use of Pearl’s algorithm in our model, consider an example of
retrieval in a small network depicted by figure 5-1.
q
similarity information retrievalimage feedback
D1 D2 D3
0.7 0.4
0.4 0.7 0.30.5 0.4
0.60.8
Figure 5-1 A network example of a retrieval process.
We assume that we know the values of the link weights for each link in the
document network and that these links were derived from the index term
distribution in the collection using equation 4.1 (chapter 4). We also assume that
a user has found that the index term information carries more weight than does
the index term retrieval. Therefore their links are assigned weights, for example,
0.7 and 0.4 respectively. If the user is not willing to assign the weights themselves,
approximate weights can be derived from the distribution of index terms in the
collection such as the tf*idf value. Note that the query is submitted as a natural
language query and it is not restricted to one sentence. Thus it is possible that an
index term occurs more than once in the query, ie. tf >1 for that term. The idf
121
value for index terms is derived in the same way as in document network, that is
the number of documents containing the index term in the collection.
The retrieval process is started by instantiating the query node q. The
effect of this new evidence in the network is then passed onto its children, that is
onto the nodes information and retrieval. The value P(information=true|q=true)
is given by the weight of the link that connects the two nodes. Thus, using the
weight-sum link matrix (see section 4.3.1), we have the following link matrices for
the nodes information and retrieval respectively:
Linformation= ⎥⎦
⎤⎢⎣
⎡7.00
3.01
Lretrieval= ⎥⎦
⎤⎢⎣
⎡4.00
6.01
Link matrices also have to be created for node D1, node D2 and D3. Since
these nodes have more than one parent node, the link matrices for these nodes
need to reflect the possibility that only one parent is true or that multiple parents
are true. To capture these possibilities we can combine the weight-sum and or
approaches to link matrix. That means that the probability of all parent=true is
given by considering all the possible combinations of true parent nodes. Using this
approach, we can calculate the probability of the child node given that the parent
node is true or false in the document network as follows:
P(D1|similarity=true)=0.4
P(D1|image=true)=0.7
P(D1|similarity,image=true)=0.4*(1-0.7)+(1-0.4)*0.7*0.4*0.7=0.82
Or
P(D1|similarity,image=true)=1-(1-0.4)(1-0.7)=0.82
122
LD1= ⎥⎦
⎤⎢⎣
⎡82.07.04.00
18.03.06.01
P(D2|image=true)=0.3
P(D2|information=true)=0.5
P(D2|retrieval=true)=0.4
P(D2|image,information=true)=1-(1-0.3)(1-0.5)=0.65
P(D2|image,retrieval=true)= 1-(1-0.3)(1-0.4)=0.58
P(D2|information,retrieval=true)=1-(1-0.5)(1-0.4)=0.7
P(D2|image,information,retrieval=true)= 1-(1-0.3)(1-0.5)(1-0.4)=0.79
LD2= ⎥⎦
⎤⎢⎣
⎡79.065.058.06.07.05.03.00
21.035.042.04.03.05.07.01
P(D3|retrieval=true)=0.6
P(D3|feedback=true)=0.8
P(D3|retrieval,feedback=true)=1-(1-0.6)(1-0.8)=0.92
LD3= ⎥⎦
⎤⎢⎣
⎡92.08.06.00
08.02.04.01
If we assume that there are 2000 index terms in the collection, the prior
probability for the nodes similarity, image, feedback is equal to 1/2000 or 5*10-4.
Instantiating node q thus results in:
P(similarity)= 5*10-4 P(retrieval)=0.4
P(image)= 5*10-4 P(feedback)= 5*10-4.
P(information)=0.7
Using the appropriate link matrix we can calculate
123
P(D1|q) = 0.4(5*10-4)(1-5*10-4) + 0.7(1-5*10-4)(5*10-4) + 0.82(5*10-4)(5*10-4)
= 0.203
P(D2|q) = 0.3(5*10-4)(1-0.7)(1-0.4) + 0.5(1-5*10-4)0.7(1-0.4) +
0.7(1-5*10-4)0.7*0.4 + 0.6(5*10-4)(1-0.7)(1-0.4) +
0.58(5*10-4)(1-0.7)0.4 + 0.65(5*10-4)0.7(1-0.4) +
0.71(5*10-4)0.7*0.4
= 0.406
P(D3|q) = 0.6(1-0.4)( 5*10-4) + 0.8*0.4(1-5*10-4) + 0.92*0.4(5*10-4) = 0.320
Therefore, the relevance ranking of the documents for the retrieval network
depicted in figure 5-1 for query q is document D2, document D3 and followed by
document D1.
In order to implement the evidence feedback, the document nodes which
are found to be relevant by the user are instantiated. For example, suppose the
user found that they are actually looking for information retrieval articles that
discuss image retrieval through similarity and the possibility of using the relevance
feedback to improve the retrieval. In this case they might chose documents D1 and
D3 as the relevant documents. To recalculate the belief in the documents node, we
instantiate nodes D1 and D3. When we instantiate these nodes, this new evidence
will influence the belief in the nodes similarity and image due to document D1,
and in nodes retrieval and feedback due to document D3. Because the link q→
information and q→retrieval meet tail-to-tail in node q, any change in the belief
in nodes information or retrieval will require the recalculation of belief in the node
q (see the heuristic check for the independence assumption in section 3.4.2). As a
result an indirect loop exists in the network when the evidence feedback approach
124
to relevance feedback is used. In other words, the network now becomes multi-
connected.
When a local propagation algorithm like Pearl’s which is devised to handle
singly connected networks is used in a multi-connected network, failure may occur
in one of two ways. Firstly, it is possible that an updating message sent by one
node cycles around the loop and causes the same node to update again. This will
repeat indefinitely, preventing convergence of the propagation. Secondly, even if
the propagation does converge, the posterior probability may not be correct due
to the algorithm’s independence assumption which does not hold for multi
connected networks. We therefore need to adopt some method that enables us to
break this loop so that the network becomes singly connected and hence allows
use of Pearl’s algorithm. To achieve this we need to look at the independence
assumption of the network and approximation methods.
Apart from the indirect loop issue, another important issue to be
considered during implementation is that of the overall size of the network. The
investigation of reasoning with uncertainty using Bayesian network began during
the development of diagnosis aids for medical applications [Fryback78,
Cooper84, Heckerman85, Shwe90]. The model’s assumptions and inference
algorithms were developed based on this medical diagnostic application. The size
of the network in the medical diagnosis problem was relatively small compared to
that of information retrieval. For example, the number of nodes involved in the
Pathfinder1 is 63, whereas the smallest test collection available to information
1 An expert system to assist pathologists with hematopathology diagnosis, jointly developed by Stanford University and the University of Southern California.
125
retrieval research (ADI collection contains 82 individual documents) requires
around 900 nodes in a Bayesian network. In real life applications, the number of
documents in an information retrieval collection may be more than one thousand.
The big difference in the size of the networks occurs due to the increase in the
number of propositions introduced to the network, and the size of the link
matrices (which is dictated by the number of parent nodes of a child node). This
increase in network size causes the computational complexity of the inference
algorithm to increase accordingly.
The link matrices in a retrieval network are large due to the fact that most
of collection will have documents that contains on average more than 20 index
terms. With this number of index terms per document and the binary assignment of
index terms to the document, we will have link matrices with numbers of elements
> 220 for most of the document nodes. When all the documents in the collection
are considered, the overall size of the link matrix will be large. Consider, an ADI
collection with 82 documents and average of 25 index terms per document - the
total size of the link matrix will be around 82×225.
Another aspect that contributes to the increase in computational
complexity is the fact that a retrieval network is a dense network whereby a large
number of nodes share common children or parent nodes. Thus, any change in
belief in an index term node may influence a large part of the network and cause
intensive recalculation.
126
5.3 Reducing the Computational Complexity
Reducing the computational complexity can be achieved through the utilization of
some approximation methods. There is a trade off between a Bayesian network
model’s accuracy and its computation complexity. We need to choose carefully an
approximation technique that enables us to reduce the computation space without
sacrificing much of the accuracy. By loss of the accuracy we mean the event that
the posterior probability of the approximate method is different from the exact
algorithm posterior probability.
Approximation approaches to Bayesian network inference are not new.
Many researchers have investigated the possibility of approximation techniques
due to the complexity of exact match algorithms like those of Olmsted
[Olmsted83], Pearl [Pearl88], Lauritzen and Spiegelhalter [Lauritzen88]. One of
the common approaches to the approximation involves coarsening the state space
[Chang91, Provan95]. The coarsening effect can be achieved through different
ways, which are respectively node and link deletion, layer reduction and
intermediate node layer addition.
5.3.1 Node and Link Deletion
One obvious way to reduce the complexity of a network is to delete the parent
node and its links when we consider that its influence on its children nodes is
minimal. Consider the example in figure 5-2.
127
W1
W2 W3
W1
W2 W3 W4
Figure 5-2a Original network Figure 5-2b Reduced network
P1 P2 P3 P4
C1
P1 P2 P3
C1C2 C2
W5
W6 W6
Figure 5-2 Reducing the network with node deletion.
Let W1,W2,…,W6 represent the link weights of the network in figure 5-2a.
If we set a threshold value x, we can compare every link in the network with x and
delete any link with link weight < x. Let us assume that W4 < x and W5 < x. Thus
we can delete those links with weights W4 and W5. The result of this operation is
depicted by figure 5-2b. Notice also, that node P4 is deleted from the network
because it exists in the network solely due to a proposition explained by P(C1|P4),
therefore its effect on other propositions in the network is lost when link W4 is
deleted. This approach can be used for approximation in information retrieval
systems with the proviso that caution needs to be taken in determining the
threshold value.
In information retrieval systems, we discriminate between documents
according to their relevance using the term weights because the link weights in our
model are implemented using tf*idf weighting. This weighting is known to
measure the scarcity of terms in the collection and it influences the precision level
of retrieval [Sparck-Jones72]. Thus, documents containing index terms with high
values of associated term weighting are assumed to be highly relevant. When we
128
remove all the links with weights below the threshold value from the network, we
actually reduce the term discrimination ability of the network because the range of
the term weights is reduced. Therefore, with this approach we will not sacrifice
system performance in terms of recall but may lose some performance in term of
the precision. The question that remains is how much reduction in precision can
we tolerate? The only way to check the degradation in the precision is by choosing
an arbitrary threshold value and then examine the precision level of the reduced
network model. The chosen threshold will vary from collection to collection due
to the difference in their network structure.
This method is therefore appropriate for recall oriented systems. A
recall oriented system aims in providing the best coverage of the concept required
by the user without worrying too much about the position of those relevant
documents in the retrieved documents list. However, it aims to retrieve all the
relevant documents.
5.3.2 Layer Reduction
Node deletion may also be achieved through the layer reduction method
[Provan95]. In this approach the nodes of a particular layer are deleted. The links
that lead to and from these nodes are then joined to create the new links. This
approach is illustrated by figure 5-3.
129
p1p2
p3
q4q5
q6
p1q1
p2q2
p2q3
p3q2
p3q3
Figure 5-3a Original network Figure 5-3b Reduced network
Figure 5-3 Reducing the network by collapsed layer.
Consider a part of a larger network as depicted in figure 5-3a. It contains
three layers and has two sets of link weights. The links p1, p2 and p3 connect the
top and the middle layers, whereas q1 ,q2, q3 connect the nodes in the middle and
the bottom layers. When the middle layer is collapsed, the reduced network is
depicted by figure 5-3b. The effect of the nodes in the middle layer is replaced by
the new combined link weights. For example, link p1 and q1 are now combined
into link p1q1, i.e. the new weight of p1q1 can be calculated as the product of p1
and q1.
With this approach, the number of link weights and nodes are reduced.
However the number of elements of the link matrix in the child node is actually
increased because each node in the bottom layer will have more parent nodes
compared with the original network. Moreover, if we collapse the index term layer
in the document network of our model, we will lose the ability to produce
document interconnectivity. In fact, if we reduce the network by taking out the
index term layer from the document network, we will have a network that gives a
retrieval function similar to that of using inner product with tf*idf weighting (see
130
section 2.4.2). In other words, the retrieval function of the network will be
equivalent to that of counting of the number of index terms shared by the query
and the document. Thus, when we use this approximation technique, we are
restricting the retrieval to a simple matching function.
5.3.3 Adding a Virtual Layer
We propose a new technique for reducing the computation complexity, namely
adding a virtual layer to the network. In this approach, for a child node with
number of parents greater than a specified maximum number of parents per node,
the parent nodes are classified into a number of groups. Each group is then linked
to a virtual node. These virtual nodes are then connected to the original child
node. To illustrate the idea, consider the example of a network depicted in figure
5-4a. A child node in figure 5-4a has 100 parent nodes. If we only assign binary
propositions to all the nodes in the network, we will have a link matrix with 2100
elements to calculate P(child). This is virtually impossible to implement in practice
due to limitations of computer resources. If we divide the parent nodes into small
groups and each group is linked into a virtual node and these virtual nodes are
then connected into the child node, we can actually reduce the number of elements
of the link matrix in the child node.
Figure 5-4b portrays our modified version of the network in figure 5-4a. In
the modified network the 100 parent nodes are divided into 10 groups (with each
group containing 10 nodes). In this example, we have 10 virtual nodes in the
131
virtual layer. 2The child node now only has 10 parents (the number of virtual
nodes). Thus the number of elements in the link matrix of the child node has been
reduced to 210. Each virtual node is linked to 10 parent nodes and will have a link
matrix of the size 210. This makes the total size of the link matrices in the network
11x210. This is a dramatic reduction from the original size of 2100.
Parents nodes Parents cluster 1
Virtual Layer
Child node
Child Node
Parents cluster 10
1 2 100 1 10 91 100
1 10
Figure 5-4a. Original network Figure 5-4b. Modified networkwith virtual layer
Figure 5-4 Network with virtual layers.
The computing resources available for implementing the Bayesian network
dictate the choice of the maximum number of parents per node. Since information
retrieval systems are used mostly interactively, some small experiments may need
to be performed to find an acceptable response time for a query with the maximum
number of parents per node is adjusted accordingly.
The number of virtual layers is not limited to one. Once the limit has been
determined, we can distribute the index term nodes into a number of groups. The
2 We call the node and the layer virtual because they do not actually form as part of the original knowledge. They are artificially added to this original knowledge.
132
total number of layers depends on the total number of the index terms to be
distributed and the limit on the maximum number of parents per node. If the
number of virtual nodes is greater than the specified limit, then these virtual nodes
need to be grouped together as were the index term nodes. This process of
grouping and introducing new layers continues until all the nodes in the network
have a number of parent nodes less than the specified limit. The optimum network
is obtained when we have a symmetric distribution of nodes in the network. For
example, consider our previous situation where a child node has 100 parent
nodes. If we have set the limit=15 parent nodes, it is better to have 10 groups
with 10 members for each group rather than having, say, 6 groups of 15 members
and one group of 10 members.
The virtual node acts as a summary node for a group of parent nodes. That
means that the weight of the link that connects the virtual node and the child node
has to capture summary information about the distribution of the parent nodes.
One obvious way to achieving this is to take the group average of the original
parent-to-child links and assign this average to the link virtual-to-child. The link
weights of parent-to-virtual links are modified by dividing the original link weight
by the group average.
Another possible approach is to normalise this virtual-to-child link by
assigning the maximum weight of parent-to-child links in the group to the virtual-
to-child link and modify the original parent-to-child link in the group by dividing
them by the maximum weight of the group.
133
The two approaches can be described as follows:
Let
v be a virtual node to introduced,
p1, p2, p3,…,pn be parent nodes connected to v,
c be a child node,
w1, w2, w3,…, wn be the weight of the links p1->c, p2->c,..., pn->c
respectively
u1, u2, u3 be the weight of link p1->v, p2->v,..., pn->v
wv be the weight of link v->c
The weight, wv of the link v->c using the average approach will be:
n
wwww n
v
...21 ++= (5.1)
The weight, wv of the link v->c using the max approach will be:
),...,,max( 32,1 nv wwwww = (5.2)
The weight, ui of the link pi->c will be:
niforw
wu
v
ii ≤≤= 1' (5.3)
The average and the maximum approaches give different ramifications for
information retrieval systems. Firstly, we look at the effect of taking the average
approach. Since we are averaging the value of the groups and assigning the nodes
randomly into the groups, we would have a similar wv for different virtual nodes in
the network. Note that we assign the value of tf*idf into the wi and a high value of
tf*idf is associated with high level of importance conferred on an index term for
finding the relevant documents. Therefore, the effect of index terms with high
term weight values on the calculation of P(d) may be reduced if there are index
134
terms with low weight values in the same group. As a result, a document which
contains these high tf*idf index terms may lose its relative superiority compared
with a document which has low term weight but belongs to a group with a higher
weight average. This means we cannot interpret the probability in a document
node as the absolute value of the document’s probability in matching the user’s
request, but rather we should see it as a relative ranking value in comparison with
other documents in the collection. In the worst case, the precision may be affected
and may even decline.
The maximum approach on the other hand always ensures that index terms
with high term weight value will not be much affected by the nodes with low
weight values in the group. This can be done by assigning the maximum tf*idf of
the group to wv. Since we assign the maximum value of the group to the virtual
node-to-child link, we ensure that the index terms with high term weight values
play a major influence in estimating the probability of a document’s relevance to
the user’s request. A document with high term weight value of index terms will
not be undervalued as in the average approach. Thus, with this approach the
probability values on the document node will be a closer approximation of the
absolute probability of relevance to the user’s query compared to the average
approach.
The choice of normalising the parent-to-virtual with the average or the
maximum of the group should be made according the implementation
requirements of the system. The average approach may be used when the
precision is not major consideration. The maximum approach on the other hand
will suit systems which require high precision retrieval. Regardless of the choice of
135
normalising approach we have to make, adding the virtual layer provides a
practical layer reduction solution to the computational complexity problem
through a drastic reduction in the size of link matrices. Moreover, our proposed
method retains the semantic structure of the original network presented in chapter
4. This characteristic of our approach provides a more accurate approximation
than the link and deletion approaches because these existing approaches reduce
the network model to an inner product retrieval function.
The adoption of a better clustering mechanism for grouping the parent
nodes can further increase the accuracy of our approximation method. We present
one clustering algorithm that can be adopted in the next section. A method of
assessing the goodness of the clustering method model will be presented in
chapter 7.
5.3.3.1 Clustering the Parent Nodes
In the clustering described in the previous section, we ignored the link weight
distribution in the network. The grouping is based on the sequence of the weights
in the index file. With this random approach, the performance will depend on the
sequence of the link weights to be classified. To avoid this dependency, we
propose another simple classification that takes into consideration the distribution
of the link weights.
In this non-random classification, we group similar link weights into a
group. The similarity is measured by the difference between a link weight under
consideration and the mean of a group of link weights. To generalise the proposed
concept, consider a set of items that have some attributes and these items are to be
classified into a number of groups. The clustering process involves examining an
136
individual item and finding its most appropriate group. The similarity in our
clustering is measured by the distance between the item’s attribute value from the
means of the groups. Each time an item is examined, its attribute values are
compared with the existing group’s mean. During the clustering process, an item
may have several candidate groups because the difference between its attribute
value and the group’s mean is still within the boundary of the maximum difference
allowed (in our algorithm this difference is called significant level). The item
however only can be assigned to one group. The best group for the item is the
group with a group mean closest to the value of the items’s attribute. Our
clustering algorithm is thus as follows:
TYPE item id TYPE INTEGER attributeValue TYPE FLOAT TYPE population total TYPE INTEGER noAttributes TYPE INTEGER individual TYPE item TYPE group id TYPE INTEGER member TYPE population mean TYPE float TYPE class totalIndividual TYPE INTEGER totalGroup TYPE INTEGER member TYPE group MAXITERATION TYPE INTEGER procedure cluster(input TYPE population, output TYPE class, significantLevel TYPE FLOAT,
iteration TYPE INTEGER) begin procedure DECLARE /* Local Variables */
137
totalGroup TYPE INTEGER i,j,k TYPE INTEGER numberAttribs TYPE INTEGER found TYPE INTEGER candidateGroup TYPE INTEGER changes TYPE INTEGER newTotal TYPE INTEGER currentDifference TYPE FLOAT if iteration = MAXITERATION then return 0 totalGroup=output→totalGroup for i=0 to i < output→totalIndividual do currentDifference = 0 found = 9999 /* Assign it to big number */ candidateGroup = 9999 for j=0 to j < output→totalGroup do /* Find an item in any group, at the same time find the fittest group it belongs to */ /* find the fittest group */ numberAttribs = 0 for k=0 to k < input→numberAttributes do currentDifference=output→member[j].mean[k]-
input→individual[i].attributeValue[k] if currentDifference < significantLevel then numberAttribute++ endif enddo if numberAttribute = input→numberAttribute then if j ≠ found then candidateGroup = j endif endif if found = 9999 for k=0 to k <output→member[j].member.total do
if output→member[j].member.individual[k].id ≠ input→individual[i].id
then found = j
endif k++
enddo endif j++ enddo if found ≠ 9999 /* item has been assigned to a group */ then if found ≠ candidateGroup and candidateGroup ≠ 9999 /* We have found a better group for this item */
then /* Delete this item from old group*/ /* Add this item to a new group */ changes++ endif
138
else /* If this a new assignment of an item to a group */ if candidateGroup ≠ 9999 then /* Add this item to an existing group */ changes++ else /* Create a new group for this item */ changes++
endif endif endif
i++ enddo
Applying the clustering algorithm in our information retrieval network
model, the “items” to be classified are the index terms within a document. The
attributes of the items are given by the link weights. The estimation of the
significant level can be derived from the standard deviation of the link weight
distribution in the document collection. For example, in the ADI collection, the
standard deviations of the distribution of the link weights of the individual
documents are in the range 0.08 to 1.0. Thus, the significant level should be
estimated within this range.
We suggest that adopting a clustering technique that recognises the
distribution of the link weights will increase the precision but not the recall of the
retrieval. The recall will be the unchanged since no additional knowledge is
introduced into the network. We will present a comparison of the performance of
the two clustering approaches, random and non-random in chapter 6. In chapter 7
we will present a method to evaluate the goodness of the clusteriong model which
in turn can help us to determine the optimal clustering for our network.
139
5.4 Handling the Indirect Loops
The indirect loop exists in our Bayesian network model when evidence feedback is
implemented. Pearl's inference algorithm as used in our model will not work
properly in this situation. There are some existing approaches to handling the
indirect loops. These approaches perform some preprocessing to find and break
the loops before performing inference. We propose a new method for handling
the indirect loop. Our method is based on the idea that we can relax the
independence assumption in the network so that we can have a finite propagation
in the loop. This approach will suit the information retrieval application or indeed
any other large network applications because the proposed independence
assumption does not require much additional computation compared to the
preprocessing approaches.
There are three existing preprocessing approaches for handling cyclic
propagation or loops in Bayesian networks, namely clustering, conditioning and
stochastic simulation [Pearl88]. Clustering involves forming compound nodes in
such a way that the resulting network of clusters is singly connected. Conditioning
involves breaking the communication pathways along the loops by instantiating a
select group of nodes. Stochastic simulation involves assigning to each node a
definite value and having each processor inspect the current state of its neighbour,
compute the belief distribution of its host node, and select one value at random
from the computed distribution. Beliefs are then computed by recording the
percentage of times that each processor selects a given value.
140
Consider the small retrieval network in Figure 5-5 which serves to
illustrate these different approaches for handling the network loop. Note that this
network is similar to the network in figure 5-1. The only difference is that we have
instantiated document D3 as the user chose it as the relevant document during the
relevance feedback process.
q
similarity information retrievalimage feedback
D1 D2 D3
0.7 0.4
0.4 0.7 0.30.5 0.4
0.60.8
Figure 5-5 Retrieval network with a loop.
A loop exists in the network when we use document nodes as evidence in
relevance feedback. In the example in figure 5-5, if we take D3 as evidence (thus
P(D3)=1), it will change the belief in nodes retrieval and node feedback. The
belief in the proposition in node retrieval will change the belief in node q and the
belief in document node D2. The new belief in D2 in turn will change the belief in
the index term nodes image, information and retrieval and this belief in turn will
change the belief of the ancestor nodes all the way up to the node q thus creating
an indirect cycle or loop. To allow Pearl’s algorithm to work properly, a method
to transform this multi connected network into a singly connected network is
required. In the following sections, we present the possible approaches to the
141
indirect loop propagation problem in Bayesian networks and discuss their
appropriateness for the implementation of information retrieval systems.
5.4.1 Clustering
The clustering approach involves collapsing nodes to transform the network from
a multi-connected network to a singly connected network. In our example in
figure 5-5, the obvious choice for the nodes to be collapsed are information and
retrieval. The modified network is now depicted in figure 5-6.
P(D3 | information, retrieval )
P(D2 | information, retrieval)
q
similarity Information - retrievalimage feedback
D1 D2 D3
0.4 0.7 0.30.8
P(information, retrieval | q)
Figure 5-6 Clustered network.
The collapsing of the nodes information and retrieval into one node information-
retrieval, forces us to estimate P(information, retrieval | q), P(D2 | information,
retrieval) and P(D3 | information, retrieval). In the medical diagnosis field, where
this method was originally introduced, the estimation of any combination
proposition’s conditional probabilities was relatively easy to obtain because each
node in the medical diagnosis application represented a medical condition which
142
could be easily observed. The combination of two observations were usually
available from observation of past diagnoses. Thus, in the medical diagnostic
context, the estimation of (information,retrieval), (¬information,retrieval)3,
(information,¬retrieval), (¬information,¬retrieval) can be derived and used to
create a link matrix estimation of the effect of the collapsed node such as
P(D2|information,retrieval), P(D3|information,retrieval) or
P(information,retrieval|q). Such observations are not as straightforward in an
information retrieval network. In our model, the probability values in the
document nodes are used to rank the documents and the document nodes are the
nodes which exhibit the effect of knowing something about the beliefs in the index
term nodes. Since the inference process in information retrieval aims to find the
most relevant documents given a user query whereby a set of index terms are
considered, isolating and observing the effect of individual index terms or a group
of index terms is not desirable and certainly not a simple task.
Another problem that may occur in implementing clustering in information
retrieval is deciding which nodes are to be clustered. We know that any document
which shares two or more index terms with the query network will create a loop in
the network. An extreme choice is to clamp all these index terms into one
compound node, both in the query network and document network as in the
approaches of Cooper [Cooper84] and Peng and Reggia [Peng86] for the medical
diagnosis application. Unfortunately, the exponential cardinality and
3 ¬symbolised negation
143
structurelessness of the link matrix for these large compound nodes make the
inference difficult to compute.
A popular method of clustering the nodes is the join tree [Lauritzen88]. If
the clusters are allowed to overlap each other until they cover all the links of the
original network, then the interdependencies between any two clusters are
mediated solely by the variables they share. If we insist that these clusters continue
to grow until their interdependencies form a tree structure, then Pearl’s tree
propagation algorithm can be used in the inference. This method of clustering will
produce a better structure and less complexity of the propositions involved in the
clustered nodes compared to Cooper’s approach. However, implementing this
approach in information retrieval may be very costly, because the number of nodes
involved in the network will mean the preprocessing involved in finding the cluster
set will be time and resource consuming. It is also worth noting that the retrieval
network is a dense network. That is, there is a high interconnectivity between the
index terms and the document nodes in the network. This characteristic means that
there is increased complexity inherent in the process of finding the cluster set. The
clustering method may be appropriate when the document collection is relatively
stable, that is when documents are not often deleted or added to the collection.
Because any addition or deletion of documents means changes will occur in the
network distribution, the clustered sets need to be regenerated in such an event.
5.4.2 Conditioning
Conditioning is based on our ability to change the connectivity of the network to
render it singly connected by instantiating a selected group of nodes [Dechter85].
144
We can condition the multi-connected network in figure 5-5 into a singly
connected network as depicted by figure 5-7 by cutting the loop in the network at
node q. The node q is called the loop-cutset node. Once we assign a node to be a
loop-cutset node, we can instantiate node q to block the propagation of the belief
in the path information-q-retrieval. By doing this, we will have singly connected
network and Pearl’s singly connected algorithm becomes applicable.
similarity information retrievalimage feedback
D1 D2 D3
q=0 q=0
0.4 0.7 0.30.5 0.4
0.60.8
P(information|q)=0.7P(retrieval|q)=0.4
Figure 5-7 A singly connected network as the decomposition of multi connected network.
If we want to recalculate the value of P(D2) given that the user chose D2
as the feedback evidence, we first need to assume that q=0 and then propagate its
value through the network until it reaches D2. Using the same network, we now
assume that q=1 and repeat the propagation process. Finally, we average the two
results weighted by the posterior probabilities P(q=1|D2=1) and P(q=0|D2=1).
Conditioning provides a working solution in many cases of approximation
in Bayesian network application. However unlike clustering, if the network is
highly connected or dense it may suffer from combinatorial explosion [Pearl88].
145
The message size grows exponentially with the number of nodes required for
breaking up the loops in the network. Considering that during the inference, we
must consider each possible combination of instantiated values of the loop-cutset
nodes, the number of these loop-cutset instances is equal to the product of the
numbers of possible loop-cutset nodes. This product is clearly exponential in the
number of loop-cutset nodes.
The information retrieval network suffers from this combinatorial
explosion because it is a dense network. It is possible to use a minimisation
algorithm to reduce the cutset, however it has been shown that the minimisation
algorithm is NP-hard [Stillman91]. Thus conditioning in an information retrieval
network would be very costly process.
5.4.3 Sampling and Simulation
Stochastic simulation is a method of computing probabilities by computing how
frequently events occur in a series of simulation runs. If a causal model of a
domain is available, the model can be used to generate random samples of
hypothetical scenarios that are likely to develop in the domain. The probability of
any event or combination of events can be computed by counting the percentage
of samples in which the event is true.
In general, the simulation methods are divided into two main categories,
namely Forward sampling [Bundy85, Henrion86, Shachter86,88] and Markov
simulation [Pearl87, Chavez90, Berzuini89]. The main difference between the two
approaches lies with the directionality of the propagation during the simulation.
Forward propagation as the name implies, only involves propagation in the
146
direction of cause in the network. The drawback of this method is that its
complexity is exponential in the number of observed or evidence nodes
[Henrion90, Hulme95]. Thus, forward sampling can only be practical if the
evidence nodes are at the root of the network.
Markov simulation (sometimes known as Gibbs sampling) on the other
hand, allows propagation in both directions. However, this method will have
convergence problems when the network contains links that are near deterministic,
that is close to 0 or 1 [Chin89].
In our information retrieval model, we have propagation in both directions.
The diagnostic or back propagation occurs when we need to infer P(ti) given
knowledge of P(dj) with an arc from ti to dj. Moreover, a loop exists in the
network when we apply evidence feedback and the evidence lies with the
document nodes which are non-root nodes. Thus, forward sampling is not
appropriate to our information retrieval network because of these two problems:
the lack of support for backward propagation and the exponential complexity of
the algorithm for non-root evidence nodes.
Markov simulation (Gibbs sampling) on the other hand does not suffer
from the above two problems. To implement this method, we need samples of
propositions and their associated observation values. For information retrieval, we
can obtain this from the relevance judgment of a test collection. A test collection
contains sets of queries with associated documents that are judged to be relevant
to the queries. A number of simulations may then be run on a particular query and
the set of retrieved document observed. A score is kept for each time a particular
document in the relevance judgment set for the query is retrieved. With this
147
approach, we have to make one important assumption, namely that the ‘causal
model’ in the network represents the correct distribution of the document
collection and that it will generate a 100% level of recall and precision. However,
it has been shown that this level of performance is unachievable in information
retrieval models [Wallis95]. Even if we were content with the approximate model
and hence with accepting less than a 100% level of recall and precision, the size
of the network would make running the simulation too costly. Pearl [Pearl88]
showed that to get within 1% of the approximate value, we need over 100 runs. It
is accepted that the accuracy of the sampling depends on the number of runs
performed [Henrion90]. Thus, although many researchers have taken the sampling
approach towards handling multi connected networks [Henrion86, Pearl87,
Fung90b, Shachter90, Hulme95], this approach does not provide a solution
practical for information retrieval. We propose instead a method using intelligent
nodes to solve the problem of multi connected networks.
5.5 Dealing with a Loop Using Intelligent Nodes
We have investigated different approaches to handle loops in Bayesian networks.
However, all of them are computationally impractical for information retrieval
networks due to the network size and density. We propose a method involving
intelligent nodes. The aims of our proposed approach are as follows:
1. Providing a mean to break the loop so that the propagation in the
network is finite.
2. Providing a mean to break the loop without introducing additional
computational complexity to the inference process.
148
We use the term intelligent because in Pearl’s inference algorithm the node is
memoryless whereas in our approach the nodes do have some memory. The
memory is used to “remember” the source of message received so that the next
time it receives the message from this source it will reject it. In other words, the
intelligent nodes act as filters of messages in the network loop. They filter any
child messages of a node so that the message is blocked from updating the parent
node value of the original message. To illustrate the method, consider the network
shown in figure 5-8.
q
similarity information retrievalimage feedback
D1 D2 D3
0.7 0.4
0.4 0.7 0.30.5 0.4
0.60.8
Figure 5-8 Network with intelligence node retrieval.
Consider that retrieval producing the initial document ranking has been
performed. Thus, each node in the above network has a belief value attached to it
(see section 5.1 for the actual values). When document D3 is chosen as the
relevant document during relevance feedback, ie. the node D3 is instantiated, this
node will send the evidential support or child message to both its parent nodes
namely retrieval and feedback. With this new evidential support, the node
149
retrieval recalculates its belief on the proposition represented in the node. In
Pearl’s algorithm this new belief then will be passed to its ancestor. Our approach,
on the other hand, stops the message from going to the parent(s) of node
retrieval. Note that we have produced the initial document ranking, so that the
belief at node retrieval is arrived at due to the instantiation of the proposition on
the node q. Therefore the degree of influence of the query on node retrieval has
been reflected in this node's belief value. If we send evidential support λretrieval to
node q, λretrieval will contain some value of πq. This means that the value of πq will
be amplified. This amplifying effect does not aid our understanding of the problem
and will cause the propagation to run indefinitely.
Consider the following reasoning process in a real life situation. In the
morning coffee break my colleague tells me that it is going to rain tomorrow. If I
tell her after lunch break that it is going to rain tomorrow because I happen to be
reading the weather column in a newspaper at lunch, her belief about the
proposition tomorrow it is going to rain should not increase because that
information came from me, a person who received the same information earlier
from her (the same source). Her initial information may have come from the same
article in the newspaper that I read at lunch. I may be considered to be acting as a
mirror of her information. My information does not introduce new knowledge to
her. The same principal may be applied to our loop problem in the information
retrieval network when we block the child message λretrieval from node q. The
message λretrieval will only amplify πq.
The independence assumption is slightly changed with this approach. We
actually relax the independence assumption to solve the loop problem. In the strict
150
d-separation or heuristic check, setting D2 as evidence will cause the all the nodes
in the sample network to become dependent. We add to the checking procedure a
routine to find a filter node that will make some of the nodes in the network
independent and hence break the loop. Checking whether any nodes in the
network have fan-in descendents and fan-out descendent can easily identify the
loop. If there is a node that meets this condition, a loop exists in the network and
needs to be broken using the filter node. The modified independence assumption
now becomes:
Given a node with descendents which have fan-in links and
ancestors which have fan-out links; if this node is the direct
parent of a node with fan-in links, the ancestor of this node is
independent when the direct child of this node is instantiated.
Using this independence assumption, the node retrieval causes this node and node
q to be independent when D3 is instantiated. In the implementation of this
independence assumption for information retrieval, we can safely assume that the
candidates for the filter nodes are the index term nodes in the document layer. The
filter nodes are the index terms used in the query and in part of the relevant
document found in the relevance feedback. Note that this filtering does not apply
in the production of the initial document ranking because we have instantiated
node q and node q is not the direct child of either node information or node
retrieval which are the candidates for the filter node.
The modified independence assumption proposed is not significantly
different from Pearl’s original independent assumption (see section 3.4.2) apart
from the fact that our proposed assumption includes the knowledge of the
151
information source. However, our proposed assumption provides a method for
breaking the infinite propagation with very little computational cost. The
additional computational cost involved is only the storage cost of keeping the
knowledge of the information source, or the memory of the intelligent node. This
memory can be easily implemented as a boolean variable. Thus, for a system that
involved large network structures such as in an information retrieval system, this
assumption presents a workable solution to the problem of indirect loops.
5.5.1 Example of the Feedback Process Using
Intelligent Nodes
Assume that we assign node D3 as the evidence node used in the relevance
feedback process. We assign P(D3)=1 and P(retrieval=true | D3=true) = 0.8. The
initial belief in node retrieval is 0.4 as calculated in section 5.1. The new belief in
node retrieval is calculated as the combination of the effect of the new evidence
which arrives in node retrieval as λretrieval. λretrieval comes from two of its child
nodes, namely node D2 and D3. With these values, the beliefs in the network nodes
become:
λretrieval = 0.4 + 0.8 = 1.2
P(retrieval) = 0.4 * 1.2 = 0.48
P(D2|q,D3)= 0.3(5*10-4)(1-0.7)(1-0.48) + 0.5(1-5*10-4)0.7(1-0.48) +
0.7(1-5*10-4)0.7*0.48 + 0.6(5*10-4)(1-0.7)(1-0.48) +
0.58(5*10-4)(1-0.7)0.48 + 0.65(5*104)0.7(1-0.48) +
0.71(5*10-4)0.7*0.48 = 0.4173
152
P(D1|q,D3) = P(D1|q) because document D1 does not share any common
index terms with document D3. If we have another document called D4 which
contains the index term feedback, then P(D4|q,D3) ≠ P(D4|q). The value of
P(D2|q,D3), as expected, is increased because it contains the index term retrieval
which is found in the relevant document D3.
5.6 Summary
We have presented issues and changes to the basic network model which need to
be considered when implementing Bayesian networks for information retrieval
systems. The main issues have been shown to be the complexity of the
computation and indirect loop propagation during the relevance feedback process.
The complexity of computation occurs due to a large number of parents per node
which cause an explosion in the size of the link matrices.
An information retrieval network can be considered a dense network
whereby a large number of nodes share the same parent nodes. The fact that the
network is dense precludes some of existing approaches, such as layer reduction
and link-node deletion, from being of practical use in information retrieval
implementations. We have proposed a new method involving the addition of a
virtual layer in order to reduce the size of the link matrices. Although the total
number of nodes in the network is increased, this approach provides a systematic
method for reducing the size of the link matrices in order to meet the computing
resources available. In the virtual layer approach, the parent nodes are grouped
into a number of clusters. Each cluster is then connected to a virtual node. This
virtual node is in turn connected to the child node.
153
There are different ways of grouping the parent nodes. We introduce two
simple methods, namely random and non-random clustering. The random
clustering approach does not take into consideration the distribution of the link
weights. The assignment of the node to a group is arbitrary determined by the link
weights sequence in the data file. The non-random clustering scheme, on the other
hand, considers the link weight distribution and classifiesthe nodes accordingly.
We also will present in chapter 7 a method which can be used to measure the
goodness of the clustering methods in order to find the optimal approximation of
the model.
Another issue in the implementation of the Bayesian network model for
information retrieval discussed in this chapter is the indirect loop problem. The
indirect loop exists in our network when we want to implement evidence
feedback. We have proposed a solution involving the use of intelligent nodes
which act as message filters in the network and break the loops in the network.
The intelligent nodes are part of the original network, however using our
independence assumption, differ slightly in that they remember the information
source. By knowing the information source, these nodes may filter the messages
better than Pearl’s independence assumption and provide finite propagation in the
loop.
In the next chapter, we will measure the performance of our retrieval
model using three test collections namely ADI, MED and CACM. The
performance will be reported in terms of recall and precision, a common
performance measurement unit in information retrieval research. Firstly, we will
look into the influence of different weightings applied to the link weights. Detailed
154
discussion of ways of estimating the link weights in both query and document
networks are presented. Secondly, we will present a performance comparison
between the two clustering methods discussed in this chapter. In the last part of
the next chapter, we compare the performance of our model with other
information retrieval models to show that our model not only provides a more
general model for information retrieval but also exhibits higher recall and
precision.
155
Chapter 6
Model Performance Evaluation
6.1 Introduction
Information retrieval systems provide us with the ability to locate and retrieve
useful documents from a large collection of documents. As a user, we would
expect these systems to perform the retrieval tasks as rapidly and economically as
possible. Further to this requirement, the value of information retrieval systems
also can be seen to depend on their [Salton83]:
• ability to identify useful information accurately and quickly.
• ability to reject non-relevant documents.
• versatility of the retrieval methods.
We have shown in chapter 5 that the proposed Bayesian model fulfills the last
requirement since different retrieval models can be simulated using appropriate
network representations. In this chapter, we present evaluation result measuring
our systems performance as judged against requirements 1 and 2 above. The
conventional measure of recall and the precision level will be used to study the
performance of a system.
The recall level measures the ability of the systems to find all the useful
or relevant documents for a given query. The precision level measures the rate of
rejecting non-relevant documents and of finding the relevant ones before the non-
relevant documents are retrieved. A perfect information retrieval system is one,
156
which claims a 100% level of recall and precision. This is achieved by retrieving
all the relevant documents before retrieving any non-relevant documents for a
given query. This is not easy to achieve because most practical retrieval systems
retrieve some non-relevant documents before all the relevant documents are
retrieved or, in other words, the level of precision usually decreases as the recall
level increases. In fact it has been proved that without relevance feedback, most
current information retrieval systems can only achieve maximum of 80%
precision with 100% recall [Rijsbergen92].
Improvement in performance in information retrieval systems may seem
very small in term of the absolute percentage. However, this small percentage
does make substantial different when we consider the massive amount of
document involved during the retrieval process. Moreover, the increase in
precision is also more difficult to achieve near the optimum level, as stated earlier
by Rijsbergen [Rijsbergen92].
In information retrieval experiments, the recall and precision levels are
obtained by performing several retrievals on the test collection using the supplied
queries. A test collection in information retrieval experiments comprises:
• A set of documents – current test collections generally contain
information from the original document such as title, author, date and
an abstract. The collection may include additional information such as
controlled vocabulary terms, author-assigned descriptor and citation
information. The documents used in the collection are usually taken
from journals and/or newspapers.
157
• A set of queries – These queries are often taken from actual queries
submitted by users. They may be either expressed in natural language
or in some formal query language such as boolean expressions.
• A set of relevance judgements – For each query in the query sets,
normally a set of relevant documents is identified. This identification
process can be done manually by human experts or by using statistical
pooling retrieval information from several information retrieval
systems.
Each of these query-document sets in the test collection is used during
experiments. The interaction of these sets in an information retrieval experiment
is depicted by figure 6-1.
Testcollection
StandardQueries
Retrievalmodel
Documentranking
Relevancejudgement
ComputedRecall
Precisionlevel
Figure 6-1 Model for experiments in information retrieval systems.
Using the standard queries in a test collection, a retrieval system under evaluation
performs a document search in the documents set. The result of the search is a list
of document identifications whereby the document assumed most relevant is
ranked first. This list of rankings is then compared with the list of relevance
158
judgments. The relevance judgment list itself does not imply any ranking. It only
contains the identification number of documents which judged relevant to the
query. Using the recall and precision formulae (see section 6.2), the recall and
precision levels are calculated.
There are several existing standard test collections available for
comparing the performance of information retrieval systems. These collections
vary in collection size, the number of queries, the structure of information and
domain of the information. We used three popular and well-studied test
collections to evaluate the performance of our system. These were ADI1,
MEDLINE and CACM respectively2. The characteristics of these collections are
shown in table 6-1
ADI MEDLINE CACM Information domain Computing Medical Computing No. documents 82 1033 3204 No. Index terms 2086 52,831 74,391 Ave. no.index term/doc 25.451 51.145 23.218 St.dev of no.index term/doc 8.282 22.547 19.903 No. queries 35 30 64 Ave.no.query terms 9.967 9.967 10.577 Size in kilobytes 2,188 1,091 37,158
Table 6-1 Test collection characteristics.
The ADI collection is the smallest test collection. It contains articles from
computing journals. This collection is usually used only in the initial
experimental stage because of its limited size. The MEDLINE test collection was
created from medical articles in the MEDLINE database. The queries in this
collection were obtained from the queries submitted by actual users of the
1 The full documents collection of the ADI are given in appendix A. The full queries are given in appendix B. 2 The collections can be obtained from anonymous-ftp site ftp.cornell.cs.edu
159
MEDLINE database. The CACM test collection is created from articles published
in the Communication of the ACM from 1958 to 1979. Each record in this
collection contains author, title, abstract, citation information, manually assigned
keywords and Computing Review categories. The CACM collection is the largest
test collection amongst the traditional test collections.
The nature of the test collection influences to some degree the result of
experiments in information retrieval research. More specifically the query and the
relevance judgment sets are the two main influences to the experimental results.
Experiments presented at 5th Text Retrieval Conference [Voorhees96] showed
that retrievals using long and more specific queries produce a better recall and
precision level than retrieval using short queries.3 Compared with the test
collection used in TREC-5, most of the queries in the traditional test collection
such as ADI, MEDLINE, and CACM are considered to be short. Thus the
maximum level of precision with 100% recall will be expected to be less than
100%.
We have started this chapter by looking at the methodology involved in
conducting experiments in information retrieval. The rest of this chapter will be
organised as follows, Section 6.2 reviews in detail how part of the test collection
can influence the outcome of the experiments, namely the relevance judgment
set. We will discuss how the relevance judgment sets are created and the effect
these different creation methods have on information retrieval experiments. In
this section, we will also provide examples which show how to calculate the
3On average the short queries consist of less than 20 index terms and the long queries contains 100 index terms.
160
recall and precision level using the retrieved document ranking produced by the
system and the relevance judgement from the test collection.
Section 6.3 presents the performance of our basic model. We use the term
basic model to refer to a retrieval model that does not use any weighting scheme.
We will use this basic model to compare and discuss the performance of different
approaches of estimating probability in section 6.4. The effect of assigning
different probability estimation to document link weights, query link weights and
the virtual link weights are discussed in this section. Finally, we will compare the
performance of our model with other existing retrieval models namely the vector
space model [Salton83] and Turtle’s inference network [Turtle90].
6.2 The Relevance Judgement Set
The most difficult task in creating a test collection is the creation of the relevance
judgement set. In the current test collections, these relevance judgments are
created using one of two methods:
• Human judgements
In this approach, the relevance judgment sets are created using human
to judge whether a document is relevant to the query. They may be the
actual users who have submitted queries or independent experts in the
collection domain. This method however, is only practical for small
collections especially the approach using independent domain experts
because the experts have to inspect every document in the collection
in order to determine the documents’ relevancy to the query. The
161
relevance judgments in the three test collections used in our
experiments are created using this method.
• Pooling methods
In this method, the output of a number of different information
retrieval systems is pooled whereby the first N documents in the rank
output are combined using some statistical methods. This method has
been claimed by some to find the vast majority of relevant documents
[Salton83]. However Wallis [Wallis95] argues the opposite. As in
other pooling applications, the number of pool participants affects the
accuracy of the relevance judgment pool. The higher the number of
participants, the higher is the chance of finding the relevant
documents.4 Despite this issue, the pooling method is the only
practical way to derive the relevance judgment set when the collection
is very large, such as the Wall Street Journal collection (250Mb).
Performing domain expert judgement is too expensive in such
collections, since it is not possible for experts to inspect every single
document in the collection.
Regardless of the limitations of the methods for creating the relevance judgment
sets, test collections remain the most widely used tool for comparing retrieval
performance. The test collections are used to generate the recall and precision
which is the comparison unit in information retrieval experiments. The recall and
precision can be calculated using the following formulae:
4 The relevance judgement of Wall Street Journal collection has been improved over these years by the participant of TREC, hence the first few versions of relevance judgement sets for this collection may thus suffer from the low number of systems used in the pool [Voorhees96].
162
R
rrecall = (6.1)
N
rprecision = (6.2)
where
r is the number of relevant documents retrieved for a given query.
R is the number of relevant documents in the collection for a given query.
N is the number of documents retrieved for a given query.
To illustrate the use of the above formulae, consider the following example of a
set of ranked retrieved document numbers and a set of document numbers judged
relevant in a given query.
Retrieved: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
Relevant: 1,2,4,5,6,8,11,14,17,20
Using equations 6.1 and 6.2 respectively, the recall and precision for the above
retrieved set are:
Recall (%) Precision(%) first 5 documents retrieved 40 80 first 10 documents retrieved 60 60 first 15 documents retrieved 80 53.3 first 20 documents retrieved 100 50
Table 6-2 Examples of recall and precision for different numbers of inspected documents.
We can see from table 6-2 that as the recall level increases the precision level
decreases. Thus, the aim of achieving 100% recall and 100% precision may be
considered an unachievable goal [Rijsbergen92].
163
The inspection points, i.e the number of documents retrieved at a given
point in the report may vary from experiments to experiments. This depends on
the size of the collection and the rate of increase in recall or of decrease in
precision. If the rate of increase in recall or of decrease in precision is very high,
a smaller interval may be needed. However, if the rate is low and the collection is
large, a bigger interval may be sufficient for us to report the performance of our
experiments without losing detail in the trends in the recall and precision level. In
the above example, we have used an interval of 5 documents as the inspection
point for calculating the recall and precision levels.
There is another way of reporting the recall and precision level. This
approach reports the value of the precision at a given recall level. In this
approach, the size of the collection will not come into consideration when
determining the interval. The only consideration is in how much detail the
experiments want to report the relation between recall and precision. This
approach provides more useful information than the previous approach because it
shows clearly the relationship between recall and precision at a given point and
the trend in the recall and precision level over the whole experiment. Using this
approach, table 6-3 shows the reporting of the recall and precision level for our
previous example. We will use this approach reporting the precision for given
recall level in reporting the precision our experimental results throughout this
chapter.
164
Recall level (%) Precision (%) 10 100.00 20 100.00 30 75.00 40 80.00 50 83.33 60 75.00 70 63.63 80 57.14 90 52.94 100 50.00
Table 6-3 Example of measuring precision at a given recall level.
We have mentioned that we performed the experiments against three test
collections. In the majority of discussion in this chapter, we will use only the ADI
collection when discussing the effect of different probability estimation to the
recall and precision of the systems for simplicity reason. The results of
MEDLINE and CACM experiments will be presented in the conclusion when the
optimum model has been established. Unless mentioned specifically, tables of
recall and precision in this chapter will be for the ADI test collection. In the next
section, we will examine the performance of our basic model.
6.3 Performance of the Basic Model
In the basic model, the value of link weights between the query node and the
query term nodes or P(ti|Q=true) is calculated as term frequency within the query
(qf). The link weights between the document nodes and the index term nodes or
P(dj|ti=true) is calculated as terms’ frequency within the document (tf, equation
2.3.1). In what follows, we discuss the individual components of these estimates
independently, although in fact they are dependent. As a result, conclusions about
165
the performance of one component cannot be based on a single observation. We
will use the values of recall and precision in this basic model (table 6-4) as the
baseline performance to show the effect of varying probability estimation for the
link weights. The results were obtained from performing retrieval based on the
basic model for all queries in the ADI collection.
The precision at given recall level for the basic system is very low as we
expected since this system does not confer any measure of importance on index
terms in the documents and collection. The tf and qf provide only a local measure
of importance within a document or the query. As a result, long documents will
be more likely to be ranked higher since the chance that a term will occur
frequently increases in longer documents.
Recall level (%) Precision ( %change) 10 21.79 20 21.50 30 15.61 40 13.79 50 12.59 60 12.59 70 12.45 80 11.95 90 11.60
100 11.60 Average 13.22
Table 6-4 Recall and precision of basic model.
The highest precision for this model is only 21.79% for a recall level of
10%. This result agrees with previous experimental results of various information
retrieval systems [Sparck-Jones72, Salton83, Turtle90]. The performance of this
basic model can be further improved by adopting good estimations for the
probability parameters of the model. In next section we present the estimations of
those probability parameters.
166
6.4 Estimating the Probabilities
The basic systems provide very simple probability estimations for the
links in the network and produce poor experimental results. We investigated
several methods of estimations. The subsequent discussion in this section
regarding the probability estimations will be divided into three sections, namely:
1. estimates of the importance of the query terms in explaining the
information needs of the user or P(ti|Q=true) (section 6.4.1).
2. estimates of the dependence of the documents upon the index terms in
the collection or P(dj|ti=true) (section 6.4.2).
3. estimates of the virtual layers’ distribution (section 6.4.3).
These estimations represent the link weights in the network. Thus correct
estimations will lead to a good retrieval performance of the model.
There are two networks, the query and document networks, used in the
model and each of them may take different estimation. We will state clearly the
parameters estimated in one network when discussing the other network’s
parameter because the combination of the two networks parameters influence the
choice of parameters in individual network.
6.4.1 Estimating P(ti|Q=true)
Users’ information need, which is represented by node Q, can be submitted to the
system by using either the Boolean or natural language. With natural language
approach, the query submitted to the system is indexed using a process similar to
that of indexing documents. All the words in the query that generally do not
167
affect the retrieval performance are removed using the stop word list (see section
2.4.1). The remaining words are then stemmed to remove common endings in
order to reduce simple spelling variations to a single form. The stemmed words
are then weighted according their importance to the user. The weighting is used
to increase the influence of terms that are believed to be important on the
document ranking.
Two factors are commonly used in weighting the contribution of the
query terms; the frequency of a term in the query (qf) and the inverse document
frequency (idf) of a term in the collection. The assumptions made in this
approach are that:
1. a content-bearing term, which occurs frequently in the query, is more
likely to be important than the one that occurs infrequently.
2. those index terms that occur infrequently in the collection are more
likely to be important than frequent or common index terms.
Moreover, such index terms can be used as discriminators of the
document in the collection.
As we have discussed in section 4.3.2, the importance of query terms can be
estimated by the users if they have some confidence to do so. We would prefer
the user to be able to assign the importance of the query terms in their query.
However, as explained in section 2.2, sometimes users are not clear about their
information needs. Thus, they do not have the ability to estimate the importance
of the query terms and in this situation, the above qf and idf estimates can be used
as an alternative. We have tested these estimates individually as well as both as a
combination. Different from the basic model, we normalised qf estimates in order
168
to reduce the bias of the estimation towards long queries. The normalised qf (nqf)
of a term i in a given query j is calculated as:
j
jiji qf
qfnqf
max,
, = (6.3)
where
qfi,j is the query term i's frequency within query j.
max qfj is the maximum frequency of any term in query j.
The second parameter, idf of term i in document k is calculated as:
iki df
Nidf log, = (6.4)
where
dfi is the number of document containing term i
N is the number of documents in the collection
The combination of these two parameters may be derived from the product of
equations 6.3 and 6.4. In the rest of the discussion we will refer it as qf.idf
estimate. In qf.idf estimate the value of this parameter may be higher than 1 for
those query terms that occur infrequent in the collection. Thus, we need to further
normalise this parameter. One of the normalisation techniques is the cosine
normalisation method. This normalisation is introduced by vector space model.
The nqf and idf are considered as vectors. Equation 6.5 shows the normalisation
formula
22__
idfqf
idfqfidfqfnormalised
××= (6.5)
169
Table 6-5 shows the results of experiments using different term weights in
the query. We use the document network estimates of the basic model for these
experiments so that we can see the effect of the query’s parameter estimates.
Compared with basic model, the performance of model which uses qf
alone is decreased. This drop in performance occurred at every recall level. This
can be explained by the short nature of the query. Since the queries involved in
the experiments are relatively short, achieving high accuracy in statistical
estimation using such limited data is difficult. This estimate may consider as
noise and as a result, it reduces the performance.
Precision (%) Recall(%) Basic qf weights idf weights qf.idf weights
10 21.79 21.18 34.62 58.19 20 21.50 20.11 34.62 56.86 30 15.61 13.8 29.05 52.38 40 13.79 12.45 18.97 44.57 50 12.59 11.68 18.94 40.47 60 12.59 11.62 18.14 37.85 70 12.45 11.36 15.62 31.11 80 11.95 10.32 15.62 23.65 90 11.60 9.09 15.62 23.04
100 11.60 9.09 15.62 20.25 Average 13.22 11.88 19.71 35.31
Table 6-5 Performance using different weights for query terms.
The implementation of the idf factor alone, on the other hand, increases
the performance significantly. The idf estimate is based on the statistical data
collected from the collection. The distribution of the index terms in the collection
can provide more accurate statistical estimates than the query because it derives
from larger population sample. Moreover, the idf introduces a global
discriminator. An index term that occurs often in a query will not be a good
document discriminator when it occurs in most of the documents in the
170
collection. Index terms that occur less frequently in the collection are treated as
more importance than those that occur more frequently.
The combination of the qf and idf factors increase further the performance
of the system using the idf weight alone. This combination of qf and idf produces
better results than the idf or qf used alone because the combination of both gives
local and global estimates of the parameters used. An index term that occurs
frequently in a query but does not occur frequently in the collection will be a
good discriminator of documents in the collection. Thus, instead of acting as
noise as in case of the pure qf weights, these qf weights work as intensifiers of the
statistical data provided by the idf weights.
6.4.2 Dependence of Documents on Index Terms
The probability that a term accurately describes the content of a document
can be estimated in several ways, but previous information retrieval research has
consistently shown that index term frequency in a document (tf) and inverse
document frequency (idf) are useful components of such estimates [Salton83].
Therefore, we will concentrate on estimating the link weight that involves tf and
idf.
6.4.2.1 Estimating the tf and idf Components
The tf estimate can be represented by the common ntf [Salton83,
Rijsbergen79] in which the tf of a term i in a given document j is given by
dividing the tfi,j by the maximum frequency of any term in the document as
shown in equation 6.6.
171
j
jiji tf
tfntf
max,
, = (6.6)
The formula is similar to the qf weighting scheme. The only difference is that it is
applied to a document instead of the query. The idf component can be estimated
using equation 6.4.
Table 6-6 shows the performance of the two estimates in the ADI
collection. The average performance of the retrieval based solely on the tf
estimates of P(dj|ti=true) shows 5.01% drop in performance compared with
retrieval based solely idf estimates. The difference in performance between the
retrieval based on tf weights alone and idf weights alone is smaller than that
observed for the qf and idf estimations(table 6-5).
Precision (%) Recall (%) tf idf 10 34.38 43.22 20 34.38 40.47 30 28.96 38.13 40 24.47 35.61 50 19.18 27.86 60 19.14 25.76 70 18.36 21.22 80 15.95 18.55 90 15.95 17.59
100 15.95 16.17 Average 22.33 27.34
Table 6-6 Performance of the retrieval using tf and idf components.
Again, this situation may be explained by the fact that documents contain more
index terms than queries, thus providing a larger sample population for the
estimates.
6.4.2.2 Estimating the Combination of tf and idf Components
The belief of P(dj|ti=true) may be estimated by determining the default
belief or the belief in the absence of any index terms that support or against a
172
proposition represented by the document nodes[Salton83, Rijsbergen79]. The
estimation is given by
P(dj|ti=true)=α + (1-α) × ntf × idf
Estimates for P(dj|ti=true) should lie in the range 0.5 to 1.0 and estimates
for the default belief should lie in the range 0.0 to 0.5. We investigated different
several values of α in the range of 0.5 to 1. The best performance is given when
α=0.5.
A large number of functions for combining and normalising the tf and idf
estimations were tested. Since we require the probabilities to lie in the range
[0,1] we need to normalise the combination of tf and idf because the combination
may produce values greater than 1. For example, consider the index term educat
in the ADI collection in document 14. This index term has the value of 0.8 for the
tf component when calculated using equation 6.6. The idf component of this
index term in the ADI collection is 12.13 when calculated using equation 6.4.
Thus without the normalisation the weight of index term educat will be greater
than 1.
There were two normalisation functions that we found performed best in
our experiments. The first estimation uses the cosine normalisation as shown in
equation 6.7.
P(dj|ti=true)=22 )5.05.0(
5.05.0
ji
ji
idfntf
idfntf
××+
××+ (6.7)
This equation is slightly different from the cosine normalisation for the query
network (equation 6.5) to take the consideration of the default belief 0.5.
173
With this estimation method, the P(dj|ti=true) in the ADI collection are
estimated in the range of [0.03,0.468], the MEDLINE collection in the range of
[0.017,0.634] and the CACM in the range of [0.017,0.994]. These measures give
a broad range for CACM and MEDLINE collection, but considerably less range
for the ADI collection. We note that this difference influence the behaviour of the
system accordingly.
We also investigated a maximum normalisation function to produce
similar estimation range among the collections. In this function, the tf.idf is
divided by the maximum tf.idf in the collection. The function is shown in
equation 6.8.
⎟⎟⎠
⎞⎜⎜⎝
⎛×
××+==
)max(5.05.0)|( ,
idfntf
idfntftruetdP iji
ij (6.8)
Using this scheme, P(dj|ti=true) in the ADI collection is now lie in the range
[0.527, 1.0], the MEDLINE is in the range [0.503,1.0] and the CACM is in the
range [0.505,1.0]. Compared with the cosine normalisation, this normalisation
produces similar ranges for the all three collections. Thus, the differences in
characteristics among the collections during the experiments can be minimised.
Table 6-7 compares the performance of the normalisation functions on the
ADI collection.
174
Precision (%) Recall (%) Cosine Normalisation Maximum Normalisation
10 64.09 63.40 20 63.80 62.75 30 59.59 57.52 40 54.91 51.94 50 48.56 45.71 60 47.35 43.98 70 37.61 36.01 80 28.87 27.83 90 25.56 26.49
100 23.37 24.38 Average 45.37 44.00
Table 6-7 Performance for two normalisation functions.
The average performance of the two normalisation functions are only differ by
1.37% with cosine normalisation consistently providing a higher precision. The
figures in table 6-7 suggest that when only qf is used to estimate P(ti|Q=true), the
estimation of P(dj|ti=true) using the cosine or maximum normalisation does not
influence the performance significantly, although they provides different weight’s
distribution range. However, when queries with cosine normalised weighted
terms are used (equation 6.7 applied for nqf), the effect of different probability
distribution in the collection due to the choice of normalisation functions for the
document term weights is significant.
Table 6-8 shows the comparative performance of the combination of the
dependence of documents on index terms using cosine and maximum tf.idf
estimates and on query based on cosine qf.idf estimates.
175
Precision (%) ADI MEDLINE CACM
Recall (%)
cosine Maximum cosine maximum cosine Maximum
10 64.09 68.92 91.10 89.50 71.76 78.84 20 63.80 68.04 79.51 81.55 59.73 69.41 30 59.59 62.24 74.72 75.09 48.26 57.37 40 54.90 56.73 70.96 72.27 39.18 43.60 50 48.56 53.41 66.78 65.15 32.77 36.90 60 47.34 50.84 60.53 58.44 27.48 32.47 70 37.61 38.15 55.92 51.39 21.35 28.70 80 28.87 28.53 47.178 45.65 18.52 20.65 90 25.56 27.65 40.78 39.49 15.55 17.03 100 23.37 25.46 33.68 32.58 12.53 13.14
Average 45.37 48.00 62.11 61.11 34.71 39.81
Table 6-8 Performance using cosine and maximum normalisation in all collections.
The average precision for both the ADI and CACM experiments is higher for the
maximum normalisation. The maximum normalisation produces 2.63% better
average precision in ADI collection and 5.1% in the CACM collection
respectively. The MEDLINE experiments on the other hand show 1.0% decrease
in the precision average for the maximum normalisation compared with the
cosine normalisation. This different behaviour in the MEDLINE collection may
be explained by the fact that the length of the document in the MEDLINE
collection varies enormously (see the average index term per document and its
standard deviation of this collection in table 6-1). The average of number of
index term per document is 51.1 and the standard deviation is 22.5. Thus, we can
see that there are some documents which are very short. The maximum
normalisation is slightly biased toward the short document.
The maximum normalisation method also produces a smaller rate of
decrease in precision. Indeed, table 6-8 shows that the drop in the precision is
slower in the maximum normalisation columns for all three collections. For
176
example, the precision drops 12.03% as the recall increases 10% in CACM
experiments using the cosine normalisation. With maximum normalisation the
corresponding in precision drop is only 9.43%. Similarly in MEDLINE, the drop
is greater in the cosine normalisation. In ADI collection although the drop is
smaller (0.29% for cosine normalisation), the fact that the highest precision is
only 64.09% means that the maximum normalisation can be considered to
perform better than the cosine normalisation.
The results of these experiments show that the maximum normalisation
performs better in overall. The cosine normalisation, although providing a
slightly better precision average in the MEDLINE collection, still suffers from
the rapid decrease in precision as recall increases. This cosine approach should
thus be considered only for those applications that do not require high recall such
as interactive searching of library items. For applications that require high recall
such as the searching of patent records, the maximum approach is more
appropriate. In the rest of the discussion in this chapter, we will thus adopt the
maximum approach as our normalisation method of choice for the estimating
P(dj|ti=true).
6.4.3 Estimating the Virtual Layer Distribution
Section 5.2.3 introduced the concept of a virtual layer into the network in
order to reduce the complexity of calculation during the inference process. A
virtual layer consists of virtual nodes which act as summary nodes for a given
group of index term nodes. Thus, it is important to be able to estimate the weight
177
of the links that connect the virtual nodes and the child node of a given group of
index term nodes.
There are two possible estimation methods for these link weights, namely
the average and the maximum approach. The average approach takes the average
value of the group’s weight as the weight of the virtual links. The maximum
approach on the other hand, takes the maximum value of the link weights in the
group to be the weight of the virtual links. As we predicted earlier in chapter 5,
the maximum approach produces better results than the average approach. Table
6-9 shows the comparison of performance of the two approaches across the three
collections.
We expected that the average and the maximum approaches give different
ramification to the accuracy of summary estimation of the groups formed by the
virtual layer approach(see section 5.2.3). However, the difference between the
two approaches was not as marked as we expected. As discussed in section 5.2.3,
we expect that assigning the average weight of the group to the links between
virtual nodes and the document nodes will cause the low weight links to pull
down the importance of the high weight links in the group. This situation occurs
due to the fact that we have assigned the nodes randomly to the groups and as a
result, similar virtual links’ weights may occur throughout in the networks.
178
Precision (%) ADI MEDLINE CACM
Recall (%)
Max Ave Max Ave Max Ave 10 68.92 68.21 89.51 89.70 78.85 76.56 20 68.04 67.26 81.55 81.15 69.41 68.49 30 62.24 61.58 75.09 75.27 57.37 58.20 40 56.72 56.48 72.27 72.20 43.60 44.03 50 53.41 52.54 65.15 65.19 36.90 36.89 60 50.84 49.923 58.44 58.08 32.47 32.58 70 38.15 37.30 51.40 51.11 28.70 28.66 80 28.53 27.75 45.65 45.61 20.66 20.41 90 27.65 26.88 39.49 39.43 17.03 16.80
100 25.46 24.72 32.58 32.27 13.14 12.83 Average 48.00 47.26 61.11 61.00 39.81 39.54
Table 6-9 Performance comparison using average and maximum estimation for virtual links.
Experimental results (table 6-9) show that the average and maximum
approaches only differ slightly in their performance. In the ADI collection, the
maximum approach is only 0.74% better than the averaging approach. The
differences are much less in the MEDLINE collection and the CACM collection,
being 0.11% and 0.27% respectively. The rate of decrease in precision is also
similar in the two approaches. There is no one approach which exhibits retrieval
bias toward either precision or recall. In this sense, both approaches may be
considered of equal value.
We have suggested a method of improving the random grouping of the
index term nodes for the virtual layers in section 5.3.3.1. Table 6-10 reports the
comparative performance of the random and non-random clustering techniques.
The non-random clustering method requires the estimation of the significant
level. Standard deviation of the link weights distribution is a good estimation for
this significant level. It gives us a better chance to evenly divide the index term
179
nodes into the group. Recall from the discussion in section 5.3.3 that the most
optimised network is given by a symmetric network.
We calculated the standard deviation of the distribution of the link weight
within the document that required classification in the ADI collection. Most of
the standard deviation within the documents lies on the range 0.08 to 1.0. We
tried several significant levels in this standard deviation range and performed the
retrieval of the ADI collection. The result of these experiments is reported in
table 6-10.
Precision
non-random cluster with n significant level
Recall (%)
random cluster n=0.08 n=0.09 n=0.1
10 68.92 68.78 70.73 69.34 20 68.04 66.43 68.35 66.64 30 62.24 60.54 63.94 61.39 40 56.73 54.36 57.03 54.87 50 53.41 49.19 50.97 49.87 60 50.84 48.39 50.05 48.16 70 38.15 37.26 36.76 37.88 80 28.53 27.85 27.49 27.52 90 27.65 26.40 26.81 26.03 100 25.46 24.42 24.66 24.27
Average 47.997 46.362 47.679 46.597
Table 6-20 Performance of difference clustering schemes.
The performance of the random clustering method for the average
precision is slightly better than the performance of the non-random clustering
method. The difference in the average precision, however, is relatively small
(0.32%) compared to the gain in precision at 10% to 40% recall. The results
shown in table 6-10 agrees with our hypothesis discussed in chapter 5, which
stated that the non-random cluster method does not find new relevant documents,
instead, it shifts the relevant documents higher in the ranked output. If the non-
random cluster method is able to find relevant documents not found by the
180
random cluster, the experiment results will show the increase in the average
precision. Therefore, the choice between the two clustering methods depends on
the objective of the retrieval systems built. If the precision is very important then
the non-random clustering method is the choice with the cost of having more
expensive preprocessing. On the other hand, the choice will be the random cluster
when the precision is not very important because it requires less computation
during the clustering process.
We have presented experimental results of different approaches for estimating the
different probability parameters in the model. The summary of the average
precision gained by different estimations is shown in table 6-11.
From this table, we can summarise that the model performs best when the
following probability parameters are used:
1. P(ti|Q=true) or the weight of the node Q to query term node is
estimated using normalised qf.idf (equation 6.5).
2. The default belief for P(dj|ti=true) is α=0.5.
3. The tf component of the P(dj|ti=true) is estimated using the
normalised tf (equation 6.6).
4. The combination of tf.idf of P(dj|ti=true) is best normalised using the
maximum normalisation (equation 6.8).
5. The virtual link weights are estimated using the maximum probability
values of the group (equation 5.2).
181
Estimation Maximum precision
Average precision
Query (none) , document (none) 21.79 13.22 Query (qf), document (none) 21.18 11.88 Query (idf), document (none) 34.62 19.71 Query (normalised qf.idf), document (none) 58.19 35.31 Query (normalised qf,idf), document (tf) 34.38 22.33 Query (normalised qf,idf), document (idf) 43.22 27.34 Query (qf), document (tf.idf with cosine normalisation) 64.09 45.37 Query (qf), document (tf.idf with maximum normalisation) 63.40 44.00 Query (normalised qf.idf), document (tf.idf with cosine normalisation)
64.09 45.37
Query (normalised qf.idf), document (tf.idf with maximum normalisation)
68.09 45.37
Virtual layer with maximum estimation 68.92 48.00 Virtual layer with average estimation 68.21 47.26 Virtual layer with non-random cluster method 70.73 47.68
Table 6-21 Performance summary of different estimations in ADI collection.
6.5 Performance Comparison with Existing
Model
Using the best estimation suggested in previous section, we compared the
performance of our model with two other well-known models, namely the vector
space model [Salton83] and Turtle and Croft’s model [Turtle90]. We chose these
two models of information retrieval because they both well-know models and
their experimental results are available publicly.
We can only compare Turtle and Croft’s model [Turtle90] with our
Bayesian network model for the CACM collection because they did not report the
experiment results for either the ADI collection or the MEDLINE collection. We
should also note that the accuracy of the reporting was different in Turtle and
Croft’s model. They only report the experiment results to one decimal point
accuracy and we will use them as they appear on their published experiment
182
results [Turtle90]. We compared our model with vector space for all the three
collections.
Most of the reporting of information retrieval experiments so far has
concentrated on looking at the average precision across different recall levels.
The problem with this approach is that the comparison is biased towards
precision oriented systems [Wallis95]. As we have mentioned, not all
applications in information retrieval are suited to these types of systems, for
example patent office systems. With systems that require high recall such the
patent office systems, lower precision in the low recall level is not necessarily as
important as having higher precision at high recall level. A system that produces
a high precision at the high recall level is able to better distinguish the relevant
and non-relevant documents retrieved compared with systems that produce a
lower precision at a high recall level.
The precision at the high recall levels is very important because the actual
number of documents retrieved is much greater at high recall level. Therefore, a
slightly higher precision system will retrieve much less number of non-relevant
documents compared with a system that produces high precision at low recall but
low precision at its high recall levels. We will show that our Bayesian network
model not only outperforms the vector space and Turtle and Croft’s model in
terms of average precision but also, more importantly, in terms of precision at
high recall levels.
183
6.5.1 Comparative Performance for the ADI
Our Bayesian network model outperforms the vector space model for
experiments in the ADI collection. Table 6-12 and figure 6-2 show the
performance comparison between vector space and our model. On average, our
model produces a 0.62% better precision for 10 different recall levels. The
improvement provided by our model is achieved at the both ends of recall level
range. The maximum improvement is achieved at 50% recall (2.01%) and the
minimum improvement is at 70% recall (0.16%).
At the low recall level (10-20% recall), the precision of our model is
between 0.96% and 1.66% better than the vector space model. The vector space
model performs almost the same level of precision on the middle recall level (30-
80% recall). Our model starts to outperform the vector space again at the high
recall level. Our model produces a comparative 1.1% and 1.66% performance
increase at 90% and 100% recall respectively.
Precision (%) Recall (%) Vector Space Bayesian Network 10 67.26 68.92 20 67.26 68.22 30 62.61 62.3 40 57.85 56.78 50 51.46 53.47 60 49.98 50.71 70 37.99 38.15 80 29.28 28.53 90 26.52 27.65
100 23.8 25.46 Average 47.40 48.02
Table 6-12 Performance comparison with vector space for ADI collection.
184
The graph in figure 6-2 shows the comparative recall and precision level
of the retrieval in ADI collection. From this graph we can see clearly that the rate
of decrease in precision level is almost the same for the vector space and our
Bayesian network model, with the exception at the 90% and 100% recall. The
trend in the graph that represents our model is flatter at these two recall points, or,
in other words our model provides smaller rate of decrease in precision as recall
increase. Therefore, our Bayesian network model will clearly outperform the
vector space model for high recall oriented system.
Higher level of precision at the high recall level can be achieved by our
Bayesian network model mainly due to the adoption of the graph that enables
explicit representation of connectivity among the index terms and the document
in the collection. It has been suggested that this connectivity will improve the
retrieval performance [Croft84, Croft87a] because explicit representation allows
documents that do not contain the query terms to be retrieved if they share many
index terms with those documents which contain query terms. However, this
index terms and document connectivity has been denied from conventional
keyword based matching model such as vector space and has contributed to its
inferior performance.
185
10 20 30 40 50 60 70 8090 100
68.9 2 6 8.22
6 2.3
56.78
5 3.47
5 0.71
3 8.15
28 .5 3
2 7.6525 .4 6
67 .2 6 67 .2 6
6 2.61
5 7.85
51 .4 649 .9 8
37.99
2 9.28
2 6.522 3.8
0
1 0
2 0
30
40
5 0
6 0
70
Precision(% )
R ecall(% )
V ector S pace
B ayesian N etw ork
Figure 6-2 Comparative Performance for the ADI collection
186
6.5.2 Comparative Performance for the MEDLINE
The experimental results on the MEDLINE collection show a similar
behaviour with those of the ADI collection. Table 6-13 shows the experimental
results. The average precision of the Bayesian network model for the experiments
in the MEDLINE collection is 1.01% better than the vector space model. The
maximum improvement is achieved at the 100% recall (7.61%) and the minimum
at 90% recall (1.69%). The vector space model shows good precision at 10%,
30%, 50%, 60% and 70% recall. However, the precision produced by the vector
space model decreases drastically at the two extreme recall levels compared with
our model. For example, it drops 11.41% in precision when the recall increase
from 10% to20%, our drop is 7.96%. It is clearly shown in figure 6-3 that our
model produces a more steady decrease in the precision rate than the vector space
model.
The Bayesian network model’s superiority is clearly shown for the
precision rate at the high recall level. For example, our model produces 7.61%
better precision than the vector space at the 100% recall level. This behaviour is
similar to the behaviour of our model in the ADI collection experiments. The
difference in precision is much higher in the MEDLINE experiments than the
ADI experiments. In the ADI experiments, the difference between the precision
at 100% recall of our model and the vector space model is 1.66%.
187
Precision (%) Recall (%) Vector-space Bayesian network 10 91.12 89.51 20 79.71 81.55 30 75.40 75.10 40 70.54 72.27 50 67.00 65.15 60 58.85 58.44 70 52.96 51.40 80 43.06 45.65 90 37.80 39.49
100 24.97 32.58 Average 60.10 61.11
Table 6-13 MEDLINE experimental results.
6.5.3 Comparative Performance for the CACM
The Bayesian network model behaves similarly in the CACM collection
as in the ADI and MEDLINE collections. It outperforms the vector space model
and Turtle and Croft’s model (see table 6-14 and figure 6-4). The average
precision for the Bayesian network is 2.74% and 0.5% better than that of the
vector space model and Turtle and Croft’s model respectively. Our model is also
superior to both the vector space and Turtle and Croft’s model in terms of
precision at low recall levels (10%-30%). Table 6-13 shows that our model
produces precision at 10% recall of 2.35% and 5.69% higher than Turtle and
Croft’s and the vector space model respectively.
188
1 02 0
3 04 0
5 06 0
7 08 0
9 01 0 0
8 9 .5 1
8 1 .5 5
7 5 .1 07 2 .2 7
6 5 .1 5
5 8 .4 4
5 1 .4 0
4 5 .6 5
3 9 .4 9
3 2 .5 8
9 1 .1 2
7 9 .7 1
7 5 .4 0
7 0 .5 4
6 7 .0 0
5 8 .8 5
5 2 .9 6
4 3 .0 6
3 7 .8 0
2 4 .9 7
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
Precision(%)
Recall(%)
Vector Space
Bayesian Network
Figure 6-3 Comparative performance for the MEDLINE collection.
189
Precision (%) Recall (%) Bayesian network Vector Space Turtle’s network 10 78.85 73.16 76.5 20 69.42 61.69 65.5 30 57.37 52.22 54.4 40 43.61 43.97 48.6 50 36.90 35.54 42.3 60 32.47 28.94 36.1 70 28.70 24.87 25.5 80 20.66 20.07 21.1 90 17.03 16.75 12.7 100 13.14 12.45 9.6
Average 39.70 36.96 39.2
Table 6-14 Experimental results for CACM collection.
In the middle range recall levels, Turtle and Croft’s model shows a better
performance than our model. However, from the point of view of practical
applications, a higher precision at the both ends of the recall spectrum will be
more desirable than that is of precision at the middle recall range for the
following reasons:
1. High precision in the middle range does not provide a clear cut-off
point in the situation where the systems want to produce limited recall
output. The recall cut-off point will be clearer in a model that retrieves
the relevant documents concentrated at the top and bottom level of
recall level.
2. Having high precision in the middle range but low precision at the
high recall level, does not provide the best support for the recall-
oriented systems. This is because the amount of documents to be
inspected in order to find relevant documents in the high recall level is
much higher than the amount of documents to be inspected in the
medium recall level. Thus, considering the amount of documents to be
190
inspected, recall-oriented systems will be benefited from high
precision at high recall level.
The vector space model produces worst performance in almost every recall levels
than that is of Turtle and Croft’s and our model. The exception occurs only at the
90% and 100% recall, it performs better than the Turtle and Croft’s model.
The CACM experiments demonstrate the superiority of our model to the
vector space and Turtle and Croft’s model. The superiority of our model is with
respect to the vector space model is particularly clear. The Bayesian network
outperforms the performance of the vector space model at every recall level. This
shows that the addition of knowledge, through the use of the network model
adopted by our model, provides a better information retrieval model than that is of
simple index terms matching function which is adopted by the vector space. This
benefit of adopting a network model is clearly demonstrated by the fact that our
model maintains comparatively higher precision at the high recall levels. The
adoption of the network model provides a natural classification which allows a
document that does not contain the query terms to be retrieved. Such characteristic
cannot be produced by the keyword-based retrieval such as the vector space
model.
We have claimed in chapter 5 that our model provides greater versatility
(Salton’s [Salton83] third requirement introduced early in this chapter) than does
Turtle and Croft’s inference network. We have shown in that chapter that our
model is not just able to simulate other existing models in information retrieval,
but is also able to support both evidence and dependency alteration relevance
feedback.
191
1020
3040
5060
7080
90100
78.84
69.42
57.37
43.61
36.9
32.47
28.7
20.66
17.03
13.14
76.5
65.5
54.4
48.6
42.3
36.1
25.5
21.1
12.7
9.6
73.16
61.69
52.22
43.97
35.54
28.94
24.87
20.07
16.75
12.45
0
10
20
30
40
50
60
70
80
Precision(%)
Recall(%)
CACM Collection
Vector Space
Turtle's Inference Network
Bayesian Network
Figure 6-4 Comparative Performance for CACM collection.
192
In this chapter, using the experimental results of the three test collections,
we have also shown that our Bayesian network model is more effective in
identifying useful information accurately and quickly (Salton’s first requirement).
In other words, our model exhibits a better precision at most recall levels of the
three collections. This ability is achieved by adopting the correct network
semantic compared with the Turtle and Croft’s model. Table 6-15 shows the
summary of the performance improvement for the three collections.
Performance improvement (%) Collection
Maximum Minimum Average
ADI 2.01 0.16 0.62
MEDLINE 7.73 0.28 0.97
CACM 7.61 1.69 2.74
Table 6-35 Summary of performance improvement of the experiments.
Our Bayesian network model also met the second requirements stated by
Salton, namely the ease of rejecting extraneous document because our model
produces a higher average precision for the three collections. Therefore, we can
claim that our model provides a better and more versatile model to the
information retrieval systems than the two popular existing information retrieval
model.
6.6 Summary
The Bayesian network model’s experimental performance has been reported in
this chapter. The experiments were run on three collections: ADI, MEDLINE and
193
CACM. These three collections vary in size, distribution of the index term’s
weight, the number of queries and their length.
Different probability estimation methods have also been tested and the
results are reported in this chapter. In general, the retrieval performance of the
network is better when weighting schemes are used for the link weights in both
the query and document network. The best performance is achieved when the
P(ti|Q=true) in the query networks is estimated by the qf.idf with cosine
normalisation and P(dj|ti=true) in the document network is estimated by the tf.idf
with maximum normalisation. The default belief of P(dj|ti=true) for the best
performance is given by value of 0.5.
As the size of the Bayesian network for information retrieval is large, the
introduction of virtual layers becomes necessary as an aid to reduce the network’s
complexity, in particular by reducing the size of the link matrix. In implementing
the virtual layer solution, index terms are grouped and connected to a virtual node
at the virtual layers. The link weights from the virtual layer to the document layer
require estimation. These link weights have to able to summarise the weight
distribution of the group. We tested two approaches to these estimation tasks,
namely taking the average and the maximum of the weights in the group. We
found that both approaches do not differ much in performance with the maximum
method producing slightly better performance than the average method.
Using those probability estimations that produce the best retrieval
performance for Bayesian network model, we compared its performance with the
vector space and the model of Turtle and Croft. The Bayesian network models
shows much better precision than these two models especially at the two recall
level extremes. This result supports our hypothesis that the introduction of the
194
network as a knowledge base will increase the performance of the retrieval
compared with the purely index term matching method (adopted by the vector
space model). The fact that our network model outperforms Turtle and Croft’s
network shows that the direction of the inference and the assumptions of the
causal relations between propositions affects the performance of the retrieval in
the network model. This is the case because the direction of the inference dictates
the semantics of the model.
We have suggested the adoption of virtual layer approximation to reduce
the computation complexity in Bayesian networks. This approximation method
involves the classification of parent nodes into smaller groups, which in turn it
will reduce the size of the link matrix. In the next chapter we will present an
evaluating model based on Minimum Message Length [Wallace68] to access the
goodness of classification of parents node in the network. This model provides a
useful and effective mean of finding the optimum virtual layer model. With the
help of this model, we can eliminate extensive retrieval testing to different virtual
layer models in order to find the optimum model.
195
Chapter 7 Measuring the Effectiveness of
Virtual Layer Model
7.1 Introduction
The Bayesian network model for information retrieval is inherently
computationally complex and requires some optimisation in order to make the
model practically useful. The primary cause of this high computational complexity
lies with the size of the link matrices. We have discussed several approaches that
can be used to reduce the link matrix size in chapter 5, in particular the addition of
a virtual layer.
In the virtual layer optimisation approach, the parent nodes are partitioned
or classified into a number of groups. Each group is then attached to a virtual
node and this virtual node in turn is attached to a document node. The choice of
the clustering method applied to the virtual layer influences the retrieval
performance directly, as shown in the experimental results presented in chapter 6.
The two clustering methods (namely random and non-random) introduced in
section 5.3.3 produce different retrieval performance, with the random clustering
method produces the highest average precision.
Different clustering techniques result in different Bayesian network
structures. The optimal clustering within the network will lead to the most
efficient inference. In this chapter we will utilise a method by which we can
196
measure the effectiveness of the classification or clustering, namely that of
Minimal Message Length (MML). The objective of this method is to provide a
means of measuring the complexity of modeling the virtual layer in Bayesian
network for information retrieval. We use this method to determine the
effectiveness of a classification model for Bayesian network nodes without
performing intensive retrieval testing. Section 7.2 presents the background theory
of this measurement method with the emphasis given to modeling with real valued
parameters (the weights of the links are real values). Section 7.3 shows how the
MML method can be used to measure the effectiveness of classification in our
information retrieval application. We present an example of such a calculation
using one of the clustering methods generated by experiments on the ADI
collection in section 7.4.
7.2 Minimum Message Length
The Minimum Message Length (MML) paradigm is introduced in [Wallace68]. In
this paper, it is stated that a classification may be regarded as a method of
representing more briefly the information contained in S×D attribute
measurements, where S is a set of items and D is a set of attributes. These
measurements contain a certain amount of information which without
classification, can be recorded directly as S lists of the D attribute values. If the
items are now classified, then the measurements can be recorded by listing the
following:
1. The class to which each item belongs.
197
2. The characteristics of the class.
3. The deviation of each item from the characteristics of its parent class.
The best classification is suggested by the briefest recording of all the attribute
information. In MML, the recording of the attribute information is achieved by
regarding the attribute information as a message.
Consider that we have some measurements from the real world and a set
of models, Mm1,m2,…,mn, that attempt to explain the measurement. MML
assesses the effectiveness of each model mi ∈ M by calculating the length of the
message needed to explain the measurement. In other words, since this method is
based on information theory, it calculates the length of the message required by a
receiver to re-construct the information send by the sender during communication.
The communication between the sender and the receiver in MML comprises two
parts, namely:
1. The message that describes the model. The model is usually the
probability distribution of the data values.
2. The message that describes the data values. This message can be
constructed by using a code dictionary. The code dictionary can be
easily constructed from the probability distribution defined in the
model part.
The best or the optimum model is given by a model mi ∈ M that produces the
shortest message in describing the two-part message. There is a trade off between
the model complexity and the size of the message length of data values. A
complex model requires a long message to describe the model but requires only a
short message to describe the data values. On the other hand, a simple model
198
leads to a short message for the model description but a long message for the data
values. It is asserted by MML that the model that produces the shortest overall
message for both model and data values parts is the optimum model.
Following Shannon’s law, the message length may be assumed to be
proportional to minus the logarithm of the relative frequency of the occurrence of
the event which it nominates. More specifically, considering the difference in the
nature of the attributes, the encoding is calculated as follows [Oliver94]:
1. Encoding an event that is equally likely to occur from N possible events
requires a code length of N2log bits
2. Encoding an event which has the probability P requires P2log− bits.
3. Encoding a real value y sampled from a probability density f(y) with an
accuracy of measurement ε. requires ))((log 2 yfε− bits.
Since we are dealing with real valued parameters in the clustering of the link
weights, we will concentrate on discussing the encoding and calculation of
message length for real valued parameters in the next section.
7.2.1 Encoding Real Valued Parameters
Assessing a model of a real valued distribution requires the description of the
distribution using real valued parameters. Real valued parameters cannot be
described to infinite precision in a finite message. Thus constructing the code
dictionary for the parameter values involves some approximations.
One method of approximation involves the construction of a code
dictionary from a density function which is divided into cells, the number of which
199
is called the accuracy of the parameter value ( AOPV). If the uniform density
function is used, the width of the cells is given by AOPV
ab −, where the parameter
values are in the range [a,b]. Thus, to specify the cell to which a parameter
belongs requires )(log2 AOPV
abceiling
− bits [Wallace68].
The message length, as stated, depends on the message length of the model
and of the data. The model and data lengths of the message are directly influenced
by the AOPV. The smaller the AOPV (i.e. the more accurate the measurement) ,
the shorter the message length for the data. However in this case, the model’s
message length will be longer. The optimal message length is achieved when the
shortest combined length of the model and the data message is obtained. The
optimal message length can be approximated by calculating the expected message
length, which is given by the following formula[Oliver94]:
)(log2
2logloglog)( 22
22
222 es
N
ss
Ns
NAOPV
range
AOPV
rangeMessLen
++++=Ε
επ
σ
σ
µ
µ
(7.1)
where
µ is the mean used to code the data values xi (i=1,...,N) .
σ is the standard deviation to code the xi.
⎯s is the unbiased sample standard deviation.
s is the sample standard deviation.
200
ε is the accuracy of the measurement of the data values xi.
N is the number of data values.
The first two terms in equation 7.1 represent the length of the message to describe
the model and the last two terms represent the length of the message to describe
the data. The optimal AOPVs are given by [Wallace68]:
NAOPV
12σµ = (7.2)
1
6
−=
NsAOPVσ (7.3)
In MML, the value of the AOPV depends upon the data which the message is
describing. It is worth noting that the two part message used in MML may seem
incomplete and that a three parts message (AOPV, model, data) is required.
However, Wallace and Freeman (Wallace87) showed that in many cases a three
parts code is not necessary, and hence we will use a two part message calculation
in identifying the effectiveness of the virtual layer model being used to optimsed
our Bayesian network.
7.3. Measuring Effectiveness of Virtual Layer
Model with MML
In the classification of index terms for the virtual layer approach to Bayesian
networks optimisation, the following assumptions are made:
1. The index terms are assigned independently to documents in the
collections.
201
2. The weights assigned to the links between the index term nodes and
document nodes are normally distributed within a document
collection.
3. The index terms cannot be assigned to more than one cluster or group,
i.e. the clusters are disjoint.
We will now consider how the MML method may be used to judge the
effectiveness of the clustering method employed in constructing the Bayesian
network with virtual layers. There is some prior knowledge that is communicated
between the receiver and sender at the start of the transfer process, thus they will
not be included as part of the message. In our MML model of index term
clustering for the virtual layer, the prior knowledge consists of the following:
1. The total number of link weights to be classified.
2. The number of attributes per term, which is equal to 1 (i.e. the link
weight values).
3. The nature of the attribute distribution, which is continuous.
4. The range of the mean used to code a link weight xi (rangeµ).
5. The range of the standard deviation to code xi (rangeσ).
6. The accuracy measurement ε
7. The total number of groups in the classification.
Using the assumptions and prior assumptions stated above, we can
calculate the complexity of the index term cluster as follows:
Given n clusters c1, c2, c3, …, cn (with respective link weights xi
for a document d)j, and that all the clusters are disjoint, the total
expected message length over the n clusters is calculated as:
202
)(...)()()(21 nccctotal MessLenEMessLenEMessLenEMessLenE +++= (7.4)
The expected message length for the individual clusters is calculated using
the equation 7.1. The accuracies of parameter value (AOPVs) for the mean and
the standard deviation for the cluster are estimated using the equation 7.2 and 7.3
respectively. The rangeµ and rangeσ and the accuracy measurement ε are
determined through prior knowledge, therefore they will be the same for all the
clusters.
The best clustering method generates the shortest E(MessLen),or in other
words, given two clustering methods C1 and C2 , C1 is a more efficient clustering
method compared with C2 if )()(21 CC MessLenEMessLenE <
7.4 Illustration of MML Calculation for Index
Term Clusters
Consider the following clusters produced by the random classification
method (see section 5.3.3) for a document (doc-1) in the ADI collection as shown
in figure 7-1. For this document, the random clustering method produces 8
clusters which each cluster having a different mean and standard deviation. The
difference in the value of the standard deviation will cause a difference in the value
of the AOPV (the accuracy of measurement). In turn, this difference in standard
deviation will influence the complexity of the model and the total message length.
GROUP 1 MEAN 0.651148 SD 0.0366408
203
MEMBER divid approach copy standard exper off conclud overdu draw
WEIGHT 0.634928 0.665769 0.610031 0.634928 0.58992 0.665769 0.696609 0.696609 0.665769
GROUP2 MEAN 0.6602816 SD 0.0897963
MEMBER trad comput inform system notic dissemin use techn control
WEIGHT 0.795458 0.549967 0.584713 0.787264 0.665769 0.610031 0.616889 0.728355 0.604088
GROUP 3 MEAN 0.6997416 SD 0.1075428
MEMBER actual docu produc data compatibl record evaluat receiv
WEIGHT 0.696609 0.563319 0.672094 0.664971 0.795458 0.904785 0.604088 0.696609
GROUP 4 MEAN 0.6422281 SD 0.0292658
MEMBER reversibl advant orient combin microfilm base year statist
WEIGHT 0.696609 0.665769 0.624999 0.647729 0.634928 0.610031 0.647729 0.610031
GROUP 5 MEAN 0.6843268 SD 0.1372901
MEMBER integr mechan hour machin format provid manual card
WEIGHT 0.634928 1 0.665769 0.586049 0.616889 0.750002 0.616889 0.604088
GROUP 6 MEAN 0.6814146 SD 0.0798737
MEMBER total effic libr gap tic output develop simpl
WEIGHT 0.665769 0.647729 0.859674 0.696609 0.696609 0.624999 0.594159 0.665769
GROUP 7 MEAN 0.7004235 SD 0.1341351
MEMBER sophist ibm access tool organ catalog prog discontinu
WEIGHT 0.647729 1 0.750002 0.696609 0.594159 0.647729 0.570551 0.696609
GROUP 8 MEAN 0.7068336 SD 0.1154598
MEMBER operat cent rely retrief featur process circl dsd
WEIGHT 0.733775 0.679837 0.696609 0.546787 0.696609 0.956714 0.647729 0.696609
Figure 7-1 Clusters for doc-1 in ADI collection using the random clustering method.
To calculate the expected message length of GROUP 1, the rangeµ can be
estimated as 1 since the possible values of the link weights in our network lie
between 0 to 1. The rangeσ is also taken as 1 and ε can be estimated as 0.01 as
this is our accuracy of measurement. Using these values and taking N = 9 (the
204
population of this cluster), we can calculate the AOPVs and E(MessLen) of
GROUP 1 as follows:
0423090.09
12036641.0 =×=µAOPV
031732.019
6036641.0 =
−×=σAOPV
83.44
)(log)036641.0(29
)036641.0(034545.0
9
01.0
2036641.0log9
031732.0
1log
042309.0
1log)(
22
2
2221
=×
+
+++=
e
MessLenE GROUP
π
Note that the optimal value for σ is the unbiased estimate of the standard deviation
[Wallace68]. Following the above procedure, E(MessLen) for the remaining
clusters can be calculated. The E(MessLen) values for these groups are shown in
table 7-1.
GROUP Model length Data length E(MessLen) 1 9.54 35.29 44.83 2 6.95 46.92 53.88 3 6.25 43.79 50.05 4 10.01 28.77 38.78 5 5.55 46.61 52.16 6 7.11 40.36 47.47 7 5.62 46.34 51.96 8 6.05 44.61 50.66
TOTAL 57.08 332.69 389.79
Table 7-1 Expected Message Length for doc-1 using the random clustering method.
205
We now consider the output of another clustering method, shown in figure
7-2, which clusters link weights with similar values (i.e. values such that the
difference between the value and the group mean is less than some threshold
difference).
GROUP 1 MEAN 0.775636 SD 0.023639
MEMBER trad system compatibl provid access
WEIGHT 0.795458 0.787264 0.795458 0.750002 0.750002
GROUP 2 MEAN 0.686579 SD 0.019755
MEMBER discontinu rely featur dsd conclud overdu actual receiv reversibl
WEIGHT 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609
cent produc data advant hour total simpl approach off
0.679837 0.672094 0.664971 0.665769 0.665769 0.665769 0.665769 0.665769 0.665769
gap tic tool draw notic techn operat
0.696609 0.696609 0.696609 0.665769 0.665769 0.728355 0.733775
GROUP 3 MEAN 0.882229 SD 0.031898
MEMBER record libr
WEIGHT 0.904785 0.859674
GROUP 4 MEAN 0.985571 SD 0.024991
MEMBER mechan ibm process
WEIGHT 1 1 0.956714
GROUP 5 MEAN 0.611573 SD 0.028925
MEMBER retrief comput exper inform machin develop organ dissemin use
0.546787 0.549967 0.58992 0.584713 0.586049 0.594159 0.594159 0.610031 0.616889
control evaluat base statist format manual card copy docu
0.604088 0.604088 0.610031 0.610031 0.616889 0.616889 0.604088 0.610031 0.563319
program orient microfilm integr output divid standard effic sophist
0.570551 0.624999 0.634928 0.634928 0.624999 0.634928 0.634928 0.647729 0.647729
catalog circl combin year
0.647729 0.647729 0.647729 0.647729
Figure 7-2 Clusters for doc-1 in ADI collection using the non-random clustering method.
This method is similar to the non-random clustering method (discussed in
section 5.3.3.1) except that it does not breakdown further clusters that have
population larger than the allowable number of parents per node (limit). We need
to adopt this change in order to generalise the example and avoid complications
206
inherent in comparing non-hierarchical and hierarchical clustering schemes. The
message lengths for the clusters produced by this method are given in table 7-2.
GROUP Model Data E(MessLen) 1 9.88 16.44 26.32 2 12.85 75.73 88.59 3 7.36 7.44 14.80 4 8.85 10.11 18.96 5 12.07 123.04 123.04
TOTAL 51.01 232.76 271.71
Table7-2 Expected Message Length for doc-1 using the non-random clustering method.
We note that model complexity for individual clusters is higher on average
for the non-random approach compared with the random approach. However, the
fact that the data values (link weights) in individual clusters in doc-1 are similar
(by virtue of the clustering method itself) results in a much shorter message length
of the data part of the total message length produced by this method compared
with that produced by the random clustering method. Moreover, the cost of
having more groups in the random clustering method also overshadows the
simplicity of this model. Therefore, the total message length required to describe
the clustering in the random method (389.79 bits) is longer than that of the non-
random method (271.71 bits).
The observations of the behaviour of the expected message length on all
the documents, shows that the random clustering method always produces a
longer message than the non-random clustering method. However, it constantly
produces similar message length for individual clusters within the document (see
207
Appendix C for the full list of all expected message length of documents in ADI
collection1).
The experimental results in chapter 6 (see figure 6-10) showed that the
random clustering method performs better in the average precision by 0.32% but
performs relatively worst at the low recall level (10% to 50% recall). Observing
the message length produced by the two methods, we can derive the relations
between the performance of the methods in terms of recall and precision with the
nature of the virtual layer models' message length in the following ways:
• A virtual layer model that can produce identical expected message
length for the individual cluster in layer will produce a higher average
precision than virtual layer model that produces distinct expected
message length for the individual cluster.
• A virtual layer model that produces shorter expected message length
will produce higher precision in the low recall level, but not necessarily
produce a higher average precision.
Considering the relations stated previously, we can conclude that the
shortest expected message length for a virtual layer model may be optimised in
terms of computation complexity, however, it does not lead to a higher average
precision in information retrieval context. In information retrieval context, to
produce a high average precision, the virtual layer model has to produce similar
expected message length for the individual cluster. The similar in expected
1 The list in Appendix C shows that index terms in each document in ADI are classified into a number of groups. Each group produces almost identical expected message length value in the random clustering method. On the other hands, each group produces varied expected message length value in the non-random clustering method.
208
message length for the individual cluster, in fact, is a measure of the "symmetry"
of the model. This symmetry, as we suggested in chapter 5, will produce the
optimum performance for the virtual layers created using the random clustering
method.
On the other hand, a virtual layer model that produces shorter message
length, such as that of non-random clustering method, will produce higher
precision at a low recall level. Hence, it still has some benefit when it is used for
the precision oriented system.
7.5 Summary
In this chapter, we have used a model based on Minimal Message Length (MML)
to evaluate the effectiveness of the virtual layer model to find the optimum
Bayesian network. It is important to clearly clarify the meaning of "optimum"
according to the objective of the systems. In an information retrieval context, the
optimum model may be considered as the model that can produce the highest
average precision or the model that can produce the most efficient computation.
The MML model suggests that the computation complexity is optimised when the
clustering method which produces the shortest expected message length and the
average precision is optimised by the model that produces similar expected
message length for individual cluster in the virtual layer. The virtual layer model
that produces the shortest expected message length may not be optimum in terms
of average precision but it will be optimum in terms of computation complexity
and vice versa.
209
We have shown that for ADI collection, according to this evaluation
method, the random clustering method provides a more optimised clustering than
the non-random method in term of average precision because the random clusters
in the virtual layers produces similar expected message length for the individual
clusters. The results of the evaluation agree with the experimental results
presented in chapter 6, the random clustering method produces 0.32% higher
average precision than the non-random clustering method. The non-random
clustering method ,however, is still beneficial for precision oriented systems
because it produces higher precision at low recall level and is optimum in terms of
computation complexity.
Some consideration also has to be taken regarding the preprocessing
required to generate the clusters. The preprocessing required in the non-random
clustering method is more costly in terms of computation compared with that of
random clustering.method This is counterbalanced by the fact that this
preprocessing only occurs once at the time of building the document collection,
and thus the high computational cost of the preprocessing for the non-random
clustering methods will be offset by more efficient inference during the retrieval
process. In terms of recall and precision, the choice of clustering is still based on
the overall objective of the information retrieval system. In this respect, recall
oriented systems will benefit more from the random clustering method and the
precision oriented systems will benefit more from the non-random clustering
method.
210
Chapter 8
Conclusion and Future Research
8.1 Conclusion
In this thesis, we have described a new formal information retrieval
model. This proposed model, unlike other models of information retrieval, has a
strong mathematical foundation for handling uncertainty because it is based on
Bayesian networks. Bayesian networks are well-known artificial intelligence
method for handling uncertainty.
The proposed model consists of two separate networks, namely query and
document networks. The use of network representations in the model confers the
following benefits:
• It subsumes other existing retrieval models through its capacity to
simulate those models using an appropriate network representation so that
the choice of the specific model can be taken during the implementation
stage.
• It provides methods of representing documents and users' information
needs as complex objects with multilevel representations. This capacity
allows the information retrieval developer to provide multiple
representations of the same documents or users' information needs, which
gives flexibility in implementation.
211
• It provides a natural model for incorporation of a thesaurus into the
system. The adoption of the network produces a natural grouping of the
index terms.
• It provides implicit inter-document dependency. This dependency will
allow the retrieval of documents that do not contain query terms but share
some common index terms with the documents that contain the query
terms. As a result, this model will produce a higher recall compared with
a model which considers only those documents that share common terms
with the query.
• It provides a common and mathematically sound model for producing
both the initial ranked output and handling relevance feedback. This
situation is not possible with the existing probabilistic models, which use
ad-hoc methods to produce the initial ranking.
• It supports both the evidence and dependency alteration techniques for
relevance feedback in a common model, which, again, is not supported in
the existing probabilistic models.
We have also presented a comparison of performance between our model
and the two well-known information retrieval models, namely those of the vector
space model and Turtle and Croft's inference network (chapter 6). The
experiments were performed on three well-studied collections, namely ADI,
MEDLINE and CACM. The experimental results showed that our model
outperforms both models in terms of average precision. The improvement
achieved by our model varies from 0.62% to 2.74%. The results also showed that
our model produces a higher precision at both ends, low and high, of the recall
212
level. At low levels of recall, the improvement in precision is in the range of
1.66% to 5.69%. The improvement in precision at high recall level is in the range
of 3.54% to 7.61%. This behaviour makes our model a better choice in
supporting both precision and a recall-oriented systems.
With respect to implementation issues of the model, we have proposed
new methods to optimise Bayesian networks using alternative approximated
networks. The approximated network can be created using virtual layers. The
introduction of the virtual layers in the network reduces the size of the link matrix
which in turn reduces the computational complexity in the network (chapter 5).
The convergence problem of an exact inference algorithm involving indirect
loops is solved by modifying the independence assumption of the algorithm
(chapter 5). This modified independence assumption is in accordance with the
human reasoning process and does not require massive preprocessing as do
current techniques.
We have also presented a model that evaluates the effectiveness of the
approximated networks using the Minimal Message Length principle. This
evaluation model enables us to choose the optimal approximated network without
performing extensive retrieval testing.
8.2 Future Work
The approach taken in this thesis suggests several further areas of research. These
areas include: adoption of phrases and thesauri, fusion of the retrieval output,
213
clustering methods for index terms, and the development of evaluation models
for Bayesian networks in general.
8.2.1 Phrases and Thesaurus
The utility of a thesaurus in increasing recall in information retrieval has been
proven [Salton71, Croft88]. We have described means of incorporating thesauri
in the proposed model in chapter 4. Traditionally, this thesaurus is generated by
looking at the similarity between index terms. Two index terms are considered
similar when they have similar weights [Salton83]. In Bayesian networks, the
graph represents explicitly the connectivity between index terms and documents.
Thus, the thesaurus may be created not just based on the index term weight
similarity but also on the fact that those index terms shared between documents.
When two or more index terms co-occurs in some documents, we can assume
that these index terms represent a higher level concept or that these index terms
are part of a phrase. Hence, an automatic thesaurus and phrase finder that can
exploit the above characteristic of Bayesian networks is worthy of further
investigation.
8.2.2 Retrieval Fusion
The proposed model provides the flexibility to represent a single information
need using multiple representations. The results of the retrieval of this
information need may vary for different representation networks [Turtle90]. So
far in this thesis, we have only performed retrieval using one query network
214
representation. A further investigation into the effect of the use of multiple query
representations could be useful. The main issue in performing this task is in
finding the optimal model to merge ranked outputs produced by the multiple
query network representations.
8.2.3 Index Term Clustering
In chapter 5, we have described two simple methods of index term clustering.
The clustering is introduced to the network in order to reduce the size of the link
matrix. Further investigation is required to find the optimal clustering using some
traditional clustering methods such as k-mean and new methods such as neural
network classification.
8.2.4 Evaluation Model for Bayesian Networks
In this thesis, we provide an evaluation model to measure the effectiveness of the
approximated network by evaluating the effectiveness of the clusters of parent
nodes in a given node in a Bayesian network. In this sense, we compare two
Bayesian networks locally within a given node and its parents, disregarding the
global structure of the network. For example, the model does not take into
consideration the effect of the clustering in doc-1 to the clustering in doc-2. A
further evaluation model that can measure globally the effect of approximation in
Bayesian network can be investigated. With this model, we expect that we can
take any two arbitrary approximated Bayesian network and choose the optimal
one.
215
References
[Allen87] Allan, J. Natural Language understanding.
Benjamin/Cummings, 1987.
[Amsler89] Amsler, R.A. Research Toward the Development of Lexical
Knowledge Based for Natural Language Processing. In the
Proceedings of the 12th Annual International Conference on
Research and Development in Information Retrieval, Belkin,
N.J. and van Rijsbergen, C. (eds), pp 242-249, ACM, New
York, 1989.
[Bhatnagar86] Bhatnagar, R.K. and Kanal, L.N. Handling Uncertain
Information: a Review of Numeric and Non-Numeric Methods.
In Uncertainty in Artificial Intelligence, Kanal, L.N. and
Lemmer, J.F. (eds), pp 3-26, North Holland, Amsterdam, 1986.
[Belew89] Belew, R.K., Adaptive Information Retrieval: Using a
Connectionist Representation to Retrieve and Learn about
Documents. In the Proceedings of the 12th Annual International
Conference on Research and Development in Information
Retrieval, Belkin, N.J. and van Rijsbergen, C. (eds), pp 11-20,
ACM, New York, 1989.
216
[Berzuini89] Berzuini, C., Bellazzi, R. and Quaglini, S. Temporal Reasoning
with Probabilities. In the Proceedings of Fifth Workshop on
Uncertainty and AI, Henrion, M. (ed.), pp 14-21, Windsor,
Ontario, 1989.
[Boguraev87] Boguraev, B., Briscoe, T., Carroll, J., Carter, D. and Grover, C.
The Derivation of a Gramatically Indexed Lexicon from the
Longman Dictionary of Contemporary English. In the
Proceedings of 25th Annual Meeting of the ACL, pp 193-200,
Stanford University, Stanford, CA, 1987.
[Brachman88] Brachman, R.J. and McGuiness, D.L. Knowledge
Representation, Connectionism, and Conceptual Retrieval. In
the Proceedings of the 11th International Conference on
Research and Development in Information Retrieval, pp 161-
174, ACM, New York, 1988.
[Brent91] Brent, M.R. From Grammar to Lexicon: Unsupervised Learning
of Lexical Syntax. Computational Linguistic, 19(2):243-262,
1991.
[Bundy85] Bundy, A. Incidence Calculus: A Mechanism for Probabilistic
Reasoning. Journal of Automated Reasoning, 1:263-283, 1985.
[Carmody66] Carmody, B.T. and Jones Jr, P.E. Automatic derivation of
microsentences. Communication of the ACM, June:435-445,
1966.
217
[Chang91] Chang, K.C. and Fung, R. Refinement and Coarsening of
Bayesian Networks. In Uncertainty in Artificial Intelligence 6,
Kanal, L.N. and Lemmer, J.F. (eds), pp 435-446, North
Holland, Amsterdam, 1991.
[Chavez90] Chavez, R.M. and Cooper, G.F. An Empirical Evaluation of a
Randomized Algorithm for Probabilistic Inference. In
Uncertainty in Artificial Intelligence 5, Henrion, M. et.al (eds),
pp 191-208, Elsevier Science, 1990.
[Cheeseman85] Cheeseman, P. In Defense of Probability. In the Proceeding of
the 9th International Joint Conference on Artificial Intelligence,
pp 1002-1009, 1985.
[Cheeseman91] Cheeseman, P. Probabilistic vs Fuzzy Reasoning. In
Uncertainty in Artificial Intelligence 6, Kanal, L.N. and
Lemmer, J.F. (eds), pp 85-102, North Holland, Amsterdam,
1991.
[Chevallet96] Chevallet, J.P. and Chiaramella, Y. Our Experience in Logical
IR Modeling. In the Proceedings of Glasgow Workshop on
LOGIC, University of Glasgow, U.K, 1996.
[Chin89] Chin, H.L. and Cooper, G.F. Bayesian Belief Network
Inference Using Simulation. In Uncertainty in Artificial
Intelligence 3, pp 129-148, North Holland, Amsterdam, 1989.
218
[Cohen87] Cohen, P.R. and Kjeldsen, R. Information Retrieval by
Constrained Spreading Activation in Semantic Networks.
Information Processing and Management, 23(2):255-268, 1987.
[Coombs90] Coombs, J.H. Hypertext, Full Text, and Automatic Linking. In
Proceedings of the 13th International Conference on Research
and Development in Information Retrieval, pp 83-98, 1990.
[Cooper71] Cooper, W.S. A Definition of Relevance for Information
Retrieval. Information Storage and Retrieval, 7:19-37, 1971.
[Cooper78] Cooper, W.S. and Maron, M.E. Foundation of Probabilistic and
Utility Theoretic Indexing. Journal of the ACM, 25(1):67-80,
1978
[Cooper84] Cooper, G.F. NESTOR: A Computer Based Medical Diagnosis
Aid that Integrates Causal and Probabilistic Knowledge. Ph.D
Thesis, Computer Science Department, Stanford University,
1984.
[Cooper90] Cooper, G.F. The Computational Complexity of Probabilistic
Inference Using Bayesian Belief Network. Artificial
Intelligence, 42:393-405, 1990.
[Crestani94] Crestani, F. and van Rijsbergen, C.J. Information Retrieval by
Imaging. In the Proceedings of 16th British Computer Science
Colloquium, Drymer, Scotland, 1994.
219
[Crestani95] Crestani, F., Ruthven, I., Sanderson, M. and van Rijsbergen,
C.J. The Troubles with Using a Logical Model of IR on Large
Collection of Documents. In the Proceedings of 4th Text
Retrieval Conference, pp 509-526, NIST 500-236, National
Institute of Standard and Technology, US, 1995.
[Croft79] Croft, W.B. and Harper, D.J. Using Probabilistic Models of
Document Retrieval without Relevance Information. Journal of
Documentation, 35(3):285-295, 1979.
[Croft80] Croft, W.B. A Model of Cluster Searching Based on
Classification. Information Systems, 5:189-195, 1980.
[Croft84] Croft, W.B. and Thompson, R.H. The Use of Adaptive
Mechanism for Selection of Search Strategies in Document
Retrieval Systems. In the Proceedings of the ACM/BCS
International Conference on Research and Development in
Information Retrieval, pp 95-110, 1984.
[Croft85] Croft, W.B. and Parenty, T.J. A Comparison of a Network
Structure and a Database System used for Document Retrieval.
Information Systems, 10(4):377-390, 1985.
[Croft86] Croft, W.B. Boolean Queries and Term Dependencies in
Probabilistic Retrieval Models. Journal of the American Society
for Information Science, 37(2):71-77, 1986.
[Croft87a] Croft, W.B. Approaches to Intelligent Information Retrieval.
Information Processing and Management, 23(4):249-254, 1987.
220
[Croft87b] Croft, W.B. and Thompson, R.H. I3R: A New Approach to the
Design of Document Retrieval Systems. Journal of the
American Society for Information Science, 38(6):389-404,
1987.
[Croft88] Croft, W.B. and Savino, P. Implementing Ranking Strategies
Using Text Signatures. ACM Transactions on Office
Information Systems, 6(1):42-62, 1988.
[Croft89a] Croft, W.B. and Turtle, H. A Retrieval Model Incorporating
Hypertext Links. In the Proceedings of Hypertext’89, pp 213-
224, 1989.
[Croft89b] Croft, W.B., Lucia, T.J. and Willet, P. Retrieving Documents
by Plausible Inference: an Experimental Study. Information
Processing and Management, 25(6):599-614, 1989.
[Dechter85] Dechter, R. and Pearl, J. The Anatomy of Easy Problems: A
Constraint-Satisfaction Formulation. In the Proceedings of the
8th International Joint Conference on AI, pp 1066-1072, 1985
[Freimuth89] Freimuth, M.E., Stein, J.A. and Kear, T.J. Searching for Health
Information, University of Pennsylvania Press, Philadelphia,
1989
[Frisse89] Frisse, M.E. and Cousin, S.B. Information Retrieval from
Hypertext: Update on the Dynamic Medical Handbook Project.
In the Proceedings of Hypertext’89, pp199-212, 1989.
221
[Fryback78] Fryback, D.G. Bayes’ Theorem and Conditional
Nonindependence of Data in Medical Diagnosis, Computers
and Biomedical Research, 11:423-434, 1978.
[Fuhr86] Fuhr, N. Two Models of Retrieval with Probabilistic Indexing.
In the Proceedings of the 9th Annual Conference on Research
and Development in Information Retrieval, Rabitti, F (ed), pp
249-257, ACM Press, New York, 1986.
[Fuhr89] Fuhr, N. Models for Retrieval with Probabilistic Indexing.
Information Processing and Management, 25(1):55-72, 1989.
[Fuhr90] Fuhr, N. A Probabilistic Framework for Vague Queries and
Imprecise Information in Database. In the Proceedings of the
16th International Conference on Very Large Databases,
McLeod, D., Sacks-Davis, R. and Schek, H. (eds), pp 696-707,
Morgan Kaufmann, Los Altos, CA, 1990
[Fuhr92] Fuhr, N. Probabilistic Models in Information Retrieval. The
Computer Journal, 35(3):243-255, 1992.
[Fung90a] Fung, R.M., Crawford, S.L., Applebaum, L.A. and Tong, R.M.
An Architecture for Probabilistic Concept-Based Information
Retrieval. In the Proceedings of the 13th International
Conference on Research and Development in Information
Retrieval, Vidik, J. (ed), pp 455-467, 1990.
222
[Fung90b] Fung, R. and Chang, K.C. Weighting and Integrating Evidence
for Stochastic Simulation in Bayesian Networks. In Uncertainty
in Artificial Intelligence 5, Henrion, M. et.al (eds), pp 209-219,
North Holland, Amsterdam, 1990.
[Ghazfan94] Ghazfan, D., Indrawan, M., Srinivasan, B. and Korb, K. A
Bayesian Model for Information Retrieval. In the Proceedings
of 5th Australian Conference in Information Systems, Arnott, D.
and Shank, G. (eds), pp 259-272, Department of Information
Systems, Monash University, Australia, 1994.
[Ghazfan95] Ghazfan, D., Indrawan, M., and Srinivasan, B. A Semantically
Correct Bayesian Network based Information Retrieval. In the
Proceedings of the 5th Helenic Conference in Informatics, pp
639-648, 1995.
[Ghazfan96] Ghazfan, D., Indrawan, M. and Srinivasan, B. Towards
Meaningful Bayesian Network for Information Retrieval
Systems. In the Proceedings of the 6th International Conference
in Information Processing and Management of Uncertainty in
Knowledge-Based Systems (IPMU), pp 841-846, Spain, 1996.
[Gordon85] Gordon, J. and Shortliffe, E.H. A Method of Managing
Evidential Reasoning in Hierarchical Hypothesis Space.
Artificial Intelligence, 26:323-357, 1985.
223
[Hansen95] Hansen, J.H.L. and Bou-Ghazale, S. Duration and Spectral
Based Strees Token Generation for Keyword Recognition
Using Hidden Markov Models. IEEE Transaction on Speech
and Audio Processing, 3(5):415-421, 1995.
[Harman92] Harman, D. Relevance Feedback Revisited. In the Proceedings
of the 15th Annual International SIGIR, pp 1-10, ACM Press,
Denmark, 1992.
[Harper78] Harper, D. and van Rijsbergen, C.J. An Evaluation of Feedback
in Document Retrieval Using Co-occurrence Data, Journal of
Documentation, 34(3):189-216, 1978.
[Heckerman85] Heckerman, D.E., Horvitz, E.J. and Nathwani, B.N. Pathfinder
Research Directions, Technical Report KSL-89-64, Knowledge
Systems Laboratory, Stanford University, Stanford, California,
1985.
[Henrion86] Henrion, M. Propagating Uncertainty in Bayesian Network by
Probabilistic Logic Sampling. In Uncertainty in Artificial
Intelligence 2, pp 149-163, 1986.
[Henrion90] Henrion, M. Towards Efficient Inference in Multiply
Connected Belief networks. In Influence Diagrams, Belief Nets
and Decision Analysis, Oliver, R.M. and Smith, J.Q. (eds.), pp
385-407, Wiley:Chichester, 1990.
224
[Hulme95] Hulme, M. Improved Sampling for Diagnostic Reasoning in
Bayesian Networks, In Uncertainty in Artificial Intelligence 95,
Besnard, P. and Hanks, S. (eds), pp 315-322, Morgan
Kaufmann, San Fransisco, US, 1995
[Indrawan96] Indrawan, M., Ghazfan, D. and Srinivasan, B. Bayesian
Network as a Retrieval Engine. In the Proceedings of the 5th
Text Retrieval Conference, pp 437-444, NIST 500-238,
National Institute of Standard and Technology, US, 1996.
[Indrawan98] Indrawan, M., Srinivasan, B., Ghazfan, D. and Wilson, C.
Handling Large Bayesian Networks: a Case Study of
Information Retrieval Systems. 1998 IEEE Conference on
Systems, Man and Cybernetics, USA, (submitted).
[Jones87] Jones, W.P. and Furnas, G.W. Pictures of Relevance – a
Geometric Analysis of Similarity Measures. Journal of the
American Society for Information Science, 38(6):420-442,
1987.
[Kupieck92] Kupiec, J. Robust Part-of-Speech Tagging using a Hidden
Markov Model. Computer Speech and Language, 6:225-242,
1992.
[Kwok89] Kwok, K.L. A Neural Network for Probabilistic Information
Retrieval. In the Proceedings of the 12th International
Conference on Research and Development in Information
Retrieval, Belkin, N.J. and van Rijsbergen, C.J. (eds), pp 21-30,
ACM, New York, 1989.
225
[Kwok90] Kwok, K.L. A Network Approach to Probabilistic Information
Retrieval. ACM Transaction on Information Systems,
13(3):324-353, 1995
[Lancaster69] Lancaster, F.W. MEDLARS: Report on the Evaluation of Its
Operating Efficiency. American Documentation, 20(2), 1969.
[Lauritzen88] Lauritzen, S.L. and Spiegelhalter, D.J. Local Computations
with Probabilities on Graphical Structure and Their
Applications to Expert Systems. Journal of Royal Statistical
Society B, 50:157-224, 1988.
[Lewis90] Lewis, D.D. Representation, Learning, and Language in
Information Retrieval. PhD Thesis, University Massachusetts,
1990.
[Lewis96] Lewis, D.D. and Sparck-Jones, K. Natural Language Processing
for Information Retrieval, Communication of the ACM,
39(1):92-101, 1996.
[Losee88] Losee, R.M. and Bookstein, A. Integrating Boolean Queries in
Conjunctive Normal Form with Probabilistic Retrieval Models,
Journal of the American Society for Information Science,
24(3):315-321, 1988.
[Luhn58] Luhn, H.P. The Automatic Creation of Literature Abstracts.
IBM Journal of Research and Development, 2(2):159-165,
1958.
226
[Maron60] Maron, M.E. and Kuhns, J.L. On Relevance, Probabilistic
Indexing and Information Retrieval. Journal of the ACM,
7:216-244, 1960.
[McDermott85] McDermott, D. and Doyle, J. Non-monotomic Logic. Artificial
Intelligence, 25:41-72, 1985.
[Mel’cuk89] Melcuk, I. Semantic Primitives from the Viewpoint of the
Meaning Text Linguistic Theory, Quaderni di Semantica,
10:27-62, 1989.
[Milstead89] Milstead,J.L. Subject Access Systems. Academic Press,
Orlando, 1989.
[Neapolitan90] Neapolitan, R.E. Probabilistic Reasoning in Expert Systems:
Theory and Algorithms. John Wiley&Sons Inc, US, 1990.
[Oddy77] Oddy, R.N. Information Retrieval through Man-machine
Dialogue. Journal of Documentation, 33:1-14, 1977.
[Oliver94] Oliver, J.J. and Hand, D.J. Introduction to Minimum Encoding
Inference. Technical Report Computer Science Department
TR94-205, Monash University, Australia, 1994.
[Olmsted83] Olmsted, S.M. On Representing and Solving Decision
Problems. PhD Thesis, Engineering-Economic Systems
Department, Stanford University, Stanford, California, 1983.
227
[Pearl87] Pearl, J. Evidential Reasoning using stochastic simulation of
causal models. Artificial Intelligence, 32:245-257, 1987.
[Pearl88] Pearl, J. Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference. Morgan Kaufmann, 1988.
[Peng86] Peng, Y. and Reggia, J.A. A Probabilistic Causal Model for
Diagnostic Problem Solving-Part I and II. IEEE Transaction on
Systems, Man and Cybernetics, 17:140-145, 1986.
[Provan95] Provan, G. Abstraction in Belief Networks: The Role of
Intermediate States in Diagnostic Reasoning. In Uncertainty in
Artificial Intelligence 95, Besnard, P. and Hanks, S. (eds), pp
464-471, Morgan Kaufmann, San Fransisco, US, 1995.
[Rajashekar95] Rajashekar, T.B. and Croft,W.B. Combining Automatic and
Manual Index Representations. Journal of American Society for
Information Science, 46(4):272-283, 1995.
[Reiter87] Reiter, R., Nonmonotonic Reasoning. Annual Review of
Computer Science, 2:147-186, 1987.
[Rijsbergen79] Van Rijsbergen, C.J., Information Retrieval. Butterworths,
1979.
[Rijsbergen86] Van Rijsbergen, C.J. A Non-Classical Logic for Information
Retrieval. Computer Journal, 29:481-485, 1986.
228
[Rijsbergen89] Rijsbergen, C.J. Towards Information Logic. In the
Proceedings of 12th Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval, Belkin, N.J. and van Rijsbergen, C.J. (eds), pp 77-86,
New York, 1989.
[Rijsbergen92] Rijsbergen, C.J. Probabilistic Retrieval Revisited. Computer
Journal, 35(3):291-298, 1992.
[Robertson76] Robertson, S.E. and Sparck-Jones, K. Relevance Weighting of
Search Terms. Journal of the American Society for Information
Science, 27:129-146, 1976.
[Robertson77] Robertson, S.E. The Probability Ranking Principle in IR.
Journal of Documentation, 33(4):294-304, 1977.
[Robertson82] Robertson, S.E., Maron, S.E. and Cooper, W.S. Probability of
Relevance: A Unification of Two Competing Models for
Document Retrieval. Information Technology: Research and
Development, 1(1):1-21, 1982.
[Salton68] Salton, G., Automatic Information Organization and Retrieval.
McGraw-Hill, 1968.
[Salton71] Salton, G (ed). The SMART Retrieval System – Experiments in
Automatic Document Processing. Prentice-Hall, Inc.,
Englewood Cliffs, New Jersey, 1971.
229
[Salton83] Salton, G. and McGill M.J. Introduction to Modern Information
Retrieval, McGraw-Hill, 1983.
[Salton88] Salton, G. A Simple Blueprint for Automatic Booelan Query
Processing. Information Processing and Management,
24(3):269-280, 1988.
[Schank77] Schank, R.C. and Abelson, R.P. Scripts, Plans, Goals, and
Understanding. Lawrence Erlbaum Press, 1977.
[Shachter86] Shachter, R.D. Intelligent probabilistic inference. In
Uncertainty in Artificial Intelligence. Kanal, L. and Lemmer, J
(eds), pp 371-382, North Holland, 1986.
[Shachter88] Shachter, R.D. Probabilistic Inference and Influence Diagrams.
Operation Research, 36:871-882, 1988.
[Shachter90] Shachter, R.D. and Peot, M.A. Simulation Approaches to
General Probabilistic Inference on Belief Networks, in
Uncertainty in Artificial Intelligence 5, Kanal, L and Lemmer, J
(eds), pp 221-231, North Holland, 1990.
[Shafter76] Shafter, G. A Mathematical Theory of Evidence. Princeton
University Press, 1976.
[Shortlife75] Shortliffe, E.H. and Buchanan, B.G. A Model of Inexact
Reasoning in Medicine. Mathematical Biosciences, 23:351-376,
1975.
230
[Shoval85] Shoval, P. Principles, Procedures and Rules in Expert System
for Information Retrieval. Information Processing and
Management, 21(6):475-487, 1985.
[Shwe90] Shwe, M. and Cooper, G. An Empirical Analysis of Likelihood-
Weighting Simulation on Large, Multiply Connected Belief
Network. Technical Report KSL-90-23, Knowledge Systems
Laboratory, Stanford University, Stanford, CA, 1990.
[Sparck-Jones71] Sparck-Jones, K. Automatic Keyword Classification for
Information Retrieval. Archon Books, 1971.
[Sparck-Jones72] Sparck-Jones, K. A Statistical Interpretation of Term
Specificity and Its Application in Retrieval. Journal of
Documentation, 28(1):11-20, 1972.
[Sparck-Jones74] Sparck-Jones, K. Automatic Indexing. Journal of
Documentation, 30(4):393-432, 1974.
[Sparck-Jones79] Sparck-Jones, K. Search Term Relevance Weighting Given
Little Relevance Information. Journal of Documentation,
35:30-48, 1979.
[Stillman91] Stillman, J. On Heuristics for Finding Loop Cutsets in Multiply
Connected Belief Network. in Uncertainty in Artificial
Intelligence 6, Bonissone, P.P., Henrion, M., Kanal, L.N. and
Lemmer, J.F. (eds), pp 233-243, North Holand, 1991.
231
[Strzalkowski93] Strzalkowski, T. Natural Language Processing in Large-Scale
Text Retrieval Tasks. In the Proceedings of the 1st Text
Retrieval Conference (TREC-1), pp 39-54, NIST 500-207,
National Institute Standard and Technology, US, 1993.
[Tong83] Tong, R.M., Shapiro, D., McCune, B.P. and Dean, J.S. A Rule-
Based Approach to Information Retrieval: Some Results and
Comments. In the Proceedings of the National Conference on
Artificial Intelligence, pp 411-415, 1983.
[Tong85] Tong, R.M. and Shapiro, D. Experimental Investigations of
Uncertainty in a Rule-Based System for Information Retrieval.
International Journal of Man-Machine Studies, 22:265-282,
1985.
[Tong86] Tong, R.M., Applebaum, L.A., Askman, V.N. and
Cunningham, J.F. RUBRIC III : An Object Oriented Expert
System for Information Retrieval. In the Proceedings of the 2th
Expert Systems in Government Symposium, Karna, K.L.,
Parsaye, K. and Silverman, B.G. (eds), pp 106-115, 1986.
[Turtle90] Turtle, H. Inference Network for Document Retrieval. Ph.D.
Thesis, University of Massachusetts, October, 1990.
[Turtle91] Turtle, H. and Croft, W.B. Evaluation of an Inference Network-
Based Retrieval Model. ACM Transaction on Information
Systems, 9:187-222, 1991.
232
[Yu88] Yu, C. and Mizuno, H. Two Learning Schemes in Information
Retrieval. In the Proceedings of 11th International Conference
on Research and Development in Information Retrieval,
Chiaramella, Y. (ed), pp 201-218, Presses Universitaires de
Grenoble, Grenoble, France, 1988.
[Voorhees96] Voorhees, E. and Harman, D. Overview of the 5th Text
Retrieval Conference. In the Proceedings of the 5th Text
Retrieval Conference, pp 1-28, NIST 500-238, National
Institute of Standard and Technology, US, 1996
[Wallace68] Wallace, C.S. and Boulton, D.M. An Information Measure for
Classification. Computer Journal, 11(2):185-194, 1968.
[Wallis95] Wallis, P. Semantic Signatures for Information Retrieval. PhD
Thesis RT-6, Computer Science Department, Royal Melbourne
Institute of Technology, 1995.
[Willet88] Willet, P. Recent Trends in Hierarchic Document Clustering: a
Critical Review. Information Processing and Management,
24(5):577-598, 1988.
[Wilson73] Wilson, P. Situational Relevance. Information Storage and
Retrieval, 9:457-471, 1973.
[Zadeh78] Zadeh, L.A. Fuzzy Sets as a Basis for Theory of Possibility.
Fuzzy Sets and Systems, 1:3-28, 1978.
233
[Zadeh86] Zadeh, L.A. Is Probability Theory Sufficient for Dealing with
Uncertainty in AI: a Negative View. In Uncertainty in Artificial
Intelligence, Kanal, L.N. and Lemmer, J.F. (eds), pp 103-116,
North Holland, Amsterdam , 1986.