A FRAMEWORK FOR INFORMATION RETRIEVAL BASED ON BAYESIAN ...users.monash.edu/~srini/theses/Maria_Thesis.pdf · A FRAMEWORK FOR INFORMATION RETRIEVAL BASED ON BAYESIAN ... Bayesian

A FRAMEWORK FOR INFORMATION

RETRIEVAL BASED ON BAYESIAN

NETWORKS

by

Maria Indrawan

B.Comp.(Hons), MACS

School of Computer Science and Software Engineering

Monash University

Thesis Submitted for Examination

for the Degree of

Doctor of Philosophy

1998

iv

Declaration

I declare that the thesis contains no material which has been accepted for the

award of any degree or diploma in any university and that, to the best of my

knowledge, the thesis contains no material previously published or written by any

other person except where due reference is made in the text.

Signed

Date

School of Computer Science and Software Engineering,

Monash University,

Caulfield, Victoria, 3168

1998

vi

Table of Contents

TITLE............................................................................................................... i

ABSTRACT .................................................................................................... ii

DECLARATION............................................................................................ iv

ACKNOWLEDGEMENT .............................................................................. v

CHAPTER 1

INTRODUCTION

1.1 Background and Motivation .................................................................. 1

1.2 Uncertainty and Artificial Intelligence .................................................... 2

1.3 Previous Work Using Network Models for Information Retrieval .......... 6

1.4 Contribution of the Thesis...................................................................... 9

1.5 Research Methodology........................................................................ 12

1.6 Thesis Overview.................................................................................. 13

CHAPTER 2

AUTOMATIC INFORMATION RETRIEVAL

2.1 Introduction......................................................................................... 16

2.2 Information Retrieval Model................................................................ 16

2.3 Document and Query Indexing ............................................................ 21

2.3.1 Indexing Problems........................................................................ 22

2.3.2 Indexing Language....................................................................... 24

2.4 Matching Functions ............................................................................. 27

2.4.1 Boolean Model ............................................................................ 28

2.4.2 Vector Space Model .................................................................... 29

2.4.3 Probabilistic Model ...................................................................... 32

2.4.3.1 Binary Independence Model.................................................... 33

2.4.3.2 Unified Model......................................................................... 37

vii

2.4.3.3 Retrieval with Probabilistic Indexing (RPI) Model................... 39

2.5 Increasing Retrieval Performance......................................................... 40

2.5.1 Stop List ...................................................................................... 41

2.5.2 Term Weighting ........................................................................... 42

2.5.3 Thesaurus .................................................................................... 43

2.5.4 Relevance Feedback.................................................................... 44

2.6 Summary............................................................................................ 47

CHAPTER 3

THEORY IN BAYESIAN NETWORKS

3.1 Introduction........................................................................................ 49

3.2 Bayes Theorem................................................................................... 50

3.3. Bayesian vs Classical Probability Theory............................................ 57

3.4 The Bayesian Network as a Knowledge Base...................................... 61

3.4.1 Bayesian Network Structure........................................................ 63

3.4.2 Conditional Independence ........................................................... 65

3.5 Probabilistic Inference in Bayesian Networks ...................................... 69

3.5.2 Pearl's Inference Algorithm ......................................................... 71

3.5.2 Handling Loops in the Network .................................................. 75

3.6 Summary............................................................................................ 76

CHAPTER 4

.......A SEMANTICALLY CORRECT BAYESIAN NETWORK MODEL

FOR INFORMATION RETRIEVAL

4.1 Introduction........................................................................................ 78

4.2 The Bayesian Network Model............................................................. 83

4.2.1 Probability Space ........................................................................ 84

4.2.2 The Document Network.............................................................. 87

4.2.3 The Query Network .................................................................... 89

4.2.4 Prior Probability.......................................................................... 90

4.3 Probabilistic Inference in Information Retrieval................................... 93

4.3.1 Link Matrices.............................................................................. 95

viii

4.3.1.1 OR-link matrix..................................................................... 96

4.3.1.2 AND-link matrix.................................................................. 96

4.3.1.3 WEIGHTED-SUM link matrix ............................................ 97

4.4 Directionality of the Inference............................................................. 98

4.5 Comparison with Other Models ........................................................ 105

4.5.1 Simulating the Boolean Model .................................................. 106

4.5.2 Simulating the Probabilistic Retrieval Model.............................. 108

4.5.3 Inference Network .................................................................... 110

4.6 Summary.......................................................................................... 116

CHAPTER 5

HANDLING LARGE BAYESIAN NETWORKS

5.1 Introduction...................................................................................... 118

5.2 An Illustration of an Exact Algorithm ............................................... 119

5.3 Reducing the Computational Complexity .......................................... 126

5.3.1 Node and Link Deletion ............................................................ 126

5.3.2 Layer Reduction........................................................................ 128

5.3.3 Adding a Virtual Layer.............................................................. 130

5.3.3.1 Clustering the Parent Nodes .............................................. 135

5.4 Handling Indirect Loops ................................................................... 139

5.4.1 Clustering ................................................................................. 141

5.4.2 Conditioning ............................................................................. 144

5.4.3 Sampling and Simulation........................................................... 145

5.5 Dealing with a Loop Using Intelligent Nodes .................................... 147

5.5.1 Example of the Feedback Process Using Intelligent Nodes ........ 151

5.6 Summary.......................................................................................... 152

CHAPTER 6

MODEL PERFORMANCE EVALUATION

6.1 Introduction...................................................................................... 155

6.2 The Relevance Judgement Set........................................................... 160

6.3 Performance of the Basic Model ....................................................... 164

ix

6.4 Estimating the Probabilities............................................................... 166

6.4.1 Estimating P(ti|Q=true)............................................................. 166

6.4.2 Dependence of Documents on Index Terms............................... 170

6.4.2.1 Estimating the tf and idf Components ................................ 170

6.4.2.2 Estimating the Combination of tf and idf Components........ 171

6.4.3 Estimating the Virtual Layer Distribution .................................. 176

6.5 Performance Comparison with Existing Models ................................ 181

6.5.1 Comparative Performance for the ADI ....................................... 183

6.5.2 Comparative Performance for the MEDLINE............................. 186

6.5.3 Comparative Performance for the CACM................................... 187

6.6 Summary........................................................................................... 192

CHAPTER 7

MEASURING THE EFECTIVENESS OF VIRTUAL LAYER MODEL

7.1 Introduction....................................................................................... 195

7.2 Minimum Message Length................................................................. 196

7.2.1 Encoding Real Valued Parameters.............................................. 198

7.3. Measuring the Effectiveness of Virtual Layer Model with MML........ 200

7.4 Illustration of MML Calculation for Index Term Clusters ................... 202

7.5 Summary........................................................................................... 208

CHAPTER 8

CONCLUSION AND FUTURE RESEARCH

8.1 Conclusion ........................................................................................ 210

8.2 Future Work...................................................................................... 212

8.2.1 Phrases and Thesaurus ............................................................... 213

8.2.2 Retrieval Fusion ......................................................................... 213

8.2.3 Index Term Clustering................................................................ 214

8.2.4 Comparison Model for Bayesian Networks ................................ 214

REFERENCES ............................................................................................ 215

x

APPENDIX A ............................................................................................. 234

APPENDIX B .............................................................................................. 264

APPENDIX C .............................................................................................. 269

1

Chapter 1

Introduction

1.1 Background and Motivation

Information is a vital resource for all organisations. The efficient management

and retrieval of information is therefore an important organisational function. It

has been suggested that the quantity of new information produced in the western

world is growing at a rate of 13 percent each year [Freimuth89]. With the

development of the internet and other global networks this figure is expected to

increase markedly. As a result, people who need information are frequently

overwhelmed by the sheer amount of information available and finding useful

information requires enormous effort.

In early examples of information retrieval systems such as library

catalogue systems, searching is achieved by the use of catalogue systems

whereby documents are represented by several fixed categories such as author,

title and subject. The assignments of categories are done manually by domain

experts. With the present explosion in the amount of information available,

handling information effectively and efficiently in this way will be very difficult,

if not impossible. In addition to the problem of information volume, the current

format of the information has also introduced another dimension to the

information retrieval task. Most information is currently delivered in electronic

format which lacks the well-structured form of books. Such ill-structured

2

documents include articles, WEB pages, medical records, patent records, legal

case records, software libraries and manuals. Compared with traditional library

systems, a different kind of computer-based search strategy is required for these

electronic documents. The search strategy needs to be able to retrieve documents

based on the ‘content’ of the items directly. The objective of modern information

retrieval systems is to provide such types of search.

The automation of search and retrieval by content is not straightforward. Most of

the information available is written in natural language such as English and, to

date, information systems have not been able to process and ‘understand’ the

natural language as competently as human beings, despite of extensive efforts by

natural language researchers [Carmody66, Schank77, Allen87, Boguraev87,

Mel’cuk89, Amsler89, Brent91, Kupiec92]. Thus, a major problem inherent in

information retrieval systems may be seen as the ‘uncertainty’ in understanding

user's information needs and the content of documents. In the last two decades,

researches in computer science have been actively investigated the possibility of

applying Artificial Intelligence (AI) techniques to handle uncertainty. In the next

section we will present a summary of different AI techniques used to handle

uncertainty in information retrieval.

1.2 Uncertainty and Artificial Intelligence

In the recent past, one focus in Artificial Intelligence has been the problem of

"Approximate Reasoning". This problem deals with the decision making and

reasoning processes in the situation where information is not fully reliable, the

representation language is inherently imprecise and information from multiple

3

sources is conflicting. In information retrieval, documents and queries are

represented by index terms. The precision in representing document and query

content relies on the effectiveness of the text analysis methods in 'understanding'

the natural language. As stated in the previous section, existing natural language

processing models have not been able to process and 'understand' the natural

language as competently as human beings. As a result, document and query

representation cannot be represented precisely, or in other words, it may be seen

as a problem that requires an "approximate reasoning". Thus, an AI approach

may be considered as a solution to this uncertainty problem inherent in

information retrieval task.

Representation techniques for uncertain or imprecise information can be

classified into numeric and non-numeric (or also known as symbolic) techniques.

In the numeric context, the approximation can be viewed as a value with a known

error margin such as in Bayesian models, Evidence Theory [Shafter76] and

Fuzzy theory [Zadeh78]. The Bayesian belief network was introduced in the

eighties as an extension to the traditional Bayesian models. It incorporates graph

theory into the Bayesian model to enrich the semantic representation.

The symbolic representation approach at first concentrated on the use of

logic or, more specifically, first order predicate calculus. This classical symbolic

logic failed to produce consistent representation due to its lack of tools for

describing how to devise a formal theory to deal with inconsistencies caused by

new information [Bhatnagar86]. A modification of the symbolic logic namely

non-monotonic logic [McDermott85] was introduced to overcome the problem of

first order logic.

4

In addition, there are some AI methods such as neural networks, genetic

algorithms and hidden Markov models. The first two methods have attracted

quite a number of researchers, especially in the recent years with the increase in

computational power. The hidden Markov model is supported by rigorous

mathematical theory and mainly used in the area of speech and character

recognition [Hansen95].

There have been many conflicting views regarding the merits of particular

models [Cheeseman85, Cheeseman91, Zadeh86]. Each model exhibits

comparative advantages depending on the domain and application being

considered. In information retrieval research probabilistic methods, which can be

categorised into numeric AI techniques, are well accepted and have shown

promising results [Robertson76, Rijsbergen79, Turtle91, Ghazfan94]. However,

there is a new surge in research concerned with adopting a symbolic AI technique

in particular non-classical logic approach to the information retrieval in the last

few years [Rijsbergen86, Rijsbergen89, Crestani94, Chevallet96]. The results

have not been widely reported due to the computational complexity of the model

[Crestani95].

We will adopt the probabilistic approach, more specifically that of

Bayesian networks, in our information retrieval model. A Bayesian network is a

directed acyclic graph where the nodes represent events or propositions and the

arcs represent causal relations between those propositions represented in the

nodes. The support of explicit relations between the propositions in the Bayesian

network can overcome the following problems experienced by other probabilistic

retrieval models:

5

1. The traditional probabilistic model, such as those of Maron and Kuhn

[Maron60], Robertson and Sparck-Jones [Robertson76], Fuhr [Fuhr89]

and Rijsbergen [Rijsbergen79], uses two different models to produce the

initial ranking and to handle relevance feedback. The initial ranking is

usually produced using some ad-hoc probability estimations and the

relevance feedback is handled using some learning models.

2. The relevance feedback is confined only for relevance information

gathered from documents, although Fuhr [Fuhr92] shown that relevance

feedback gathered from queries can also be used to improve the retrieval

performance.

3. Multiple representations of document and queries are not possible,

although Turtle [Turtle90] showed that an information need represented

by different query representations generates different ranked output and

the combinations of these outputs may increase the retrieval performance.

4. Thesaurus, citation and synonyms are created as an addition to the

retrieval model instead of part of the retrieval model itself.

Our proposed Bayesian network model for information retrieval addresses the

problems inherent in the traditional probabilistic retrieval model in the following

ways:

1. The probabilistic inference in the Bayesian network retains the sound

theoretical basis of the traditional probabilistic models, but also

incorporates a common method for producing initial rankings of

documents and for handling relevance feedback.

2. Relevance feedback fits naturally into the model. The probabilistic

inference approach provides an automatic mechanism for learning.

6

3. The probabilistic inference approach allows us to incorporate relevance

information from other queries into the model by using separate network

representation for the query and exploiting the used of multiple query

network representations for a single information need.

4. Documents in the collection may be represented as a complex object with

multilevel representations, not merely as a collection of index terms.

5. Dependencies between documents are built implicitly in the model by

using the independence assumption principle of Bayesian networks which

allows the retrieval of documents that do not share common index terms

with the query. Citation or nearest neighbor links can be easily

incorporated into the model because of the graphical nature of the model.

6. Synonyms and a thesaurus can be easily implemented as part of the

network. Any index terms that are synonyms can be linked, so the system

can use all those synonyms during retrieval.

Graph and network structures have been widely used in information retrieval.

Salton [Salton68] showed the early use of tree and graph models in information

retrieval to describe the implementation of many basic structures used in retrieval

systems in graph theoretic terms. However, their use in combination with a

formal inference technique is still a current topic of research.

1.3 Previous Work Using Network Models for

Information Retrieval

Salton's [Salton68] early use of tree an graph models for information retrieval

provides a starting point for many information retrieval researches that uses tree

7

or graph model. There are few numbers of current information retrieval models

that use network representation. These information retrieval models can be

loosely categorised based on whether they support clustering, rule-based

inference, browsing, spreading activation, or connections.

Clustering. In the clustering approach, the network structure is derived

naturally from the representation of document and term clusters. Sparck-Jones

[Sparck-Jones71] investigated the term clustering technique and later used it to

develop the automatic indexing technique [Spark Jones74]. Croft [Croft80]

describes a retrieval model incorporating document and term clusters. Croft and

Parenty [Croft85] compare the performance of cluster based network

representation with a conventional database implementation. A survey of

document clustering techniques, especially those for hierarchic clustering, is

presented by Willet [Willet88]. All the different approaches to clustering have

one common feature, namely that they assume there is a natural similarity

between index terms or documents and these similarities can be exploited to

increase the retrieval performance.

Rule-based inference. The rule-based inference method in RUBRIC

systems [Tong83, Tong85] represents queries as a set of rules in an evaluation

tree that specifies how individual document features can be combined to estimate

the certainty that a document matches the query. One of the objectives of the

RUBRIC design was to allow the comparison of different uncertainty calculi

models [Tong86]. Recently, the RUBRIC system included the inference network

approach [Fung90a]. Rule-based inference using network structures has also been

used with the construction of automatic thesauri [Croft87b, Shoval85].

8

Browsing. A network representation is essential in information retrieval

systems that support a browsing capability. Hypertext systems are a typical

example of browsing systems and are also common in thesaurus based systems.

The THOMAS system [Oddy77] uses a method that allows browsing in a simple

network of document and terms. A more complex network model for browsing is

investigated by Croft and Thomson [Croft87b] using the I3R system. Croft and

Turtle [Croft89a] and Frisse and Cousins [Frisse89] describe a retrieval model for

hypertext networks. A survey of hypertext retrieval research can be found in

Coombs [Coombs90].

Spreading activation. Spreading activation is a search technique in

which the query is used to activate a set of nodes in the representation network,

which in turn activates the neighbouring nodes. The rank of the retrieved

documents is generated by the pattern of activation in the network. The variation

between such models usually arises due to different halting conditions and

weighting functions. Jones and Furnas [Jones87] present a representative

spreading activation model which is compared to the conventional retrieval

models by Salton [Salton88]. Croft [Croft89b] used spreading activation in a

network based on document clustering. Cohen and Kjeldson [Cohen87] used

spreading activation in a more complex representation network with typed edges.

Connections approach. The connectionist approaches are similar to

spreading activation. However, the connectionist approach does not include a

clear semantic interpretation of the links in the network, which is clearly defined

in the spreading activation approach. The weights associated with the links are

learned from training samples or through user guidance. Croft and Thompson

[Croft84] used a connectionist network in an attempt to learn and select a query

9

strategy. Brachman and Mcguiness [Brachman88] used a connectionist approach

to retrieve facts from knowledge bases on programming language. Belew

[Belew89] and Kwok [Kwok89] describe other connectionist approaches to

information retrieval. Lewis [Lewis90] further explores the relationship between

information retrieval and machine learning.

All the network approaches discussed in this section lack one major

feature required to produce a good information retrieval model, namely that of a

strong mathematical foundation. In this thesis, we introduce a new formal model

based on a Bayesian network that provides a strong mathematical foundation.

1.4 Contribution of the Thesis

Recent information retrieval research has suggested that significant

improvements in retrieval performance will require techniques that, in some

sense, ‘understand’ the content of document and the queries [Rijsbergen86,

Croft87a], in order for to infer probable relationships between documents and

queries.

The idea that the retrieval process is an inference or evidential reasoning

process is not new. Cooper’s logical relevance approach [Cooper71] is based on

deductive relationships between representations of documents and information

needs. Wilson [Wilson73] used situational relevance to extend Cooper’s logical

relevance by incorporating inductive inference.

In the research described in this thesis we present semantically sound

Bayesian network model for a formal model of information retrieval. This thesis

contains two areas of contribution, namely to information retrieval modeling and

10

to Bayesian network inference theory. In detail, the thesis contains the following

contributions:

• We formally define a new model for information retrieval based on a

Bayesian network. The model provides a strong mathematical

foundation to model uncertainty in information retrieval. The new

model can be used as a general framework for information retrieval

because it can represent different existing information retrieval

models, such as the Boolean and probabilistic models, by using

appropriate network representations. With this framework, the

decision of adopting specific retrieval model can be postponed until

the implementation level.

• We introduce a specific implementation of the above model to

perform probabilistic retrieval. The probability model presented

includes the probability estimations that will produce better

performance compared with other well known information retrieval

systems, such as the vector space model [Salton83] and Turtle and

Croft's [Turtle90] network model. The performance tests were carried

out on three well-studied test collection, namely ADI, MEDLINE and

CACM. Moreover, The adoption of a graph, which captures the

connectivity between the index terms and documents, enables our

proposed model to produce higher recall compared with those

produced by the information retrieval models previously mentioned.

• We provide a framework within the Bayesian network model to

support both evidential and dependency alteration relevance feedback.

Existing information retrieval models have failed to provide a

11

common model for both approaches to relevance feedback, although

the two approaches have been shown to benefit different retrieval

situations. The evidential feedback is suited for modeling the situation

where we perceive that the probability distribution has been correctly

modeled, hence the data received from the relevance feedback is

treated as a new evidence to this probability distribution. Altering the

dependencies, on the other hand, is best used when we perceive the

probability distribution to be incorrect and the data gathered from the

relevance feedback process should be used to correct this probability

distribution. As we can see, the two relevance feedback approaches

each have their own place in information retrieval applications.

Therefore the ability to support both approaches in a single framework

is essential to information retrieval.

• Cooper [Cooper90] proved that the complexity of an exact inference

algorithm for Bayesian network is NP-hard. It is common for

information retrieval systems to deal with large document collections.

Therefore, we see the importance of adopting some approximation

methods to reduce the inference complexity in the Bayesian network

model for information retrieval. We introduce some heuristics to

reduce the complexity of the inference in Bayesian networks.

• Finally, we present an evaluation model that can be used to measure

the complexity of the heuristics proposed in the previous point. The

model is based on the idea of A Minimal Message Length

[Wallace68]. The best approximation or heuristics is given by the

approximation model that produces the shortest coding in describing

12

the probability distribution. This evaluation model will enable us to

evaluate the efficiency of a given approximation model without

performing extensive retrieval tests.

1.5 Research Methodology

In information retrieval research, experiments are performed using test

collections. Recall and precision levels are used to measure the performance of

the system. The recall measures the ability of the system to retrieve all the

relevant documents. The precision measures the ability of the system to

discriminate between the relevant and non-relevant documents. A test collection

in information retrieval experiments comprised of:

• A set of documents – current test collections generally contain

information from the original documents such as title, author, date and

an abstract.

• A set of queries – These queries are often taken from actual queries

submitted by the users. They can be expressed either in natural

language or in some formal query language such as Boolean

expressions.

• A set of relevance judgements – For each query in the query set, a set

of relevant documents is identified. The identification process can be

done manually by a human expert or by using pooling methods for

results from several information retrieval systems.

The interaction of those sets in an information retrieval experiment is depicted by

figure 1-1.

13

Testcollection

StandardQueries

Retrievalmodel

Documentranking

Relevancejudgement

RecallPrecision

level

Figure 1-1 Model for experiments in information retrieval systems.

Using the standard queries in the test collection, the retrieval system under

evaluation is used to perform a search in the document set. The result of the

search is a list of document identifiers with the documents assumed most relevant

being ranked first. This list of rankings is then compared with the list of relevance

judgments. The relevance judgment itself does not include any ranking. It only

contains the document identifier of documents judged relevant to the query.

Using the recall and precision formulae, the recall and precision levels are then

measured.

1.6 Thesis Overview

Researchers have adopted artificial intelligence to solve the problem of

uncertainty across different knowledge domains. We have adopted one particular

artificial intelligence technique, namely that of the Bayesian network, to solve the

problem of uncertainty in information retrieval.

14

In the next chapter, Chapter 2, a summary of the current state of

information retrieval is given. The research problem, which has been introduced

in the current chapter, will be discussed further in this chapter. We also present a

comparison of the two major existing retrieval models; the vector space and the

probabilistic models.

Chapter 3 describes the development of Bayesian network theory. The use

of inference in the Bayesian network is also discussed in this chapter.

Based on the information discussed in chapter 2 and chapter 3, we present

a semantically sound Bayesian network model for information retrieval in chapter

4. We show that our model provides a correct semantic interpretation of the

retrieval model and also provides a general model for information retrieval

through its effectiveness to simulate existing models using appropriate network

representations.

One major consideration in implementing Bayesian network for

information retrieval is the computational complexity inherent within the

network. Chapter 5 investigates the possibilities of adopting an approximation

model that can reduce the computational complexity in the network in order to

make the implementation practical.

We report the results of our experiments in chapter 6. Different

probability estimations and their effect on the performance of the system are

tested and reported in this chapter. We also compare the performance of our

network model with other well-known retrieval models.

Chapter 7 introduces an evaluation model that can be used to measure the

effectiveness of approximation models introduced in chapter 5. The evaluation

model enables us to choose the optimal approximation without performing

15

extensive retrieval performance test. Finally, we provide the conclusion of our

research and possible future direction in chapter 8.

16

Chapter 2

Automatic Information Retrieval

2.1 Introduction

Information retrieval systems are designed to help people extract useful or

interesting information from document collections. Information or document

retrieval systems are not recent innovations. They existed since the first libraries in

the form of manual library catalogue systems. Since that time, information

retrieval systems have changed rapidly due to the growth in the amount of textual

information available in both digital and paper format. This dramatic increase in

available information has driven the need for the development of automatic

information retrieval.

In this chapter, we present models of the retrieval systems and their

associated problems. We organise this chapter into three major parts. The first

section defines the information retrieval models and their problem domain. The

second provides detailed explanations of those parts that constitute the

information retrieval models. The third and last section examines some methods

that can be used for improving the performance of information retrieval systems.

2.2 Information Retrieval Model

An information retrieval system involves three major tasks (figure 2-2), namely

document indexing, query formulation and the use of a matching function. The

17

document indexing task involves building and organising representations for each

document involved in the collection. Query formulation is a similar task to that of

document indexing, translating the user’s information needs to a format which can

be understood by a matching function. Document and query indexing are

discussed in detail in section 2.3.

Query Documents’Representation

Matching Function

Relevant Documents

Query Formulation Documentsindexing

reading

Figure 2-1 Information retrieval task model.

Once the two representations are built, the matching function will use both

the document and query representations to find those documents judged to be

relevant by the system. However, the documents returned by the system may not

necessarily be relevant from the user’s point of view. The two main factors that

influence the disparity between the set of documents judged relevant by the system

and those perceived to be relevant by the user to their original query are natural

language ambiguity and the possible limited background knowledge of users on

the query subject.

18

The first problem of natural language ambiguity results from the fact that a

concept may be expressed in many ways. For example, consider the word

windows. A user may use this word to search for documents explaining windows

based operating systems or for documents explaining how to classify different type

of architecture by looking at the shape of windows. The formulation of methods

to overcome the problem of the ambiguity in natural languages is a major

objective of information retrieval research.

The second problem, that of limited background knowledge, from the

point of view of information retrieval research, may not be completely eliminated

since it is partially the responsibility of the users. Upon the delivery of the

documents judged relevant by the retrieval system, the users may read the

documents to expand their knowledge (we refer to this as the reading process).

As the users’ knowledge of the subject expands, the query submitted to the system

can be refined using the new knowledge learnt. Therefore, the responsibility of the

system lies with providing means of refining and resubmitting the query that

reflects the additional knowledge learnt. This facility is known as relevance

feedback (section 2.5.4 discusses relevance feedback in more detail).

The reading process we discussed in the previous paragraph plays an

important role in solving the problem of users’ limited background knowledge

concerning the topic of a query. Indeed, the emphasis on the role of the reading

process differentiates information retrieval systems from other information systems

like data or knowledge retrieval systems. Although most of the literature to date

uses the terms information retrieval system and data or knowledge retrieval

system interchangeably, further investigation of the area reveals that information

19

retrieval systems are in fact a broader generalisation of data or knowledge retrieval

systems. In essence, the three systems differ in the nature of the queries involved

and in the expected result of the queries.

Data retrieval systems provide users with the ability to retrieve specific

data. Thus, the queries in data retrieval systems are necessarily very precise in

nature. Aside from the query, the data are usually also organised in a well defined

structure. An example of a data retrieval system is a relational database.

Knowledge retrieval systems provide users with the ability to find answers to

specific questions. Unlike data retrieval systems, knowledge retrieval systems’

data may not necessarily be well structured. However, in both knowledge retrieval

and data retrieval systems queries on the data are very specific and precise.

On the other hand, a query in an information retrieval systems as we have

stated previously, may involve ambiguity or uncertainty. The user of an

information retrieval system does not search for specific data as in data retrieval,

nor search for direct answers to a question as in knowledge retrieval. The

information or knowledge is acquired by the user through the reading of the

documents. For example, a user may want to get information on the topic cheap

production methods for assembled electronic goods. This does not necessarily

imply that the user wants a specific answer to the specific question, What are the

cheap methods? or How do the cheap and expensive methods differ? Even in the

situation whereby one has some specific questions in mind, the aim is to acquire

overall information such that not only those questions but also others suggested by

reading the documents can be answered.

20

As well as increasing the user’s knowledge of the subject behind the

query, the reading process may clarify the relationship between the user’s needs

and those documents perceived to be relevant by the users since the relationship

between those needs and what information meets them is not necessarily obvious.

For instance, the user’s query on cheap production methods for assembled

electronic goods may be met by the article entitled “Assembly line workers in

third world countries : human right vs national income”. This article may not

specifically discuss how to cheaply produce electronic goods, but the related

assertion that cheap labour in third world countries can be a way of reducing

production cost. There is ,therefore, a link in terms of relevance between the

user’s query and the article. However, if ,the information retrieval system in use,

based its matching function solely on matching keywords as do traditional

systems, the above article may not be retrieved because the article may not

actually contain the word cheap, production or electronic goods.

As we have stated, traditional information retrieval systems performed

matching at the keyword level. We have shown through the above example that

this approach may miss relevant articles which match the query at the concept

level. As a result, adding a knowledge base into the systems has become necessary

in order to provide better retrieval. Current research in information retrieval

systems aims to perform matching at the concept level. Section 2.4 discusses

different methods used in defining the matching functions. In the next section, we

investigate the two other tasks involved in the information retrieval, namely

document and query indexing.

21

2.3 Document and Query Indexing

Document and query indexing are very important tasks in any information retrieval

system. However, document and query indexing are also considered to be the

most difficult task to carry out successfully. The indexing task is considered

difficult to implement because natural language ambiguity introduces uncertainty

in the text analysis process of indexing document and queries. Salton [Salton88]

suggests the indexing process is not required if the collection is considered small.

In a small collection, a full text scanning method will be more efficient in

retrieving the documents from the collection than using matching function on

document and query indexes.

Today’s document databases are large due to the amount of information

available in digital form and this volume of information will only increase with

time. Full text scanning methods are impractical for such databases given the

capability of the current computer technology. In other words, document indexing

has to be performed regardless of the problem of uncertainty in text analysis

during the indexing process. As a result, reducing uncertainty becomes part of the

problem domain of both the document and query indexing tasks in information

retrieval systems.

2.3.1 Indexing Problems

Three main factors contribute to the problem of uncertainty in document and

query indexing. Firstly, there is the problem posed by the variability in the ways

that a concept may be expressed [Fuhr86]. One word may have different

22

interpretations in different contexts. This is partly a matter of language.

Considering the same query example introduced earlier in this chapter, cheap

production methods for assembled electronic goods, the word assembled may be

interpreted as unit of construction or as how the assembly will be carried out, e.g.

machine-made.

The second problem may occur due to underspecification of the request.

Sometimes a user does not provide enough details or specifications in the query.

This produces a vague request, such as the qualification of cheap methods in the

example query. Does this mean cheap in the sense of economical production or

cheap as in low quality? Request underspecification can also occur when the

request itself is incomplete. For example, considering our example query, the user

may want information not only about the production method of assembled

electronic goods, but also about design aspects of the goods. However, it is

unlikely that the system will retrieve documents containing design aspects of cheap

electronic goods since design and production cannot be generalised into

“production method”. Both vague and incomplete requests contribute to the

request underspecification, the difference between them being that in the first case,

a vague request, the user may not realise the inherent ambiguity in the query,

whereas in the second case, that of an incomplete request, the user has failed to

include sufficient detail in the query. Request underspecification is less obvious

than the variability problems. Nevertheless, it still contributes to uncertainty in the

information retrieval process. Both request underspecification and variability

problems follow from the user’s ignorance before the reading process is

undertaken.

23

The third problem is that of document descriptor reduction. The following

example illustrates this problem: In the article “ Assembly line workers in the third

world countries: human rights vs national income“, the term national income has

a narrower meaning than may be expected. The article actually describes national

income but in the narrower sense of national income generated from export. In

this case, the reduction of a document description by the author lead to an indirect

description or a generalisation of export generated income to national income.

This problem can never be completely avoided - the author of a document always

leaves much unsaid on a subject - nor it is always harmful. Forming compact

descriptions of document contents may seem to increase ambiguity, but it can

increase both the efficiency of matching and the effectiveness of document

classifying.

Information retrieval can thus be seen to impose conflicting demands on

text descriptors. It requires that they be generalising but accurate, as well as

discriminating and summarising. Meeting these demands becomes the fundamental

goal of an indexing language, a language that is required to perform the indexing

process [Lewis96].

2.3.2 Indexing Language

In the previous section, we have examined problems associated with the indexing

process. Since human beings have the capability to handle ambiguity in natural

language, the obvious solution to the indexing problems would seem to be

manual indexing. In fact, the indexing process in early information retrieval

systems was carried out manually by human experts in the subject domain. To

24

date, manual indexing is still considered superior to automatic indexing in its

capacity to handle uncertainty. However, manual indexing suffers from high

operational costs and would be almost impossible to perform in today’s document

databases due to their size. Automatic indexing has become an active area of

information retrieval research. To perform automatic indexing, an indexing

language needs to be defined. The indexing language consists of a term

vocabulary and methods of constructing representation.

An indexing language’s term vocabulary can be either derived from the

text of the document described or may be arrived at independently from the text.

The use of elements of vocabulary derived from the text itself is called the natural

language approach. The other approach, which uses terms independent of the text

in the vocabulary is known as the controlled vocabulary method.

There are many representation construction methods in text retrieval

systems [Milstead89]. However, they all share the common goal of indexing, that

is to create documents and queries representations which are both summarising

and discriminating. To achieve this goal, the index construction methods perform

the following steps:

1. Eliminate common terms from the document or query which are bad

discriminators. The systems usually have a list of common terms which

are kept in a stop word list (refer to section 2.5.1).

2. Break down the document and query into individual terms.

3. Eliminate suffixes and prefixes from the terms.

4. Assign weights to the term to identify the terms significant in the

collection.

25

One of the common methods for assigning weights to the indexed terms uses

stastistical methods, with each term given a weight according to its importance to

the collection. The first such weighting scheme was introduced by Luhn[Luhn58].

He proposed the use of a term frequency (tf) to measure the term’s siginificance in

the document. In fact it provides a local weight calculation for each term and can

be formulated as:

x fi k i k, ,= (2.1)

where

xi,k is the weight of term i in document k

fi,k is the frequency occurrence of term i in document k

This idea was developed further by Sparck-Jones [Sparck-Jones72] who added an

inverse document frequency(idf) to the weighting scheme as a global weight

which can be formulated as follows:

⎟⎠⎞⎜

⎝⎛=

ii f

Ny log (2.2)

where

yi is term i inverse document frequency.

N is number of documents in the collection

fi is number of documents in which term i appears.

Global weighting is important for discriminating terms because very high

frequency words cannot be considered to be good discriminators if they appear in

26

most of the documents in the collection. By taking into consideration the number

of documents containing a given term, this problem can be tackled.

Combining the term frequency (equation 2.1) and inverse document

frequency (equation 2.2), the final weight of a term in the collection can be

calculated as

ikiki yxw ×= ,, (2.3)

where

wi,k = weight of term i in document k

xi,k = term i's term frequency in document k

yi = term i's inverse document frequency

There have been some objections from natural language processing

researchers (namely [Strzalkowski93] and [Lewis96]) to the use of pure statistical

methods for estimating the terms’ weight in a collection. Their objection concerns

mostly their doubt in the ability of such a weighting scheme to handle phrases. It

becomes more difficult to justify the assertion of term independence in a collection

once phrases are introduced. For example, consider the phrase take over. Weights

used to discriminate documents containing this phrase may not be successful

because the individual words take and over are common words (i.e. they may have

a small global weighting). Although the idea of using phrases is quite attractive, it

has not been shown for all cases that this form of retrieval exhibits advantages

over non-phrase supporting information retrieval systems.

Indexing languages may be classified as pre-coordinate or post-coordinate

according to the time at which they choose to organise and use the terms resulting

27

from the indexing process. In pre-coordinate indexing, the terms are coordinated

at the time of indexing by logically combining any index terms as a label which

identifies a class of documents. In post-coordinate indexing, the same class of

documents would be identified at search time by combining classes of documents

labeled with the individual terms.

In the next section, we examine how matching functions use the document

and query representations resulting from the indexing process.

2.4 Matching Functions

Matching functions are the main engine of information retrieval systems. Once

representations for documents and queries are built, these representations are used

by the matching function to achieve the three following related tasks:

1. To locate or identify items related to a user query.

2. To identify both related and distinct documents in the collection.

3. To predict the relevance of a document to the user’s information request

through the use of index terms with well defined scope and meaning.

Many matching functions have been proposed over the years by researchers in the

information retrieval area. In this section we examine three different matching

function models, namely the Boolean, vector space and probabilistic models.

2.4.1 Boolean Model

The Boolean model is considered to be the simplest matching function in

information retrieval. Relationships or similarities between individual documents

28

are not utilised, neither are any relationships between query terms. In systems

which use the Boolean model, the users’ query is represented only as

combinations of terms that a relevant document is expected to contain. For

example, one may require all documents which contain the two terms (design and

production) or the three terms (cheap, electronic and good). The query Q can be

formulated as

Q = (design AND production) OR (cheap AND electronics AND good)

Simplicity of implementation is the main advantage of Boolean model. The

documents’ similarity to the query is calculated solely on the basis of a binary

decision as to whether the query terms exist in the document representation. As a

result, documents retrieved by the Boolean model are weighted equally against the

users’ query. Thus the first document retrieved is not necessarily the most relevant

document. This drawback of the Boolean model due to the binary nature of its

retrieval decision function is frequently cited [Croft86, Salton83, Salton88,

Losee88].

The solution to the problem of equally weight documents exhibited in the

Boolean model has become a goal of research in information retrieval. This

research has concentrated on building a retrieval model that has the ability to

weight the relevance of the documents against the query. In the following

sections, 2.4.2 and 2.4.3, we examine two other models of retrieval, the vector

space and probabilistic model, that do produce a ranked output.

29

2.4.2 Vector Space Model

The vector space model represents both the documents and queries as a vector of

terms. Both document and query representations are described as points in T

dimensional space, where T is the number of unique terms in the document

collection. Figure 2-2 shows an example of a vector space model representation

for a system with three terms.

TERM1

TERM2

TERM3

D1=(TERM31,TERM32,TERM33)

D2=(TERM11,TERM12,TERM13)

α

Query Qβ

Figure 2-2 Three dimensional vector space.

Each axis in the space corresponds to a different term. The position of each

document vector in the space is determined by the magnitude (weight) of the

terms in that vector. A similarity computation measuring the similarity between a

particular document vector and a particular query vector as a function of the

magnitudes of the matching terms in the respective vectors may be used to identify

the relevant documents. The simplest such scheme to calculate the similarity is to

assume that the document containing the most terms from the query will be the

most relevant. Thus the similarity between a query Q and the kth document, Dk,

30

can be calculated as an inner product of term vectors in Q and Dk. Formally it can

be represented as

( ) ∑=

=n

iikik tqDQsim

1

, (2.4)

where

Q is the query vector

Dk is the kth document vector in the collection

qi is the term i in the query Q

tik is the term i in the document Dk

n is the total number of query terms.

Besides the inner product approach, another well understood (and more

widely accepted in information retrieval systems) vector similarity measure is the

cosine correlation function. In the cosine correlation function, the angle between

documents or documents and a query measures the similarity between the vectors

that represents them. Consider the situation depicted in figure 2-2. The similarity

between D1 and D2 would be measured by the angle α. The similarity between

documents D1 to query Q is measured by angle β. The cosine correlation

function is shown in Table 2-1 (which also includes some other common vector

space similarity measures).

Similarity Measure

Sim(Q,Dk)

Formula

Inner product ∑

=

n

iiki tq

1

31

Cosine correlation 2

1

1

22

1

1

2

1

⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛ ∑∑

∑

==

=

n

iik

n

ii

n

iiki

tq

tq

Dice measure ∑ ∑

∑

= =

=

+n

i

n

iiki

n

iiki

tq

tq

1 1

22

1

2

Jaccard measure ∑ ∑∑

∑

= ==

=

−+n

i

n

iikj

n

iiki

n

iiki

tqtq

tq

1 11

22

1

Table 2-1 Similarity measures.

We note that the numerator of the cosine formula gives the sum of the

product of the matching terms between query Q and document Dk. That is, when

binary indexing is used, the numerator is the total number of matching terms in

query Q and document Dk. When the indexing is not binary, the numerator

represents the sum of the products of term weights for the matching terms in Q

and Dk. The denominator in the cosine similarity function acts as a normalising

factor because it takes into consideration the number of terms contained in a

document. The longer the documents, that is the more terms used to describe the

documents, the smaller the cosine similarities. Thus, unlike the inner product, the

cosine measure takes into consideration the effect of a document’s length. Inner

product measures always discriminate against short documents because short

documents always produce a shorter term vector sum compared with long

documents.

32

Using such similarity functions, the vector space model can produce a

ranked output. The capability to produce ranked document output gives the

vector space model an advantage over the Boolean model. However, the lack of

formal methods to support the vector space model in handling uncertainty has

driven research in information retrieval towards seeking models that can support

uncertainty. In the next section, we analyse the probabilistic matching function

model which does provide more formal support to handle uncertainty.

2.4.3 Probabilistic Model

The probabilistic model attempts to address the uncertainty problem in

information retrieval through the formal methods of probability theory. Unlike in

the vector space model, in this model the document ranking is based on the

probability of the relevance of documents and the query submitted by the user.

This has been formalised and is known as the Probability Ranking Principle

[Robertson77]. There are three different models of probabilistic retrievals: binary

independence [Robertson76, Rijsbergen79], the unified model [Roberston82] and

retrieval with probabilistic indexing (RPI) [Fuhr89]. The models differ in their

treatment of and assumptions behind the probability of relevance. In this section,

we analyse the formulation of these probablistic models and state the assumptions

associated with them.

2.4.3.1 Binary Independence Model

As the name implies, this model assumes that the index terms exist independently

in the documents and we can then assign binary values to these index terms. For a

33

further illustration of this model, consider a document Dk in a collection, is

represented by a binary vector t = (t1,t2,t3,…,tu) where u represents total number

of terms in the collection, ti=1 indicates the presence of the ith index term and

ti=0 indicates its absence. A decision rule can be formulated by which any

document can be assigned to either the relevant or non-relevant set of documents

for a particular query. The obvious rule is to assign a document to the relevant set

if the probability of the document being relevant given the document

representation is greater than the probability of document being non relevant, that

is, if:

P(relevant|t) > P(non-relevant|t) (2.5)

Using Bayes’s theorem, equation 2.5 can be rewritten as:

( ) ( )relevantnontPrelevanttP −> || (2.6)

This decision rule, when expressed as a weighting function g(t), becomes:

( ) ( )relevantnontPrelevanttPtg −−= |log|log)( (2.7)

This means now we can use the weighting function g(t) to rank the document

according to their g(.) value such that the more highly ranked a document is, the

more likely it is to be relevant to the query.

Since the calculation of the probabilities P(t|relevant) and P(t|non-relevant) are

difficult, we have to assume that the index terms occur independently in the

relevant and non-relevant documents so that we can calculate P(t|relevant) as:

P(t|relevant)=P(t1|relevant)P(t2|relevant)…P(tn|relevant) (2.8)

and similarly for P(t|non-relevant).

Now let:

34

pi=P(ti=1|relevant) (2.9)

qi=P(ti=1|non-relevant) (2.10)

So pi and qi are the probabilities that an index term occurs in the relevant or non-

relevant document sets respectively. Then

( ) ii ti

n

i

ti pprelevanttP −

=

−= ∏ 1

1

1)|( (2.11)

( ) ii ti

n

i

ti qqelevantnontP −

=

−=− ∏ 1

1

1)|( (2.12)

Subtituting 2.11 and 2.10 into 2.7, we have

( )

( ) ∑∑== −

−+

−−

=n

i i

in

i ii

iii q

p

qp

qpttg

11 1

1log

1

1log)( (2.13)

The second summation in equation 2.13 is constant for a given query and does not

affect the ranking of documents. Since probabilistic models assume the relevant

and non-relevant sets can only be calculated for a single query, this second

summation can be omitted from the calculation. However, it can be interpreted as

a cut-off value to the retrieval function. That is, only documents that have a

relevance value greater than this constant value are retrieved as relevant

documents. This capability in fact gives the probabilistics model an advantage over

the vector space model. In vector space model, such a cut-off value has to be

found through trial and error, because its mathematical model does not provide

support for it.

35

Omitting the second part of equation 2.13, the weighting function g(t) can be

formulated as

( )

( )∑= −

−=

n

i ii

iii qp

qpttg

1 1

1log)( (2.14)

Observation of equation 2.14 shows that g(t) is equivalent to a simple matching

function between query and document where query term i has the weight of

( )( ) ii

ii

qp

qp

−−

1

1log . This weighting scheme was first introduced by Robertson and

Sparck-Jones [Robertson76].

As in any probabilistic model, a prior probability needs to be defined

before any inference can be calculated. Therefore, probabilistic models rely on the

major assumption that relevance information is available in the collection to define

the prior probability. That is, that some or all of the relevant document and non-

relevant documents have been identified. In reality, this assumption is very difficult

to satisfy because the relevance information is not easy to obtain at the early stage

of a search. One way of overcoming this problem is to use an interactive search at

an early stage of a search. An interactive search can be used to provide the

information retrieval systems with relevance information. The users’ judgement of

the document ranking in this search is then used as relevance information in the

next search [Sparck-Jones79].

In the situation where there is no relevance information available or in the

case of a non-interactive search, a combination of similarity measures shown in

table 2.1 with the inverse document frequency can be used to define prior

probability [Croft79]. Consider the inner product similarity measure. The

36

combined measure using the inverse document frequency in this case can be

formulated as:

∑=

=n

i iikiqk f

NffDQsim

1

log),( (2.15)

where

fiq is term i's term fequency in query Q.

fik is term i's term frequency in document Dk.

N is the total number of document in the collection.

fi is the total number of document where term i exists.

n is the total number of query terms.

We have stated above that the probabilistic model assumes that the terms

in the document are distributed independently. However, Rijsbergen [Harper78]

argues that this assumption is often made as a matter of mathematical

convenience, although it is generally agreed that exploitation of associations

between items of information retrieval systems, such as index terms or documents

will improve the effectiveness of retrieval. In our studies we analyse the possibility

of exploiting these associations in order to improve retrieval performance. We use

the Bayesian belief network mode which will be explained in detail in chapter 4.

2.4.3.2 Unified Model

The Unified model exists as a combination of Maron-Kuhn’s model [Maron60]

and Cooper’s model [Cooper78] with the binary independence model. The Maron-

Kuhn model differs from the binary independence model in the assumption of

document relevancy. In their model, a record of the number of times each query is

submitted and of which documents are judged relevant or non-relevant to each

37

query is kept. This information is then used to determine the frequency of a

document that has been judged relevant to each query submitted. This frequency

in turn is used to estimate the probability of relevance and documents are ranked

accordingly.

Thus, this model combines the judgements of multiple users in order to

compute the probability of relevance with respect to a set of equivalent queries.

This differs from the binary independence model which viewed the association

between document and the index terms as fixed by the collection and independent

from the use of the index terms in the queries. In other words, the relevance

judgment in the binary independence model does not come from the association of

the index terms in the query and the documents. To illustrate the difference in

more detail, consider the following, letting:

Q be the set of all (past and future) queries of the retrieval system.

D be the set of all (past and present) documents in the system.

QS be the set of queries that using the same query terms.1

DS be the set of all the documents to which the same index terms have

been applied.

qm be an individual query (qm ∈ QS).

dk be an individual document (dk ∈ DS).

R be the event of relevance.

1 It is assumed that the same query terms may represent different information need. The same apply to the documents, document represented with the same index terms may contain different information.

38

The set consisting of all pairs of (dk,qm) represents the event space. The

relevance R is a subset of this event space. Using the above notations, the method

of calculating relevance according to the Maron-Kuhn, binary independence and

Unified models respectively, are as follows:

Maron-Kuhn model: P(R|QS,dk).

Binary Independence: P(R|qm,DS).

Unified: P(R|dk,qm).

The unified model combines the estimation provided by Maron-Kuhn’s model and

that of the binary independence model to derive the relevance judgment of the

individual document to a query. This unified model attempts to generalise the two

models such that Maron-Kuhn model is used when only the query history is

available, and reduces to the binary independence model when only document

representation data is available. When both query and document representation

data is available, the unified model is used. However, the combination of the two

models provided by the unified model does not solve the problems of probability

estimations inherent by the two individual models [Fuhr92]. Thus, a better model

that incorporates good probability estimation when no initial stastisical data is

available is still required.

2.4.3.3 Retrieval with Probabilistic Indexing (RPI) Model

The RPI model is a generalisation of the binary independence model. This model

includes a more detailed assumption of the relevancy of the index terms

assignment to the document compared with that of the binary independence

model. To illustrate the model, consider the following. Let:

dk represent a document in the collection,

39

ti be the binary vector (t1,t2,t3,…,tn) of index terms in document dk,

qm be a query,

C denotes the event of correctness.

Unlike the binary independence model and unified models which calculate the

probability of relevance as P(dk,qm), the RPI model measures the correctness of

the assignment of ti to dk by assigning value to C. The probability is now measured

as P(C|ti,dk,qm). The decision as to whether the assignment of ti to dk is correct or

not can be specified in various ways, for example, by comparison with the results

of manual indexing, or by comparing the retrieval results. Thus, parameter

estimations or more specifically the estimation of correctness of the index term

assignment is still relies on ad-hoc estimation.

We have in this section discussed several information retrieval matching

function models, namely the Boolean, vector space, binary independence, unified

and RPI models. Each have their own individual drawbacks. The vector space

model exhibits a lack of mathematical support for the handling of uncertainty. The

probabilistic approach attempted to provide models with strong mathematical

foundations but fell short due to the need for some ad-hoc probability estimations.

We will introduce in chapter 4 a probabilistic retrieval model which overcomes the

problems of the probability estimations, specifically one based on Bayesian

networks.

2.5 Increasing Retrieval Performance

Regardless of the limitations of the retrieval models discussed in the previous

section, there are several methods available to improve the retrieval performance

40

of such information retrieval system. These methods are usually not considered to

be part of the retrieval model as such, but rather as additional components of the

retrieval model. Before we further discuss these methods, we will present a

common definition of performance measurement in information retrieval systems.

An information retrieval system finds documents that are intended to be

relevant to the user’s query. In a very real sense only the user knows exactly what

is relevant to his or her information needs. Information retrieval system can only

suggest “relevant” documents for the user to read. In this situation, providing

uniform performance evaluation can be difficult. Research in this area, however,

has provided common performance measurements, based on recall-precision on a

standard test collection. The recall level describes the completeness of the

retrieval; precision represents the accuracy of the retrieval. These can be defined

formally as follows:

R

rrecall = (2.16)

N

rprecision = (2.17)

where

r is the number of relevant documents retrieved for a given query.

R is the number of relevant documents in the collection for a given query.

N is the number of documents retrieved for a given query.

Both high recall and high precision are desirable in information retrieval systems.

However, they are very difficult to achieve simultaneously. High recall

performance usually means poor precision. Chapter 6 discusses performance

measurement in information retrieval systems in detail.

41

The following sub-sections 2.5.1-2.5.5 analyse different methods that can

be used to improve the performance of all the retrieval models discussed in section

3. These methods may be combined to achieve the optimum retrieval.

2.5.1 Stop List

Every word in a language has its meaning. However, not all of them have the

ability to distinguish one document from another. For example, the word “the”

will never provide such information. Many retrieval systems provide a stop list, a

list containing such words that do not have any discrimination capacity. The stop

list is used during document indexing and query formulation. Any word within the

list that appears in the document or query is discarded. The word may have

discrimination capacity in one domain and not in other domain. Thus, different

knowledge domains may employ different stop lists, but caution is required when

adding word to stop list. A very specific stop list can result in low recall because

the query becomes too specific.

2.5.2 Term Weighting

Document can be described by the presence/absence of index terms, that is any

document can be represented by binary vector. For example, if document dk is a

member of a documents collection, which has 6 terms in its vocabulary, contains

terms t1,t3,t4,t5 but not t2 and t6, it can be represented as

dk=(1,0,1,1,1,0)

42

Every term in the index is treated equally. One may argue that this does

not reflect the real life situation, where one term or word may have more

importance than others. Indeed, many information retrieval systems employed

term weightings to capture the importance of individual terms in the collection as

we discussed in section 2.3.2.

The term frequency (tf) within a document can indicate the importance of

the terms in the document. In other words, the terms frequency can be used to

summarise the contents of a document. However, using within document

frequency alone is not enough because it cannot be used to discriminate

documents in the collection effectively [Sparck-Jones79]. Consider the following

case; the word computer may have a very high frequency in a document belonging

to the Communication of ACM collection. However, almost every document has a

high term frequency of the word computer because the collection’s domain is

computer theory and application. This situation shows that the word computer

does not have ability in discriminating the documents. The more documents

represented by a particular term the less importance this term has in terms of

distinguishing one document from another. As we have explained in section 2.3.1,

a good document representation has to be able to summarise and discriminate the

documents at the same time. Inverse document frequency (idf) may be introduced

to the term weighting as a discriminator. The combination of tf and idf is usually

used as in equation 2.3.

43

2.5.3 Thesaurus

One obvious problem with query formulation is that there are often many ways to

say the same thing. Introducing a thesaurus to match synonyms and closely

related words is one solution to this problem. It can be used to expand the user’s

query by adding the synonyms or related words to the initial query submitted.

The thesaurus can be generated automatically from the text in the

collection by means of calculating similarity amongst the terms in the collection.

Given the matrix of document-term relation .

T1 T2 … Tm D1 w11 w12 … w1m D2 w21 w22 … w2m D3 w31 w32 … w3m . … … … … . … … … …

DN wn1 wn2 … wNm

The similarity measure between term Tj and term Tm can be calculated by

sim T T w wj m iji

N

im( , ) ,==∑

1

(2.18)

where

N is the number of documents in the collection

wij is the weight of term i in document j

Once the similarities for all the terms are computed, a term can be put into

a group if it has a similarity exceeding a stated threshold with at least one of the

members of the cluster or with all the members of the cluster. The first situation is

44

called single-link classification, the second is called complete-link classification. It

has been claimed that an automatic thesaurus generated from the text in the

domain which it is used can increase the recall up to 20% [Salton71, Croft88].

2.5.4 Relevance Feedback

Retrieving all relevant documents or achieving a 100% level of recall has not been

accomplished by existing information retrieval systems. The problem of limited

recall has been recognised as the major difficulty in information retrieval systems

[Lancaster69]. More recently, van Rijsbergen spoke of the limits of providing

increasingly better ranked results based solely on the initial query. He indicated a

need to modify the initial query to enable increased performance after a certain

level of recall reached [Rijsbergen86].

For many years researchers have suggested relevance feedback as a

solution for query modification because a user may give vague or incomplete

initial requests as we discussed in indexing problems in section 2.3.1. The

feedback given by the user can be used to re-weight the query terms and/or

expand the query by adding new terms to the query.

In the vector space model, the term relevance feedback is achieved by

merging the relevant document vectors with the initial query vectors. This

automatically re-weights the query terms by adding weights to the initial query

terms for any query terms existing in the relevant documents and subtracting the

weights of those query terms occurring in non-relevant documents. Ide (1971)

formulated this as:

45

Q Q R Skk

x

kk

y

1 01 1

= + −= =∑ ∑ (2.19)

where

Q1 is the modified query

Q0 is the original query

Rk is the vector for relevant document k

Sk is the vector for non-relevant document k

x is the number of relevant documents

y is the number of non-relevant document

The query is also automatically expanded by adding all terms not in the

original query that are in both the relevant document and non-relevant documents.

These terms are added using positive or negative values based on whether they are

coming from relevant or non-relevant vectors respectively. Although the new

query includes new terms from non-relevant documents, the fact that such terms

carry negative weight means that they only contribute to determining the weights

of new terms introduced by relevant documents.

Probabilistic retrieval treats the relevant and non-relevant set equally in re-

weighting the query. Harman [Harman92] suggests that this particular treatment

may cause poor performance of the probabilistic model in relevance feedback.

Her study has shown that the performance of relevance feedback in probabilistic

retrieval varies from one collection to other.

Another issue that the probabilistic models have to address in

incorporating relevance feedback in the model is that of probability estimations.

The fact that most of the probabilistic models use different probabilistic

46

estimations for producing the initial document ranking and relevance feedback2

may contribute to the inconsistency of the ability of probabilistic models to handle

relevance feedback. A model that has a common method for estimating the

probability for the initial document ranking and for relevance feedback is required.

A promising solution may be presented by considering the inference model. The

inference model is known to have learning capability. Relevance feedback is, in

fact, a learning process since the document ranking may change due to new

knowledge learnt from the previous retrieval. Bayesian networks are one such

inference model, and may thus be used as an effective tool for incoporating

relevance feedback into the information retrieval system.

2.6 Summary

We have analysed and discussed the problems inherent in models for information

retrieval systems. The major problem faced by information retrieval systems is the

uncertainty involved due to the ambiguity of the natural language. This ambiguity

itself can not be totally eliminated, which makes the reading process important in

information retrieval and also makes information retrieval systems different from

data retrieval or question-answer systems. Regardless of the ambiguity inherent in

natural language, information retrieval systems must find those documents

assumed relevant documents for a user’s given information needs.

We have discussed several approaches to information retrieval models in

this chapter, namely te Boolean, vector space and probabilistic models. With the

2 See section 2.4.3, the estimation of prior probability which is used to produce the initial

47

limitation of these models in mind, there are some methods which usually are not

considered as part of the model that can be used to further improve the retrieval

performance. These methods of improvement include the stop list, thesaurus, term

weighting and relevance feedback. The probabilistic approach may be considered

the best of the approaches because it is based on well-established mathematical

theory for handling uncertainty. However, it still requires improvement in terms

of the development of built-in methods for estimating the probability for the initial

ranking and relevance feedback. We will introduce a model based on a Bayesian

network that can overcome this problem in chapter 4.

In the next chapter, we review probablilty theory in detail, in particular that

of Bayesian belief networks. Bayesian networks are a good candidate for a

framework that can provide the retrieval model with a common probability

estimation method for the initial document ranking and the relevance feedback

through its support for inference.

document ranking is derived from ad-hoc estimation [Sparck-Jones79, Croft79].

49

Chapter 3 Theory in Bayesian Networks

3.1 Introduction

Over the last few decades, interest in artificial intelligence research has been

growing rapidly, especially in the area of knowledge based systems. The phrases

knowledge based systems or expert systems are usually employed to denote the

computer systems which incorporate some symbolic representation of human

knowledge. The symbolic representation of this knowledge is used in turn by the

computer systems to make decisions as if they had been made by a human expert.

By studying many knowledge based systems developed for many

different problem domains, artificial intelligence researchers have found that the

knowledge required for the decision process often cannot be precisely defined. In

fact, many real-life problem domains are fraught with uncertainty. Chapter 2 has

shown that information retrieval systems are not spared uncertainty in their

problem domain. The challenge in the research of building knowledge based

systems can be seen as that of modeling a human expert’s capability for handling

uncertainty. Human experts in particular problem domains are able to form

judgements and take decisions based on uncertain, incomplete or even

contradictory information. Therefore, a good knowledge based systems to be of

practical use, has to perform at least equally well compared with a human expert

in handling uncertainty in a given problem domain.

50

In this chapter, we introduce a formalism for representing uncertainty

using Bayesian networks and associated algorithms for manipulating uncertain

information. There are many other formalisms including Rule based systems and

fuzzy logic. However, Bayesian networks have been accepted by a large

population of artificial intelligence researchers due to their powerful formalism

for representing domain knowledge and its associated uncertainty. In section 3.2

we recap classical probability theory and Bayes theory. In this section we provide

the formal development of Bayes theorem from that of the classical probability

theory. Section 3.3 reviews the difference between the Bayesian and classical

approaches to probability theory. Section 3.4 discusses the use of Bayesian

network in knowledge based systems. This section includes the discussion of

Bayesian network formalism and properties. These properties include the

implementation of conditional independence.

Any knowledge based system needs to be able to adapt to additional

knowledge that arrives at the system as evidence. The procedure to perform this

operation is known as the inference process. Section 3.5 looks at inference

processes in the Bayesian network. We conclude the chapter with a summary in

section 3.6.

3.2 Bayes Theorem

To understand Bayesian networks, it is important to understand the Bayesian

approach to the probability and statistics. In this section, we contrast the Bayesian

view of probability to the classical view of probability. We also present the main

theorem on which Bayesian probability and statistics are based, that is of Bayes

51

theorem. First, we present in this section the development of the Bayes theorem.

The following derivation follows that of Neapolitan [Neapolitan90]

According to Laplace [Neapolitan90, pp28] probability is defined as :

The theory of chance consists in reducing all the events of some kind to a

ceratin number of cases equally possible, that is to say, such as we may be

equally undecided about in regard to their existence, and in determining the

number of cases favorable to the event whose probability is sought. The

ratio of this number to that of all the cases possible is the measure of the

probability.

Laplace's definition gives the framework of the classical approach which states

that every possible outcome of an experiment has an equal chance. We discuss

the meaning of the above definition in more detail below. First, we define the

meaning of a sample space where the possible outcomes can be derived. During

this discussion we use the example of picking up a card from a 52 card deck.

Definition 3.1. Let an experiment which has a set of mutually exclusive and

exhaustive outcomes be given. That set of outcomes is called the sample

space and is denoted by Ω.

In our experiment picking a card from a deck, the sample space Ω is the set of 52

different outcomes. Next, we define an event in a sample space.

Definition 3.2 Let ℑ be the set of subsets of Ω such that

1. Ω∈ℑ

2. E1 and E2 ∈ ℑ implies E1 ∪ E2 ∈ ℑ

3. E ∈ ℑ implies E∈ ℑ

Then ℑ is called a set of events relative to Ω.

52

According this definition, an event is simply a set of propositions which has its

corresponding set of possible outcomes in the sample space Ω. For example, if an

event E is the proposition of getting a king from the deck of cards, then there are

4 corresponding possible outcomes in sample space Ω, namely king of spades,

king of hearts, king of diamonds and king of clubs. Next, we define the means of

a probability value for an event.

Definition 3.3 For each event E ∈ ℑ, there is corresponding a real number P(E),

called the probability of E. This number is obtained by dividing the

number of equipossible alternatives favorable to E by the total number of

equipossible alternatives or outcomes.

According to definition 3.3, the probability of the event king card turn up is 4/52.

Using the above definitions we can now prove some properties of probability

theory and of conditional probability [Neapolitan90].

Theorem 3.1 Let Ω be a finite set of sample points, ℑ a set of events relative to

Ω, and, for each E ∈ ℑ, P(E) is the probability of event E according to the

classical definition of probability in definition 3.2. Then

1. P(E)>=0 for E ∈ ℑ

2. P(Ω)=1

3. If E1 and E2 are disjoint subsets of ℑ, then P(E1∪E2)=P(E1)+p(E2)

Proof. Let n be the number of equipossible outcomes in Ω.

1. If k is the number of equipossible outcomes in E, the, according to

definition 3.3,

0)( ≥=n

kEP

53

2. Following definition 3.3,

1)( ==Ωn

nP

3. Let E1 and E2 be disjoint events, let k be the number of equipossible

outcomes in E1, and let m be the number of euipossible outcomes in E2.

Then, since E1 and E2 are disjoint, k+m is the number of equipossible

outcomes in E1∪E2. Thus, following definition 3.3,

)()(

)()( 2121 EEPn

mk

n

m

n

kEPEP +=+=+=+

Definition 3.4. Let Ω be the set of sample points, ℑ a set of events relative to Ω,

and P a function that assigns a unique real number to each E ∈ ℑ. Suppose P

satisfies the properties defined by theorem 3.1. Then (Ω,ℑ,P) is called a

probability space and P is called probability measure of Ω.

Defining a probability space is very important for measuring any

probability value. Laplace [Neapolitan90] states that there is no absolute

probability value. Any probability space exists relative to partial information or

knowledge. Different knowledge will generate different probability spaces. For

example, Natalie, a sneaky girl, peeks at the top of the card deck before a card is

drawn from it. She sees that the top card is a king but does not know to which suit

that king belongs. By doing this, Natalie has changed her probability space from

52 possible outcomes to 4 possible outcomes. The probability of the card drawn

being a king of hearts now becomes 1/4 instead of 1/52. This example illustrates

the importance of conditioning the probability on some known knowledge or

information. Now, we define the meaning of conditional probability.

54

Theorem 3.2. Let (Ω,ℑ,P) be the probability space created according to the

classical definition of probability. Suppose E1 ∈ ℑ is nonempty therefore

has a positive probability. Then, we assume that the alternatives in E1

remain equipossible when it is known for certain that E1 has occurred, the

probability of E2 given that E1 has occurred is equal to

Proof. Let n,m, and k be the number of sample points in Ω, E1 and E1 ∩ E2,

respectively. Then the number of equipossible alternatives based on the

information that E1 has occurred is equal to m while the number of these

alternatives which are favorable to E2 is equal to k. Therefore the

probability of E2 given that E1 has occurred is equal to

)(

)(

1

21

EP

EEP

nm

nk

m

k ∩==

Definition 3.5. Let (Ω,ℑ,P) be a probability space and E1 ∈ ℑ such that P(E1) >

0. Then for E2∈ℑ, the conditional probability of E2, given E1, which is

denoted by P(E2|E1), is defined as follows :

Definition 3.6. Let (Ω,ℑ,P) be a probability space and E1,E2,...,En be a set of

events such that for i≠j

NULLEE ji =∩

)(

)(

1

21

EP

EEP ∩

)(

)()|(

1

2112 EP

EEPEEP

∩=

55

and

Un

iiE

1=

Ω=

Then the events in E1,E2,...,Enare said to be mutually exclusive and

exhaustive.

Lemma 3.1. Let (Ω,ℑ,P) be a probability space and E1,E2,...,Enbe a set of

mutually exclusive and exhaustive events in ℑ such that 1<=i<=n,

P(E1)>0. Then for any E ∈ℑ.

∑=

=n

iii EPEEPEP

1

)()|()(

Proof. Since the Ei’s are exhaustive, we have that

E=(E∩E1)∪(E∩E2) ∪ .. ∪(E∩En)

Therefore, since the Ei’s are mutually exclusive, by definition 3.4 we have that

P(E)= (E∩E1)+(E∩E2)+...+(E∩En)

From definition 3.5,

P(E)= P(E|E1)P(E1)+ P(E|E2)P(E2)+…+ P(E|En)P(En)

The definition 3.5 is known as the classical or traditional view of conditional

probability. In the following discussion illustrate development of Bayes theorem

which is a different view of conditional probability.

Theorem 3.3. Bayes Theorem. Let (Ω,ℑ,P) be a probability space and

E1,E2,...,En be a set of mutually exclusive and exhaustive events in ℑ

such that for 1≤ i ≤ n, P(Ei) > 0. Then for any E ∈ ℑ such that P(E) > 0,

we have that for 1 ≤j≤n

56

P E EP E E P E

P E E P E

j j

i ii

n( | )

( | ) ( )

( | ) ( )1

1

=

=∑

Proof. Let E1,E2,...,Enas set of mutually exclusive and exhaustive events in ℑ

such that for 1≤ i ≤ n, P(Ei) > 0.

It follows from definition 3.5 P E EP E E

P Ejj( | )

( )

( )=

∧

or P E EP E E P E

P Ejj j( | )

( | ) ( )

( )=

From Lemma 3.1 we have P E P E E P Ei ii

n

( ) ( | ) ( )==∑

1

.

If E and E' are any two events such that P(E) and P(E') are both positive, then the

following equality follows directly from definition 3.5 :

)'(

)()|'()'|(

EP

EPEEPEEP =

Notice that in Bayes theorem, the conditional probability is not represented as

joint events as in classical conditional probability. This different treatment of

conditional probability leads to several philosophical differences between the

Bayesian and classical approach. Section 3.4 compares and discusses these two

different approaches towards probability in detail.

In our study, we use Bayes theorem as a diagnosis tool. A typical

diagnostic process consists of a hypothesis that has been postulated and some

evidence that can be used to verify the hypothesis. For a diagnostic process,

Bayes theorem as in theorem 3.3 can be written as

P H eP e H P H

P e( / )

( / ) ( )

( )=

(3.1)

57

P(H|e) represents the belief that we yield a hypothesis H upon obtaining

evidence e. The belief can be calculated by multiplying our previous belief P(H)

by the likelihood P(e|H), that is, if e will materialise if hypothesis H is true. P(H)

is sometimes called the prior probability and P(H|e) is posterior probability. The

denominator P(e) hardly enters into the calculation because it is a normalising

constant. We will use this format of Bayes theorem in the rest of the discussion in

this thesis, unless we feel necessary to go back to the general format as in

theorem 3.3.

3.3. Bayesian vs Classical Probability Theory

We compare the Bayesian and classical view of probability on two of the

important aspects of probability, namely the meaning of the probability and the

meaning of conditional independence. First, we will discuss the different

meanings of probability according to the Bayesian and classical views

respectively.

The Bayesian approach views probability as a person’s degree of belief in

an event x occurring given the information available to that person. A probability

of 1 corresponds to the belief in the absolute truth of a proposition, a probability

of 0 to the belief in the proposition’s negation, and the intervening values to the

partial belief or knowledge.

Classical probability theory considers the probability of an event x as the

physical probability of the event x occurring. The probability values are acquired

through a number of repeated experiments. The larger the number of experiments

performed, the more accurate the value of the probability. Thus, the classical

58

approach relies on the existence of the experiments and is not willing to attach

any probability value to an event that is not a member of a repeatable sequence of

events. The Bayesian approach, on the other hand, consider a probability as a

person’s degree of belief, a belief can be assigned to unique events that are not

members of any repeatable sequence of events. For example, consider assigning

the probability to the belief that the Australian will win the Ashes in 1997,

although the matches have not yet taken place. Although Bayesian approach is

willing to assign a probability value to this event, the assignment of this

subjective probability should be considered carefully. It must be based on all the

information available to the individual who makes the prediction. This

information may include those items that are known to be true, deducible in a

logical sense and empirical frequency information. For example in predicting the

probability of the Australian team winning the Ashes, information about all the

Australian and England players’ current form, the Australian team's past

experience in playing in England, as well as the weather pattern in England

during summer may be used.

The second main difference between the Bayesian and classical

approaches is their treatment of conditional independence. We define the

conditional independence as follows:

Definition 3.7. Let (Ω,ℑ,P) be a probability space and H and e events in ℑ such

that one of the following is true:

1. P(H)=0 or P(e)=0

2. P(H|e)=P(H)

Then H is said to be independent of e.

Based on this definition, classical probability introduced the following theorem.

59

Theorem 3.4. Let (Ω,ℑ,P) be the probability space and H and e be arbitrary

events in ℑ. Then H and e are independent if and only if

)()()( ePHPeHP =∩

Proof. Let H and e be independent event in ℑ and P(H) > 0.

It follows from definition 3.5 that P(H∩e) = P(H|e)P(e).

From definition 3.6 we have P(H|e)=P(H) for H independent of e.

Combining the two definitions we have P(H∩e)=P(H)P(e)

Theorem 3.4 shows that the classical probability formalism checks the

conditional independence through the equality of the joint probability of the

events and the product of the individual events. The problem with this checking

is that the result of the joint probability calculation does not provide

psychological meaning to the user or developer of the knowledge-based system

about the dependency between the events. Human can not easily attach numerical

values to an event but can easily determine whether two events are independent

from looking at the cause-effect relationship between the events involved. The

Bayesian approach, on the other hand, bases its conditional independence concept

around the human reasoning process.

Bayesian approach sees the conditional relationship as the more basic

than that of joint events. According to this approach, conditional probability

should reflect the organisation of human knowledge. The organisation of human

knowledge consists of a set of evidence e that serves as pointer to a context or

frame of knowledge H. In other words, H|e stands for an event H in the context

specified by e. Consequently, empirical knowledge invariably will be encoded in

60

the conditional statements, while belief in joint events, if ever needed, will be

computed from those statements via the product

P(H,e)=P(H|e)P(e) (3.2)

Therefore, Bayesian approach states conditional independence in terms of

conditional probabilities, for example P(H|e) which specify the belief of

hypothesis H under the assumption that evidence e is known with absolute

certainty. If P(H|e)=P(H), it is said that H and e are independent.

Treating conditional independence using conditional probabilities rather

than joint probabilities not only mirrors the human reasoning process but also

provides the capability for knowledge based systems to use the recursive and

incremental updating of the belief value. Consider the following situation. Let H

denote a hypothesis, en = e1,e2,…,en denote a sequence of data observed in the

past and e denote a new fact. A brute force way of calculating the belief in H

would be to add the new datum e to the past data en and perform a global

computation of the impact on H of the entire set en+1 = en,e. In other words, the

systems needs to compute the joint probability of H, en and e. To calculate this

joint probability, the entire stream of past data needs to be stored and made

available for subsequent computation. In practise, this can be time and storage

consuming. Using Bayes theorem, to include the new datum e, we have

)|(

)|(),|(),|(

n

nnn eeP

eHPHeePeeHP = (3.3)

The above equation shows that the prior probability P(H|en,e), represented

by P(H|en) is the old belief of the hypothesis H given the en data. Thus, P(H|en)

can be considered as a summary of past experience. An update of the belief due

to the new datum then can be calculated using this past experience multiplied by

61

the likelihood function P(e|en,H). Thus, the calculation of the new belief for a

hypothesis given a new datum does not require the memory of the past data

values. It can always be performed as a recursive and incremental computation.

We have shown the background of Bayes theorem and its advantages

compared with the traditional approach to probability in this section. Bayes

theorem provides us with greater ability to quantify the probability model of a

situation by a method close to the human reasoning process, however this purely

numerical representation lacks psychological meaningfulness. The numerical

model can produce coherent probability measures for all propositional sentences,

but often leads to computations that a human reasoner would not use. As a result,

the process leading from the premises to the conclusions cannot be followed,

tested, or justified by the users, or even the designer of the reasoning system. An

extension of the numerical representation is needed to provide psychological

meaningfulness of the reasoning system. Such an extension of the numerical

representation of the Bayes theorem is provided by the Bayesian network.

3.4 The Bayesian Network as a Knowledge Base

We have mentioned in the previous section that a purely numerical representation

is inadequate in representing the human reasoning process. For that reason, many

researchers in AI consider probability theory to be epistimelogically inadequate.

Due to this perceived inadequacy of the probabilistic approach to AI, some

researchers have looked into representing qualitative reasoning through a

symbolic reasoning approach. This includes non-monotonic logic[Reiter87],

fuzzy logic[Zadeh78], certainty factors[Shortlife75] and Shafer-Dempster belief

62

functions[Gordon85]. Further investigation of the probabilistic approach to AI

shows that the exploitation of the conditional independence assumptions

implicitly in the qualitative structure of the expert knowledge provides a rich way

of representation of knowledge in the probability approach. This qualitative

structure of the expert knowledge can be represented by a graph. Using this

graph, we can capture and exploit the human ability to easily detect events or

proposition dependencies without knowing precisely the numerical estimates of

their probabilities.

Consider the following situation. A person may be reluctant to estimate

the probability of having Third World War at the end of the century or winning

the Lotto jackpot in the next draw. However, this person can nevertheless state

with ease whether these two events are dependent, that is, whether knowing the

truth of one event or proposition will alter the belief in the other. Evidently, the

notions of relevance and dependence between propositions are far more basic to

human reasoning than are the numerical values attached to the probability

judgements.

A knowledge based system that models the human expert reasoning

process therefore needs to use a language to represent probabilistic information

that allows assertions about dependency relationships to be expressed

qualitatively, directly and explicitly [Pearl88]. One way of providing qualitative

dependence relationships in the probability model is by the use of graph theory.

The nodes in graphs can be used to represent proposition variables, and the arcs

can be used to represent conditional independence.

There are several graph models are used in AI. These graph models can be

classified into two main groups. The first group uses undirected graph. Falling

63

into this category are the Markov networks [Lauritzen88]. The second group uses

directed graphs in order to represent explicitly causal dependency between

proposition. Bayesian network falls into the second group. Pearl [Pearl88]

suggests that this directed graph is a closer representation of human reasoning

process and a semantically richer model compared with the undirected graph. For

the reason of its richness in capturing diagnostic reasoning processes, we use

Bayesian network model in our study.

3.4.1 Bayesian Network Structure

A Bayesian network is a directed acyclic graph (DAG) whereby a node represents

a proposition or an event and an arc represents a direct cause-effect dependency

between two propositions or events. Consider the following situation1: an office

worker called Mr.Goody lives in the outer suburbs of Melbourne and works in

the central business district of Melbourne. His boss, Ms Habib has noticed

recently that he comes to the office late most of the time. She does not like the

situation, but she wants to give Mr. Goody another chance because he is a good

worker and it is only recently that he is often coming late to the office. One day,

Mr Goody has a very important meeting with a client and he is late. Ms Habib,

currently doing Mr. Goody’s performance evaluation, needs to decide whether to

give a good or bad evaluation of him. She needs to know whether Mr. Goody is

late because of his carelessness or whether he is just an innocent person caught in

the bad traffic. Ms.Habib's decision process can be described by figure 3-1.

1 The name of the characters in this example are copywrite of BBC program "Thin Blue Line"

64

While she is waiting for Mr.Goody, Ms.Habib listens to the radio station and

learns that there is an accident in the freeway taken by Mr.Goody everyday to

work. And since the freeway is under going some repairs and only one lane is

open, the traffic is almost at a stand still. With the arrival of this new knowledge,

Ms.Habib concludes that Mr.Goody is caught in bad traffic and therefore he is

innocent. Her belief in Mr.Goody being late because of his carelessness in getting

up late has decreased. In this situation, we can say that the two events of the

traffic being heavy and Mr.Goody sleeping in are dependent, giving the new

evidence of the event Mr.Goody is late.

Pr(C|B,¬A)=0.5

Pr(C|A,B)=0.99

Pr(C|A,¬B)=0.9

Pr(C|¬A,¬B)=0.01

Mr.Goody is late

Traffic is heavyMr.Goody sleeps in

AB

C

Pr(B)=0.001 Pr(A)=0.01

Figure 3-1 An example of a Bayesian network.

The situation depicted through the Bayesian network model in the

previous paragraph shows how a directed graph can be used to explain the

qualitative part of a decision process. It can clearly show the dependency or

cause-effect relations between events or propositions. If the above situation is

explained only using only the probability distribution, the dependency between

the event Mr.Goody sleeping in and traffic is heavy has to be checked through the

65

numerous probability computations. This computation is not only time

consuming but also sometimes difficult to interpret. The example clearly shows

that the richness of Bayesian network through the use of directed graph can make

it a powerful tool in building knowledge base system.

3.4.2 Conditional Independence

The example and discussion in section 3.4.1 has shown that the semantics of the

Bayesian network demands a clear correspondence between the topology of a

DAG and the dependence relationship potrayed by it. Finding conditional

independence of events in Bayesian network can be done through the checking of

the d-separation.

Definition 3.8 If X,Y,Z are the three disjoint subsets of nodes in a DAGD, then Z is

said to d-separate X from Y , denoted <X|Z|Y>D, if there is no path between a

node in X and a node in Y along which the following conditions hold :

1. Every node with converging arrows is in Z or has a descendent in Z

2. Every other node is outside Z.

Any path satisfies the above condition is said to be active and blocked

otherwise. Consider a diagnostic procedure of a metastatic cancer patient

illustrated by figure 3-2. A patient who is diagnosed of having metastatic cancer

might have resulted in showing two different symptoms, namely increased total

serum calcium or brain tumor. Both of these symptoms may cause the patient fall

into the stage of coma. If the patient fall into a stage of coma for a long period of

time, it will damage the brain cells.

66

A

B C

D

E

Metastatic cancer

Brain tumor

Increasetotal serum

Coma

Brain dead

Figure 3-2 Bayesian network model of metastatic cancer diagnostic.

The dependency between the events in this diagnostic process can be checked by

using d-separation (definition 3.8). Consider the different situations below :

1. Let X=C and Y=B and Z=A. Z d-separates or blocks X and Y

because along the path C-A-B there is no converging arrow in A or in

any of descendent of A and other nodes in the network ( D,E ) are

outside Z. The fact that node A separates node C from B, once the

belief value in A is known the belief in node C will not contribute to

the belief value in B. If a patient has been diagnosed as having a

metastatic cancer, the belief that the patient has an increase in total

serum calcium will not increase or decrease the belief that the patient

suffers from brain tumor because once the patient diagnosed of having

metastatic cancer, brain tumor will be present. It is said that the

67

knowledge of the existence metastatic cancer in a patient makes the

event of increased total serum calcium and brain tumor independent.

2. The situation would be different if we take X=C and Y=B and

Z=D. The path C-D-B has converging arrows namely path C-D and

B-D, thus D does not d-separate node C and B. If a patient has been

found in coma, using the same Bayesian network, the doctors can

consider that the patient suffers from increased total serum calcium or

brain tumor. A further medical test carried by the doctors on the patient

and find out that the patient has brain tumor. This new finding will

decrease the possibility of this patient increase in the total serum

calcium. Thus, the knowledge of coma occurred makes the belief of the

two events increased total serum calcium and brain tumor dependent.

Change in one of these events will change the belief value in the other

event.

3. Let the same values are assigned to X and Y, if Z=E, then nodes X

and Y does not d-separate by Z because node E is a descendent of node

D which has converging arrows, thus it violates the first condition of

definition 3.8. Looking back at the patient diagnostic example, a

patient has been found suffering from a brain damage. The Bayesian

network representation of the problem shows that a patient can suffer a

brain damage after falling into a stage of coma over a long period of

time. In this situation once the truth that the patient suffer a brain

damaged is known, this new evidence introduced to the diagnostic

process can be explained as the result of being in the coma. The second

68

situation above shows that the coma event makes the events increasing

total serum calcium and brain tumor become dependent.

The procedure for testing d-separation performed above shows that it separation

criteria follow the basic pattern of diagnostics reasoning, such as two inputs of a

logic gate are presumed independent, but if the output becomes known, the

learning of one input has bearing on the other. The d-separation test is the formal

way of determining the independence assumption between proposition in the

network. Heuristically, the independence assumption can be derived by looking

at the topology in the network. There are three basic topologies that can cause

different independence assumption. These topologies are depicted by figure 3-3.

Fig.3.3a head-to-head Fig.3.3b head-to-tail Fig.3.3c tail-to-tail

a b

c

a b

c

a b

c

Figure 3-3 Different topologies for independence assumption.

In figure 3-3a, the network is considered having arrows a→c and b→c meet

head-to-head at node c. In this type of topology where two arrows meet in a

node, any instantiation of the root node, ie. node a or b causes these two nodes to

be independent. On the other hand, the revelation of the value of proposition in

node c causes the node a and node b to be dependent.

The second topology depicted by figure 3-3b contains two arrows a→c

and c→b meet head-to-tail at node c. In this topology, the instantiation of the root

69

node, ie. node a, causes the node b and node c to be dependent because the

proposition in node c is the cause of the proposition in node b. The similar

situation occurs when we instantiate the leaf node, ie node b. It effects will be

computed all the way to node a. However, if we instantiate node c, it causes the

node a and node b becomes independent. Once we instantiate node c, it blocks

any reasoning from node a to node b and vice versa.

The third topology is depicted by figure 3-3c. It contains two arrows c→a

and c→b which meet tail-to-tail at node c. Only the instantiation of node c

causes the node a and node b to be independent. Any instantiation of other node

causes the nodes to be dependent in the network.

The heuristic checking explained above provides us with an easier way to

determine the conditional independence of the proposition in the network. This

checking process using the heuristic can be aided by human vision.

A diagnostic reasoning involves not only building a diagnostic model of

events and its dependency but also observing the changes of the events behaviour

when a new evidence or knowledge arrive to the reasoning process. The

observation and adjustment process due to the arrival of new evidences is known

as inference process. In this study we use probabilistic model to represent our

knowledge thus we will concentrate the discussion in section 3.4 around the

probabilistic inference.

3.5 Probabilistic Inference In Bayesian Networks

The basic task of any probabilistic inference system can be regarded as a task to

compute the posterior probability distribution for a set of query variables, given

70

the exact values for some evidence variables. In other words, the computation of

P(Query|Evidence). We use the following notation for the discussion of the

inference algorithm.

Upper case letters such as A,B,C,D,...,X,Y,Z represents variables

Lower case letters such as a,b,c,d,...,x,y,z represents the possible value of

the corresponding variables.

+ represents an affirmation of a proposition.

¬ represents a denial of a proposition.

E represents a set of evidence variables.

e represents a set of evidence variables which the value is known or

instantiated.

BEL(x) represents the overall belief accorded to the proposition X=x by

all received evidences. BEL(x)=∑ P(x|e) where P(x|e) = probability that x

is true given this evidence e.

α represents the normalising constant such that this Bel(x) when x is

vector is normalised to 1, for example α[2,2,1]=[0.4,0.4,0.2]

My|x represents a fixed conditional probability matrix which quantifies the

link X→Y.

In this study, we will only dealing with the discrete variables, thus BEL(x) can be

regarded as vectors, which its component corresponds to different values of X.

For example if the domain of X is High, Short, BEL(x) can be written as

BEL(x)=(BEL(X=High),BEL(X=Short))=[0.4,0.6]

71

3.5.1 Pearl’s Inference Algorithm

Pearl’s algorithm of probabilistic inference works on the directed graph approach

to the Bayesian network. We adopt the directed graph approach due to its

semantic richness in representing knowledge. The main idea behind this

algorithm is to create a two ways communication between nodes. Each

communication line contains different type of message. The nodes in the network

are arranged according to the rank of parent-child association. A direct link

between two nodes constitutes a parent-child link and the direction of the arrow

determined the rank of a node. The node that has an emanated arrow is the parent

whereas the node with a coming arrow is the child. Consider the network in

figure 3-4.

A B C

πA πB

λB λC

Figure 3-4 Inference in Bayesian network.

Node B is considered as a parent in relation to node C, but is considered a child in

relation to node A. The messages that are passed in the networks go through two

different channel. The first being the parent-to-child channel and the second

being the child-to-parent channel. The messages passes through parent-to-child

channel gives the inference process the causal support(π) message. In figure 3-4,

the πA and πB are the causal support message. The causal support messages are

72

passed to the direct child/children of the node where the message is originated.

The child-to-parent channel on the other hand is used to pass the evidential

support(λ) messages. The evidential support messages are passed to the direct

parent/parents of the originated node. In figure 3-3, the evidential support

message are represented by λB and λC.

In the every single node in the network the value of π and λ are used to

update the belief value of the node through the following formula:

BEL x x x( ) ( ) ( )= αλ π

where

x is a proposition X=x

α is the normalising constant

π is the causal support value

λ is the evidence support value

By exploiting the conditional independence of the nodes in the network

and using the Bayesian recursive update, the calculation of the belief in the

Bayesian network can be performed locally in the set of nodes without losing the

global effect of the new evidence. Moreover, parallel computation become

permissible once the nodes dependency has been determined and independent set

of nodes has been found. This local and parallel computation provides Pearl’s

algorithm with the ability to perform inference efficiently.

To update the new causal belief (πx) and evidential belief (λx) from a node

X to Y (X→Y), a link matrix My|x consisting conditional probability of the

73

variables in Y given some known variables in X is used. To illustrate the

inference process, consider the following situation (this example is the modified

version of Pearl’s example [Pearl88], pp 151 ):

In a murder case trial, there are two suspects, one of whom definetly

commited a murder. A gun has been found with some fingerprints. Let A

identify the event of person X is the last user of the gun, namely the killer.

Let B identify the event of person Y is the last person hold the gun, and let

C represent an event of the fingerprint finding obtained from the

laboratory. The following probability distribution is held for the situation.

After getting the some evidence during investigation, the police believe

that suspect 1 has 0.8 probability as the killer or BEL(a=1)=0.8 and suspect

2 is 0.2or BEL(a=2)=0.2.

The above example can be simply represented in a Bayesian network by figure 3-

5. The detailed message passing scheme in this example is as follow:

π(a)=[0.8,0.2]λ(a)=[1,1]BEL(a)=[0.8,0.1]

0.8 0.20.2 0.8

π(b)=[0.68,0.320]λ(b)=[1,1]BEL(b)=[0.68,0.32]

Observation C=c

Mc|bMb|a

Figure 3-5 The use of link matrix in the inference.

⎩⎨⎧

=≠==

=2,1,2.0

2,1,8.0)|Pr(

babaif

babaifab

2,11)|Pr( == bforbc

74

Prior to the inspection of the fingerprint, all λ are unit vector of 1. The links

matrix Mb|a is used to calculate the values of πb ,λb and in turn using the formula

BEL(b).

[ ] [ ]32.0,68.08.02.0

2.08.02.0,8.0 =⎥

⎦

⎤⎢⎣

⎡•=bπ

[ ][ ] [ ]32.0,68.01,132.0,68.0)( == TbBEL α

Now we assume that the laboratory report arrives, summarised a evidential

support λc=(0.8,0.6), the knowledge about the the last person holding the gun

change as

[ ][ ] [ ] [ ]181.0,819.012.0,544.06.0,8.02.0,68.0)( === αα TbBEL

Using this the update belief in B, the evidential message to the A is changed to

[ ] [ ]309.0,691.08.02.0

2.08.0181.0,819.0 =⎥

⎦

⎤⎢⎣

⎡•=aλ

In turn, the belief in person 1 being a killer changes from 0.8 to

[ ][ ] [ ] [ ]10.0,902.006.0,553.0309.0,691.02.0,8.0)( === αα TaBEL

It is showed from this new belief about the last person that hold the gun

resulted from the fingerprint report from the laboratory, the belief that person 1 is

guilty increases from 0.8 to 0.902.

We have shown the inference process in a Bayesian network in the shape

of a chain. This inference algorithm can be used in other shape of network

including tree and singly-connected-networks. All of these different shapes have

one thing in common, they do not contain any cycle in the network. The

75

inference process for a Bayesian network that has a cycle is different from the

process we presented above.

3.5.2 Handling Loops in the Network

In a cyclic network, the propagation or inference process will face problem in

reaching a stable equilibrium state. The message passing scheme that we

discussed in previous section will cause the inference process goes indefinitely.

The cycle does not necessarily obvious, it may be implicitly constructed because

of the d-separation rule. Consider our previous example in figure 3-2 the

metastastic cancer diagnostic. There is a cyclic in the network through the links

metastatic cancer-increase total serum-coma and metastatic cancer-brain tumor-

coma. According to d-separation rule, the instantiation of node coma cause the

event increase total serum and brain tumor to be independent, that is changing

the belief in brain tumor and increase total serum when a new knowledge about

coma is obtained. In turn these two new beliefs will cause the change the belief in

metastatic cancer. Since metastatic cancer is the cause of increase total serum

and brain tumor, the new belief in metastatic cancer change the belief of these

two events, and at last it will change the belief in the patient will go to state of

coma. Thus, we are back to square one.

There are several methods that can be used to solve the problem of the

cycle in a Bayesian network. They are clustering, conditioning and stochastic

simulation. Clustering involves forming compound variables in such way that the

resulting network of cluster is singly connected. Conditioning involves breaking

the communication pathway along the loops by instantiating a selected group of

76

variables. Stochastic simulation involves assigning each variables a definite value

and having each processor inspect the current state of its neighbours, compute the

belief distribution of its variables, and select one value at random from the

computed distribution.

From the three approaches to handle a loop in Bayesian network,

stochastic simulation gives the best estimation of the posterior probability.

However, it suffers from the complexity of the calculation. The accuracy of the

stochastic simulation approach is related to the number of simulation runs. Pearl

[Pearl88] suggests that to achieve 1% of accuracy we need to perform 100 runs of

simulation. In this study, we introduce a new method in handling cycle in

Bayesian network, namely intelligent node. We discuss and compares the

different methods of handling loops in chapter 5.

3.6 Summary

In this chapter we have discussed the theory behind the Bayesian network.

Bayesian network is built as a combination of Bayes and Graph theory. The

Bayes theorem provides the numerical representation of the model, whereas the

graph theory provides the semantic representation. A belief distribution in a

Bayesian network experience changes when a new knowledge arrives to the

network. The changes are computed through the inference process. One of the

inference algorithm for a directed Bayesian network is Pearl’s algorithm. It is

based on the concept of message passing in the network through the concept of

parent-child communication. This algorithm works for chain, tree and singly

connected network, but does not work perfectly for a network that contain a cycle

77

or loop. To handle a cycle or loop in the network, clustering, conditioning and

stochastic simulation can be used. Stochastic simulation provides the best

accuracy at the expense of computation complexity.

In chapter 4, we will introduce a Bayesian network model for an

information retrieval system. The basic model will be presented together with an

appropriate inference algorithm for the model.

78

Chapter 4

A Semantically Correct Bayesian Network Model for Information

Retrieval

4.1 Introduction

Probability theory has long been recognised as a useful tool for dealing with

uncertainty. The information retrieval process, as explained in Chapter 2, involves

some uncertainty. This suggests that an appropriate approach to information

retrieval is one which uses probability theory as its basic framework. Indeed,

researchers in the area of information retrieval have been investigating the

probabilistic model since the 60’s [Maron60]. This model was the first

information retrieval model with such a firm theoretical foundation for handling

uncertainty. Despite the apparent attractiveness of an information retrieval model

with firm theoretical basis, acceptance of this model has not been universal. This

lack of wide acceptance is due to the fact that the estimation of the probability

parameters in the model is perceived to be somewhat unsatisfactory.

The estimation of probability parameters in traditional probability models

[Maron60, Robertson76, Rijsbergen79] requires us to look at the frequency of

term occurrences in the set of relevant and non-relevant documents. The

probabilistic model on the other hand, usually relies on the relevance feedback

process. This is a process whereby the system presents the top-ranked documents

79

to the user for a judgement as to whether they are relevant or not. In such a

model, before any relevance feedback data is available, it is very difficult for the

system to determine the relevance status of a document for an ad hoc query in

order to produce an initial document ranking. The existing probabilistic models

usually circumvent this parameter estimation problem by producing initial ranked

documents based on ad hoc estimations of the probabilistic model parameter or by

using an alternate retrieval model, for example, using the number of index terms

that are in common with those in the query to produce the initial ranking. These

models then use the probabilistic formulae to calculate the revised document

ranking after the data that can be used to determine relevant or non-relevant set

becomes available. Based on this relevance data, it is then possible to estimate the

parameters of the probabilistic model by computing the proportion of times each

term occurs in the documents that have been judged relevant and non relevant.

There have been some attempts to formulate methods which perform the

estimation of probability parameters of the probabilistic model using either no

prior knowledge of relevance data [Croft79] or using partial relevance information

[Sparck-Jones79]. However these models are still based on some degree of ad-hoc

estimation. Thus, one weakness of the traditional probabilistic models such as that

of Maron and Kuhn [Maron60], Robertson and Sparck-Jones [Robertson76], Fuhr

[Fuhr89] and Rijsbergen [Rijsbergen79] is that they use two different computation

methods: one to produce the initial ranking and another to handle relevance

feedback. We will also refer to the traditional probabilistic models as non-

inference probabilistic models.

80

An additional weakness of existing probabilistic models concerns their lack

of ability to learn from past queries to determine the prior distribution for the

model parameters. This mean that parameters are applied only for current query,

and as such a large potential database of relevance judgement based on past

queries would be wasted, although it is considered to be useful [Fuhr92].

In this chapter we introduce a new model for information retrieval based

on probabilistic inference. Probability inference, as explained in chapter 3, is the

mechanism that can be used to revise beliefs as new evidence arrives at the body

of belief. This approach to the probabilistic information retrieval model overcomes

the weakness of the traditional probabilistic model in the following ways:

• The probabilistic inference approach to the probabilistic model retains

the sound theoretical basis of the traditional probabilistic model, but

also incorporate methods for producing initial ranking of documents

and of handling relevance feedback. In this approach, the initial ranking

of document is produced using the prior probability distribution and

relevance feedback data is used as new evidence to update the prior

probabilities.

• Relevance feedback fits naturally into the model. The probabilistic

inference approach provides an automatic mechanism for learning. By

modeling prior distributions on the model parameters, we can

coherently update the prior distributions as more feedback data

becomes available.

• The probabilistic inference approach allows us to incorporate relevance

information from other queries into the model by using its ability to

81

incorporate learning into the model. This and the natural application of

relevance feedback means that the probabilistic inference model

provides a better learning framework than do the traditional

probabilistic models.

The difference between the existing non-inference probabilistic models and

the probabilistic inference approach arises due to their differing treatment of the

meaning of probability. In the former, the probability is considered from a

frequency point of view. The probability values in this model are obtained simply

by counting the number of documents containing a particular descriptor or index

term. On the other hand, the probabilistic inference models interpret the

probability as the degree of belief in an event or proposition. This is known as an

epistemological view of probability, a view which considers the assignment of

beliefs in propositions, quite devoid of any statistical background. The

epistemological view provides the information retrieval model with the ability to

capture the model’s semantics. Although these two views of probability are

considered to be contradicting, the statistical notion of probability may be used as

a way to measure chance at the implementation level [Rijsbergen79]. A clear

explanation of probability at the conceptual level to capture the semantics of the

document collection is required and has been lacking from the traditional

probabilistic models [Rijsbergen92]. A major objective of our model therefore is

that of providing a conceptual model for information retrieval.

Probabilistic inference is also superior to the traditional probabilistic

models in providing the information retrieval mechanism with effective means by

which important semantic information can be incorporated into the retrieval

82

process, thus circumventing some statistical problems inherent in the traditional

probabilistic model.

We will adopt one specific approach to probabilistic inference, namely that

of Bayesian networks. A Bayesian network is a model for probabilistic inference

which uses a combination of probability and graph theory. The inclusion of graph

theory into the model provides Bayesian network with some additional

characteristics lacking in those retrieval model based on non-graphical approaches

to probabilistic inference. These additional characteristics are:

• The documents in the collection may be represented as complex

objects with multilevel representations, not merely as collections of

index terms. The document may be considered as a collection of either

index terms, sentences or phrases. The level of representation of the

document is implementation dependent.

• Dependencies between documents are built implicitly into the model by

using the independence assumption of the Bayesian network. We will

explore this in detail in section 4.2.2. Citation or nearest neighbor

links can also be easily incorporated into the model because of its

graphical nature. Citation and nearest neighbour links have been shown

to improve the performance of information retrieval systems

[Turtle90].

• Synonyms and a thesaurus can be easily implemented as part of the

network. Any index terms that are synonyms can be linked, so that the

system can use all those synonyms during retrieval. The index terms

that belong to the same concept may be linked into a concept node.

83

The collection of the concept nodes in the network forms a thesaurus

in the system. The addition of a thesaurus has been shown to increase

the performance of a retrieval system [Sparck-Jones71].

We will present in detail the characteristics of the model and how the

model addresses problems of the traditional probabilistic models in section 4.2.

This section also examines the concept of prior probability which is one of the

major issues in probabilistic information retrieval models. Section 4.3 includes the

discussion of the inference process in the network which occurs once new

information or evidence has arrived. Since a Bayesian network is directed acyclic

graph, the determination of the correct causal direction of the inference is

considered an important issue. Different causal directions in the model lead to

different inference results. We will compare different causal direction models and

discuss their application to information retrieval in section 4.3. To show that our

model may be used as a conceptual model, in section 4.4 we compare our model

with existing information retrieval models such as the Boolean, vector space and

traditional probabilistic models.

4.2 The Bayesian Network Model

In the probabilistic inference model, we assume that there exists an ideal concept

space U called the universe of discourse or the domain reference. The elements in

this space are considered to be elementary concepts. A proposition is a subset of

U. The correspondence between propositions and subsets of elementary concepts

can translate the logical notions of conjunction, disjunction, negation, and

implication into the more familiar set-theoretic notions of intersection, union, and

84

complemetation respectively. We will use the set-theoretic and logical operations

interchangeably during our discussion in this chapter.

4.2.1 Probability Space

In information retrieval, the ideal concept space can be interpreted as the

knowledge space, in which documents, index terms, and user queries are all

represented as propositions or subsets.

A probability function P is defined on the concept space U. The probability

P(d) is interpreted as the degree to which U is covered by the area occupied by

the knowledge contained in document d. Similarly, P(q∩ d) represents the degree

to which U is covered by the knowledge common to both query q and document d.

We assume that all documents in the collection have been indexed and are

represented by a set of index terms and that the concept space U includes all the

index terms in the collection.

Definition 4.1 Let n be the number of index terms in the system and let ti be an

index term. U=ti,t2,t3,…,tn) is the set of all the index terms and defines

the sample space. Let u⊂ U be a subset of U.

The concept u may represent a document or a query in the information retrieval

model. In the Bayesian network, set relationships are specified using random

variables as follows.

Definition 4.2 Each index term ti is associated with a random binary variable ki

as follows:

ki=1 ⇔ ti ∈ u

85

ki=0 ⇔ ti ∉ u

We also define a function g: U → R such that g(ti) represents the term

weight of the index term ti

The term weights g(ti) will be implemented in our model as the weights of the

links in the network. Now we define the document and query concepts in the

sample space U.

Definition 4.3 A document is represented as a concept d=k1,k2,k3,…,kn) where ki

is binary random variable and ki=1 if ti∈ d and ki=0 otherwise. Similarly,

the user query can be represented as a concept q=k1,k2,k3,…,km ) where

ki=1 if ti∈ q or ki=0 otherwise.

In a Bayesian network, the concepts ti, concept d and concept q are represented as

nodes. The relations g(ti) are represented by the links in the network. Then, given

that documents and queries are represented as concepts in the space U, we can

apply a basic model of information retrieval as a concept matching system

[Kwok90].

To represent the knowledge contained in document d and the knowledge

represented by query q, a Bayesian network model uses two separate networks.

One network, the document network represents the documents in the collection.

The other network, the query network represents the user query [Ghazfan95].

The two networks are combined when a retrieval process is performed. The

network model is depicted by figure 4-1.

86

Query

information retrieval

automatic information retrieval image

doc-1 doc-2 doc-3

QueryNetwork

DocumentNetwork

Figure 4-1 Bayesian network model for information retrieval systems.

The document and query networks are similar except that the document

network is established at the creation of the database collection and remains the

same unless new documents are added to or obsolete documents are deleted from

the collection. The query network, on the other hand exists only for the duration

of the user’s query and is very dynamic in the sense that the query network

changes for different queries as opposed to the document network which remains

the same for different queries.

The output of retrieval consists of the collection's relevant documents

which are ranked by calculating P(d|q) ie: the value of probability in nodes di

when the value of query q is known. This inference network approach to

information retrieval was first introduced by Turtle and Croft [Turtle91]. In their

model, however, they use the inference P(q|d) which produces a different

document ranking output than does our model. We will show in section 4.4 that

87

our model provides a more general and appropriate framework for modeling

information retrieval than the model of Turtle and Croft.

4.2.2 The Document Network

A document in a Bayesian network is represented as a complex object with

several levels of representation. The smallest possible document network consists

of two layers of nodes. The top layer represents the system's dictionary or the

concept universe U. The next layer down represents arbitrary objects that can be

constructed by the combination of index term ti in U. These objects are the

concept objects in U and may represent arbitrary sets or propositions. In

information retrieval implementations, these objects include phrases, subject

classifications, sentences and documents. In the document network used for

example in figure 4-1, the concept universe U consists of the index terms as the

elementary concepts and documents as higher level concepts that are built from

the elementary concepts. The index terms automatic, information, retrieval, and

image make up the concept universe U (the top layer of the nodes) and the

document objects doc-1, doc-2 and doc-3 make up the bottom layer. The number

of layers in the network is not limited to two.

The number of layers in the network depends on the level of abstraction to

be modeled. For example, if we do not want to implement classification of the

index terms into subjects, we only need a two layers network, one layer to

represent the dictionary layer and the other to represent document layer as in

figure 4-1. However, if we need to model an information retrieval system that

represents subject classification explicitly, we need to introduce another layer

88

between the document and the dictionary layer. Such a situation is depicted in

figure 4-2. We can see that the index terms image and information are combined

into a node multimedia in the middle layer. The network in this situation consists

of three layers. Regardless of the number of layers required for the implementation

of the model, the top most layer is always the dictionary layer and the bottom

most layer is always the document layer. The index terms in the dictionary layer

are always required because they are the common elements shared by documents

and queries and are used during the matching process. The documents themselves

are also required as they are the objects be presented to the user as the result of

the matching process.

automatic informationretrieval image

multimedia

doc 1 doc 2 doc 3

Figure 4-2 Document network with a subject classsification layer.

The link arrows in the network signifys the causal relationship between the

layers. In figure 4-1, the arrows emanate from the index terms automatic,

information, and retrieval to the doc-1 node signifies the fact that these three

index terms are the cause for document 1 to exist in the collection. In other

89

words, the documents may be considered as the subsets of U formed by the index

terms ti in U for which gi(t)>0.

4.2.3 The Query Network

A user’s information need is represented in a Bayesian network model by the

query network. Compared to the document network, the query network can be

considered as an up side down network. The root node of this network represents

an abstraction of the user’s information need. The nodes in the layers underneath

the root node further explain the abstract concept of the information need. This

lower layer consists of the index terms tj which are part of the universe U. The

number of layers in the query network is always two.

We note here that as in any model in information retrieval, the index terms

used in the query come from the system’s vocabulary of index terms. Without this

limiting assumption, the task of matching between document and query

representation becomes almost impossible.

The user’s information needs are represented explicitly in our model in

order to allow the user to further refine their query by assigning weights to the

index terms used to explain the information need. Consider the example shown in

figure 4-3.

90

Query

information image retrieval

Figure 4-3 An example of a query network.

In this example, the user’s information needs are expressed by the index terms

information, image and retrieval. If the user is certain about the relative

importance of these index terms, they can quantify their degree of belief in the

individual index terms for meeting their information need, that is they can assign

weights to the links query→index term. For example, they may decide to give

more weight to the term image than information, if they prefer image retrieval

rather than information retrieval articles but do not wish to eliminate the

possibility that some of the articles concerned with information retrieval may

discuss image retrieval.

4.2.4 Prior Probability

Before any inference process can be performed, any inference based probability

model requires that a prior probability be defined. In a Bayesian network, the prior

probabilities for all the conditional probabilities for a node given the belief values

for its parent nodes and the prior probability for all the root nodes need to be

defined. Therefore in our model for information retrieval, we need to define

probabilities for the following:

91

• In the document network:

• P(d|ti,d=true) - conditional probabilities for a document node

given the probability of the index terms that construct the

document. These conditional probabilities act as the weights of

the link between the document and index term nodes.

• P(ti) – probability that the index term ti will be found relevant

to the user’s information need.

• In the query network:

• P(ti|Q=true) - conditional probabilities for an index term ti

given the probability of the query.

Any interior nodes and leaf nodes do not required prior probabilities to be defined

and their values will change according to the inference process. Note that we do

not have to assign the probability value to the query node because we will

instantiate this node during the inference process.

For reasons of clarity and simplicity, we will use the model with a two

layer document network depicted in figure 4-1. The conditional probabilities of a

document node given its parents, that is P(d|ti,d), are defined using the term

weights. The term weight can be calculated as the product of the term frequency

(tf) and inverse document frequency (idf) (refer section 2.5.2). This term weight

based on tf*idf may not produce values in the range [0,1], the usual range for

probability values. Therefore to use term frequency in the conditional probability

we need to normalise the term weight. Table 4.1 shows two such term frequency

normalisation methods.

92

Max tf/max(tf)

Augmented 0.5 + 0.5*tf/max(tf)

Table 4-1 Variety of term frequency normalisations.

The main difference between the max and the augmented methods lies with

the density of the distribution they produce. The augmented term frequency for a

particular document will be in the range [0.5,1], whereas the max method

produces a term frequency in the range [0,1].

We can use different combinations of term frequency normalisations and

inverse document frequency weighting schemes. For example, we may adopt the

combination of the max term frequency and the inverse document frequency. With

these choices, the term weights are then calculated as

)log(*max df

N

tf

tftermweight = (4.1)

We found that different weighting schemes provide different levels of

recall and precision. These differences occur due to the different densities of the

probability distributions provided by the different weighting schemes. We found

that the best recall and precision was achieved by the middle range density

distribution, that is the distribution in the max approach. Further detailed

discussion on the behaviour of the model with different weighting schemes is

presented in Chapter 6.

The index terms are treated as though they have an equal chance to be

found relevant to the information need. Thus we can assign 1/(total no. of index

terms) as the prior probability of each index term node. The values of these prior

93

probabilities in the index term nodes will change when a new query is submitted

and a query network is attached to the document network.

4.3 Probabilistic Inference in Information

Retrieval

The Bayesian probability approach views probability values of nodes as degrees of

belief in an event or proposition The links in the network represent the cause and

effect relationships between two propositions. The whole network may be viewed

as our universe of belief which all the propositions interact with, their effect on

each other being derived through the links and any independence assumptions

inherent in the network. This process is recognised as being similar to the human

reasoning process. A person always has some belief value for a particular issue.

These belief values are arrived at using knowledge which comes through

experience. A new experience arrives at the human belief system regularly through

observation of new evidence. This new evidence causes the human to adjust their

belief value, ie. to perform a reasoning process. The belief in some propositions

may be amplified while belief in others may be lessened. A Bayesian probabilistic

inference algorithm tries to model this human reasoning process. Any new

evidence which comes to the network will alter the belief distribution in the

network. In order to change the belief distribution, a probabilistic inference

process is used to a make decision as to which propositions will be affected and by

how much the belief in these propositions may have to change. A Bayesian

probabilistic inference algorithm carries out this decision using two characteristics

94

of the network model, namely the semantic of the network to determine the

independence assumption, that is the decision as to which propositions will be

affected by the new evidence, and the numeric contents (quantitative

representation) to calculate the value of new belief for all the affected nodes.

We will use Pearl’s inference algorithm [Pearl88] in our model. In this

algorithm, the independence assumption can be validated formally through a d-

separation check or heuristically by considering the shape of the network (see

chapter 3 ). Once the independence assumption between nodes has been

established, the process of the belief updating is performed using the link matrices.

A link matrix represents all possible conditional probabilities of a node given the

belief values of its parents. For example, if a node x has a set of parents π

x=p1,p2,p3,…,pn, we must estimate P(x|πx) = P(x|p1,p2,p3,…,pn). Since we are

dealing with binary valued propositions ( see definition 4.2), this link matrix can

be represented by a matrix of size 2x2n for a node with n parents. The matrix

elements specify the probability taken by a node x given the truth value of its

parents. Given that all the parents of x are independent, the estimate P(x|

p1,p2,p3,…,pn) can be presented as the sum of all these truth values. For

illustration, we will assume that a node doc-1 is constructed from three index

terms X,Y and Z (figure 4-4) and that

P(X = true) = x

P(Y = true) = y

P(Z = true) = z

95

doc

X Y Z

Figure 4-4 Network for the link matrix example.

The link matrix for the information retrieval network in figure 4-4 can be

constructed as L[i,j], i∈ 0,1, 0 ≤ j < 2n given that the parents correspond to pj (

j∈0,1,2 in our example) [Turtle91]. We will use the row number to index the

values assumed by the child node and use the binary representation of the column

number to index the values of the parents. The high order bit of the column

number indexes the first parent’s value, the second highest order bit indexes the

second parent and so on. The link matrix for figure 4-4 is therefore:

⎥⎦

⎤⎢⎣

⎡

XYZZXYZYXZYXYZXZYXZYXZYX

XYZZXYZYXZYXYZXZYXZYXZYX

The first row represent the possible values of the parent nodes X,Y,Z when

Pr(doc = false) and the second row when Pr(doc = true).

4.3.1 Link Matrices

We will describe three link matrix forms that can be used for different information

retrieval implementations, namely the or and and link matrix for Boolean retrieval

96

and the weighted-sum matrix for probabilistic retrieval. We will base our

discussion in this section on the network example in figure 4-4.

4.3.1.1 OR-link matrix

In an or-combination link matrix, the doc node will be true when any of X,Y, or Z

is true and false when all of X,Y,Z are false. So, for our example,

⎥⎦

⎤⎢⎣

⎡=

11111110

00000001orL

Using a closed form update procedure we have

P(doc=true) = (1-x)(1-y)z + (1-x)y(1-z) + (1-x)yz + x(1-y)(1-z) +

x(1-y)z + xy(1-z) + xyz

The update procedure can be simplified as

P(doc = true) = 1 – (1- x)(1-y)(1-z)

P(doc = false) = (1-x)(1-y)(1-z)

4.3.1.2 AND-link matrix

For an and-combination link matrix, the doc node will be true when all of X,Y

and Z are true and false otherwise. Thus we have a matrix of the form

⎥⎦

⎤⎢⎣

⎡=

10000000

01111111andL

Again using closed form update, we have

P(doc=false) = (1-x)(1-y)(1-z) + (1-x)(1-y)z + (1-x)y(1-z) + (1-x)yz +

x(1-y)(1-z) + x(1-y)z + xy(1-z)

The calculation can be simplified as

97

P(doc = true) = xyz

P(doc = false) = 1 – xyz

The AND and OR link matrices infer different degrees of influence to the belief in

the child node. The influence of the belief values for (parents = true) are greater in

the OR link matrix than in the AND matrix. Therefore, we can use the OR matrix

when we interested in having the child belief values significantly influenced

greatly by the belief values of (parent = true).

4.3.1.3 WEIGHTED-SUM link matrix

The weighted-sum link matrix is an attempt to weight the influence of individual

parent nodes by the probability values of the child node. A parent with a larger

weight will influence the child more than a smaller weight parent. If we let the

links between the node doc and nodes X ,Y, Z be weighted as wx,wy,wz respectively

and set t= wx + wy + wz, for our example we have the link matrix of the form

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

+++

−+

−+

−−+

−−−=

ddyxdzxdxdzydydz

ddyxdzxdxdzydydz

ws

wt

www

t

www

t

ww

t

www

t

ww

t

ww

wt

www

t

www

t

ww

t

www

t

ww

t

ww

L)()()(

0

1)(

1)(

11)(

1111

Evaluation of the link matrix produces

P(doc= true) =

xyzwzxyt

wwwzyx

t

www

zyxt

wwyzx

t

wwwzyx

t

wwzyx

t

ww

ddyxdzx

dxdzydydz

+−+

+−+

+−−+−+

+−−+−−

)1()(

)1()(

)1)(1()1()(

)1()1()1)(1(

= t

wzwywxw dzyx )++

98

P(doc = false) =

)1()1()1)(1()(

)1(

)1()1()(

)1)(1()(

)1)(1)(1(

zxyt

wwzyx

t

wwzyx

t

wwwyzx

t

ww

zyxt

wwwzyx

t

wwwzyxw

dzdydzydx

dzxdyxd

−+−+−−+

+−

+−−++−−+

+−−−

= t

wzwywxw dzyx )(1

++−

We may use the term weight (such as term frequency (tf)) for the parent’s

or index term weight (wz,wy,wz in the above example) to implement the weighted-

sum because the parent’s or the index term’s weight are summed and normalised

over a document. Analogously, we may use the inverse document frequency (idf)

for the weight of the document node to represent the index term’s ability to

discriminate the document from the other document in the collection. Since we

always multiply the weight of the individual parent (index term) by the weight of

the child (document) when P(doc=true), the weight we have used is actually

equivalent to a tf*idf weighting. Thus, we can assign the tf*idf values as the link

weights. In other word, the link weight represents the P(dj|ti).

In a query network, the link weight may be interpreted as the user

weighting the index term’s relative importance in representing their information

need. The link matrix of the query network is less complicated than that of the

document network since all index terms in the query network only have one parent

node (ie. the query node).

4.4 Directionality of the Inference

The notion of causation, that is the idea that a given random variable can be

perceived as the cause for another variable to exist or change its belief value, is

99

fundamental to inference using a Bayesian network. Different causal direction in

the network produce different reasoning models and thus it is important to

consider the direction of the causation in the network. In many cases the direction

of the causation is clear, in others it is difficult to distinguish between causal and

evidential support.

Causal support is represented as an arc in the network whereas evidential

support flows against the direction of the arc. By drawing an arc from node x to y

we are asserting that proposition x has in some way caused proposition y to be

observed. That is, if we observe proposition x, then this observation in turn will

determine our belief in proposition y, assuming that x is the only parent of y. If y

has other parents in addition to x, then we need to consider the influence of these

other parents.

Evidential support, on the other hand, means that the observation of

proposition y may change the belief in proposition x because y is a potential

explanation of x. Thus, in this case, knowing y will confirm or oppose the belief in

x.

In the basic information retrieval model depicted in figure 4-1, there are

three different propositions which may be used as causal or evidential support,

namely the queries, the index terms and the documents. In our model, we assert in

the query network that the observation of a query influences our belief as to which

index terms are useful in representing the user’s information need. In the

document network, we assert that the combination of the index terms causes the

object document to exist. The inference process is performed by instantiating the

100

query network and observing the result of inference at the document nodes. The

topology is depicted in figure 4-5a.

query

document

index term

Figure-4.5a

document

query

index term

Figure-4.5b

index term

query document

Figure 4.5c

Figure 4-5 Contrasting causal topologies.

There are at least two others possible topologies which may be used in

information retrieval modeling. In the first, we simply invert the network as shown

in figure 4-5b. Thus, we assert that the observation of the document causes a

change in belief in the index terms and in turn changes the belief in the query. This

approach to modeling the inference network was taken by Turtle and Croft

[Turtle91]. Superficially, the difference between the two topologies appears

trivial, however we have found that the topology shown in figure 4-5b does not

provide a “correct” inference model for information retrieval [Ghazfan96]. In this

section, we will show what we mean by the “correct” inference. We will also

defer until later in this section the discussion of the third topology shown in figure

4-5c.

As we have stated previously in chapter 2, an information retrieval model's

task is to estimate the relevance of the document to a given query. In other words,

101

it attempts to estimate P(Relevant|documenti,queryj), ie. the probability that a

document is relevant to a given query. Applying this estimation to the topology

shown in figures 4-5a and 4-5b respectively, we have the situation as shown in

figure 4-6

Document

Query

Index term

Query

Document

Index term

Relevance Relevance

Figure-4.6a Figure-4.6b

Figure 4-6 Causal topologies with relevance node.

The node relevance in figure 4-6 is introduced to the graph to explicitly represent

the belief value that exists at the end of the inference network chain. In this case,

this is the belief value of whether the document is relevant to the query. Thus, the

node relevance in figure 4-6 corresponds to an area in the universe U which is

occupied by the query Q, the set of index terms ti and the document d, in other

words P(Q∩ti∩d). Using Bayes theorem notation, the graph in figure 4-6a gives:

P(Relevant|d,Q) = P(Relevant|d,t,Q)

= P(d,t,Q) (4.2)

By conditioning the probability values only on the index terms in the evidence set,

i.e the index terms in the query, the above equation can be simplified further into:

102

P(Relevant|di,Qj)= P(di,|Qj) (4. 3)

Using the same procedure, the graph in figure 4-6b gives:

P(Relevant|di,Qj)= P(Qj| di) (4.4)

The belief value presented by the relevance node is the result of the belief

propagation process triggered by the arrival of new evidence. The arrival of new

evidence for figure 4-6a (our approach) is indicated by the introduction of a query

into the system. On the other hand, the result of the inference for the graph in

figure 4-6b is obtained by instantiating one document node at a time in the

network. In our approach, the relevance value to be measured is that associated

with the probabilities at the document nodes. The hypothesi that we have to verify

is that as to whether the documents are relevant to a given query (calculating

P(d|Q)). On the contrary, the approach adopted in figure 4-5b and figure 4-6b

measures the relevance of the query node. The hypothesis then to be verified is

the relevancy of the query to a selected document, ie. P(Q|d).

If we apply Bayes theorem to equations 4.3 and 4.4, the two above models

of retrieval produce the following interpretations respectively:

)(

),()|(

j

ji

jiQP

QdPQdP = (4.5)

)(

),()|(

i

jiij dP

QdPdQP = (4.6)

Note that the denominators of equations (4.5) and (4.6) are normalisation

factor for the equations. Therefore, in the process of answering an arbitrary query,

equation (4.5) uses the same normalisation factor in every document matching

process for that query. Equation (4.6) on the other hand, uses a different

103

normalisation factor for each observed document in the same query. Since an

information retrieval system fully evaluates all the documents in the collection for

a single query introduced to the system to find the relevant documents, equation

4.6 will produce different normalisation values across documents instantiated for a

single query. Hence we can assert that an implementation that exhibits the features

of equation (4.5) will give a “correct” result. To clarify this assertion and the

importance of a common normalisation value, consider the following case as an

example. Suppose there are four objects in the knowledge universe U: a book A, a

thesis B, an article C, and a query Q. The mapping of the document and query set

to the knowledge space are given in figure 4-7. A quick visual observation on the

graph reveals these facts:

Book A covers around 50% of the knowledge space occupied by set Q.

Thesis B covers around 30% of the knowledge space occupied by set Q.

Article C covers very little of Q.

A

C

Q B

U-concept universe

Figure 4-7 Document and query mapping in concept universe U.

104

Conversely, note that query Q covers around 30% of set A, around 70% of

set B and most of set C. Intuitively, we would choose A, followed by B and C as

our order of ranking for the retrieval of the documents relevant to query Q.

However, if we apply equation (4.6) and examine the document ranking produced

with this equation, we obtain the following order: article C followed by thesis B

and then book A. This example illustrates that adopting a topology that uses an

inference direction of P(d|Q) (figure 4-5a) provides a semantically accurate result

compared with the model that uses P(Q|d) as in figure 4-5b.

The third topology in figure 4-5, (figure 4-5c) asserts that both the query

and the document are causal agents or propositions for the existence of the index

terms. This model is actually the Bayesian network equivalent of the document

space modification model adopted by [Yu88, Fuhr90]. To see why this topology

is not appropriate as an information retrieval model we have to consider the

independence assumption of Bayesian networks. Using the independence

assumption heuristic in chapter 3, we can see that the observation of index term

nodes in network topology in figure 4-5c will in fact cause the query and

document to be independent. This is not a desired outcome because we would like

to infer our belief in the query to the document through the index terms. In fact if

we use the topology in figure 4-5c, we observe that the query node and the

document nodes become competing explanations for the index terms. That is, if

we observe the query to be true then it will diminish the causal effect of the

document on the index terms. We know however, that the effect of documents on

the index terms is constant for a set of document collections. The effect of

documents on index terms will only change if a new document is added or an

105

obsolete document is deleted from the collection. Thus, the topology in figure 4-

5c is also not able to model the information retrieval task correctly. Therefore, we

use the topology depicted in figure 4-5a as the basis for our model.

4.5 Comparison with Other Models

We have provided our model of Bayesian network for information retrieval in

sections 4.3 and 4.4. In this section we will show that we can use our model to

implement other retrieval models such as the Boolean and binary independence

probabilistic model [Rijsbergen79, Robertson76]. We began a comparison with

Turtle and Croft’s inference model [Turtle91] in the discussion of causation in the

network model in section 4.4. We will further analyse the difference between their

model and ours in this section. We will show that our model provides not just a

semantically correct document ranking in general but also a richer framework for

information retrieval modeling. It is worth noting at this point that the drawbacks

of Turtle and Croft’s model do not preclude it producing good recall and

precision [Turtle90, Rajashekar95]. Our model has also shown promising results

(see chapter 6) as well as providing a more general and richer framework for

information retrieval.

4.5.1 Simulating the Boolean Model

A Bayesian inference network can be used to simulate the Boolean information

retrieval model. In this model, each document is evaluated independently of the

other documents in the collection (the fact that there is no document ranking

106

involved means that documents in the collection may be assumed to be

independent). Thus, to simulate such retrieval we can create a disjoint network for

each individual document in the collection. Each network is then evaluated to

determine its relevance. In the implementation we will actually have one network

with different prior probabilities assigned to the index term (ti) nodes. In the

Boolean model, no weighting is applied, ie. P(ti|Q=true)=1 for all index terms ti

used to represent the user’s information need Q. The simulated network can be

built using the following steps:

• Build an expression tree for the query network.

• Assign P(ti|Q)=1 for all ti∈ d and Pr(ti|Q)=0 for all ti∉ d.

• Instantiate node Q.

• Use the logical link matrix or or and ( section 4.3.1) to calculate the

value of P(d), depending on the Boolean relationship between the node

terms in the query.

Then, P(d)=1 means that the document d satisfies the query and P(d)<1 means it

does not satisfy the query. To illustrate the situation, consider The Boolean query

(information or image) and retrieval

applied to the document collection in figure 4-1. The new network with the

expression tree is depicted in figure 4-8.

107

information image retrieval

or

and

d

Q

Figure 4-8 Boolean model implemented by a Bayesian Network.

Consider a document which includes all those index terms appearing in the

query. That is, P(information|Q),P(image|Q) and P(retrieval|Q) are all equal to

1. Using the Lor link matrix (section 4.3.1), the new belief in the index terms in

turn will cause the node or to have the value 1. Using the Land link matrix, the or

node and retrieval node cause the value of node and to become equal to 1. This

value is then passed down to the node document d so that P(d)=1, therefore d

satisfies the query as expected.

We can consider another example, a document which contains only the

index terms information and image but not the index term retrieval. In this

example, we can assign P(retrieval|Q)=0, P(information|Q)=1 and

P(image|Q)=1. Using the same link matrix, the value of P(d) becomes 0, therefore

the document d does not satisfy the query as expected. These two examples show

how the Boolean retrieval model can be effectively simulated by the Bayesian

network model.

108

4.5.2 Simulating the Probabilistic Retrieval Model

In the probabilistic retrieval model, a document is described by the presence or

absence of index terms. Any document can therefore be represented as a binary

vector d(k1,k2,…,kn) where ki=1 indicates the presence of index term i in the

document and ki=0 indicating its absence. The document ranking is calculated as

the cost function for the retrieval of a particular document that contains index

term i and the document is either considered relevant or not relevant. The

Bayesian network model for the probabilistic model is represented by figure 4-9.

R ele vant

t1 t2 t3 tn

d

Figure 4-9 Probabilistic retrieval using a Bayesian network.

In a traditional probabilistic model, each individual document is considered

in isolation from the other documents in the collection. The set of index terms

observed in a document are restricted to a subset which occurs in the query. The

values used for the ranking depend on the ratio of the values of P(ti|relevant) and

P(ti|non-relevant) (see section 2.4.3.1).

In a Bayesian network, we use the values of P(d) at the leaf nodes of the

network (the document nodes) to determine the ranking of the document d

109

relative to that of the other documents in the collection. There is no explicit

representation of query in the network. The query node is replaced by the node

relevance. The index terms are conditioned over the index terms that occur in the

query. The main consideration for the probabilistic model is the estimation of the

relevant and non-relevant document set. This estimation will not be accurate

without comprehensive sampling of queries and relevance judgments. One way to

overcome this problem is to estimate the relevance of the index terms which are

found in the relevant documents. This estimation may be achieved using a small

sample set of relevant documents and the relevance of other documents can be

estimated using the existence of the index terms found in the relevant document

on the current observed document.

The probability that an index term ti is found to be relevant is given by the

conditional probability P(ti|relevance). These values may be estimated from a

small sample retrieval or by using the inverse document frequency [Croft79] when

no relevance information is available. One advantage of the probabilistic retrieval

implementation using our model is that this estimation may be derived from a

user’s confidence in the terms they use in the query. This is possible because we

provide an explicit relationship between relevance and index terms via the

relevance->index term link. When a set of relevance judgements are available after

the retrieval is made, only the P(ti|relevance) need be changed to represent the

new confidence level of the user in the index terms. We can see from this that our

model provides a consistent interpretation of the P(ti|relevance) which has been

lacking in the traditional probabilistic model. Thus, we also can claim that our

model subsumes the binary probabilistic retrieval model since it is able to simulate

110

the model and provides a more intuitive interpretation of the relevance estimation

in the model.

4.5.3 Inference Network

We began discussion on the different approaches to the inference network model

for information retrieval in section 4.4.3. In this section, we will further analyse

and comment on the differences between our model and Turtle and Croft’s

inference model.

The two models differ in their assumptions of the causation in the

network. Turtle and Croft’s model assumes that document are the main cause of

index terms’ existence which in turn causes the information need to exist. Our

model, on other hand, asserts that the main cause of the existence of index terms

is the information need. These contradicting causation assumptions lead to

different respective inference directions. Turtle and Croft’s model infer the

evidence as P(Q|d) whereas ours infers evidence as P(d|Q). We have shown in

section 4.4 that the inference process of P(d|Q) produce a more accurate

document ranking in the retrieved document set. The benefits of applying P(d|Q)

rather than P(Q|d) are not limited to the provision of accurate ranking. There are

other benefits which may be gained through our model compared with Turtle and

Croft’s model. For example, our model is able to capture interconnectivity

between documents in a collection. This will enable us to easily implement

relevance feedback into the model. To illustrate this, consider the two networks

depicted in figure 4-10.

111

q

T3 T4

D1 D2 D3

D1 D2 D3

T1 T2 T3 T4 T5 T6

Figure 4-10a Our model Figure 4-10b Turtle and Croft’s model

q

T1 T2 T5 T6

T3 T4

T3 T4

Figure 4-10 Contrasting Bayesian networks.

Relevance feedback is a method used by information retrieval to refine a

user’s information requirements after the user judges the relevancy of the

document retrieved. There are two basic ways by which feedback data can be

incorporated in a Bayesian network: adding evidence and altering the

dependencies represented in the network. The two approaches are fundamentally

different. Adding evidence always leaves the probability distribution in the

network unchanged. However, it will alter the beliefs in the network to be

consistent with that distribution. Altering the dependencies, either by changing the

topology of the network or by altering the link matrices changes the probability

distribution which in turn alters beliefs in the network.

Implementing the addition of evidence in the network is very

straightforward in our model (see figure 4-10a). The document nodes for all the

documents that the user chose as relevant are set as the evidence node, or in other

words, we assign the belief = 1 to those document nodes found relevant by the

user. Then we can instantiate the query node again and calculate the new

112

probabilities in all the document nodes. Note that considering the document nodes

as evidence will set all the index terms for those documents to be dependent, and

in turn they will change the beliefs in the other documents that share the same

index terms with the relevant documents. Therefore the introduction of new

evidence into the network will change the beliefs not only in those documents

found to be relevant but also in other documents in the network. This approach

cannot be implemented in Turtle and Croft’s model because they have disjoint

inference for each document; ie. they instantiate each document in isolation from

other documents in the collection and only consider those index terms that exist in

the query. If we assign the document nodes as evidence, then it is only possible to

change the belief of those index terms activated by the instantiated document

nodes and shared by both document and query.

Consider the small network depicted in figure 4-10b. The new evidence of

D1 and D3 (the shaded document node) will not change the belief in D1,D2 or D3

because D1 and D2 do not share common index terms with the query. Even if there

is an index term shared by D1 and the query, say for example D1 contains term T3,

the instantiation of the document nodes D1 and D3 will make the index terms

independent. Thus, it will fail to alter the belief in the document D2 which the user

has not chosen as a relevant document.

The only way relevance feedback can be implemented in Turtle and Croft

model is by adopting altering the dependencies in the network. Given a set of

documents that a user has chosen as the relevant documents, a new query

representation layer in the network can be built. The new query representations

can either replace or extend the original query representations. Considering the

113

same situation as in our previous example for evidence feedback, this dependency

feedback is implemented as a network depicted in figure 4-11 and figure 4-12.

This approach can be implemented in both our model and in the model of Turtle

and Croft.

Figures 4-11a and 4-11b show how dependency feedback can be

implemented in both our model and Turtle and Croft’s model by augmenting all

the index terms that are found in the judged relevant documents (i.e. D1 and D2 in

our example). The inference process will be performed as normal without

assigning evidence to the relevant documents as in adding evidence approach.

q

T1 T2

T3 T4

T5 T6

D1 D2 D3

D1 D2 D3

T1 T2 T3 T4 T5 T6


q

T1 T2 T5 T6 T1 T2 T5 T6

T3 T4

T3 T4

Figure 4-11 Dependency feedback by augmenting.

Figure 4-12, on the other hand, shows dependency feedback implemented

by replacing the index terms in the query with the index terms in the documents

judged relevant. Notice that T3 and T4 are deleted from the query network

regardless of whether they were in the set of original index terms used to represent

the query.

114

q

T1 T2

T3 T4

T5 T6

D1 D2 D3

D1 D2 D3

T1 T2 T3 T4 T5 T6


q

T1 T2 T5 T6 T1 T2 T5 T6

Figure 4-12 Dependency feedback using replacement.

It has been shown that our model supports both relevance feedback

methods whereas Turtle and Croft only supports the dependency alteration

approach. The efficiency of the respective approaches depends on the retrieval

situation. Evidential feedback is appropriate when we are confident that the

distribution in the collection is “correct”. A very specific collection domain is an

example where this approach is appropriate. Altering dependencies is appropriate

when we have low confidence in the model distribution and therefore want to

obtain better information about the nature of the true distribution. An example of

this approach is document space modification [Yu88, Fuhr90]. They use the set of

queries and relevance judgements to learn the “correct” distribution for documents

and representation concepts.

The failure to capture the document interconnectivity means that Turtle

and Croft’s model is static. By this we mean that the probabilities used to rank the

documents will not change when a new document is introduced to the system.

This situation occurs because the relevance of a document is calculated in

115

isolation from other documents in the collection. Although information retrieval is

often considered a static system whereby document addition and deletion are not

performed frequently, the ability to capture the changes of the distribution of

knowledge in the collection is still a desirable feature which our model is able to

provide.

The difference between the two models can also be seen in terms of the

efficiency of the inference process. In our model, the inference process starts with

the instantiation of the query node. Since there is only one query we only need to

perform one inference process. The Turtle and Croft model starts the inference

process by instantiating individual document in the collection. Thus repeated

inference processes are required, proportional to the number of documents in the

collection.

Although our model is more efficient in terms of the inference process, this

can only be achieved by carefully handling the independence assumption and the

link matrix. Our model is richer than Turtle and Croft model by virtue of its

document interconnectivity, however this also means that our network is more

complex than Turtle and Croft’s model and thus the independence assumption

needs to be handled carefully. The link matrix also will be larger in our model

because , in our model, we have to create the link matrix for index terms. Since

the size of the link matrix is 2n, where n is equal to the number of index terms in

the document collection, n greater than 20 will be common. This link matrix size

issue is not a problem in Turtle and Croft’s model because the maximum size of

their link matrix is proportional to the number of index terms used in the query.

We will discuss these implementation issues in Chapter 5.

116

4.6 Summary

In this chapter, we have presented a new formal model for information retrieval

based on Bayesian network theory. The proposed model subsumes the existing

models by providing more a general framework to model information retrieval.

The proposed model can represent existing models by using appropriate network

representations. As a result, the decision of adopting a specific network model can

be seen as an issue of implementation.

The proposed model consists of two separate networks, namely a

document network and a query network. These two networks are combined

during the matching process. The matching process is started by instantiating the

query node and calculating the effect of this new evidence in the probability

distribution in the network.

Different inference directions in the network have different effects on the

probability distribution. The two possible inferences in information retrieval are

P(d|Q) or P(Q|d). We shown in this chapter that the first approach gives a more

accurate result and also provides a richer model through its ability to support

evidence and dependency relevance feedback.

We have concentrated on discussion of our model of information retrieval

using Bayesian networks in this chapter. Implementation issue will be discussed in

Chapter 5. These can be categorised into two groups, namely the computational

complexity of the inference algorithm and the indirect loop which exists during the

relevance feedback process. We will discuss some existing approaches to these

two issues and their practicality in information retrieval implementations. We will

117

also present our approaches to reducing the computation complexity and for

dealing with the indirect loop.

118

Chapter 5

Handling Large Bayesian Networks

5.1 Introduction

We have presented a Bayesian network model for information retrieval in the

previous chapter. This chapter presents discussion on issues associated with the

practical implementation of the model. Implementing an information retrieval

system using a Bayesian network is not a straightforward task. Exact inference

algorithms for a Bayesian network such as Pearl’s algorithm have been shown to

be NP-hard [Cooper90]. Thus, approximation techniques need to be considered in

order to implement the model in practice. Before we discuss the issue of

implementing the Bayesian network using an exact algorithm, first we will

illustrate the use of the exact algorithm in a retrieval process (section 5.2).

Following this example we will discuss possible problems that may occur during

implementation if we strictly follow Pearl’s algorithm or use the basic model

presented in chapter 4. There are two main issues to be considered, namely the

complexity of the computation and the indirect loop.

The complexity of computation is caused by the size of the link matrix.

The link matrix size in an information retrieval network is determined by the

number of index terms contained in a document. Documents with more than 30

index terms are common in a document collection. Therefore, the total size of the

link matrices for the collection will be generally very large. We propose an

119

approximation method involving the addition of a virtual layer into the network

[Indrawan96] This method provides the solution to the problem of computation

complexity without losing much of the accuracy required by the network to

perform the retrieval. Existing approximation methods such as node deletion, link

deletion and layer reduction do not provide adequate accuracy in modeling the

retrieval task. We will discuss these different methods and compare them with our

proposed method in section 5.3.

The indirect loop problem in the network occurred when evidence

relevance feedback is used. We will discuss the existing solutions to the indirect

loop problems in section 5.4. The discussion includes the methods of clustering,

conditioning and sampling. The main problem with these existing methods lies

with their own individual computational complexity. This complexity prevents us

from adopting these methods for the information retrieval. We propose a new

method involving the use of an intelligent node [Indrawan98]. This method

provides for much less complex computation than the existing methods, thus

providing a good solution to the indirect loop problem. We present this new

method of handling the indirect loop in section 5.5.

5.2 An Illustration of an Exact Algorithm

As we discussed in chapter 4, we have adopted Pearl’s algorithm to carry out the

inference process in our model. There are two main approaches to inference

algorithms, namely the exact and approximate approaches [Henrion90] and

Pearl’s algorithm falls into the former category. An exact algorithm defines the

complete probability distribution of the propositions in the network, and as such is

120

computationally intensive. The approximate approach on the other hand uses some

estimation techniques to estimate the probability distribution in the network. To

illustrate the use of Pearl’s algorithm in our model, consider an example of

retrieval in a small network depicted by figure 5-1.

q

similarity information retrievalimage feedback

D1 D2 D3

0.7 0.4

0.4 0.7 0.30.5 0.4

0.60.8

Figure 5-1 A network example of a retrieval process.

We assume that we know the values of the link weights for each link in the

document network and that these links were derived from the index term

distribution in the collection using equation 4.1 (chapter 4). We also assume that

a user has found that the index term information carries more weight than does

the index term retrieval. Therefore their links are assigned weights, for example,

0.7 and 0.4 respectively. If the user is not willing to assign the weights themselves,

approximate weights can be derived from the distribution of index terms in the

collection such as the tf*idf value. Note that the query is submitted as a natural

language query and it is not restricted to one sentence. Thus it is possible that an

index term occurs more than once in the query, ie. tf >1 for that term. The idf

121

value for index terms is derived in the same way as in document network, that is

the number of documents containing the index term in the collection.

The retrieval process is started by instantiating the query node q. The

effect of this new evidence in the network is then passed onto its children, that is

onto the nodes information and retrieval. The value P(information=true|q=true)

is given by the weight of the link that connects the two nodes. Thus, using the

weight-sum link matrix (see section 4.3.1), we have the following link matrices for

the nodes information and retrieval respectively:

Linformation= ⎥⎦

⎤⎢⎣

⎡7.00

3.01

Lretrieval= ⎥⎦

⎤⎢⎣

⎡4.00

6.01

Link matrices also have to be created for node D1, node D2 and D3. Since

these nodes have more than one parent node, the link matrices for these nodes

need to reflect the possibility that only one parent is true or that multiple parents

are true. To capture these possibilities we can combine the weight-sum and or

approaches to link matrix. That means that the probability of all parent=true is

given by considering all the possible combinations of true parent nodes. Using this

approach, we can calculate the probability of the child node given that the parent

node is true or false in the document network as follows:

P(D1|similarity=true)=0.4

P(D1|image=true)=0.7

P(D1|similarity,image=true)=0.4*(1-0.7)+(1-0.4)*0.7*0.4*0.7=0.82

Or

P(D1|similarity,image=true)=1-(1-0.4)(1-0.7)=0.82

122

LD1= ⎥⎦

⎤⎢⎣

⎡82.07.04.00

18.03.06.01

P(D2|image=true)=0.3

P(D2|information=true)=0.5

P(D2|retrieval=true)=0.4

P(D2|image,information=true)=1-(1-0.3)(1-0.5)=0.65

P(D2|image,retrieval=true)= 1-(1-0.3)(1-0.4)=0.58

P(D2|information,retrieval=true)=1-(1-0.5)(1-0.4)=0.7

P(D2|image,information,retrieval=true)= 1-(1-0.3)(1-0.5)(1-0.4)=0.79

LD2= ⎥⎦

⎤⎢⎣

⎡79.065.058.06.07.05.03.00

21.035.042.04.03.05.07.01

P(D3|retrieval=true)=0.6

P(D3|feedback=true)=0.8

P(D3|retrieval,feedback=true)=1-(1-0.6)(1-0.8)=0.92

LD3= ⎥⎦

⎤⎢⎣

⎡92.08.06.00

08.02.04.01

If we assume that there are 2000 index terms in the collection, the prior

probability for the nodes similarity, image, feedback is equal to 1/2000 or 5*10-4.

Instantiating node q thus results in:

P(similarity)= 5*10-4 P(retrieval)=0.4

P(image)= 5*10-4 P(feedback)= 5*10-4.

P(information)=0.7

Using the appropriate link matrix we can calculate

123

P(D1|q) = 0.4(5*10-4)(1-5*10-4) + 0.7(1-5*10-4)(5*10-4) + 0.82(5*10-4)(5*10-4)

= 0.203

P(D2|q) = 0.3(5*10-4)(1-0.7)(1-0.4) + 0.5(1-5*10-4)0.7(1-0.4) +

0.7(1-5*10-4)0.7*0.4 + 0.6(5*10-4)(1-0.7)(1-0.4) +

0.58(5*10-4)(1-0.7)0.4 + 0.65(5*10-4)0.7(1-0.4) +

0.71(5*10-4)0.7*0.4

= 0.406

P(D3|q) = 0.6(1-0.4)( 5*10-4) + 0.8*0.4(1-5*10-4) + 0.92*0.4(5*10-4) = 0.320

Therefore, the relevance ranking of the documents for the retrieval network

depicted in figure 5-1 for query q is document D2, document D3 and followed by

document D1.

In order to implement the evidence feedback, the document nodes which

are found to be relevant by the user are instantiated. For example, suppose the

user found that they are actually looking for information retrieval articles that

discuss image retrieval through similarity and the possibility of using the relevance

feedback to improve the retrieval. In this case they might chose documents D1 and

D3 as the relevant documents. To recalculate the belief in the documents node, we

instantiate nodes D1 and D3. When we instantiate these nodes, this new evidence

will influence the belief in the nodes similarity and image due to document D1,

and in nodes retrieval and feedback due to document D3. Because the link q→

information and q→retrieval meet tail-to-tail in node q, any change in the belief

in nodes information or retrieval will require the recalculation of belief in the node

q (see the heuristic check for the independence assumption in section 3.4.2). As a

result an indirect loop exists in the network when the evidence feedback approach

124

to relevance feedback is used. In other words, the network now becomes multi-

connected.

When a local propagation algorithm like Pearl’s which is devised to handle

singly connected networks is used in a multi-connected network, failure may occur

in one of two ways. Firstly, it is possible that an updating message sent by one

node cycles around the loop and causes the same node to update again. This will

repeat indefinitely, preventing convergence of the propagation. Secondly, even if

the propagation does converge, the posterior probability may not be correct due

to the algorithm’s independence assumption which does not hold for multi

connected networks. We therefore need to adopt some method that enables us to

break this loop so that the network becomes singly connected and hence allows

use of Pearl’s algorithm. To achieve this we need to look at the independence

assumption of the network and approximation methods.

Apart from the indirect loop issue, another important issue to be

considered during implementation is that of the overall size of the network. The

investigation of reasoning with uncertainty using Bayesian network began during

the development of diagnosis aids for medical applications [Fryback78,

Cooper84, Heckerman85, Shwe90]. The model’s assumptions and inference

algorithms were developed based on this medical diagnostic application. The size

of the network in the medical diagnosis problem was relatively small compared to

that of information retrieval. For example, the number of nodes involved in the

Pathfinder1 is 63, whereas the smallest test collection available to information

1 An expert system to assist pathologists with hematopathology diagnosis, jointly developed by Stanford University and the University of Southern California.

125

retrieval research (ADI collection contains 82 individual documents) requires

around 900 nodes in a Bayesian network. In real life applications, the number of

documents in an information retrieval collection may be more than one thousand.

The big difference in the size of the networks occurs due to the increase in the

number of propositions introduced to the network, and the size of the link

matrices (which is dictated by the number of parent nodes of a child node). This

increase in network size causes the computational complexity of the inference

algorithm to increase accordingly.

The link matrices in a retrieval network are large due to the fact that most

of collection will have documents that contains on average more than 20 index

terms. With this number of index terms per document and the binary assignment of

index terms to the document, we will have link matrices with numbers of elements

> 220 for most of the document nodes. When all the documents in the collection

are considered, the overall size of the link matrix will be large. Consider, an ADI

collection with 82 documents and average of 25 index terms per document - the

total size of the link matrix will be around 82×225.

Another aspect that contributes to the increase in computational

complexity is the fact that a retrieval network is a dense network whereby a large

number of nodes share common children or parent nodes. Thus, any change in

belief in an index term node may influence a large part of the network and cause

intensive recalculation.

126

5.3 Reducing the Computational Complexity

Reducing the computational complexity can be achieved through the utilization of

some approximation methods. There is a trade off between a Bayesian network

model’s accuracy and its computation complexity. We need to choose carefully an

approximation technique that enables us to reduce the computation space without

sacrificing much of the accuracy. By loss of the accuracy we mean the event that

the posterior probability of the approximate method is different from the exact

algorithm posterior probability.

Approximation approaches to Bayesian network inference are not new.

Many researchers have investigated the possibility of approximation techniques

due to the complexity of exact match algorithms like those of Olmsted

[Olmsted83], Pearl [Pearl88], Lauritzen and Spiegelhalter [Lauritzen88]. One of

the common approaches to the approximation involves coarsening the state space

[Chang91, Provan95]. The coarsening effect can be achieved through different

ways, which are respectively node and link deletion, layer reduction and

intermediate node layer addition.

5.3.1 Node and Link Deletion

One obvious way to reduce the complexity of a network is to delete the parent

node and its links when we consider that its influence on its children nodes is

minimal. Consider the example in figure 5-2.

127

W1

W2 W3

W1

W2 W3 W4

Figure 5-2a Original network Figure 5-2b Reduced network

P1 P2 P3 P4

C1

P1 P2 P3

C1C2 C2

W5

W6 W6

Figure 5-2 Reducing the network with node deletion.

Let W1,W2,…,W6 represent the link weights of the network in figure 5-2a.

If we set a threshold value x, we can compare every link in the network with x and

delete any link with link weight < x. Let us assume that W4 < x and W5 < x. Thus

we can delete those links with weights W4 and W5. The result of this operation is

depicted by figure 5-2b. Notice also, that node P4 is deleted from the network

because it exists in the network solely due to a proposition explained by P(C1|P4),

therefore its effect on other propositions in the network is lost when link W4 is

deleted. This approach can be used for approximation in information retrieval

systems with the proviso that caution needs to be taken in determining the

threshold value.

In information retrieval systems, we discriminate between documents

according to their relevance using the term weights because the link weights in our

model are implemented using tf*idf weighting. This weighting is known to

measure the scarcity of terms in the collection and it influences the precision level

of retrieval [Sparck-Jones72]. Thus, documents containing index terms with high

values of associated term weighting are assumed to be highly relevant. When we

128

remove all the links with weights below the threshold value from the network, we

actually reduce the term discrimination ability of the network because the range of

the term weights is reduced. Therefore, with this approach we will not sacrifice

system performance in terms of recall but may lose some performance in term of

the precision. The question that remains is how much reduction in precision can

we tolerate? The only way to check the degradation in the precision is by choosing

an arbitrary threshold value and then examine the precision level of the reduced

network model. The chosen threshold will vary from collection to collection due

to the difference in their network structure.

This method is therefore appropriate for recall oriented systems. A

recall oriented system aims in providing the best coverage of the concept required

by the user without worrying too much about the position of those relevant

documents in the retrieved documents list. However, it aims to retrieve all the

relevant documents.

5.3.2 Layer Reduction

Node deletion may also be achieved through the layer reduction method

[Provan95]. In this approach the nodes of a particular layer are deleted. The links

that lead to and from these nodes are then joined to create the new links. This

approach is illustrated by figure 5-3.

129

p1p2

p3

q4q5

q6

p1q1

p2q2

p2q3

p3q2

p3q3

Figure 5-3a Original network Figure 5-3b Reduced network

Figure 5-3 Reducing the network by collapsed layer.

Consider a part of a larger network as depicted in figure 5-3a. It contains

three layers and has two sets of link weights. The links p1, p2 and p3 connect the

top and the middle layers, whereas q1 ,q2, q3 connect the nodes in the middle and

the bottom layers. When the middle layer is collapsed, the reduced network is

depicted by figure 5-3b. The effect of the nodes in the middle layer is replaced by

the new combined link weights. For example, link p1 and q1 are now combined

into link p1q1, i.e. the new weight of p1q1 can be calculated as the product of p1

and q1.

With this approach, the number of link weights and nodes are reduced.

However the number of elements of the link matrix in the child node is actually

increased because each node in the bottom layer will have more parent nodes

compared with the original network. Moreover, if we collapse the index term layer

in the document network of our model, we will lose the ability to produce

document interconnectivity. In fact, if we reduce the network by taking out the

index term layer from the document network, we will have a network that gives a

retrieval function similar to that of using inner product with tf*idf weighting (see

130

section 2.4.2). In other words, the retrieval function of the network will be

equivalent to that of counting of the number of index terms shared by the query

and the document. Thus, when we use this approximation technique, we are

restricting the retrieval to a simple matching function.

5.3.3 Adding a Virtual Layer

We propose a new technique for reducing the computation complexity, namely

adding a virtual layer to the network. In this approach, for a child node with

number of parents greater than a specified maximum number of parents per node,

the parent nodes are classified into a number of groups. Each group is then linked

to a virtual node. These virtual nodes are then connected to the original child

node. To illustrate the idea, consider the example of a network depicted in figure

5-4a. A child node in figure 5-4a has 100 parent nodes. If we only assign binary

propositions to all the nodes in the network, we will have a link matrix with 2100

elements to calculate P(child). This is virtually impossible to implement in practice

due to limitations of computer resources. If we divide the parent nodes into small

groups and each group is linked into a virtual node and these virtual nodes are

then connected into the child node, we can actually reduce the number of elements

of the link matrix in the child node.

Figure 5-4b portrays our modified version of the network in figure 5-4a. In

the modified network the 100 parent nodes are divided into 10 groups (with each

group containing 10 nodes). In this example, we have 10 virtual nodes in the

131

virtual layer. 2The child node now only has 10 parents (the number of virtual

nodes). Thus the number of elements in the link matrix of the child node has been

reduced to 210. Each virtual node is linked to 10 parent nodes and will have a link

matrix of the size 210. This makes the total size of the link matrices in the network

11x210. This is a dramatic reduction from the original size of 2100.

Parents nodes Parents cluster 1

Virtual Layer

Child node

Child Node

Parents cluster 10

1 2 100 1 10 91 100

1 10

Figure 5-4a. Original network Figure 5-4b. Modified networkwith virtual layer

Figure 5-4 Network with virtual layers.

The computing resources available for implementing the Bayesian network

dictate the choice of the maximum number of parents per node. Since information

retrieval systems are used mostly interactively, some small experiments may need

to be performed to find an acceptable response time for a query with the maximum

number of parents per node is adjusted accordingly.

The number of virtual layers is not limited to one. Once the limit has been

determined, we can distribute the index term nodes into a number of groups. The

2 We call the node and the layer virtual because they do not actually form as part of the original knowledge. They are artificially added to this original knowledge.

132

total number of layers depends on the total number of the index terms to be

distributed and the limit on the maximum number of parents per node. If the

number of virtual nodes is greater than the specified limit, then these virtual nodes

need to be grouped together as were the index term nodes. This process of

grouping and introducing new layers continues until all the nodes in the network

have a number of parent nodes less than the specified limit. The optimum network

is obtained when we have a symmetric distribution of nodes in the network. For

example, consider our previous situation where a child node has 100 parent

nodes. If we have set the limit=15 parent nodes, it is better to have 10 groups

with 10 members for each group rather than having, say, 6 groups of 15 members

and one group of 10 members.

The virtual node acts as a summary node for a group of parent nodes. That

means that the weight of the link that connects the virtual node and the child node

has to capture summary information about the distribution of the parent nodes.

One obvious way to achieving this is to take the group average of the original

parent-to-child links and assign this average to the link virtual-to-child. The link

weights of parent-to-virtual links are modified by dividing the original link weight

by the group average.

Another possible approach is to normalise this virtual-to-child link by

assigning the maximum weight of parent-to-child links in the group to the virtual-

to-child link and modify the original parent-to-child link in the group by dividing

them by the maximum weight of the group.

133

The two approaches can be described as follows:

Let

v be a virtual node to introduced,

p1, p2, p3,…,pn be parent nodes connected to v,

c be a child node,

w1, w2, w3,…, wn be the weight of the links p1->c, p2->c,..., pn->c

respectively

u1, u2, u3 be the weight of link p1->v, p2->v,..., pn->v

wv be the weight of link v->c

The weight, wv of the link v->c using the average approach will be:

n

wwww n

v

...21 ++= (5.1)

The weight, wv of the link v->c using the max approach will be:

),...,,max( 32,1 nv wwwww = (5.2)

The weight, ui of the link pi->c will be:

niforw

wu

v

ii ≤≤= 1' (5.3)

The average and the maximum approaches give different ramifications for

information retrieval systems. Firstly, we look at the effect of taking the average

approach. Since we are averaging the value of the groups and assigning the nodes

randomly into the groups, we would have a similar wv for different virtual nodes in

the network. Note that we assign the value of tf*idf into the wi and a high value of

tf*idf is associated with high level of importance conferred on an index term for

finding the relevant documents. Therefore, the effect of index terms with high

term weight values on the calculation of P(d) may be reduced if there are index

134

terms with low weight values in the same group. As a result, a document which

contains these high tf*idf index terms may lose its relative superiority compared

with a document which has low term weight but belongs to a group with a higher

weight average. This means we cannot interpret the probability in a document

node as the absolute value of the document’s probability in matching the user’s

request, but rather we should see it as a relative ranking value in comparison with

other documents in the collection. In the worst case, the precision may be affected

and may even decline.

The maximum approach on the other hand always ensures that index terms

with high term weight value will not be much affected by the nodes with low

weight values in the group. This can be done by assigning the maximum tf*idf of

the group to wv. Since we assign the maximum value of the group to the virtual

node-to-child link, we ensure that the index terms with high term weight values

play a major influence in estimating the probability of a document’s relevance to

the user’s request. A document with high term weight value of index terms will

not be undervalued as in the average approach. Thus, with this approach the

probability values on the document node will be a closer approximation of the

absolute probability of relevance to the user’s query compared to the average

approach.

The choice of normalising the parent-to-virtual with the average or the

maximum of the group should be made according the implementation

requirements of the system. The average approach may be used when the

precision is not major consideration. The maximum approach on the other hand

will suit systems which require high precision retrieval. Regardless of the choice of

135

normalising approach we have to make, adding the virtual layer provides a

practical layer reduction solution to the computational complexity problem

through a drastic reduction in the size of link matrices. Moreover, our proposed

method retains the semantic structure of the original network presented in chapter

4. This characteristic of our approach provides a more accurate approximation

than the link and deletion approaches because these existing approaches reduce

the network model to an inner product retrieval function.

The adoption of a better clustering mechanism for grouping the parent

nodes can further increase the accuracy of our approximation method. We present

one clustering algorithm that can be adopted in the next section. A method of

assessing the goodness of the clustering method model will be presented in

chapter 7.

5.3.3.1 Clustering the Parent Nodes

In the clustering described in the previous section, we ignored the link weight

distribution in the network. The grouping is based on the sequence of the weights

in the index file. With this random approach, the performance will depend on the

sequence of the link weights to be classified. To avoid this dependency, we

propose another simple classification that takes into consideration the distribution

of the link weights.

In this non-random classification, we group similar link weights into a

group. The similarity is measured by the difference between a link weight under

consideration and the mean of a group of link weights. To generalise the proposed

concept, consider a set of items that have some attributes and these items are to be

classified into a number of groups. The clustering process involves examining an

136

individual item and finding its most appropriate group. The similarity in our

clustering is measured by the distance between the item’s attribute value from the

means of the groups. Each time an item is examined, its attribute values are

compared with the existing group’s mean. During the clustering process, an item

may have several candidate groups because the difference between its attribute

value and the group’s mean is still within the boundary of the maximum difference

allowed (in our algorithm this difference is called significant level). The item

however only can be assigned to one group. The best group for the item is the

group with a group mean closest to the value of the items’s attribute. Our

clustering algorithm is thus as follows:

TYPE item id TYPE INTEGER attributeValue TYPE FLOAT TYPE population total TYPE INTEGER noAttributes TYPE INTEGER individual TYPE item TYPE group id TYPE INTEGER member TYPE population mean TYPE float TYPE class totalIndividual TYPE INTEGER totalGroup TYPE INTEGER member TYPE group MAXITERATION TYPE INTEGER procedure cluster(input TYPE population, output TYPE class, significantLevel TYPE FLOAT,

iteration TYPE INTEGER) begin procedure DECLARE /* Local Variables */

137

totalGroup TYPE INTEGER i,j,k TYPE INTEGER numberAttribs TYPE INTEGER found TYPE INTEGER candidateGroup TYPE INTEGER changes TYPE INTEGER newTotal TYPE INTEGER currentDifference TYPE FLOAT if iteration = MAXITERATION then return 0 totalGroup=output→totalGroup for i=0 to i < output→totalIndividual do currentDifference = 0 found = 9999 /* Assign it to big number */ candidateGroup = 9999 for j=0 to j < output→totalGroup do /* Find an item in any group, at the same time find the fittest group it belongs to */ /* find the fittest group */ numberAttribs = 0 for k=0 to k < input→numberAttributes do currentDifference=output→member[j].mean[k]-

input→individual[i].attributeValue[k] if currentDifference < significantLevel then numberAttribute++ endif enddo if numberAttribute = input→numberAttribute then if j ≠ found then candidateGroup = j endif endif if found = 9999 for k=0 to k <output→member[j].member.total do

if output→member[j].member.individual[k].id ≠ input→individual[i].id

then found = j

endif k++

enddo endif j++ enddo if found ≠ 9999 /* item has been assigned to a group */ then if found ≠ candidateGroup and candidateGroup ≠ 9999 /* We have found a better group for this item */

then /* Delete this item from old group*/ /* Add this item to a new group */ changes++ endif

138

else /* If this a new assignment of an item to a group */ if candidateGroup ≠ 9999 then /* Add this item to an existing group */ changes++ else /* Create a new group for this item */ changes++

endif endif endif

i++ enddo

Applying the clustering algorithm in our information retrieval network

model, the “items” to be classified are the index terms within a document. The

attributes of the items are given by the link weights. The estimation of the

significant level can be derived from the standard deviation of the link weight

distribution in the document collection. For example, in the ADI collection, the

standard deviations of the distribution of the link weights of the individual

documents are in the range 0.08 to 1.0. Thus, the significant level should be

estimated within this range.

We suggest that adopting a clustering technique that recognises the

distribution of the link weights will increase the precision but not the recall of the

retrieval. The recall will be the unchanged since no additional knowledge is

introduced into the network. We will present a comparison of the performance of

the two clustering approaches, random and non-random in chapter 6. In chapter 7

we will present a method to evaluate the goodness of the clusteriong model which

in turn can help us to determine the optimal clustering for our network.

139

5.4 Handling the Indirect Loops

The indirect loop exists in our Bayesian network model when evidence feedback is

implemented. Pearl's inference algorithm as used in our model will not work

properly in this situation. There are some existing approaches to handling the

indirect loops. These approaches perform some preprocessing to find and break

the loops before performing inference. We propose a new method for handling

the indirect loop. Our method is based on the idea that we can relax the

independence assumption in the network so that we can have a finite propagation

in the loop. This approach will suit the information retrieval application or indeed

any other large network applications because the proposed independence

assumption does not require much additional computation compared to the

preprocessing approaches.

There are three existing preprocessing approaches for handling cyclic

propagation or loops in Bayesian networks, namely clustering, conditioning and

stochastic simulation [Pearl88]. Clustering involves forming compound nodes in

such a way that the resulting network of clusters is singly connected. Conditioning

involves breaking the communication pathways along the loops by instantiating a

select group of nodes. Stochastic simulation involves assigning to each node a

definite value and having each processor inspect the current state of its neighbour,

compute the belief distribution of its host node, and select one value at random

from the computed distribution. Beliefs are then computed by recording the

percentage of times that each processor selects a given value.

140

Consider the small retrieval network in Figure 5-5 which serves to

illustrate these different approaches for handling the network loop. Note that this

network is similar to the network in figure 5-1. The only difference is that we have

instantiated document D3 as the user chose it as the relevant document during the

relevance feedback process.

q


D1 D2 D3

0.7 0.4

0.4 0.7 0.30.5 0.4

0.60.8

Figure 5-5 Retrieval network with a loop.

A loop exists in the network when we use document nodes as evidence in

relevance feedback. In the example in figure 5-5, if we take D3 as evidence (thus

P(D3)=1), it will change the belief in nodes retrieval and node feedback. The

belief in the proposition in node retrieval will change the belief in node q and the

belief in document node D2. The new belief in D2 in turn will change the belief in

the index term nodes image, information and retrieval and this belief in turn will

change the belief of the ancestor nodes all the way up to the node q thus creating

an indirect cycle or loop. To allow Pearl’s algorithm to work properly, a method

to transform this multi connected network into a singly connected network is

required. In the following sections, we present the possible approaches to the

141

indirect loop propagation problem in Bayesian networks and discuss their

appropriateness for the implementation of information retrieval systems.

5.4.1 Clustering

The clustering approach involves collapsing nodes to transform the network from

a multi-connected network to a singly connected network. In our example in

figure 5-5, the obvious choice for the nodes to be collapsed are information and

retrieval. The modified network is now depicted in figure 5-6.

P(D3 | information, retrieval )

P(D2 | information, retrieval)

q

similarity Information - retrievalimage feedback

D1 D2 D3

0.4 0.7 0.30.8

P(information, retrieval | q)

Figure 5-6 Clustered network.

The collapsing of the nodes information and retrieval into one node information-

retrieval, forces us to estimate P(information, retrieval | q), P(D2 | information,

retrieval) and P(D3 | information, retrieval). In the medical diagnosis field, where

this method was originally introduced, the estimation of any combination

proposition’s conditional probabilities was relatively easy to obtain because each

node in the medical diagnosis application represented a medical condition which

142

could be easily observed. The combination of two observations were usually

available from observation of past diagnoses. Thus, in the medical diagnostic

context, the estimation of (information,retrieval), (¬information,retrieval)3,

(information,¬retrieval), (¬information,¬retrieval) can be derived and used to

create a link matrix estimation of the effect of the collapsed node such as

P(D2|information,retrieval), P(D3|information,retrieval) or

P(information,retrieval|q). Such observations are not as straightforward in an

information retrieval network. In our model, the probability values in the

document nodes are used to rank the documents and the document nodes are the

nodes which exhibit the effect of knowing something about the beliefs in the index

term nodes. Since the inference process in information retrieval aims to find the

most relevant documents given a user query whereby a set of index terms are

considered, isolating and observing the effect of individual index terms or a group

of index terms is not desirable and certainly not a simple task.

Another problem that may occur in implementing clustering in information

retrieval is deciding which nodes are to be clustered. We know that any document

which shares two or more index terms with the query network will create a loop in

the network. An extreme choice is to clamp all these index terms into one

compound node, both in the query network and document network as in the

approaches of Cooper [Cooper84] and Peng and Reggia [Peng86] for the medical

diagnosis application. Unfortunately, the exponential cardinality and

3 ¬symbolised negation

143

structurelessness of the link matrix for these large compound nodes make the

inference difficult to compute.

A popular method of clustering the nodes is the join tree [Lauritzen88]. If

the clusters are allowed to overlap each other until they cover all the links of the

original network, then the interdependencies between any two clusters are

mediated solely by the variables they share. If we insist that these clusters continue

to grow until their interdependencies form a tree structure, then Pearl’s tree

propagation algorithm can be used in the inference. This method of clustering will

produce a better structure and less complexity of the propositions involved in the

clustered nodes compared to Cooper’s approach. However, implementing this

approach in information retrieval may be very costly, because the number of nodes

involved in the network will mean the preprocessing involved in finding the cluster

set will be time and resource consuming. It is also worth noting that the retrieval

network is a dense network. That is, there is a high interconnectivity between the

index terms and the document nodes in the network. This characteristic means that

there is increased complexity inherent in the process of finding the cluster set. The

clustering method may be appropriate when the document collection is relatively

stable, that is when documents are not often deleted or added to the collection.

Because any addition or deletion of documents means changes will occur in the

network distribution, the clustered sets need to be regenerated in such an event.

5.4.2 Conditioning

Conditioning is based on our ability to change the connectivity of the network to

render it singly connected by instantiating a selected group of nodes [Dechter85].

144

We can condition the multi-connected network in figure 5-5 into a singly

connected network as depicted by figure 5-7 by cutting the loop in the network at

node q. The node q is called the loop-cutset node. Once we assign a node to be a

loop-cutset node, we can instantiate node q to block the propagation of the belief

in the path information-q-retrieval. By doing this, we will have singly connected

network and Pearl’s singly connected algorithm becomes applicable.


D1 D2 D3

q=0 q=0

0.4 0.7 0.30.5 0.4

0.60.8

P(information|q)=0.7P(retrieval|q)=0.4

Figure 5-7 A singly connected network as the decomposition of multi connected network.

If we want to recalculate the value of P(D2) given that the user chose D2

as the feedback evidence, we first need to assume that q=0 and then propagate its

value through the network until it reaches D2. Using the same network, we now

assume that q=1 and repeat the propagation process. Finally, we average the two

results weighted by the posterior probabilities P(q=1|D2=1) and P(q=0|D2=1).

Conditioning provides a working solution in many cases of approximation

in Bayesian network application. However unlike clustering, if the network is

highly connected or dense it may suffer from combinatorial explosion [Pearl88].

145

The message size grows exponentially with the number of nodes required for

breaking up the loops in the network. Considering that during the inference, we

must consider each possible combination of instantiated values of the loop-cutset

nodes, the number of these loop-cutset instances is equal to the product of the

numbers of possible loop-cutset nodes. This product is clearly exponential in the

number of loop-cutset nodes.

The information retrieval network suffers from this combinatorial

explosion because it is a dense network. It is possible to use a minimisation

algorithm to reduce the cutset, however it has been shown that the minimisation

algorithm is NP-hard [Stillman91]. Thus conditioning in an information retrieval

network would be very costly process.

5.4.3 Sampling and Simulation

Stochastic simulation is a method of computing probabilities by computing how

frequently events occur in a series of simulation runs. If a causal model of a

domain is available, the model can be used to generate random samples of

hypothetical scenarios that are likely to develop in the domain. The probability of

any event or combination of events can be computed by counting the percentage

of samples in which the event is true.

In general, the simulation methods are divided into two main categories,

namely Forward sampling [Bundy85, Henrion86, Shachter86,88] and Markov

simulation [Pearl87, Chavez90, Berzuini89]. The main difference between the two

approaches lies with the directionality of the propagation during the simulation.

Forward propagation as the name implies, only involves propagation in the

146

direction of cause in the network. The drawback of this method is that its

complexity is exponential in the number of observed or evidence nodes

[Henrion90, Hulme95]. Thus, forward sampling can only be practical if the

evidence nodes are at the root of the network.

Markov simulation (sometimes known as Gibbs sampling) on the other

hand, allows propagation in both directions. However, this method will have

convergence problems when the network contains links that are near deterministic,

that is close to 0 or 1 [Chin89].

In our information retrieval model, we have propagation in both directions.

The diagnostic or back propagation occurs when we need to infer P(ti) given

knowledge of P(dj) with an arc from ti to dj. Moreover, a loop exists in the

network when we apply evidence feedback and the evidence lies with the

document nodes which are non-root nodes. Thus, forward sampling is not

appropriate to our information retrieval network because of these two problems:

the lack of support for backward propagation and the exponential complexity of

the algorithm for non-root evidence nodes.

Markov simulation (Gibbs sampling) on the other hand does not suffer

from the above two problems. To implement this method, we need samples of

propositions and their associated observation values. For information retrieval, we

can obtain this from the relevance judgment of a test collection. A test collection

contains sets of queries with associated documents that are judged to be relevant

to the queries. A number of simulations may then be run on a particular query and

the set of retrieved document observed. A score is kept for each time a particular

document in the relevance judgment set for the query is retrieved. With this

147

approach, we have to make one important assumption, namely that the ‘causal

model’ in the network represents the correct distribution of the document

collection and that it will generate a 100% level of recall and precision. However,

it has been shown that this level of performance is unachievable in information

retrieval models [Wallis95]. Even if we were content with the approximate model

and hence with accepting less than a 100% level of recall and precision, the size

of the network would make running the simulation too costly. Pearl [Pearl88]

showed that to get within 1% of the approximate value, we need over 100 runs. It

is accepted that the accuracy of the sampling depends on the number of runs

performed [Henrion90]. Thus, although many researchers have taken the sampling

approach towards handling multi connected networks [Henrion86, Pearl87,

Fung90b, Shachter90, Hulme95], this approach does not provide a solution

practical for information retrieval. We propose instead a method using intelligent

nodes to solve the problem of multi connected networks.

5.5 Dealing with a Loop Using Intelligent Nodes

We have investigated different approaches to handle loops in Bayesian networks.

However, all of them are computationally impractical for information retrieval

networks due to the network size and density. We propose a method involving

intelligent nodes. The aims of our proposed approach are as follows:

1. Providing a mean to break the loop so that the propagation in the

network is finite.

2. Providing a mean to break the loop without introducing additional

computational complexity to the inference process.

148

We use the term intelligent because in Pearl’s inference algorithm the node is

memoryless whereas in our approach the nodes do have some memory. The

memory is used to “remember” the source of message received so that the next

time it receives the message from this source it will reject it. In other words, the

intelligent nodes act as filters of messages in the network loop. They filter any

child messages of a node so that the message is blocked from updating the parent

node value of the original message. To illustrate the method, consider the network

shown in figure 5-8.

q


D1 D2 D3

0.7 0.4

0.4 0.7 0.30.5 0.4

0.60.8

Figure 5-8 Network with intelligence node retrieval.

Consider that retrieval producing the initial document ranking has been

performed. Thus, each node in the above network has a belief value attached to it

(see section 5.1 for the actual values). When document D3 is chosen as the

relevant document during relevance feedback, ie. the node D3 is instantiated, this

node will send the evidential support or child message to both its parent nodes

namely retrieval and feedback. With this new evidential support, the node

149

retrieval recalculates its belief on the proposition represented in the node. In

Pearl’s algorithm this new belief then will be passed to its ancestor. Our approach,

on the other hand, stops the message from going to the parent(s) of node

retrieval. Note that we have produced the initial document ranking, so that the

belief at node retrieval is arrived at due to the instantiation of the proposition on

the node q. Therefore the degree of influence of the query on node retrieval has

been reflected in this node's belief value. If we send evidential support λretrieval to

node q, λretrieval will contain some value of πq. This means that the value of πq will

be amplified. This amplifying effect does not aid our understanding of the problem

and will cause the propagation to run indefinitely.

Consider the following reasoning process in a real life situation. In the

morning coffee break my colleague tells me that it is going to rain tomorrow. If I

tell her after lunch break that it is going to rain tomorrow because I happen to be

reading the weather column in a newspaper at lunch, her belief about the

proposition tomorrow it is going to rain should not increase because that

information came from me, a person who received the same information earlier

from her (the same source). Her initial information may have come from the same

article in the newspaper that I read at lunch. I may be considered to be acting as a

mirror of her information. My information does not introduce new knowledge to

her. The same principal may be applied to our loop problem in the information

retrieval network when we block the child message λretrieval from node q. The

message λretrieval will only amplify πq.

The independence assumption is slightly changed with this approach. We

actually relax the independence assumption to solve the loop problem. In the strict

150

d-separation or heuristic check, setting D2 as evidence will cause the all the nodes

in the sample network to become dependent. We add to the checking procedure a

routine to find a filter node that will make some of the nodes in the network

independent and hence break the loop. Checking whether any nodes in the

network have fan-in descendents and fan-out descendent can easily identify the

loop. If there is a node that meets this condition, a loop exists in the network and

needs to be broken using the filter node. The modified independence assumption

now becomes:

Given a node with descendents which have fan-in links and

ancestors which have fan-out links; if this node is the direct

parent of a node with fan-in links, the ancestor of this node is

independent when the direct child of this node is instantiated.

Using this independence assumption, the node retrieval causes this node and node

q to be independent when D3 is instantiated. In the implementation of this

independence assumption for information retrieval, we can safely assume that the

candidates for the filter nodes are the index term nodes in the document layer. The

filter nodes are the index terms used in the query and in part of the relevant

document found in the relevance feedback. Note that this filtering does not apply

in the production of the initial document ranking because we have instantiated

node q and node q is not the direct child of either node information or node

retrieval which are the candidates for the filter node.

The modified independence assumption proposed is not significantly

different from Pearl’s original independent assumption (see section 3.4.2) apart

from the fact that our proposed assumption includes the knowledge of the

151

information source. However, our proposed assumption provides a method for

breaking the infinite propagation with very little computational cost. The

additional computational cost involved is only the storage cost of keeping the

knowledge of the information source, or the memory of the intelligent node. This

memory can be easily implemented as a boolean variable. Thus, for a system that

involved large network structures such as in an information retrieval system, this

assumption presents a workable solution to the problem of indirect loops.

5.5.1 Example of the Feedback Process Using

Intelligent Nodes

Assume that we assign node D3 as the evidence node used in the relevance

feedback process. We assign P(D3)=1 and P(retrieval=true | D3=true) = 0.8. The

initial belief in node retrieval is 0.4 as calculated in section 5.1. The new belief in

node retrieval is calculated as the combination of the effect of the new evidence

which arrives in node retrieval as λretrieval. λretrieval comes from two of its child

nodes, namely node D2 and D3. With these values, the beliefs in the network nodes

become:

λretrieval = 0.4 + 0.8 = 1.2

P(retrieval) = 0.4 * 1.2 = 0.48

P(D2|q,D3)= 0.3(5*10-4)(1-0.7)(1-0.48) + 0.5(1-5*10-4)0.7(1-0.48) +

0.7(1-5*10-4)0.7*0.48 + 0.6(5*10-4)(1-0.7)(1-0.48) +

0.58(5*10-4)(1-0.7)0.48 + 0.65(5*104)0.7(1-0.48) +

0.71(5*10-4)0.7*0.48 = 0.4173

152

P(D1|q,D3) = P(D1|q) because document D1 does not share any common

index terms with document D3. If we have another document called D4 which

contains the index term feedback, then P(D4|q,D3) ≠ P(D4|q). The value of

P(D2|q,D3), as expected, is increased because it contains the index term retrieval

which is found in the relevant document D3.

5.6 Summary

We have presented issues and changes to the basic network model which need to

be considered when implementing Bayesian networks for information retrieval

systems. The main issues have been shown to be the complexity of the

computation and indirect loop propagation during the relevance feedback process.

The complexity of computation occurs due to a large number of parents per node

which cause an explosion in the size of the link matrices.

An information retrieval network can be considered a dense network

whereby a large number of nodes share the same parent nodes. The fact that the

network is dense precludes some of existing approaches, such as layer reduction

and link-node deletion, from being of practical use in information retrieval

implementations. We have proposed a new method involving the addition of a

virtual layer in order to reduce the size of the link matrices. Although the total

number of nodes in the network is increased, this approach provides a systematic

method for reducing the size of the link matrices in order to meet the computing

resources available. In the virtual layer approach, the parent nodes are grouped

into a number of clusters. Each cluster is then connected to a virtual node. This

virtual node is in turn connected to the child node.

153

There are different ways of grouping the parent nodes. We introduce two

simple methods, namely random and non-random clustering. The random

clustering approach does not take into consideration the distribution of the link

weights. The assignment of the node to a group is arbitrary determined by the link

weights sequence in the data file. The non-random clustering scheme, on the other

hand, considers the link weight distribution and classifiesthe nodes accordingly.

We also will present in chapter 7 a method which can be used to measure the

goodness of the clustering methods in order to find the optimal approximation of

the model.

Another issue in the implementation of the Bayesian network model for

information retrieval discussed in this chapter is the indirect loop problem. The

indirect loop exists in our network when we want to implement evidence

feedback. We have proposed a solution involving the use of intelligent nodes

which act as message filters in the network and break the loops in the network.

The intelligent nodes are part of the original network, however using our

independence assumption, differ slightly in that they remember the information

source. By knowing the information source, these nodes may filter the messages

better than Pearl’s independence assumption and provide finite propagation in the

loop.

In the next chapter, we will measure the performance of our retrieval

model using three test collections namely ADI, MED and CACM. The

performance will be reported in terms of recall and precision, a common

performance measurement unit in information retrieval research. Firstly, we will

look into the influence of different weightings applied to the link weights. Detailed

154

discussion of ways of estimating the link weights in both query and document

networks are presented. Secondly, we will present a performance comparison

between the two clustering methods discussed in this chapter. In the last part of

the next chapter, we compare the performance of our model with other

information retrieval models to show that our model not only provides a more

general model for information retrieval but also exhibits higher recall and

precision.

155

Chapter 6

Model Performance Evaluation

6.1 Introduction

Information retrieval systems provide us with the ability to locate and retrieve

useful documents from a large collection of documents. As a user, we would

expect these systems to perform the retrieval tasks as rapidly and economically as

possible. Further to this requirement, the value of information retrieval systems

also can be seen to depend on their [Salton83]:

• ability to identify useful information accurately and quickly.

• ability to reject non-relevant documents.

• versatility of the retrieval methods.

We have shown in chapter 5 that the proposed Bayesian model fulfills the last

requirement since different retrieval models can be simulated using appropriate

network representations. In this chapter, we present evaluation result measuring

our systems performance as judged against requirements 1 and 2 above. The

conventional measure of recall and the precision level will be used to study the

performance of a system.

The recall level measures the ability of the systems to find all the useful

or relevant documents for a given query. The precision level measures the rate of

rejecting non-relevant documents and of finding the relevant ones before the non-

relevant documents are retrieved. A perfect information retrieval system is one,

156

which claims a 100% level of recall and precision. This is achieved by retrieving

all the relevant documents before retrieving any non-relevant documents for a

given query. This is not easy to achieve because most practical retrieval systems

retrieve some non-relevant documents before all the relevant documents are

retrieved or, in other words, the level of precision usually decreases as the recall

level increases. In fact it has been proved that without relevance feedback, most

current information retrieval systems can only achieve maximum of 80%

precision with 100% recall [Rijsbergen92].

Improvement in performance in information retrieval systems may seem

very small in term of the absolute percentage. However, this small percentage

does make substantial different when we consider the massive amount of

document involved during the retrieval process. Moreover, the increase in

precision is also more difficult to achieve near the optimum level, as stated earlier

by Rijsbergen [Rijsbergen92].

In information retrieval experiments, the recall and precision levels are

obtained by performing several retrievals on the test collection using the supplied

queries. A test collection in information retrieval experiments comprises:

• A set of documents – current test collections generally contain

information from the original document such as title, author, date and

an abstract. The collection may include additional information such as

controlled vocabulary terms, author-assigned descriptor and citation

information. The documents used in the collection are usually taken

from journals and/or newspapers.

157

• A set of queries – These queries are often taken from actual queries

submitted by users. They may be either expressed in natural language

or in some formal query language such as boolean expressions.

• A set of relevance judgements – For each query in the query sets,

normally a set of relevant documents is identified. This identification

process can be done manually by human experts or by using statistical

pooling retrieval information from several information retrieval

systems.

Each of these query-document sets in the test collection is used during

experiments. The interaction of these sets in an information retrieval experiment

is depicted by figure 6-1.

Testcollection

StandardQueries

Retrievalmodel

Documentranking

Relevancejudgement

ComputedRecall

Precisionlevel

Figure 6-1 Model for experiments in information retrieval systems.

Using the standard queries in a test collection, a retrieval system under evaluation

performs a document search in the documents set. The result of the search is a list

of document identifications whereby the document assumed most relevant is

ranked first. This list of rankings is then compared with the list of relevance

158

judgments. The relevance judgment list itself does not imply any ranking. It only

contains the identification number of documents which judged relevant to the

query. Using the recall and precision formulae (see section 6.2), the recall and

precision levels are calculated.

There are several existing standard test collections available for

comparing the performance of information retrieval systems. These collections

vary in collection size, the number of queries, the structure of information and

domain of the information. We used three popular and well-studied test

collections to evaluate the performance of our system. These were ADI1,

MEDLINE and CACM respectively2. The characteristics of these collections are

shown in table 6-1

ADI MEDLINE CACM Information domain Computing Medical Computing No. documents 82 1033 3204 No. Index terms 2086 52,831 74,391 Ave. no.index term/doc 25.451 51.145 23.218 St.dev of no.index term/doc 8.282 22.547 19.903 No. queries 35 30 64 Ave.no.query terms 9.967 9.967 10.577 Size in kilobytes 2,188 1,091 37,158

Table 6-1 Test collection characteristics.

The ADI collection is the smallest test collection. It contains articles from

computing journals. This collection is usually used only in the initial

experimental stage because of its limited size. The MEDLINE test collection was

created from medical articles in the MEDLINE database. The queries in this

collection were obtained from the queries submitted by actual users of the

1 The full documents collection of the ADI are given in appendix A. The full queries are given in appendix B. 2 The collections can be obtained from anonymous-ftp site ftp.cornell.cs.edu

159

MEDLINE database. The CACM test collection is created from articles published

in the Communication of the ACM from 1958 to 1979. Each record in this

collection contains author, title, abstract, citation information, manually assigned

keywords and Computing Review categories. The CACM collection is the largest

test collection amongst the traditional test collections.

The nature of the test collection influences to some degree the result of

experiments in information retrieval research. More specifically the query and the

relevance judgment sets are the two main influences to the experimental results.

Experiments presented at 5th Text Retrieval Conference [Voorhees96] showed

that retrievals using long and more specific queries produce a better recall and

precision level than retrieval using short queries.3 Compared with the test

collection used in TREC-5, most of the queries in the traditional test collection

such as ADI, MEDLINE, and CACM are considered to be short. Thus the

maximum level of precision with 100% recall will be expected to be less than

100%.

We have started this chapter by looking at the methodology involved in

conducting experiments in information retrieval. The rest of this chapter will be

organised as follows, Section 6.2 reviews in detail how part of the test collection

can influence the outcome of the experiments, namely the relevance judgment

set. We will discuss how the relevance judgment sets are created and the effect

these different creation methods have on information retrieval experiments. In

this section, we will also provide examples which show how to calculate the

3On average the short queries consist of less than 20 index terms and the long queries contains 100 index terms.

160

recall and precision level using the retrieved document ranking produced by the

system and the relevance judgement from the test collection.

Section 6.3 presents the performance of our basic model. We use the term

basic model to refer to a retrieval model that does not use any weighting scheme.

We will use this basic model to compare and discuss the performance of different

approaches of estimating probability in section 6.4. The effect of assigning

different probability estimation to document link weights, query link weights and

the virtual link weights are discussed in this section. Finally, we will compare the

performance of our model with other existing retrieval models namely the vector

space model [Salton83] and Turtle’s inference network [Turtle90].

6.2 The Relevance Judgement Set

The most difficult task in creating a test collection is the creation of the relevance

judgement set. In the current test collections, these relevance judgments are

created using one of two methods:

• Human judgements

In this approach, the relevance judgment sets are created using human

to judge whether a document is relevant to the query. They may be the

actual users who have submitted queries or independent experts in the

collection domain. This method however, is only practical for small

collections especially the approach using independent domain experts

because the experts have to inspect every document in the collection

in order to determine the documents’ relevancy to the query. The

161

relevance judgments in the three test collections used in our

experiments are created using this method.

• Pooling methods

In this method, the output of a number of different information

retrieval systems is pooled whereby the first N documents in the rank

output are combined using some statistical methods. This method has

been claimed by some to find the vast majority of relevant documents

[Salton83]. However Wallis [Wallis95] argues the opposite. As in

other pooling applications, the number of pool participants affects the

accuracy of the relevance judgment pool. The higher the number of

participants, the higher is the chance of finding the relevant

documents.4 Despite this issue, the pooling method is the only

practical way to derive the relevance judgment set when the collection

is very large, such as the Wall Street Journal collection (250Mb).

Performing domain expert judgement is too expensive in such

collections, since it is not possible for experts to inspect every single

document in the collection.

Regardless of the limitations of the methods for creating the relevance judgment

sets, test collections remain the most widely used tool for comparing retrieval

performance. The test collections are used to generate the recall and precision

which is the comparison unit in information retrieval experiments. The recall and

precision can be calculated using the following formulae:

4 The relevance judgement of Wall Street Journal collection has been improved over these years by the participant of TREC, hence the first few versions of relevance judgement sets for this collection may thus suffer from the low number of systems used in the pool [Voorhees96].

162

R

rrecall = (6.1)

N

rprecision = (6.2)

where

r is the number of relevant documents retrieved for a given query.

R is the number of relevant documents in the collection for a given query.

N is the number of documents retrieved for a given query.

To illustrate the use of the above formulae, consider the following example of a

set of ranked retrieved document numbers and a set of document numbers judged

relevant in a given query.

Retrieved: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20

Relevant: 1,2,4,5,6,8,11,14,17,20

Using equations 6.1 and 6.2 respectively, the recall and precision for the above

retrieved set are:

Recall (%) Precision(%) first 5 documents retrieved 40 80 first 10 documents retrieved 60 60 first 15 documents retrieved 80 53.3 first 20 documents retrieved 100 50

Table 6-2 Examples of recall and precision for different numbers of inspected documents.

We can see from table 6-2 that as the recall level increases the precision level

decreases. Thus, the aim of achieving 100% recall and 100% precision may be

considered an unachievable goal [Rijsbergen92].

163

The inspection points, i.e the number of documents retrieved at a given

point in the report may vary from experiments to experiments. This depends on

the size of the collection and the rate of increase in recall or of decrease in

precision. If the rate of increase in recall or of decrease in precision is very high,

a smaller interval may be needed. However, if the rate is low and the collection is

large, a bigger interval may be sufficient for us to report the performance of our

experiments without losing detail in the trends in the recall and precision level. In

the above example, we have used an interval of 5 documents as the inspection

point for calculating the recall and precision levels.

There is another way of reporting the recall and precision level. This

approach reports the value of the precision at a given recall level. In this

approach, the size of the collection will not come into consideration when

determining the interval. The only consideration is in how much detail the

experiments want to report the relation between recall and precision. This

approach provides more useful information than the previous approach because it

shows clearly the relationship between recall and precision at a given point and

the trend in the recall and precision level over the whole experiment. Using this

approach, table 6-3 shows the reporting of the recall and precision level for our

previous example. We will use this approach reporting the precision for given

recall level in reporting the precision our experimental results throughout this

chapter.

164

Recall level (%) Precision (%) 10 100.00 20 100.00 30 75.00 40 80.00 50 83.33 60 75.00 70 63.63 80 57.14 90 52.94 100 50.00

Table 6-3 Example of measuring precision at a given recall level.

We have mentioned that we performed the experiments against three test

collections. In the majority of discussion in this chapter, we will use only the ADI

collection when discussing the effect of different probability estimation to the

recall and precision of the systems for simplicity reason. The results of

MEDLINE and CACM experiments will be presented in the conclusion when the

optimum model has been established. Unless mentioned specifically, tables of

recall and precision in this chapter will be for the ADI test collection. In the next

section, we will examine the performance of our basic model.

6.3 Performance of the Basic Model

In the basic model, the value of link weights between the query node and the

query term nodes or P(ti|Q=true) is calculated as term frequency within the query

(qf). The link weights between the document nodes and the index term nodes or

P(dj|ti=true) is calculated as terms’ frequency within the document (tf, equation

2.3.1). In what follows, we discuss the individual components of these estimates

independently, although in fact they are dependent. As a result, conclusions about

165

the performance of one component cannot be based on a single observation. We

will use the values of recall and precision in this basic model (table 6-4) as the

baseline performance to show the effect of varying probability estimation for the

link weights. The results were obtained from performing retrieval based on the

basic model for all queries in the ADI collection.

The precision at given recall level for the basic system is very low as we

expected since this system does not confer any measure of importance on index

terms in the documents and collection. The tf and qf provide only a local measure

of importance within a document or the query. As a result, long documents will

be more likely to be ranked higher since the chance that a term will occur

frequently increases in longer documents.

Recall level (%) Precision ( %change) 10 21.79 20 21.50 30 15.61 40 13.79 50 12.59 60 12.59 70 12.45 80 11.95 90 11.60

100 11.60 Average 13.22

Table 6-4 Recall and precision of basic model.

The highest precision for this model is only 21.79% for a recall level of

10%. This result agrees with previous experimental results of various information

retrieval systems [Sparck-Jones72, Salton83, Turtle90]. The performance of this

basic model can be further improved by adopting good estimations for the

probability parameters of the model. In next section we present the estimations of

those probability parameters.

166

6.4 Estimating the Probabilities

The basic systems provide very simple probability estimations for the

links in the network and produce poor experimental results. We investigated

several methods of estimations. The subsequent discussion in this section

regarding the probability estimations will be divided into three sections, namely:

1. estimates of the importance of the query terms in explaining the

information needs of the user or P(ti|Q=true) (section 6.4.1).

2. estimates of the dependence of the documents upon the index terms in

the collection or P(dj|ti=true) (section 6.4.2).

3. estimates of the virtual layers’ distribution (section 6.4.3).

These estimations represent the link weights in the network. Thus correct

estimations will lead to a good retrieval performance of the model.

There are two networks, the query and document networks, used in the

model and each of them may take different estimation. We will state clearly the

parameters estimated in one network when discussing the other network’s

parameter because the combination of the two networks parameters influence the

choice of parameters in individual network.

6.4.1 Estimating P(ti|Q=true)

Users’ information need, which is represented by node Q, can be submitted to the

system by using either the Boolean or natural language. With natural language

approach, the query submitted to the system is indexed using a process similar to

that of indexing documents. All the words in the query that generally do not

167

affect the retrieval performance are removed using the stop word list (see section

2.4.1). The remaining words are then stemmed to remove common endings in

order to reduce simple spelling variations to a single form. The stemmed words

are then weighted according their importance to the user. The weighting is used

to increase the influence of terms that are believed to be important on the

document ranking.

Two factors are commonly used in weighting the contribution of the

query terms; the frequency of a term in the query (qf) and the inverse document

frequency (idf) of a term in the collection. The assumptions made in this

approach are that:

1. a content-bearing term, which occurs frequently in the query, is more

likely to be important than the one that occurs infrequently.

2. those index terms that occur infrequently in the collection are more

likely to be important than frequent or common index terms.

Moreover, such index terms can be used as discriminators of the

document in the collection.

As we have discussed in section 4.3.2, the importance of query terms can be

estimated by the users if they have some confidence to do so. We would prefer

the user to be able to assign the importance of the query terms in their query.

However, as explained in section 2.2, sometimes users are not clear about their

information needs. Thus, they do not have the ability to estimate the importance

of the query terms and in this situation, the above qf and idf estimates can be used

as an alternative. We have tested these estimates individually as well as both as a

combination. Different from the basic model, we normalised qf estimates in order

168

to reduce the bias of the estimation towards long queries. The normalised qf (nqf)

of a term i in a given query j is calculated as:

j

jiji qf

qfnqf

max,

, = (6.3)

where

qfi,j is the query term i's frequency within query j.

max qfj is the maximum frequency of any term in query j.

The second parameter, idf of term i in document k is calculated as:

iki df

Nidf log, = (6.4)

where

dfi is the number of document containing term i

N is the number of documents in the collection

The combination of these two parameters may be derived from the product of

equations 6.3 and 6.4. In the rest of the discussion we will refer it as qf.idf

estimate. In qf.idf estimate the value of this parameter may be higher than 1 for

those query terms that occur infrequent in the collection. Thus, we need to further

normalise this parameter. One of the normalisation techniques is the cosine

normalisation method. This normalisation is introduced by vector space model.

The nqf and idf are considered as vectors. Equation 6.5 shows the normalisation

formula

22__

idfqf

idfqfidfqfnormalised

××= (6.5)

169

Table 6-5 shows the results of experiments using different term weights in

the query. We use the document network estimates of the basic model for these

experiments so that we can see the effect of the query’s parameter estimates.

Compared with basic model, the performance of model which uses qf

alone is decreased. This drop in performance occurred at every recall level. This

can be explained by the short nature of the query. Since the queries involved in

the experiments are relatively short, achieving high accuracy in statistical

estimation using such limited data is difficult. This estimate may consider as

noise and as a result, it reduces the performance.

Precision (%) Recall(%) Basic qf weights idf weights qf.idf weights

10 21.79 21.18 34.62 58.19 20 21.50 20.11 34.62 56.86 30 15.61 13.8 29.05 52.38 40 13.79 12.45 18.97 44.57 50 12.59 11.68 18.94 40.47 60 12.59 11.62 18.14 37.85 70 12.45 11.36 15.62 31.11 80 11.95 10.32 15.62 23.65 90 11.60 9.09 15.62 23.04

100 11.60 9.09 15.62 20.25 Average 13.22 11.88 19.71 35.31

Table 6-5 Performance using different weights for query terms.

The implementation of the idf factor alone, on the other hand, increases

the performance significantly. The idf estimate is based on the statistical data

collected from the collection. The distribution of the index terms in the collection

can provide more accurate statistical estimates than the query because it derives

from larger population sample. Moreover, the idf introduces a global

discriminator. An index term that occurs often in a query will not be a good

document discriminator when it occurs in most of the documents in the

170

collection. Index terms that occur less frequently in the collection are treated as

more importance than those that occur more frequently.

The combination of the qf and idf factors increase further the performance

of the system using the idf weight alone. This combination of qf and idf produces

better results than the idf or qf used alone because the combination of both gives

local and global estimates of the parameters used. An index term that occurs

frequently in a query but does not occur frequently in the collection will be a

good discriminator of documents in the collection. Thus, instead of acting as

noise as in case of the pure qf weights, these qf weights work as intensifiers of the

statistical data provided by the idf weights.

6.4.2 Dependence of Documents on Index Terms

The probability that a term accurately describes the content of a document

can be estimated in several ways, but previous information retrieval research has

consistently shown that index term frequency in a document (tf) and inverse

document frequency (idf) are useful components of such estimates [Salton83].

Therefore, we will concentrate on estimating the link weight that involves tf and

idf.

6.4.2.1 Estimating the tf and idf Components

The tf estimate can be represented by the common ntf [Salton83,

Rijsbergen79] in which the tf of a term i in a given document j is given by

dividing the tfi,j by the maximum frequency of any term in the document as

shown in equation 6.6.

171

j

jiji tf

tfntf

max,

, = (6.6)

The formula is similar to the qf weighting scheme. The only difference is that it is

applied to a document instead of the query. The idf component can be estimated

using equation 6.4.

Table 6-6 shows the performance of the two estimates in the ADI

collection. The average performance of the retrieval based solely on the tf

estimates of P(dj|ti=true) shows 5.01% drop in performance compared with

retrieval based solely idf estimates. The difference in performance between the

retrieval based on tf weights alone and idf weights alone is smaller than that

observed for the qf and idf estimations(table 6-5).

Precision (%) Recall (%) tf idf 10 34.38 43.22 20 34.38 40.47 30 28.96 38.13 40 24.47 35.61 50 19.18 27.86 60 19.14 25.76 70 18.36 21.22 80 15.95 18.55 90 15.95 17.59

100 15.95 16.17 Average 22.33 27.34

Table 6-6 Performance of the retrieval using tf and idf components.

Again, this situation may be explained by the fact that documents contain more

index terms than queries, thus providing a larger sample population for the

estimates.

6.4.2.2 Estimating the Combination of tf and idf Components

The belief of P(dj|ti=true) may be estimated by determining the default

belief or the belief in the absence of any index terms that support or against a

172

proposition represented by the document nodes[Salton83, Rijsbergen79]. The

estimation is given by

P(dj|ti=true)=α + (1-α) × ntf × idf

Estimates for P(dj|ti=true) should lie in the range 0.5 to 1.0 and estimates

for the default belief should lie in the range 0.0 to 0.5. We investigated different

several values of α in the range of 0.5 to 1. The best performance is given when

α=0.5.

A large number of functions for combining and normalising the tf and idf

estimations were tested. Since we require the probabilities to lie in the range

[0,1] we need to normalise the combination of tf and idf because the combination

may produce values greater than 1. For example, consider the index term educat

in the ADI collection in document 14. This index term has the value of 0.8 for the

tf component when calculated using equation 6.6. The idf component of this

index term in the ADI collection is 12.13 when calculated using equation 6.4.

Thus without the normalisation the weight of index term educat will be greater

than 1.

There were two normalisation functions that we found performed best in

our experiments. The first estimation uses the cosine normalisation as shown in

equation 6.7.

P(dj|ti=true)=22 )5.05.0(

5.05.0

ji

ji

idfntf

idfntf

××+

××+ (6.7)

This equation is slightly different from the cosine normalisation for the query

network (equation 6.5) to take the consideration of the default belief 0.5.

173

With this estimation method, the P(dj|ti=true) in the ADI collection are

estimated in the range of [0.03,0.468], the MEDLINE collection in the range of

[0.017,0.634] and the CACM in the range of [0.017,0.994]. These measures give

a broad range for CACM and MEDLINE collection, but considerably less range

for the ADI collection. We note that this difference influence the behaviour of the

system accordingly.

We also investigated a maximum normalisation function to produce

similar estimation range among the collections. In this function, the tf.idf is

divided by the maximum tf.idf in the collection. The function is shown in

equation 6.8.

⎟⎟⎠

⎞⎜⎜⎝

⎛×

××+==

)max(5.05.0)|( ,

idfntf

idfntftruetdP iji

ij (6.8)

Using this scheme, P(dj|ti=true) in the ADI collection is now lie in the range

[0.527, 1.0], the MEDLINE is in the range [0.503,1.0] and the CACM is in the

range [0.505,1.0]. Compared with the cosine normalisation, this normalisation

produces similar ranges for the all three collections. Thus, the differences in

characteristics among the collections during the experiments can be minimised.

Table 6-7 compares the performance of the normalisation functions on the

ADI collection.

174

Precision (%) Recall (%) Cosine Normalisation Maximum Normalisation

10 64.09 63.40 20 63.80 62.75 30 59.59 57.52 40 54.91 51.94 50 48.56 45.71 60 47.35 43.98 70 37.61 36.01 80 28.87 27.83 90 25.56 26.49

100 23.37 24.38 Average 45.37 44.00

Table 6-7 Performance for two normalisation functions.

The average performance of the two normalisation functions are only differ by

1.37% with cosine normalisation consistently providing a higher precision. The

figures in table 6-7 suggest that when only qf is used to estimate P(ti|Q=true), the

estimation of P(dj|ti=true) using the cosine or maximum normalisation does not

influence the performance significantly, although they provides different weight’s

distribution range. However, when queries with cosine normalised weighted

terms are used (equation 6.7 applied for nqf), the effect of different probability

distribution in the collection due to the choice of normalisation functions for the

document term weights is significant.

Table 6-8 shows the comparative performance of the combination of the

dependence of documents on index terms using cosine and maximum tf.idf

estimates and on query based on cosine qf.idf estimates.

175

Precision (%) ADI MEDLINE CACM

Recall (%)

cosine Maximum cosine maximum cosine Maximum

10 64.09 68.92 91.10 89.50 71.76 78.84 20 63.80 68.04 79.51 81.55 59.73 69.41 30 59.59 62.24 74.72 75.09 48.26 57.37 40 54.90 56.73 70.96 72.27 39.18 43.60 50 48.56 53.41 66.78 65.15 32.77 36.90 60 47.34 50.84 60.53 58.44 27.48 32.47 70 37.61 38.15 55.92 51.39 21.35 28.70 80 28.87 28.53 47.178 45.65 18.52 20.65 90 25.56 27.65 40.78 39.49 15.55 17.03 100 23.37 25.46 33.68 32.58 12.53 13.14

Average 45.37 48.00 62.11 61.11 34.71 39.81

Table 6-8 Performance using cosine and maximum normalisation in all collections.

The average precision for both the ADI and CACM experiments is higher for the

maximum normalisation. The maximum normalisation produces 2.63% better

average precision in ADI collection and 5.1% in the CACM collection

respectively. The MEDLINE experiments on the other hand show 1.0% decrease

in the precision average for the maximum normalisation compared with the

cosine normalisation. This different behaviour in the MEDLINE collection may

be explained by the fact that the length of the document in the MEDLINE

collection varies enormously (see the average index term per document and its

standard deviation of this collection in table 6-1). The average of number of

index term per document is 51.1 and the standard deviation is 22.5. Thus, we can

see that there are some documents which are very short. The maximum

normalisation is slightly biased toward the short document.

The maximum normalisation method also produces a smaller rate of

decrease in precision. Indeed, table 6-8 shows that the drop in the precision is

slower in the maximum normalisation columns for all three collections. For

176

example, the precision drops 12.03% as the recall increases 10% in CACM

experiments using the cosine normalisation. With maximum normalisation the

corresponding in precision drop is only 9.43%. Similarly in MEDLINE, the drop

is greater in the cosine normalisation. In ADI collection although the drop is

smaller (0.29% for cosine normalisation), the fact that the highest precision is

only 64.09% means that the maximum normalisation can be considered to

perform better than the cosine normalisation.

The results of these experiments show that the maximum normalisation

performs better in overall. The cosine normalisation, although providing a

slightly better precision average in the MEDLINE collection, still suffers from

the rapid decrease in precision as recall increases. This cosine approach should

thus be considered only for those applications that do not require high recall such

as interactive searching of library items. For applications that require high recall

such as the searching of patent records, the maximum approach is more

appropriate. In the rest of the discussion in this chapter, we will thus adopt the

maximum approach as our normalisation method of choice for the estimating

P(dj|ti=true).

6.4.3 Estimating the Virtual Layer Distribution

Section 5.2.3 introduced the concept of a virtual layer into the network in

order to reduce the complexity of calculation during the inference process. A

virtual layer consists of virtual nodes which act as summary nodes for a given

group of index term nodes. Thus, it is important to be able to estimate the weight

177

of the links that connect the virtual nodes and the child node of a given group of

index term nodes.

There are two possible estimation methods for these link weights, namely

the average and the maximum approach. The average approach takes the average

value of the group’s weight as the weight of the virtual links. The maximum

approach on the other hand, takes the maximum value of the link weights in the

group to be the weight of the virtual links. As we predicted earlier in chapter 5,

the maximum approach produces better results than the average approach. Table

6-9 shows the comparison of performance of the two approaches across the three

collections.

We expected that the average and the maximum approaches give different

ramification to the accuracy of summary estimation of the groups formed by the

virtual layer approach(see section 5.2.3). However, the difference between the

two approaches was not as marked as we expected. As discussed in section 5.2.3,

we expect that assigning the average weight of the group to the links between

virtual nodes and the document nodes will cause the low weight links to pull

down the importance of the high weight links in the group. This situation occurs

due to the fact that we have assigned the nodes randomly to the groups and as a

result, similar virtual links’ weights may occur throughout in the networks.

178

Precision (%) ADI MEDLINE CACM

Recall (%)

Max Ave Max Ave Max Ave 10 68.92 68.21 89.51 89.70 78.85 76.56 20 68.04 67.26 81.55 81.15 69.41 68.49 30 62.24 61.58 75.09 75.27 57.37 58.20 40 56.72 56.48 72.27 72.20 43.60 44.03 50 53.41 52.54 65.15 65.19 36.90 36.89 60 50.84 49.923 58.44 58.08 32.47 32.58 70 38.15 37.30 51.40 51.11 28.70 28.66 80 28.53 27.75 45.65 45.61 20.66 20.41 90 27.65 26.88 39.49 39.43 17.03 16.80

100 25.46 24.72 32.58 32.27 13.14 12.83 Average 48.00 47.26 61.11 61.00 39.81 39.54

Table 6-9 Performance comparison using average and maximum estimation for virtual links.

Experimental results (table 6-9) show that the average and maximum

approaches only differ slightly in their performance. In the ADI collection, the

maximum approach is only 0.74% better than the averaging approach. The

differences are much less in the MEDLINE collection and the CACM collection,

being 0.11% and 0.27% respectively. The rate of decrease in precision is also

similar in the two approaches. There is no one approach which exhibits retrieval

bias toward either precision or recall. In this sense, both approaches may be

considered of equal value.

We have suggested a method of improving the random grouping of the

index term nodes for the virtual layers in section 5.3.3.1. Table 6-10 reports the

comparative performance of the random and non-random clustering techniques.

The non-random clustering method requires the estimation of the significant

level. Standard deviation of the link weights distribution is a good estimation for

this significant level. It gives us a better chance to evenly divide the index term

179

nodes into the group. Recall from the discussion in section 5.3.3 that the most

optimised network is given by a symmetric network.

We calculated the standard deviation of the distribution of the link weight

within the document that required classification in the ADI collection. Most of

the standard deviation within the documents lies on the range 0.08 to 1.0. We

tried several significant levels in this standard deviation range and performed the

retrieval of the ADI collection. The result of these experiments is reported in

table 6-10.

Precision

non-random cluster with n significant level

Recall (%)

random cluster n=0.08 n=0.09 n=0.1

10 68.92 68.78 70.73 69.34 20 68.04 66.43 68.35 66.64 30 62.24 60.54 63.94 61.39 40 56.73 54.36 57.03 54.87 50 53.41 49.19 50.97 49.87 60 50.84 48.39 50.05 48.16 70 38.15 37.26 36.76 37.88 80 28.53 27.85 27.49 27.52 90 27.65 26.40 26.81 26.03 100 25.46 24.42 24.66 24.27

Average 47.997 46.362 47.679 46.597

Table 6-20 Performance of difference clustering schemes.

The performance of the random clustering method for the average

precision is slightly better than the performance of the non-random clustering

method. The difference in the average precision, however, is relatively small

(0.32%) compared to the gain in precision at 10% to 40% recall. The results

shown in table 6-10 agrees with our hypothesis discussed in chapter 5, which

stated that the non-random cluster method does not find new relevant documents,

instead, it shifts the relevant documents higher in the ranked output. If the non-

random cluster method is able to find relevant documents not found by the

180

random cluster, the experiment results will show the increase in the average

precision. Therefore, the choice between the two clustering methods depends on

the objective of the retrieval systems built. If the precision is very important then

the non-random clustering method is the choice with the cost of having more

expensive preprocessing. On the other hand, the choice will be the random cluster

when the precision is not very important because it requires less computation

during the clustering process.

We have presented experimental results of different approaches for estimating the

different probability parameters in the model. The summary of the average

precision gained by different estimations is shown in table 6-11.

From this table, we can summarise that the model performs best when the

following probability parameters are used:

1. P(ti|Q=true) or the weight of the node Q to query term node is

estimated using normalised qf.idf (equation 6.5).

2. The default belief for P(dj|ti=true) is α=0.5.

3. The tf component of the P(dj|ti=true) is estimated using the

normalised tf (equation 6.6).

4. The combination of tf.idf of P(dj|ti=true) is best normalised using the

maximum normalisation (equation 6.8).

5. The virtual link weights are estimated using the maximum probability

values of the group (equation 5.2).

181

Estimation Maximum precision

Average precision

Query (none) , document (none) 21.79 13.22 Query (qf), document (none) 21.18 11.88 Query (idf), document (none) 34.62 19.71 Query (normalised qf.idf), document (none) 58.19 35.31 Query (normalised qf,idf), document (tf) 34.38 22.33 Query (normalised qf,idf), document (idf) 43.22 27.34 Query (qf), document (tf.idf with cosine normalisation) 64.09 45.37 Query (qf), document (tf.idf with maximum normalisation) 63.40 44.00 Query (normalised qf.idf), document (tf.idf with cosine normalisation)

64.09 45.37

Query (normalised qf.idf), document (tf.idf with maximum normalisation)

68.09 45.37

Virtual layer with maximum estimation 68.92 48.00 Virtual layer with average estimation 68.21 47.26 Virtual layer with non-random cluster method 70.73 47.68

Table 6-21 Performance summary of different estimations in ADI collection.

6.5 Performance Comparison with Existing

Model

Using the best estimation suggested in previous section, we compared the

performance of our model with two other well-known models, namely the vector

space model [Salton83] and Turtle and Croft’s model [Turtle90]. We chose these

two models of information retrieval because they both well-know models and

their experimental results are available publicly.

We can only compare Turtle and Croft’s model [Turtle90] with our

Bayesian network model for the CACM collection because they did not report the

experiment results for either the ADI collection or the MEDLINE collection. We

should also note that the accuracy of the reporting was different in Turtle and

Croft’s model. They only report the experiment results to one decimal point

accuracy and we will use them as they appear on their published experiment

182

results [Turtle90]. We compared our model with vector space for all the three

collections.

Most of the reporting of information retrieval experiments so far has

concentrated on looking at the average precision across different recall levels.

The problem with this approach is that the comparison is biased towards

precision oriented systems [Wallis95]. As we have mentioned, not all

applications in information retrieval are suited to these types of systems, for

example patent office systems. With systems that require high recall such the

patent office systems, lower precision in the low recall level is not necessarily as

important as having higher precision at high recall level. A system that produces

a high precision at the high recall level is able to better distinguish the relevant

and non-relevant documents retrieved compared with systems that produce a

lower precision at a high recall level.

The precision at the high recall levels is very important because the actual

number of documents retrieved is much greater at high recall level. Therefore, a

slightly higher precision system will retrieve much less number of non-relevant

documents compared with a system that produces high precision at low recall but

low precision at its high recall levels. We will show that our Bayesian network

model not only outperforms the vector space and Turtle and Croft’s model in

terms of average precision but also, more importantly, in terms of precision at

high recall levels.

183

6.5.1 Comparative Performance for the ADI

Our Bayesian network model outperforms the vector space model for

experiments in the ADI collection. Table 6-12 and figure 6-2 show the

performance comparison between vector space and our model. On average, our

model produces a 0.62% better precision for 10 different recall levels. The

improvement provided by our model is achieved at the both ends of recall level

range. The maximum improvement is achieved at 50% recall (2.01%) and the

minimum improvement is at 70% recall (0.16%).

At the low recall level (10-20% recall), the precision of our model is

between 0.96% and 1.66% better than the vector space model. The vector space

model performs almost the same level of precision on the middle recall level (30-

80% recall). Our model starts to outperform the vector space again at the high

recall level. Our model produces a comparative 1.1% and 1.66% performance

increase at 90% and 100% recall respectively.

Precision (%) Recall (%) Vector Space Bayesian Network 10 67.26 68.92 20 67.26 68.22 30 62.61 62.3 40 57.85 56.78 50 51.46 53.47 60 49.98 50.71 70 37.99 38.15 80 29.28 28.53 90 26.52 27.65

100 23.8 25.46 Average 47.40 48.02

Table 6-12 Performance comparison with vector space for ADI collection.

184

The graph in figure 6-2 shows the comparative recall and precision level

of the retrieval in ADI collection. From this graph we can see clearly that the rate

of decrease in precision level is almost the same for the vector space and our

Bayesian network model, with the exception at the 90% and 100% recall. The

trend in the graph that represents our model is flatter at these two recall points, or,

in other words our model provides smaller rate of decrease in precision as recall

increase. Therefore, our Bayesian network model will clearly outperform the

vector space model for high recall oriented system.

Higher level of precision at the high recall level can be achieved by our

Bayesian network model mainly due to the adoption of the graph that enables

explicit representation of connectivity among the index terms and the document

in the collection. It has been suggested that this connectivity will improve the

retrieval performance [Croft84, Croft87a] because explicit representation allows

documents that do not contain the query terms to be retrieved if they share many

index terms with those documents which contain query terms. However, this

index terms and document connectivity has been denied from conventional

keyword based matching model such as vector space and has contributed to its

inferior performance.

185

10 20 30 40 50 60 70 8090 100

68.9 2 6 8.22

6 2.3

56.78

5 3.47

5 0.71

3 8.15

28 .5 3

2 7.6525 .4 6

67 .2 6 67 .2 6

6 2.61

5 7.85

51 .4 649 .9 8

37.99

2 9.28

2 6.522 3.8

0

1 0

2 0

30

40

5 0

6 0

70

Precision(% )

R ecall(% )

V ector S pace

B ayesian N etw ork

Figure 6-2 Comparative Performance for the ADI collection

186

6.5.2 Comparative Performance for the MEDLINE

The experimental results on the MEDLINE collection show a similar

behaviour with those of the ADI collection. Table 6-13 shows the experimental

results. The average precision of the Bayesian network model for the experiments

in the MEDLINE collection is 1.01% better than the vector space model. The

maximum improvement is achieved at the 100% recall (7.61%) and the minimum

at 90% recall (1.69%). The vector space model shows good precision at 10%,

30%, 50%, 60% and 70% recall. However, the precision produced by the vector

space model decreases drastically at the two extreme recall levels compared with

our model. For example, it drops 11.41% in precision when the recall increase

from 10% to20%, our drop is 7.96%. It is clearly shown in figure 6-3 that our

model produces a more steady decrease in the precision rate than the vector space

model.

The Bayesian network model’s superiority is clearly shown for the

precision rate at the high recall level. For example, our model produces 7.61%

better precision than the vector space at the 100% recall level. This behaviour is

similar to the behaviour of our model in the ADI collection experiments. The

difference in precision is much higher in the MEDLINE experiments than the

ADI experiments. In the ADI experiments, the difference between the precision

at 100% recall of our model and the vector space model is 1.66%.

187

Precision (%) Recall (%) Vector-space Bayesian network 10 91.12 89.51 20 79.71 81.55 30 75.40 75.10 40 70.54 72.27 50 67.00 65.15 60 58.85 58.44 70 52.96 51.40 80 43.06 45.65 90 37.80 39.49

100 24.97 32.58 Average 60.10 61.11

Table 6-13 MEDLINE experimental results.

6.5.3 Comparative Performance for the CACM

The Bayesian network model behaves similarly in the CACM collection

as in the ADI and MEDLINE collections. It outperforms the vector space model

and Turtle and Croft’s model (see table 6-14 and figure 6-4). The average

precision for the Bayesian network is 2.74% and 0.5% better than that of the

vector space model and Turtle and Croft’s model respectively. Our model is also

superior to both the vector space and Turtle and Croft’s model in terms of

precision at low recall levels (10%-30%). Table 6-13 shows that our model

produces precision at 10% recall of 2.35% and 5.69% higher than Turtle and

Croft’s and the vector space model respectively.

188

1 02 0

3 04 0

5 06 0

7 08 0

9 01 0 0

8 9 .5 1

8 1 .5 5

7 5 .1 07 2 .2 7

6 5 .1 5

5 8 .4 4

5 1 .4 0

4 5 .6 5

3 9 .4 9

3 2 .5 8

9 1 .1 2

7 9 .7 1

7 5 .4 0

7 0 .5 4

6 7 .0 0

5 8 .8 5

5 2 .9 6

4 3 .0 6

3 7 .8 0

2 4 .9 7

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

1 0 0

Precision(%)

Recall(%)

Vector Space

Bayesian Network

Figure 6-3 Comparative performance for the MEDLINE collection.

189

Precision (%) Recall (%) Bayesian network Vector Space Turtle’s network 10 78.85 73.16 76.5 20 69.42 61.69 65.5 30 57.37 52.22 54.4 40 43.61 43.97 48.6 50 36.90 35.54 42.3 60 32.47 28.94 36.1 70 28.70 24.87 25.5 80 20.66 20.07 21.1 90 17.03 16.75 12.7 100 13.14 12.45 9.6

Average 39.70 36.96 39.2

Table 6-14 Experimental results for CACM collection.

In the middle range recall levels, Turtle and Croft’s model shows a better

performance than our model. However, from the point of view of practical

applications, a higher precision at the both ends of the recall spectrum will be

more desirable than that is of precision at the middle recall range for the

following reasons:

1. High precision in the middle range does not provide a clear cut-off

point in the situation where the systems want to produce limited recall

output. The recall cut-off point will be clearer in a model that retrieves

the relevant documents concentrated at the top and bottom level of

recall level.

2. Having high precision in the middle range but low precision at the

high recall level, does not provide the best support for the recall-

oriented systems. This is because the amount of documents to be

inspected in order to find relevant documents in the high recall level is

much higher than the amount of documents to be inspected in the

medium recall level. Thus, considering the amount of documents to be

190

inspected, recall-oriented systems will be benefited from high

precision at high recall level.

The vector space model produces worst performance in almost every recall levels

than that is of Turtle and Croft’s and our model. The exception occurs only at the

90% and 100% recall, it performs better than the Turtle and Croft’s model.

The CACM experiments demonstrate the superiority of our model to the

vector space and Turtle and Croft’s model. The superiority of our model is with

respect to the vector space model is particularly clear. The Bayesian network

outperforms the performance of the vector space model at every recall level. This

shows that the addition of knowledge, through the use of the network model

adopted by our model, provides a better information retrieval model than that is of

simple index terms matching function which is adopted by the vector space. This

benefit of adopting a network model is clearly demonstrated by the fact that our

model maintains comparatively higher precision at the high recall levels. The

adoption of the network model provides a natural classification which allows a

document that does not contain the query terms to be retrieved. Such characteristic

cannot be produced by the keyword-based retrieval such as the vector space

model.

We have claimed in chapter 5 that our model provides greater versatility

(Salton’s [Salton83] third requirement introduced early in this chapter) than does

Turtle and Croft’s inference network. We have shown in that chapter that our

model is not just able to simulate other existing models in information retrieval,

but is also able to support both evidence and dependency alteration relevance

feedback.

191

1020

3040

5060

7080

90100

78.84

69.42

57.37

43.61

36.9

32.47

28.7

20.66

17.03

13.14

76.5

65.5

54.4

48.6

42.3

36.1

25.5

21.1

12.7

9.6

73.16

61.69

52.22

43.97

35.54

28.94

24.87

20.07

16.75

12.45

0

10

20

30

40

50

60

70

80

Precision(%)

Recall(%)

CACM Collection

Vector Space

Turtle's Inference Network

Bayesian Network

Figure 6-4 Comparative Performance for CACM collection.

192

In this chapter, using the experimental results of the three test collections,

we have also shown that our Bayesian network model is more effective in

identifying useful information accurately and quickly (Salton’s first requirement).

In other words, our model exhibits a better precision at most recall levels of the

three collections. This ability is achieved by adopting the correct network

semantic compared with the Turtle and Croft’s model. Table 6-15 shows the

summary of the performance improvement for the three collections.

Performance improvement (%) Collection

Maximum Minimum Average

ADI 2.01 0.16 0.62

MEDLINE 7.73 0.28 0.97

CACM 7.61 1.69 2.74

Table 6-35 Summary of performance improvement of the experiments.

Our Bayesian network model also met the second requirements stated by

Salton, namely the ease of rejecting extraneous document because our model

produces a higher average precision for the three collections. Therefore, we can

claim that our model provides a better and more versatile model to the

information retrieval systems than the two popular existing information retrieval

model.

6.6 Summary

The Bayesian network model’s experimental performance has been reported in

this chapter. The experiments were run on three collections: ADI, MEDLINE and

193

CACM. These three collections vary in size, distribution of the index term’s

weight, the number of queries and their length.

Different probability estimation methods have also been tested and the

results are reported in this chapter. In general, the retrieval performance of the

network is better when weighting schemes are used for the link weights in both

the query and document network. The best performance is achieved when the

P(ti|Q=true) in the query networks is estimated by the qf.idf with cosine

normalisation and P(dj|ti=true) in the document network is estimated by the tf.idf

with maximum normalisation. The default belief of P(dj|ti=true) for the best

performance is given by value of 0.5.

As the size of the Bayesian network for information retrieval is large, the

introduction of virtual layers becomes necessary as an aid to reduce the network’s

complexity, in particular by reducing the size of the link matrix. In implementing

the virtual layer solution, index terms are grouped and connected to a virtual node

at the virtual layers. The link weights from the virtual layer to the document layer

require estimation. These link weights have to able to summarise the weight

distribution of the group. We tested two approaches to these estimation tasks,

namely taking the average and the maximum of the weights in the group. We

found that both approaches do not differ much in performance with the maximum

method producing slightly better performance than the average method.

Using those probability estimations that produce the best retrieval

performance for Bayesian network model, we compared its performance with the

vector space and the model of Turtle and Croft. The Bayesian network models

shows much better precision than these two models especially at the two recall

level extremes. This result supports our hypothesis that the introduction of the

194

network as a knowledge base will increase the performance of the retrieval

compared with the purely index term matching method (adopted by the vector

space model). The fact that our network model outperforms Turtle and Croft’s

network shows that the direction of the inference and the assumptions of the

causal relations between propositions affects the performance of the retrieval in

the network model. This is the case because the direction of the inference dictates

the semantics of the model.

We have suggested the adoption of virtual layer approximation to reduce

the computation complexity in Bayesian networks. This approximation method

involves the classification of parent nodes into smaller groups, which in turn it

will reduce the size of the link matrix. In the next chapter we will present an

evaluating model based on Minimum Message Length [Wallace68] to access the

goodness of classification of parents node in the network. This model provides a

useful and effective mean of finding the optimum virtual layer model. With the

help of this model, we can eliminate extensive retrieval testing to different virtual

layer models in order to find the optimum model.

195

Chapter 7 Measuring the Effectiveness of

Virtual Layer Model

7.1 Introduction

The Bayesian network model for information retrieval is inherently

computationally complex and requires some optimisation in order to make the

model practically useful. The primary cause of this high computational complexity

lies with the size of the link matrices. We have discussed several approaches that

can be used to reduce the link matrix size in chapter 5, in particular the addition of

a virtual layer.

In the virtual layer optimisation approach, the parent nodes are partitioned

or classified into a number of groups. Each group is then attached to a virtual

node and this virtual node in turn is attached to a document node. The choice of

the clustering method applied to the virtual layer influences the retrieval

performance directly, as shown in the experimental results presented in chapter 6.

The two clustering methods (namely random and non-random) introduced in

section 5.3.3 produce different retrieval performance, with the random clustering

method produces the highest average precision.

Different clustering techniques result in different Bayesian network

structures. The optimal clustering within the network will lead to the most

efficient inference. In this chapter we will utilise a method by which we can

196

measure the effectiveness of the classification or clustering, namely that of

Minimal Message Length (MML). The objective of this method is to provide a

means of measuring the complexity of modeling the virtual layer in Bayesian

network for information retrieval. We use this method to determine the

effectiveness of a classification model for Bayesian network nodes without

performing intensive retrieval testing. Section 7.2 presents the background theory

of this measurement method with the emphasis given to modeling with real valued

parameters (the weights of the links are real values). Section 7.3 shows how the

MML method can be used to measure the effectiveness of classification in our

information retrieval application. We present an example of such a calculation

using one of the clustering methods generated by experiments on the ADI

collection in section 7.4.

7.2 Minimum Message Length

The Minimum Message Length (MML) paradigm is introduced in [Wallace68]. In

this paper, it is stated that a classification may be regarded as a method of

representing more briefly the information contained in S×D attribute

measurements, where S is a set of items and D is a set of attributes. These

measurements contain a certain amount of information which without

classification, can be recorded directly as S lists of the D attribute values. If the

items are now classified, then the measurements can be recorded by listing the

following:

1. The class to which each item belongs.

197

2. The characteristics of the class.

3. The deviation of each item from the characteristics of its parent class.

The best classification is suggested by the briefest recording of all the attribute

information. In MML, the recording of the attribute information is achieved by

regarding the attribute information as a message.

Consider that we have some measurements from the real world and a set

of models, Mm1,m2,…,mn, that attempt to explain the measurement. MML

assesses the effectiveness of each model mi ∈ M by calculating the length of the

message needed to explain the measurement. In other words, since this method is

based on information theory, it calculates the length of the message required by a

receiver to re-construct the information send by the sender during communication.

The communication between the sender and the receiver in MML comprises two

parts, namely:

1. The message that describes the model. The model is usually the

probability distribution of the data values.

2. The message that describes the data values. This message can be

constructed by using a code dictionary. The code dictionary can be

easily constructed from the probability distribution defined in the

model part.

The best or the optimum model is given by a model mi ∈ M that produces the

shortest message in describing the two-part message. There is a trade off between

the model complexity and the size of the message length of data values. A

complex model requires a long message to describe the model but requires only a

short message to describe the data values. On the other hand, a simple model

198

leads to a short message for the model description but a long message for the data

values. It is asserted by MML that the model that produces the shortest overall

message for both model and data values parts is the optimum model.

Following Shannon’s law, the message length may be assumed to be

proportional to minus the logarithm of the relative frequency of the occurrence of

the event which it nominates. More specifically, considering the difference in the

nature of the attributes, the encoding is calculated as follows [Oliver94]:

1. Encoding an event that is equally likely to occur from N possible events

requires a code length of N2log bits

2. Encoding an event which has the probability P requires P2log− bits.

3. Encoding a real value y sampled from a probability density f(y) with an

accuracy of measurement ε. requires ))((log 2 yfε− bits.

Since we are dealing with real valued parameters in the clustering of the link

weights, we will concentrate on discussing the encoding and calculation of

message length for real valued parameters in the next section.

7.2.1 Encoding Real Valued Parameters

Assessing a model of a real valued distribution requires the description of the

distribution using real valued parameters. Real valued parameters cannot be

described to infinite precision in a finite message. Thus constructing the code

dictionary for the parameter values involves some approximations.

One method of approximation involves the construction of a code

dictionary from a density function which is divided into cells, the number of which

199

is called the accuracy of the parameter value ( AOPV). If the uniform density

function is used, the width of the cells is given by AOPV

ab −, where the parameter

values are in the range [a,b]. Thus, to specify the cell to which a parameter

belongs requires )(log2 AOPV

abceiling

− bits [Wallace68].

The message length, as stated, depends on the message length of the model

and of the data. The model and data lengths of the message are directly influenced

by the AOPV. The smaller the AOPV (i.e. the more accurate the measurement) ,

the shorter the message length for the data. However in this case, the model’s

message length will be longer. The optimal message length is achieved when the

shortest combined length of the model and the data message is obtained. The

optimal message length can be approximated by calculating the expected message

length, which is given by the following formula[Oliver94]:

)(log2

2logloglog)( 22

22

222 es

N

ss

Ns

NAOPV

range

AOPV

rangeMessLen

++++=Ε

επ

σ

σ

µ

µ

(7.1)

where

µ is the mean used to code the data values xi (i=1,...,N) .

σ is the standard deviation to code the xi.

⎯s is the unbiased sample standard deviation.

s is the sample standard deviation.

200

ε is the accuracy of the measurement of the data values xi.

N is the number of data values.

The first two terms in equation 7.1 represent the length of the message to describe

the model and the last two terms represent the length of the message to describe

the data. The optimal AOPVs are given by [Wallace68]:

NAOPV

12σµ = (7.2)

1

6

−=

NsAOPVσ (7.3)

In MML, the value of the AOPV depends upon the data which the message is

describing. It is worth noting that the two part message used in MML may seem

incomplete and that a three parts message (AOPV, model, data) is required.

However, Wallace and Freeman (Wallace87) showed that in many cases a three

parts code is not necessary, and hence we will use a two part message calculation

in identifying the effectiveness of the virtual layer model being used to optimsed

our Bayesian network.

7.3. Measuring Effectiveness of Virtual Layer

Model with MML

In the classification of index terms for the virtual layer approach to Bayesian

networks optimisation, the following assumptions are made:

1. The index terms are assigned independently to documents in the

collections.

201

2. The weights assigned to the links between the index term nodes and

document nodes are normally distributed within a document

collection.

3. The index terms cannot be assigned to more than one cluster or group,

i.e. the clusters are disjoint.

We will now consider how the MML method may be used to judge the

effectiveness of the clustering method employed in constructing the Bayesian

network with virtual layers. There is some prior knowledge that is communicated

between the receiver and sender at the start of the transfer process, thus they will

not be included as part of the message. In our MML model of index term

clustering for the virtual layer, the prior knowledge consists of the following:

1. The total number of link weights to be classified.

2. The number of attributes per term, which is equal to 1 (i.e. the link

weight values).

3. The nature of the attribute distribution, which is continuous.

4. The range of the mean used to code a link weight xi (rangeµ).

5. The range of the standard deviation to code xi (rangeσ).

6. The accuracy measurement ε

7. The total number of groups in the classification.

Using the assumptions and prior assumptions stated above, we can

calculate the complexity of the index term cluster as follows:

Given n clusters c1, c2, c3, …, cn (with respective link weights xi

for a document d)j, and that all the clusters are disjoint, the total

expected message length over the n clusters is calculated as:

202

)(...)()()(21 nccctotal MessLenEMessLenEMessLenEMessLenE +++= (7.4)

The expected message length for the individual clusters is calculated using

the equation 7.1. The accuracies of parameter value (AOPVs) for the mean and

the standard deviation for the cluster are estimated using the equation 7.2 and 7.3

respectively. The rangeµ and rangeσ and the accuracy measurement ε are

determined through prior knowledge, therefore they will be the same for all the

clusters.

The best clustering method generates the shortest E(MessLen),or in other

words, given two clustering methods C1 and C2 , C1 is a more efficient clustering

method compared with C2 if )()(21 CC MessLenEMessLenE <

7.4 Illustration of MML Calculation for Index

Term Clusters

Consider the following clusters produced by the random classification

method (see section 5.3.3) for a document (doc-1) in the ADI collection as shown

in figure 7-1. For this document, the random clustering method produces 8

clusters which each cluster having a different mean and standard deviation. The

difference in the value of the standard deviation will cause a difference in the value

of the AOPV (the accuracy of measurement). In turn, this difference in standard

deviation will influence the complexity of the model and the total message length.

GROUP 1 MEAN 0.651148 SD 0.0366408

203

MEMBER divid approach copy standard exper off conclud overdu draw

WEIGHT 0.634928 0.665769 0.610031 0.634928 0.58992 0.665769 0.696609 0.696609 0.665769

GROUP2 MEAN 0.6602816 SD 0.0897963

MEMBER trad comput inform system notic dissemin use techn control

WEIGHT 0.795458 0.549967 0.584713 0.787264 0.665769 0.610031 0.616889 0.728355 0.604088

GROUP 3 MEAN 0.6997416 SD 0.1075428

MEMBER actual docu produc data compatibl record evaluat receiv

WEIGHT 0.696609 0.563319 0.672094 0.664971 0.795458 0.904785 0.604088 0.696609

GROUP 4 MEAN 0.6422281 SD 0.0292658

MEMBER reversibl advant orient combin microfilm base year statist

WEIGHT 0.696609 0.665769 0.624999 0.647729 0.634928 0.610031 0.647729 0.610031

GROUP 5 MEAN 0.6843268 SD 0.1372901

MEMBER integr mechan hour machin format provid manual card

WEIGHT 0.634928 1 0.665769 0.586049 0.616889 0.750002 0.616889 0.604088

GROUP 6 MEAN 0.6814146 SD 0.0798737

MEMBER total effic libr gap tic output develop simpl

WEIGHT 0.665769 0.647729 0.859674 0.696609 0.696609 0.624999 0.594159 0.665769

GROUP 7 MEAN 0.7004235 SD 0.1341351

MEMBER sophist ibm access tool organ catalog prog discontinu

WEIGHT 0.647729 1 0.750002 0.696609 0.594159 0.647729 0.570551 0.696609

GROUP 8 MEAN 0.7068336 SD 0.1154598

MEMBER operat cent rely retrief featur process circl dsd

WEIGHT 0.733775 0.679837 0.696609 0.546787 0.696609 0.956714 0.647729 0.696609

Figure 7-1 Clusters for doc-1 in ADI collection using the random clustering method.

To calculate the expected message length of GROUP 1, the rangeµ can be

estimated as 1 since the possible values of the link weights in our network lie

between 0 to 1. The rangeσ is also taken as 1 and ε can be estimated as 0.01 as

this is our accuracy of measurement. Using these values and taking N = 9 (the

204

population of this cluster), we can calculate the AOPVs and E(MessLen) of

GROUP 1 as follows:

0423090.09

12036641.0 =×=µAOPV

031732.019

6036641.0 =

−×=σAOPV

83.44

)(log)036641.0(29

)036641.0(034545.0

9

01.0

2036641.0log9

031732.0

1log

042309.0

1log)(

22

2

2221

=×

+

+++=

e

MessLenE GROUP

π

Note that the optimal value for σ is the unbiased estimate of the standard deviation

[Wallace68]. Following the above procedure, E(MessLen) for the remaining

clusters can be calculated. The E(MessLen) values for these groups are shown in

table 7-1.

GROUP Model length Data length E(MessLen) 1 9.54 35.29 44.83 2 6.95 46.92 53.88 3 6.25 43.79 50.05 4 10.01 28.77 38.78 5 5.55 46.61 52.16 6 7.11 40.36 47.47 7 5.62 46.34 51.96 8 6.05 44.61 50.66

TOTAL 57.08 332.69 389.79

Table 7-1 Expected Message Length for doc-1 using the random clustering method.

205

We now consider the output of another clustering method, shown in figure

7-2, which clusters link weights with similar values (i.e. values such that the

difference between the value and the group mean is less than some threshold

difference).

GROUP 1 MEAN 0.775636 SD 0.023639

MEMBER trad system compatibl provid access

WEIGHT 0.795458 0.787264 0.795458 0.750002 0.750002

GROUP 2 MEAN 0.686579 SD 0.019755

MEMBER discontinu rely featur dsd conclud overdu actual receiv reversibl

WEIGHT 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609 0.696609

cent produc data advant hour total simpl approach off

0.679837 0.672094 0.664971 0.665769 0.665769 0.665769 0.665769 0.665769 0.665769

gap tic tool draw notic techn operat

0.696609 0.696609 0.696609 0.665769 0.665769 0.728355 0.733775

GROUP 3 MEAN 0.882229 SD 0.031898

MEMBER record libr

WEIGHT 0.904785 0.859674

GROUP 4 MEAN 0.985571 SD 0.024991

MEMBER mechan ibm process

WEIGHT 1 1 0.956714

GROUP 5 MEAN 0.611573 SD 0.028925

MEMBER retrief comput exper inform machin develop organ dissemin use

0.546787 0.549967 0.58992 0.584713 0.586049 0.594159 0.594159 0.610031 0.616889

control evaluat base statist format manual card copy docu

0.604088 0.604088 0.610031 0.610031 0.616889 0.616889 0.604088 0.610031 0.563319

program orient microfilm integr output divid standard effic sophist

0.570551 0.624999 0.634928 0.634928 0.624999 0.634928 0.634928 0.647729 0.647729

catalog circl combin year

0.647729 0.647729 0.647729 0.647729

Figure 7-2 Clusters for doc-1 in ADI collection using the non-random clustering method.

This method is similar to the non-random clustering method (discussed in

section 5.3.3.1) except that it does not breakdown further clusters that have

population larger than the allowable number of parents per node (limit). We need

to adopt this change in order to generalise the example and avoid complications

206

inherent in comparing non-hierarchical and hierarchical clustering schemes. The

message lengths for the clusters produced by this method are given in table 7-2.

GROUP Model Data E(MessLen) 1 9.88 16.44 26.32 2 12.85 75.73 88.59 3 7.36 7.44 14.80 4 8.85 10.11 18.96 5 12.07 123.04 123.04

TOTAL 51.01 232.76 271.71

Table7-2 Expected Message Length for doc-1 using the non-random clustering method.

We note that model complexity for individual clusters is higher on average

for the non-random approach compared with the random approach. However, the

fact that the data values (link weights) in individual clusters in doc-1 are similar

(by virtue of the clustering method itself) results in a much shorter message length

of the data part of the total message length produced by this method compared

with that produced by the random clustering method. Moreover, the cost of

having more groups in the random clustering method also overshadows the

simplicity of this model. Therefore, the total message length required to describe

the clustering in the random method (389.79 bits) is longer than that of the non-

random method (271.71 bits).

The observations of the behaviour of the expected message length on all

the documents, shows that the random clustering method always produces a

longer message than the non-random clustering method. However, it constantly

produces similar message length for individual clusters within the document (see

207

Appendix C for the full list of all expected message length of documents in ADI

collection1).

The experimental results in chapter 6 (see figure 6-10) showed that the

random clustering method performs better in the average precision by 0.32% but

performs relatively worst at the low recall level (10% to 50% recall). Observing

the message length produced by the two methods, we can derive the relations

between the performance of the methods in terms of recall and precision with the

nature of the virtual layer models' message length in the following ways:

• A virtual layer model that can produce identical expected message

length for the individual cluster in layer will produce a higher average

precision than virtual layer model that produces distinct expected

message length for the individual cluster.

• A virtual layer model that produces shorter expected message length

will produce higher precision in the low recall level, but not necessarily

produce a higher average precision.

Considering the relations stated previously, we can conclude that the

shortest expected message length for a virtual layer model may be optimised in

terms of computation complexity, however, it does not lead to a higher average

precision in information retrieval context. In information retrieval context, to

produce a high average precision, the virtual layer model has to produce similar

expected message length for the individual cluster. The similar in expected

1 The list in Appendix C shows that index terms in each document in ADI are classified into a number of groups. Each group produces almost identical expected message length value in the random clustering method. On the other hands, each group produces varied expected message length value in the non-random clustering method.

208

message length for the individual cluster, in fact, is a measure of the "symmetry"

of the model. This symmetry, as we suggested in chapter 5, will produce the

optimum performance for the virtual layers created using the random clustering

method.

On the other hand, a virtual layer model that produces shorter message

length, such as that of non-random clustering method, will produce higher

precision at a low recall level. Hence, it still has some benefit when it is used for

the precision oriented system.

7.5 Summary

In this chapter, we have used a model based on Minimal Message Length (MML)

to evaluate the effectiveness of the virtual layer model to find the optimum

Bayesian network. It is important to clearly clarify the meaning of "optimum"

according to the objective of the systems. In an information retrieval context, the

optimum model may be considered as the model that can produce the highest

average precision or the model that can produce the most efficient computation.

The MML model suggests that the computation complexity is optimised when the

clustering method which produces the shortest expected message length and the

average precision is optimised by the model that produces similar expected

message length for individual cluster in the virtual layer. The virtual layer model

that produces the shortest expected message length may not be optimum in terms

of average precision but it will be optimum in terms of computation complexity

and vice versa.

209

We have shown that for ADI collection, according to this evaluation

method, the random clustering method provides a more optimised clustering than

the non-random method in term of average precision because the random clusters

in the virtual layers produces similar expected message length for the individual

clusters. The results of the evaluation agree with the experimental results

presented in chapter 6, the random clustering method produces 0.32% higher

average precision than the non-random clustering method. The non-random

clustering method ,however, is still beneficial for precision oriented systems

because it produces higher precision at low recall level and is optimum in terms of

computation complexity.

Some consideration also has to be taken regarding the preprocessing

required to generate the clusters. The preprocessing required in the non-random

clustering method is more costly in terms of computation compared with that of

random clustering.method This is counterbalanced by the fact that this

preprocessing only occurs once at the time of building the document collection,

and thus the high computational cost of the preprocessing for the non-random

clustering methods will be offset by more efficient inference during the retrieval

process. In terms of recall and precision, the choice of clustering is still based on

the overall objective of the information retrieval system. In this respect, recall

oriented systems will benefit more from the random clustering method and the

precision oriented systems will benefit more from the non-random clustering

method.

210

Chapter 8

Conclusion and Future Research

8.1 Conclusion

In this thesis, we have described a new formal information retrieval

model. This proposed model, unlike other models of information retrieval, has a

strong mathematical foundation for handling uncertainty because it is based on

Bayesian networks. Bayesian networks are well-known artificial intelligence

method for handling uncertainty.

The proposed model consists of two separate networks, namely query and

document networks. The use of network representations in the model confers the

following benefits:

• It subsumes other existing retrieval models through its capacity to

simulate those models using an appropriate network representation so that

the choice of the specific model can be taken during the implementation

stage.

• It provides methods of representing documents and users' information

needs as complex objects with multilevel representations. This capacity

allows the information retrieval developer to provide multiple

representations of the same documents or users' information needs, which

gives flexibility in implementation.

211

• It provides a natural model for incorporation of a thesaurus into the

system. The adoption of the network produces a natural grouping of the

index terms.

• It provides implicit inter-document dependency. This dependency will

allow the retrieval of documents that do not contain query terms but share

some common index terms with the documents that contain the query

terms. As a result, this model will produce a higher recall compared with

a model which considers only those documents that share common terms

with the query.

• It provides a common and mathematically sound model for producing

both the initial ranked output and handling relevance feedback. This

situation is not possible with the existing probabilistic models, which use

ad-hoc methods to produce the initial ranking.

• It supports both the evidence and dependency alteration techniques for

relevance feedback in a common model, which, again, is not supported in

the existing probabilistic models.

We have also presented a comparison of performance between our model

and the two well-known information retrieval models, namely those of the vector

space model and Turtle and Croft's inference network (chapter 6). The

experiments were performed on three well-studied collections, namely ADI,

MEDLINE and CACM. The experimental results showed that our model

outperforms both models in terms of average precision. The improvement

achieved by our model varies from 0.62% to 2.74%. The results also showed that

our model produces a higher precision at both ends, low and high, of the recall

212

level. At low levels of recall, the improvement in precision is in the range of

1.66% to 5.69%. The improvement in precision at high recall level is in the range

of 3.54% to 7.61%. This behaviour makes our model a better choice in

supporting both precision and a recall-oriented systems.

With respect to implementation issues of the model, we have proposed

new methods to optimise Bayesian networks using alternative approximated

networks. The approximated network can be created using virtual layers. The

introduction of the virtual layers in the network reduces the size of the link matrix

which in turn reduces the computational complexity in the network (chapter 5).

The convergence problem of an exact inference algorithm involving indirect

loops is solved by modifying the independence assumption of the algorithm

(chapter 5). This modified independence assumption is in accordance with the

human reasoning process and does not require massive preprocessing as do

current techniques.

We have also presented a model that evaluates the effectiveness of the

approximated networks using the Minimal Message Length principle. This

evaluation model enables us to choose the optimal approximated network without

performing extensive retrieval testing.

8.2 Future Work

The approach taken in this thesis suggests several further areas of research. These

areas include: adoption of phrases and thesauri, fusion of the retrieval output,

213

clustering methods for index terms, and the development of evaluation models

for Bayesian networks in general.

8.2.1 Phrases and Thesaurus

The utility of a thesaurus in increasing recall in information retrieval has been

proven [Salton71, Croft88]. We have described means of incorporating thesauri

in the proposed model in chapter 4. Traditionally, this thesaurus is generated by

looking at the similarity between index terms. Two index terms are considered

similar when they have similar weights [Salton83]. In Bayesian networks, the

graph represents explicitly the connectivity between index terms and documents.

Thus, the thesaurus may be created not just based on the index term weight

similarity but also on the fact that those index terms shared between documents.

When two or more index terms co-occurs in some documents, we can assume

that these index terms represent a higher level concept or that these index terms

are part of a phrase. Hence, an automatic thesaurus and phrase finder that can

exploit the above characteristic of Bayesian networks is worthy of further

investigation.

8.2.2 Retrieval Fusion

The proposed model provides the flexibility to represent a single information

need using multiple representations. The results of the retrieval of this

information need may vary for different representation networks [Turtle90]. So

far in this thesis, we have only performed retrieval using one query network

214

representation. A further investigation into the effect of the use of multiple query

representations could be useful. The main issue in performing this task is in

finding the optimal model to merge ranked outputs produced by the multiple

query network representations.

8.2.3 Index Term Clustering

In chapter 5, we have described two simple methods of index term clustering.

The clustering is introduced to the network in order to reduce the size of the link

matrix. Further investigation is required to find the optimal clustering using some

traditional clustering methods such as k-mean and new methods such as neural

network classification.

8.2.4 Evaluation Model for Bayesian Networks

In this thesis, we provide an evaluation model to measure the effectiveness of the

approximated network by evaluating the effectiveness of the clusters of parent

nodes in a given node in a Bayesian network. In this sense, we compare two

Bayesian networks locally within a given node and its parents, disregarding the

global structure of the network. For example, the model does not take into

consideration the effect of the clustering in doc-1 to the clustering in doc-2. A

further evaluation model that can measure globally the effect of approximation in

Bayesian network can be investigated. With this model, we expect that we can

take any two arbitrary approximated Bayesian network and choose the optimal

one.

215

References

[Allen87] Allan, J. Natural Language understanding.

Benjamin/Cummings, 1987.

[Amsler89] Amsler, R.A. Research Toward the Development of Lexical

Knowledge Based for Natural Language Processing. In the

Proceedings of the 12th Annual International Conference on

Research and Development in Information Retrieval, Belkin,

N.J. and van Rijsbergen, C. (eds), pp 242-249, ACM, New

York, 1989.

[Bhatnagar86] Bhatnagar, R.K. and Kanal, L.N. Handling Uncertain

Information: a Review of Numeric and Non-Numeric Methods.

In Uncertainty in Artificial Intelligence, Kanal, L.N. and

Lemmer, J.F. (eds), pp 3-26, North Holland, Amsterdam, 1986.

[Belew89] Belew, R.K., Adaptive Information Retrieval: Using a

Connectionist Representation to Retrieve and Learn about

Documents. In the Proceedings of the 12th Annual International

Conference on Research and Development in Information

Retrieval, Belkin, N.J. and van Rijsbergen, C. (eds), pp 11-20,

ACM, New York, 1989.

216

[Berzuini89] Berzuini, C., Bellazzi, R. and Quaglini, S. Temporal Reasoning

with Probabilities. In the Proceedings of Fifth Workshop on

Uncertainty and AI, Henrion, M. (ed.), pp 14-21, Windsor,

Ontario, 1989.

[Boguraev87] Boguraev, B., Briscoe, T., Carroll, J., Carter, D. and Grover, C.

The Derivation of a Gramatically Indexed Lexicon from the

Longman Dictionary of Contemporary English. In the

Proceedings of 25th Annual Meeting of the ACL, pp 193-200,

Stanford University, Stanford, CA, 1987.

[Brachman88] Brachman, R.J. and McGuiness, D.L. Knowledge

Representation, Connectionism, and Conceptual Retrieval. In

the Proceedings of the 11th International Conference on

Research and Development in Information Retrieval, pp 161-

174, ACM, New York, 1988.

[Brent91] Brent, M.R. From Grammar to Lexicon: Unsupervised Learning

of Lexical Syntax. Computational Linguistic, 19(2):243-262,

1991.

[Bundy85] Bundy, A. Incidence Calculus: A Mechanism for Probabilistic

Reasoning. Journal of Automated Reasoning, 1:263-283, 1985.

[Carmody66] Carmody, B.T. and Jones Jr, P.E. Automatic derivation of

microsentences. Communication of the ACM, June:435-445,

1966.

217

[Chang91] Chang, K.C. and Fung, R. Refinement and Coarsening of

Bayesian Networks. In Uncertainty in Artificial Intelligence 6,

Kanal, L.N. and Lemmer, J.F. (eds), pp 435-446, North

Holland, Amsterdam, 1991.

[Chavez90] Chavez, R.M. and Cooper, G.F. An Empirical Evaluation of a

Randomized Algorithm for Probabilistic Inference. In

Uncertainty in Artificial Intelligence 5, Henrion, M. et.al (eds),

pp 191-208, Elsevier Science, 1990.

[Cheeseman85] Cheeseman, P. In Defense of Probability. In the Proceeding of

the 9th International Joint Conference on Artificial Intelligence,

pp 1002-1009, 1985.

[Cheeseman91] Cheeseman, P. Probabilistic vs Fuzzy Reasoning. In

Uncertainty in Artificial Intelligence 6, Kanal, L.N. and

Lemmer, J.F. (eds), pp 85-102, North Holland, Amsterdam,

1991.

[Chevallet96] Chevallet, J.P. and Chiaramella, Y. Our Experience in Logical

IR Modeling. In the Proceedings of Glasgow Workshop on

LOGIC, University of Glasgow, U.K, 1996.

[Chin89] Chin, H.L. and Cooper, G.F. Bayesian Belief Network

Inference Using Simulation. In Uncertainty in Artificial

Intelligence 3, pp 129-148, North Holland, Amsterdam, 1989.

218

[Cohen87] Cohen, P.R. and Kjeldsen, R. Information Retrieval by

Constrained Spreading Activation in Semantic Networks.

Information Processing and Management, 23(2):255-268, 1987.

[Coombs90] Coombs, J.H. Hypertext, Full Text, and Automatic Linking. In

Proceedings of the 13th International Conference on Research

and Development in Information Retrieval, pp 83-98, 1990.

[Cooper71] Cooper, W.S. A Definition of Relevance for Information

Retrieval. Information Storage and Retrieval, 7:19-37, 1971.

[Cooper78] Cooper, W.S. and Maron, M.E. Foundation of Probabilistic and

Utility Theoretic Indexing. Journal of the ACM, 25(1):67-80,

1978

[Cooper84] Cooper, G.F. NESTOR: A Computer Based Medical Diagnosis

Aid that Integrates Causal and Probabilistic Knowledge. Ph.D

Thesis, Computer Science Department, Stanford University,

1984.

[Cooper90] Cooper, G.F. The Computational Complexity of Probabilistic

Inference Using Bayesian Belief Network. Artificial

Intelligence, 42:393-405, 1990.

[Crestani94] Crestani, F. and van Rijsbergen, C.J. Information Retrieval by

Imaging. In the Proceedings of 16th British Computer Science

Colloquium, Drymer, Scotland, 1994.

219

[Crestani95] Crestani, F., Ruthven, I., Sanderson, M. and van Rijsbergen,

C.J. The Troubles with Using a Logical Model of IR on Large

Collection of Documents. In the Proceedings of 4th Text

Retrieval Conference, pp 509-526, NIST 500-236, National

Institute of Standard and Technology, US, 1995.

[Croft79] Croft, W.B. and Harper, D.J. Using Probabilistic Models of

Document Retrieval without Relevance Information. Journal of

Documentation, 35(3):285-295, 1979.

[Croft80] Croft, W.B. A Model of Cluster Searching Based on

Classification. Information Systems, 5:189-195, 1980.

[Croft84] Croft, W.B. and Thompson, R.H. The Use of Adaptive

Mechanism for Selection of Search Strategies in Document

Retrieval Systems. In the Proceedings of the ACM/BCS

International Conference on Research and Development in

Information Retrieval, pp 95-110, 1984.

[Croft85] Croft, W.B. and Parenty, T.J. A Comparison of a Network

Structure and a Database System used for Document Retrieval.

Information Systems, 10(4):377-390, 1985.

[Croft86] Croft, W.B. Boolean Queries and Term Dependencies in

Probabilistic Retrieval Models. Journal of the American Society

for Information Science, 37(2):71-77, 1986.

[Croft87a] Croft, W.B. Approaches to Intelligent Information Retrieval.


220

[Croft87b] Croft, W.B. and Thompson, R.H. I3R: A New Approach to the

Design of Document Retrieval Systems. Journal of the

American Society for Information Science, 38(6):389-404,

1987.

[Croft88] Croft, W.B. and Savino, P. Implementing Ranking Strategies

Using Text Signatures. ACM Transactions on Office

Information Systems, 6(1):42-62, 1988.

[Croft89a] Croft, W.B. and Turtle, H. A Retrieval Model Incorporating

Hypertext Links. In the Proceedings of Hypertext’89, pp 213-

224, 1989.

[Croft89b] Croft, W.B., Lucia, T.J. and Willet, P. Retrieving Documents

by Plausible Inference: an Experimental Study. Information

Processing and Management, 25(6):599-614, 1989.

[Dechter85] Dechter, R. and Pearl, J. The Anatomy of Easy Problems: A

Constraint-Satisfaction Formulation. In the Proceedings of the

8th International Joint Conference on AI, pp 1066-1072, 1985

[Freimuth89] Freimuth, M.E., Stein, J.A. and Kear, T.J. Searching for Health

Information, University of Pennsylvania Press, Philadelphia,

1989

[Frisse89] Frisse, M.E. and Cousin, S.B. Information Retrieval from

Hypertext: Update on the Dynamic Medical Handbook Project.

In the Proceedings of Hypertext’89, pp199-212, 1989.

221

[Fryback78] Fryback, D.G. Bayes’ Theorem and Conditional

Nonindependence of Data in Medical Diagnosis, Computers

and Biomedical Research, 11:423-434, 1978.

[Fuhr86] Fuhr, N. Two Models of Retrieval with Probabilistic Indexing.

In the Proceedings of the 9th Annual Conference on Research

and Development in Information Retrieval, Rabitti, F (ed), pp

249-257, ACM Press, New York, 1986.

[Fuhr89] Fuhr, N. Models for Retrieval with Probabilistic Indexing.


[Fuhr90] Fuhr, N. A Probabilistic Framework for Vague Queries and

Imprecise Information in Database. In the Proceedings of the

16th International Conference on Very Large Databases,

McLeod, D., Sacks-Davis, R. and Schek, H. (eds), pp 696-707,

Morgan Kaufmann, Los Altos, CA, 1990

[Fuhr92] Fuhr, N. Probabilistic Models in Information Retrieval. The

Computer Journal, 35(3):243-255, 1992.

[Fung90a] Fung, R.M., Crawford, S.L., Applebaum, L.A. and Tong, R.M.

An Architecture for Probabilistic Concept-Based Information

Retrieval. In the Proceedings of the 13th International


Retrieval, Vidik, J. (ed), pp 455-467, 1990.

222

[Fung90b] Fung, R. and Chang, K.C. Weighting and Integrating Evidence

for Stochastic Simulation in Bayesian Networks. In Uncertainty

in Artificial Intelligence 5, Henrion, M. et.al (eds), pp 209-219,

North Holland, Amsterdam, 1990.

[Ghazfan94] Ghazfan, D., Indrawan, M., Srinivasan, B. and Korb, K. A

Bayesian Model for Information Retrieval. In the Proceedings

of 5th Australian Conference in Information Systems, Arnott, D.

and Shank, G. (eds), pp 259-272, Department of Information

Systems, Monash University, Australia, 1994.

[Ghazfan95] Ghazfan, D., Indrawan, M., and Srinivasan, B. A Semantically

Correct Bayesian Network based Information Retrieval. In the

Proceedings of the 5th Helenic Conference in Informatics, pp

639-648, 1995.

[Ghazfan96] Ghazfan, D., Indrawan, M. and Srinivasan, B. Towards

Meaningful Bayesian Network for Information Retrieval

Systems. In the Proceedings of the 6th International Conference

in Information Processing and Management of Uncertainty in

Knowledge-Based Systems (IPMU), pp 841-846, Spain, 1996.

[Gordon85] Gordon, J. and Shortliffe, E.H. A Method of Managing

Evidential Reasoning in Hierarchical Hypothesis Space.

Artificial Intelligence, 26:323-357, 1985.

223

[Hansen95] Hansen, J.H.L. and Bou-Ghazale, S. Duration and Spectral

Based Strees Token Generation for Keyword Recognition

Using Hidden Markov Models. IEEE Transaction on Speech

and Audio Processing, 3(5):415-421, 1995.

[Harman92] Harman, D. Relevance Feedback Revisited. In the Proceedings

of the 15th Annual International SIGIR, pp 1-10, ACM Press,

Denmark, 1992.

[Harper78] Harper, D. and van Rijsbergen, C.J. An Evaluation of Feedback

in Document Retrieval Using Co-occurrence Data, Journal of

Documentation, 34(3):189-216, 1978.

[Heckerman85] Heckerman, D.E., Horvitz, E.J. and Nathwani, B.N. Pathfinder

Research Directions, Technical Report KSL-89-64, Knowledge

Systems Laboratory, Stanford University, Stanford, California,

1985.

[Henrion86] Henrion, M. Propagating Uncertainty in Bayesian Network by

Probabilistic Logic Sampling. In Uncertainty in Artificial

Intelligence 2, pp 149-163, 1986.

[Henrion90] Henrion, M. Towards Efficient Inference in Multiply

Connected Belief networks. In Influence Diagrams, Belief Nets

and Decision Analysis, Oliver, R.M. and Smith, J.Q. (eds.), pp

385-407, Wiley:Chichester, 1990.

224

[Hulme95] Hulme, M. Improved Sampling for Diagnostic Reasoning in

Bayesian Networks, In Uncertainty in Artificial Intelligence 95,

Besnard, P. and Hanks, S. (eds), pp 315-322, Morgan

Kaufmann, San Fransisco, US, 1995

[Indrawan96] Indrawan, M., Ghazfan, D. and Srinivasan, B. Bayesian

Network as a Retrieval Engine. In the Proceedings of the 5th

Text Retrieval Conference, pp 437-444, NIST 500-238,

National Institute of Standard and Technology, US, 1996.

[Indrawan98] Indrawan, M., Srinivasan, B., Ghazfan, D. and Wilson, C.

Handling Large Bayesian Networks: a Case Study of

Information Retrieval Systems. 1998 IEEE Conference on

Systems, Man and Cybernetics, USA, (submitted).

[Jones87] Jones, W.P. and Furnas, G.W. Pictures of Relevance – a

Geometric Analysis of Similarity Measures. Journal of the

American Society for Information Science, 38(6):420-442,

1987.

[Kupieck92] Kupiec, J. Robust Part-of-Speech Tagging using a Hidden

Markov Model. Computer Speech and Language, 6:225-242,

1992.

[Kwok89] Kwok, K.L. A Neural Network for Probabilistic Information

Retrieval. In the Proceedings of the 12th International


Retrieval, Belkin, N.J. and van Rijsbergen, C.J. (eds), pp 21-30,

ACM, New York, 1989.

225

[Kwok90] Kwok, K.L. A Network Approach to Probabilistic Information

Retrieval. ACM Transaction on Information Systems,

13(3):324-353, 1995

[Lancaster69] Lancaster, F.W. MEDLARS: Report on the Evaluation of Its

Operating Efficiency. American Documentation, 20(2), 1969.

[Lauritzen88] Lauritzen, S.L. and Spiegelhalter, D.J. Local Computations

with Probabilities on Graphical Structure and Their

Applications to Expert Systems. Journal of Royal Statistical

Society B, 50:157-224, 1988.

[Lewis90] Lewis, D.D. Representation, Learning, and Language in

Information Retrieval. PhD Thesis, University Massachusetts,

1990.

[Lewis96] Lewis, D.D. and Sparck-Jones, K. Natural Language Processing

for Information Retrieval, Communication of the ACM,

39(1):92-101, 1996.

[Losee88] Losee, R.M. and Bookstein, A. Integrating Boolean Queries in

Conjunctive Normal Form with Probabilistic Retrieval Models,

Journal of the American Society for Information Science,

24(3):315-321, 1988.

[Luhn58] Luhn, H.P. The Automatic Creation of Literature Abstracts.

IBM Journal of Research and Development, 2(2):159-165,

1958.

226

[Maron60] Maron, M.E. and Kuhns, J.L. On Relevance, Probabilistic

Indexing and Information Retrieval. Journal of the ACM,

7:216-244, 1960.

[McDermott85] McDermott, D. and Doyle, J. Non-monotomic Logic. Artificial

Intelligence, 25:41-72, 1985.

[Mel’cuk89] Melcuk, I. Semantic Primitives from the Viewpoint of the

Meaning Text Linguistic Theory, Quaderni di Semantica,

10:27-62, 1989.

[Milstead89] Milstead,J.L. Subject Access Systems. Academic Press,

Orlando, 1989.

[Neapolitan90] Neapolitan, R.E. Probabilistic Reasoning in Expert Systems:

Theory and Algorithms. John Wiley&Sons Inc, US, 1990.

[Oddy77] Oddy, R.N. Information Retrieval through Man-machine

Dialogue. Journal of Documentation, 33:1-14, 1977.

[Oliver94] Oliver, J.J. and Hand, D.J. Introduction to Minimum Encoding

Inference. Technical Report Computer Science Department

TR94-205, Monash University, Australia, 1994.

[Olmsted83] Olmsted, S.M. On Representing and Solving Decision

Problems. PhD Thesis, Engineering-Economic Systems

Department, Stanford University, Stanford, California, 1983.

227

[Pearl87] Pearl, J. Evidential Reasoning using stochastic simulation of

causal models. Artificial Intelligence, 32:245-257, 1987.

[Pearl88] Pearl, J. Probabilistic Reasoning in Intelligent Systems:

Networks of Plausible Inference. Morgan Kaufmann, 1988.

[Peng86] Peng, Y. and Reggia, J.A. A Probabilistic Causal Model for

Diagnostic Problem Solving-Part I and II. IEEE Transaction on

Systems, Man and Cybernetics, 17:140-145, 1986.

[Provan95] Provan, G. Abstraction in Belief Networks: The Role of

Intermediate States in Diagnostic Reasoning. In Uncertainty in

Artificial Intelligence 95, Besnard, P. and Hanks, S. (eds), pp

464-471, Morgan Kaufmann, San Fransisco, US, 1995.

[Rajashekar95] Rajashekar, T.B. and Croft,W.B. Combining Automatic and

Manual Index Representations. Journal of American Society for

Information Science, 46(4):272-283, 1995.

[Reiter87] Reiter, R., Nonmonotonic Reasoning. Annual Review of

Computer Science, 2:147-186, 1987.

[Rijsbergen79] Van Rijsbergen, C.J., Information Retrieval. Butterworths,

1979.

[Rijsbergen86] Van Rijsbergen, C.J. A Non-Classical Logic for Information

Retrieval. Computer Journal, 29:481-485, 1986.

228

[Rijsbergen89] Rijsbergen, C.J. Towards Information Logic. In the

Proceedings of 12th Annual International ACM SIGIR


Retrieval, Belkin, N.J. and van Rijsbergen, C.J. (eds), pp 77-86,

New York, 1989.

[Rijsbergen92] Rijsbergen, C.J. Probabilistic Retrieval Revisited. Computer

Journal, 35(3):291-298, 1992.

[Robertson76] Robertson, S.E. and Sparck-Jones, K. Relevance Weighting of

Search Terms. Journal of the American Society for Information

Science, 27:129-146, 1976.

[Robertson77] Robertson, S.E. The Probability Ranking Principle in IR.

Journal of Documentation, 33(4):294-304, 1977.

[Robertson82] Robertson, S.E., Maron, S.E. and Cooper, W.S. Probability of

Relevance: A Unification of Two Competing Models for

Document Retrieval. Information Technology: Research and

Development, 1(1):1-21, 1982.

[Salton68] Salton, G., Automatic Information Organization and Retrieval.

McGraw-Hill, 1968.

[Salton71] Salton, G (ed). The SMART Retrieval System – Experiments in

Automatic Document Processing. Prentice-Hall, Inc.,

Englewood Cliffs, New Jersey, 1971.

229

[Salton83] Salton, G. and McGill M.J. Introduction to Modern Information

Retrieval, McGraw-Hill, 1983.

[Salton88] Salton, G. A Simple Blueprint for Automatic Booelan Query

Processing. Information Processing and Management,

24(3):269-280, 1988.

[Schank77] Schank, R.C. and Abelson, R.P. Scripts, Plans, Goals, and

Understanding. Lawrence Erlbaum Press, 1977.

[Shachter86] Shachter, R.D. Intelligent probabilistic inference. In

Uncertainty in Artificial Intelligence. Kanal, L. and Lemmer, J

(eds), pp 371-382, North Holland, 1986.

[Shachter88] Shachter, R.D. Probabilistic Inference and Influence Diagrams.

Operation Research, 36:871-882, 1988.

[Shachter90] Shachter, R.D. and Peot, M.A. Simulation Approaches to

General Probabilistic Inference on Belief Networks, in

Uncertainty in Artificial Intelligence 5, Kanal, L and Lemmer, J

(eds), pp 221-231, North Holland, 1990.

[Shafter76] Shafter, G. A Mathematical Theory of Evidence. Princeton

University Press, 1976.

[Shortlife75] Shortliffe, E.H. and Buchanan, B.G. A Model of Inexact

Reasoning in Medicine. Mathematical Biosciences, 23:351-376,

1975.

230

[Shoval85] Shoval, P. Principles, Procedures and Rules in Expert System

for Information Retrieval. Information Processing and

Management, 21(6):475-487, 1985.

[Shwe90] Shwe, M. and Cooper, G. An Empirical Analysis of Likelihood-

Weighting Simulation on Large, Multiply Connected Belief

Network. Technical Report KSL-90-23, Knowledge Systems

Laboratory, Stanford University, Stanford, CA, 1990.

[Sparck-Jones71] Sparck-Jones, K. Automatic Keyword Classification for

Information Retrieval. Archon Books, 1971.

[Sparck-Jones72] Sparck-Jones, K. A Statistical Interpretation of Term

Specificity and Its Application in Retrieval. Journal of

Documentation, 28(1):11-20, 1972.

[Sparck-Jones74] Sparck-Jones, K. Automatic Indexing. Journal of

Documentation, 30(4):393-432, 1974.

[Sparck-Jones79] Sparck-Jones, K. Search Term Relevance Weighting Given

Little Relevance Information. Journal of Documentation,

35:30-48, 1979.

[Stillman91] Stillman, J. On Heuristics for Finding Loop Cutsets in Multiply

Connected Belief Network. in Uncertainty in Artificial

Intelligence 6, Bonissone, P.P., Henrion, M., Kanal, L.N. and

Lemmer, J.F. (eds), pp 233-243, North Holand, 1991.

231

[Strzalkowski93] Strzalkowski, T. Natural Language Processing in Large-Scale

Text Retrieval Tasks. In the Proceedings of the 1st Text

Retrieval Conference (TREC-1), pp 39-54, NIST 500-207,

National Institute Standard and Technology, US, 1993.

[Tong83] Tong, R.M., Shapiro, D., McCune, B.P. and Dean, J.S. A Rule-

Based Approach to Information Retrieval: Some Results and

Comments. In the Proceedings of the National Conference on

Artificial Intelligence, pp 411-415, 1983.

[Tong85] Tong, R.M. and Shapiro, D. Experimental Investigations of

Uncertainty in a Rule-Based System for Information Retrieval.

International Journal of Man-Machine Studies, 22:265-282,

1985.

[Tong86] Tong, R.M., Applebaum, L.A., Askman, V.N. and

Cunningham, J.F. RUBRIC III : An Object Oriented Expert

System for Information Retrieval. In the Proceedings of the 2th

Expert Systems in Government Symposium, Karna, K.L.,

Parsaye, K. and Silverman, B.G. (eds), pp 106-115, 1986.

[Turtle90] Turtle, H. Inference Network for Document Retrieval. Ph.D.

Thesis, University of Massachusetts, October, 1990.

[Turtle91] Turtle, H. and Croft, W.B. Evaluation of an Inference Network-

Based Retrieval Model. ACM Transaction on Information

Systems, 9:187-222, 1991.

232

[Yu88] Yu, C. and Mizuno, H. Two Learning Schemes in Information

Retrieval. In the Proceedings of 11th International Conference

on Research and Development in Information Retrieval,

Chiaramella, Y. (ed), pp 201-218, Presses Universitaires de

Grenoble, Grenoble, France, 1988.

[Voorhees96] Voorhees, E. and Harman, D. Overview of the 5th Text

Retrieval Conference. In the Proceedings of the 5th Text

Retrieval Conference, pp 1-28, NIST 500-238, National

Institute of Standard and Technology, US, 1996

[Wallace68] Wallace, C.S. and Boulton, D.M. An Information Measure for

Classification. Computer Journal, 11(2):185-194, 1968.

[Wallis95] Wallis, P. Semantic Signatures for Information Retrieval. PhD

Thesis RT-6, Computer Science Department, Royal Melbourne

Institute of Technology, 1995.

[Willet88] Willet, P. Recent Trends in Hierarchic Document Clustering: a

Critical Review. Information Processing and Management,

24(5):577-598, 1988.

[Wilson73] Wilson, P. Situational Relevance. Information Storage and

Retrieval, 9:457-471, 1973.

[Zadeh78] Zadeh, L.A. Fuzzy Sets as a Basis for Theory of Possibility.

Fuzzy Sets and Systems, 1:3-28, 1978.

233

[Zadeh86] Zadeh, L.A. Is Probability Theory Sufficient for Dealing with

Uncertainty in AI: a Negative View. In Uncertainty in Artificial

Intelligence, Kanal, L.N. and Lemmer, J.F. (eds), pp 103-116,

North Holland, Amsterdam , 1986.