14
PROBABILISTIC RETRIEVAL INCORPORATING THE RELATIONSHIPS OF DESCRIPTORS INCREMENTALLY WON YONG KIM,* MYOUNG HO KIM 1 and YOON JOON LEE 2 Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusong-Dong, Yusong-Gu, Taejon, 305-701, South Korea (Received 1 July 1997; accepted 1 January 1998) Abstract—The previous probabilistic retrieval models assume that the relevance probability of a document is independent of the descriptors that are not specified in a query. This is not true in practice because there can be many descriptors that represent the same concept. The probabilistic retrieval model developed in this paper overcomes this unsuitable assumption and incorporates the relationships of descriptors. A learning method is also proposed to figure out the relationships incrementally. Each time retrieval results are available, the method identifies in the relevant documents the descriptors that designate the concepts specified by the query descriptors. Although it employs user feedbacks like relevance feedback, it attempts to capture certain stable relationships of descriptors from many past user queries rather than to distinguish relevant documents from non-relevant ones for a particular query. We show through experiments that the proposed information retrieval method improves retrieval eectiveness over time. # 1998 Elsevier Science Ltd. All rights reserved 1. INTRODUCTION Relevance of a document depends on many variables related with the document (e.g. its scope, how it is written) as well as numerous user characteristics (e.g. why the search was initiated, user’s previous knowledge). Since many such factors aect the judgment in a complex way, it is generally considered that an IR system cannot precisely select only and all relevant documents. In order to cope with this environment most IR systems present to a user a sequence of documents which are ranked in decreasing order of estimated relevancy (Robertson and Jones, 1976; Salton, 1989; Lee et al., 1993). In general, the contents of documents and the information need of a user are identified by sets of concepts, and the concepts are encoded into descriptors such as single words, phrases, etc. The relevance probability of a document is expressed as the conditional probability of relevance given the descriptors attached to the documents in a collection. If the conditional probability distribution functions are maintained for all possible queries, we could obtain the correct relevance probabilities of documents for any given query. The estimation of the probabilities in most IR systems, however, is not complete because those are not available for reasons of storage space limitation, computational eciency, and the lack of the required information in many cases (Croft and Harper, 1979; Fuhr and Huther, 1989; Losee, 1988; Robertson and Walker, 1994; Wu and Salton, 1981) Information Processing & Management Vol. 34, No. 4, pp. 417–430, 1998 # 1998 Elsevier Science Ltd. All rights reserved Printed in Great Britain 0306-4573/98 $19.00 + 0.00 PII: S0306-4573(98)00009-0 *Corresponding author. Fax: +82-42-869-3510; e-mail: [email protected]. 1 E-mail: [email protected] 2 E-mail: [email protected] 417

Probabilistic retrieval incorporating the relationships of descriptors incrementally

Embed Size (px)

Citation preview

Page 1: Probabilistic retrieval incorporating the relationships of descriptors incrementally

PROBABILISTIC RETRIEVAL INCORPORATING THE

RELATIONSHIPS OF DESCRIPTORS INCREMENTALLY

WON YONG KIM,* MYOUNG HO KIM1 and YOON JOON LEE2

Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1,Kusong-Dong, Yusong-Gu, Taejon, 305-701, South Korea

(Received 1 July 1997; accepted 1 January 1998)

AbstractÐThe previous probabilistic retrieval models assume that the relevanceprobability of a document is independent of the descriptors that are not speci®ed in aquery. This is not true in practice because there can be many descriptors that

represent the same concept. The probabilistic retrieval model developed in this paperovercomes this unsuitable assumption and incorporates the relationships ofdescriptors. A learning method is also proposed to ®gure out the relationshipsincrementally. Each time retrieval results are available, the method identi®es in the

relevant documents the descriptors that designate the concepts speci®ed by the querydescriptors. Although it employs user feedbacks like relevance feedback, it attemptsto capture certain stable relationships of descriptors from many past user queries

rather than to distinguish relevant documents from non-relevant ones for a particularquery. We show through experiments that the proposed information retrieval methodimproves retrieval e�ectiveness over time. # 1998 Elsevier Science Ltd. All rights

reserved

1. INTRODUCTION

Relevance of a document depends on many variables related with the document (e.g. its

scope, how it is written) as well as numerous user characteristics (e.g. why the search

was initiated, user's previous knowledge). Since many such factors a�ect the judgment

in a complex way, it is generally considered that an IR system cannot precisely select

only and all relevant documents. In order to cope with this environment most IR

systems present to a user a sequence of documents which are ranked in decreasing order

of estimated relevancy (Robertson and Jones, 1976; Salton, 1989; Lee et al., 1993). In

general, the contents of documents and the information need of a user are identi®ed by

sets of concepts, and the concepts are encoded into descriptors such as single words,

phrases, etc. The relevance probability of a document is expressed as the conditional

probability of relevance given the descriptors attached to the documents in a collection.

If the conditional probability distribution functions are maintained for all possible

queries, we could obtain the correct relevance probabilities of documents for any given

query. The estimation of the probabilities in most IR systems, however, is not complete

because those are not available for reasons of storage space limitation, computational

e�ciency, and the lack of the required information in many cases (Croft and Harper,

1979; Fuhr and Huther, 1989; Losee, 1988; Robertson and Walker, 1994; Wu and

Salton, 1981)

Information Processing & Management Vol. 34, No. 4, pp. 417±430, 1998# 1998 Elsevier Science Ltd. All rights reserved

Printed in Great Britain0306-4573/98 $19.00+0.00

PII: S0306-4573(98)00009-0

*Corresponding author. Fax: +82-42-869-3510; e-mail: [email protected].

1E-mail: [email protected]: [email protected]

417

Page 2: Probabilistic retrieval incorporating the relationships of descriptors incrementally

Two major clues for the estimation are the probability distribution of each descriptor

and the query submitted by a user. The previous probabilistic retrieval models assume

that only the descriptors in the query a�ect the probabilities of relevance (Fuhr, 1989;

Robertson and Jones, 1976). If there might be only one corresponding descriptor for

each concept, the assumption could make sense. However, there are a variety of

descriptors for a given concept, which is known as a vocabulary gap problem. For

example, given a query descriptor ``information retrieval'', the relevance probability

would be conditionally dependent on the descriptor and its analogues such as

``document retrieval'', ``text retrieval'', and so on. Therefore, the probability estimation

based on the descriptors in a query, can be correct only if the authors who write the

documents, and the user who issues the query take the same descriptors for the query

concepts; otherwise, it is unavoidable to have incorrect estimation. This is one of the

major sources that an IR system without knowledge provides poor retrieval results.

It has been recognized that human knowledge is crucial for solving the vocabulary

gap problem. The human knowledge can be either obtained from prede®ned

terminology thesauri, or automatically learned from training samples given by humans.

A well-observed problem of the general purpose thesauri is that they do not have

su�cient vocabulary coverage crossing di�erent applications (Salton and Buckley,

1988). Furthermore, subjective decisions in thesaurus development often do not re¯ect

the context sensitivity of descriptor meaning and the application dependent relationship

between descriptors. The automatic learning approach from training samples could

alleviate the problems of the general purpose thesauri (Park and Choi, 1996). However,

it is di�cult to obtain training samples which completely cover the vocabulary in an

application.

In this paper, we propose an alternative method, called Learning from Relevant

Documents (LRD) to construct knowledge. LRD uses relevance feedbacks from the

users of an IR system to learn analogues of descriptors. We assume that a relevant

document has all the concepts represented by a query though the relevance is judged in

a complex way. This implies that if a descriptor is in a query but is not in a relevant

document, at least one of the analogues of the descriptor is in the document. LRD

attempts to identify analogues from the descriptors for the document. LRD takes an

average among the results from many users to avoid being biased in favor of some

persons' opinions. Note that although the idea of employing user feedbacks resembles

the notion of traditional relevance feedback (Haines and Croft, 1993; Harman, 1988,

1992; Smeaton and van Rijsbergen, 1983; Spink, 1994), quite well known in the IR

literature, the objective here is not to learn how to distinguish relevant documents from

non-relevant ones for a particular query. Instead, LRD attempts to capture certain

stable relationships of descriptors from many past user queries.

We also extend the previous probabilistic retrieval models since they do not take

knowledge into account (Fuhr, 1989; Robertson and Jones, 1976). The extension is

described with Bayesian inference networks (Pearl, 1988) because they provide a

plausible inference mechanism for reasoning under conditions of uncertainty. We

evaluate the extended model with the knowledge constructed by LRD. The extended

model is implemented on the SMART system (Salton, 1971). The experimental results

show that the proposed method works well and improves over time as we expect.

The remainder of the paper is organized as follows. In Section 2, we map the

previous probabilistic retrieval models into a Bayesian inference network, and then

describe the necessity of knowledge to alleviate the vocabulary gab problem. We extend

the retrieval models to incorporate knowledge in a retrieval session, and then present

how to estimate the relevance probability of a document in Section 3. An incremental

knowledge construction method is proposed in Section 4. The proposed knowledge

construction method and the extended retrieval model are experimentally evaluated in

Section 5. Finally, the conclusions are given in Section 6.

Won Yong Kim et al.418

Page 3: Probabilistic retrieval incorporating the relationships of descriptors incrementally

2. MAPPING THE PREVIOUS PROBABILISTIC RETRIEVAL MODELS INTO ABAYESIAN INFERENCE NETWORK

A Bayesian inference network is a directed, acyclic dependency graph (DAG) inwhich nodes represent propositional variables or constants and edges representdependence relations between propositions (Pearl, 1988). If a proposition represented bya node p causes or implies the proposition represented by node q, we draw a directededge from p to q. The node q contains a matrix that speci®es P(qvp) for all possiblevalues of the two variables. When a node has multiple parents, the matrix speci®es thedependence of the node on the set of parents and characterizes the dependencerelationships between that node and all nodes representing its potential causes. Given aset of prior probabilities for the roots of the DAG, these networks can be used tocompute the probability or degree of belief associated with all remaining nodes.

An inference network for the previous probabilistic retrieval models is shown in Fig. 1(Fuhr, 1989; Robertson and Jones, 1976; Turtle and Croft, 1990). It consists of adocument representation component and a query representation component. All nodesin the inference network have a value, either 1 or 0. The document representationcomponent is built for a collection of documents. It is composed of a set of documentnodes D, a set of descriptor nodes T, and dependence relations between documentnodes and descriptor nodes. In the ®gure, Di$D, i = 1, . . . , D and Tj$T, j = 1, . . . , T.Each document node represents an actual document in the collection. The elements ofT are descriptors for the documents in the collection, and those in queries. The unit ofa descriptor can be single word, phrase, or etc, which depends on the indexing strategyof an IR system. We draw a directed arc to a descriptor node from a document node towhich the descriptor has been assigned by an indexing system. The query representationcomponent consists of the set of all the descriptor nodes T, a single node RQ whichrepresents a user's information need, and dependence relations between descriptornodes and RQ. It is built for each retrieval request and can be modi®ed throughinteractive query formulation or relevance feedback. A directed arc from a descriptornode to RQ is drawn if the descriptor is attached to the query by an indexing system orrelevance feedback. The query representation component in Fig. 1 shows that a user'sinformation need RQ has dependence relations with T1, T3 and TT. It also says that allthe other descriptors do not have dependence relations with RQ.

An IR system cannot precisely select only and all documents in which a user isinterested since many variables determine the concerns of the user in a complex way.Therefore, a probabilistic retrieval model ranks documents according toP(RQ=1vDi=1), the relevance probability of a document Di with respect to the user'srequest. Based on the inference network shown in Fig. 1, P(RQ=1vDi=1) becomes

P�RQ � 1jDi � 1� �Xx2X

P�RQ � 1jx1, . . . ,xk�P�x1jDi � 1� . . .P�xkjDi � 1� �1�

Fig. 1. An inference network for the previous probabilistic retrieval models.

Probabilistic retrieval of descriptors incrementally 419

Page 4: Probabilistic retrieval incorporating the relationships of descriptors incrementally

where x is a value vector (x1, x2, . . . , xk) whose element xj corresponds to a possible

value of a parent node tj of RQ, and X is the set of possible values for x.

Since the inference network assumes that the descriptors are assigned independently,

P�xjDi � �Ykj�1

P�xjjDi �

Thus, the equation [1] reduces to

P�RQ � 1jDi � 1� �Xx2X

P�RQ � 1jx�P�xjDi � 1� �2�

Note that equation [2] is equivalent to the basic ranking expression of the previous

probabilistic retrieval models which are Binary Independence Retrieval (BIR)

(Robertson and Jones, 1976) and Retrieval with Probabilistic Indexing (RPI) (Fuhr,

1989). BIR re¯ects certainty indexing such that P(Tj=1vDi=1) = 1 if a descriptor Tj is

attached to a document Di. On the other hand, RPI makes use of uncertainty indexing

such that P(Tj=1vDi=1) stands for the probability of correctness estimated by an

indexing system. The correctness means that a descriptor assignment would be correct

to represent the contents of a document.

We focus, in this paper, on the dependence relations expressed in Fig. 1 and

equation [1]. In the ®gure, only parent nodes of RQ a�ects the probability,

P(RQ=1vDi=1). Therefore, the query representation component for a retrieval request

should exactly present all the dependence relations between descriptor nodes and RQ. A

document may or may not be relevant to the interest in the user's mind, which depends

on many factors in a complex way. Although the factors and their relationships have

not been completely ®gured out, the most important factors are concepts. The relevance

of a document would be determined by factors such as the importance, relationship and

description of some concepts. A user usually submits a set of descriptors to specify the

concepts. The query statement, therefore, is a main contributor to construct the query

representation component.

A query statement alone, however, could not provide all the required dependence

relationships in many cases. This is because a concept can be expressed in a variety of

descriptors, which is known as the vocabulary gap problem. For example, consider a

query statement, ``the importance of indexing in information retrieval''. We can extract

from the query only the information that the relevance probability of a document

would be dependent on ``indexing'' and ``information retrieval''. It is shown as bold

lines in the query representation component of Fig. 2. For the given query in the ®gure,

BIR produces the same relevance probability of D3 and D4. However, we can intuitively

recognize that P(RQ=1vD3=1) would be higher than P(RQ=1vD4=1) because D3

contains ``document retrieval'' which designates the same concept represented by

``information retrieval''.

An IR system needs to utilize human knowledge in order to capture all the required

dependence relations from a query statement. The human knowledge in this regard is a

collection of relationships between descriptors. However, the inference network shown

in Fig. 1 does not take it into account appropriately. The knowledge can be either

obtained from prede®ned terminology thesauri, or automatically learned from training

samples given by humans. A well-observed problem of the general purpose thesauri is

that they often do not have su�cient vocabulary coverage crossing di�erent

applications. Furthermore, subjective decisions in thesaurus development often do not

re¯ect the context sensitivity of descriptor meaning and the application dependent

relationships between descriptors. The automatic learning approach from training

samples could alleviate the problems of the general purpose thesauri. However, it is

di�cult to obtain training samples that cover the entire vocabulary in an application. In

Won Yong Kim et al.420

Page 5: Probabilistic retrieval incorporating the relationships of descriptors incrementally

Section 4, we propose an alternative method to construct knowledge, whichincrementally learns relationships between descriptors from relevance feedbacks.

3. A RETRIEVAL NETWORK INCORPORATING KNOWLEDGE

The networks shown in Figs 1 and 2 state that a relevance judgment depends on thedescriptors speci®ed by a query. However, the judgment, in fact, must depend on theconcepts in the user's mind. The descriptors in the query are merely chosen by the userbased on his/her capacity of vocabulary and his/her diction, possibly without muchinvestigation. In other words, the descriptors in the query may not be all and only onesthat represent the concepts in the user's mind. Therefore, we propose to extend theinference network such that relationships between descriptors and concepts can beincorporated. Figure 3 shows this extended inference network.

In the proposed extended inference network, there are three components such asdocument, knowledge, and query representation components. The documentrepresentation component is the same as that of Fig. 1. The knowledge representationcomponent is composed of a set of descriptor nodes T, a set of concept nodes C and

Fig. 2. An inference network for a query, ``the importance of indexing in information retrieval''.

Fig. 3. An extended inference network incorporating knowledge.

Probabilistic retrieval of descriptors incrementally 421

Page 6: Probabilistic retrieval incorporating the relationships of descriptors incrementally

dependence relations between descriptors and concepts. In the ®gure, Ti$T, i = 1, . . . ,T and Cj$C, j = 1, . . . , C. The set of descriptor nodes in the ®gure are equal to that inFig. 1. Ci denotes a concept which is represented by descriptor Ti with certainty and

can be represented by other several descriptors uncertainly. An arc from descriptornode Tj to concept node Ci is drawn if the descriptor can designate the conceptperfectly or partially. For instance, a concept ``INFORMATION RETRIEVAL'' mayhave several incoming arcs from descriptors such as ``information retrieval'', ``document

retrieval'', ``text retrieval'' and so on. We can draw arcs to re¯ect relationships de®nedin a thesaurus such as broader descriptor, narrower descriptor, synonym, and analogue(Salton, 1989). The query representation component consists of the set of concept nodesC, a single node RQ that represents a user's information need, and dependence relations

between concept nodes and RQ. It is built for each retrieval request as in the previousprobabilistic retrieval model. An arc from a concept node to RQ is placed if thedescriptor that represents the concept with certainty, is attached to the query.

The extended inference network should re¯ect all of the signi®cant probabilisticdependencies among the variables represented by nodes in the document, knowledge,and query components. Each descriptor node contains a speci®cation of its dependenceupon its parent document nodes. We assume that this dependence is complete; a

descriptor is observed (Tj=1) exactly when one of its parent documents is observed(Di=1). In other words, we can ®nd certainly the descriptor in the observed document.Each concept node Ck speci®es the occurrence probability of the concept when at leastone of its parent descriptor nodes are observed. The probability could be acquired from

a thesaurus or a knowledge base constructed automatically. We will propose aknowledge construction method in Section 4.

The node for an information need also has to specify the conditional probability forits potential causes P(RQ=1vx), where x is a binary vector (x1, x2, . . . , xm) whoseelement xj corresponds to a possible value of a parent node cj of RQ. It, however,cannot characterize the probability if it is regarded as a temporary node in an IR

system. We must estimate the conditional probability in this case. The estimation couldmake use of the distribution of the concepts in relevant documents, which is similar tothat of BIR and RPI in that they utilize the distribution of the descriptors (Fuhr, 1989;Robertson and Jones, 1976). The distributions can be used by applying Bayes' theorem:

P�RQ � 1jx� � P�xjRQ � 1�P�x� P�RQ � 1� �3�

in which P(RQ=1) is the probability that a randomly selected document would berelevant to the interest of the user. As this probability does not change for aretrieval session, there is no need for its estimation when only a ranking of documentsfor this request is required. P(xvRQ=1) is the probability that a relevant document has

concepts x. Finally, P(x) is the probability that a randomly selected document hasconcepts x.

Given the conditional probabilities associated with the nodes except the documents

nodes in the extended inference network, we will observe a single document Di andattach evidence to the extended inference network asserting Di=1. Then an evidencepropagation procedure is invoked to compute the probability P(RQ=1vDi=1) that Di isrelevant (Pearl, 1988). We can now remove this evidence and instead assert that some

Dj, i$ j is observed. By repeating this process we can compute the probability for eachdocument in the collection and rank the documents accordingly. By the way theevidence propagation procedure is too complex to provide the user with the rankeddocuments within a reasonable time in a large collection. And node RQ may not be

able to specify the dependence on its potential causes because most IR systems regardit as a temporary node for the reason of limited resources. Therefore, in the remainderof this section, we will attempt to simplify the following general ranking formularepresented by the extended inference network, and then present the estimation of the

Won Yong Kim et al.422

Page 7: Probabilistic retrieval incorporating the relationships of descriptors incrementally

probabilistic parameters:

P�RQ � 1jDi � 1� �Xx2X

"P�RQ � 1jx�

Yj

Xyj2Yj

�P�xjjyj �

Yk

P� yjkjDi � 1��#

�4�

In the above equation, x is a binary vector (x1, x2, . . . , xm) whose element xjcorresponds to a possible value of concept cj which is a parent of RQ. The set of allpossible binary vectors for the parents of RQ is denoted as X. yj is a binary vector (yj1,yj2, . . . , yjn) whose element yjk corresponds to a possible value of descriptor tjk which isa parent of cj. The set of all possible binary vectors for the parents of cj is denoted asYj.

As equation [4] is a general ranking formula for retrieval with concepts, it would bepossible to make some assumptions in order to break down the complex expression intoa combination of simpler ones. The ®rst reduction is achieved by the completedependence assumption that a descriptor node depends on its parents completely, i.e.P(yjkvDi=1) is either 1 or 0. Then,

P�RQ � 1jDi � 1� �Xx2X

�P�RQ � 1jx�

Yj

P�xjjyij ��

�5�

Here, a binary vector yji represents whether parent descriptors of cj appears in

document Di, or not.For simplicity, two independence assumptions had been adopted frequently in the

past. The distribution of descriptors in relevant documents is independent and theirdistribution in all documents is independent. However, recent researches have shownthat the former is incompatible with the latter, and suggested the linked dependenceassumption consistent with the latter (Cooper, 1991; Cooper et al., 1992):

P�xjRQ � 1�P�xjRQ � 0� �

Ymj�1

P�xjjRQ � 1�P�xjjRQ � 0� �6�

This asserts that if two or more query concepts are statistically dependent in the set ofrelevant documents, then they are statistically dependent to a corresponding degree inthe set of non-relevant ones.

Though the linked dependence assumption cannot make equations [3)±(5] simple, it isused to derive a simple expression for the odds3 of relevance as follows:

O�RQjDi � 1� � O�RQ�

Xx2X

Yj

�P�xjjRQ � 1�=P�xj �P�xjjyij ��Xx2X

Yj

�P�xjjRQ � 0�=P�xj �P�xjjyij ���7�

where P(xjvRQ=1) is the probability that concept cj has value xj in relevant documents,and P(xjvRQ=0) is the corresponding probability for the non-relevant documents.

This can be transformed into

O�RQjDi � 1� � O�RQ�Yj

P�cj � 1jRQ � 1�=P�cj � 1�P�cj � 1jyij � � P�cj � 0jRQ � 1�=P�cj � 0�P�cj � 0jyij �P�cj � 1jRQ � 0�=P�cj � 1�P�cj � 1jyij � � P�cj � 0jRQ � 0�=P�cj � 0�P�cj � 0jyij �

�8�

With the following notation:

3The odds of an event E is de®ned as O(E) = P(E= 1)/P(E = 0).

Probabilistic retrieval of descriptors incrementally 423

Page 8: Probabilistic retrieval incorporating the relationships of descriptors incrementally

oj �P�cj � 1jRQ � 1�pj �P�cj � 1jRQ � 0�qj �P�cj � 1�rij �P�cj � 1jyij �

we have

O�RQjDi � 1� � O�RQ�Yj

�oj=qj �rij � ��1ÿ oj �=�1ÿ qj ���1ÿ rij �� pj=qj �rij � ��1ÿ pj �=�1ÿ qj ���1ÿ rij �

�9�

We will use equation [9] as a ranking formula. This is because the rank of thedocument by the odds is equal to that based on the probability, and the odds isexpressed more simply with consistent assumptions than the probability. Note that therelevance probability would be also a useful feature in a probability based retrievalsystem because the probability can give the user some notion of the retrieval quality ofthe documents retrieved.

To apply the formula, we must estimate the parameters oj, pj, qj and rji. The

parameters oj and pj are speci®c to an information need. They depend on the style of auser in constructing a query. A user writes a query such that it includes a set ofdescriptors specifying concepts which a�ect relevance judgment on a document, andwould try to give a necessary and su�cient condition to retrieve only the relevantdocuments in a collection. But the query in general could not represent the conditionperfectly since a necessary and su�cient condition cannot be made in most cases. Itwould usually represent a necessary condition to retrieve all relevant documents instead.In other words, all relevant documents have the concept speci®ed in the query, i.e.oj=1. We call it Faithful User Assumption (FUA). With the assumption, we havepj=(cfjÿvRv)/(vDvÿ vRv) where vDv denotes the total number of documents in a collectionand vRv is the number of documents relevant to the query. cfj is related to qj as thefrequency of occurrence of concept cj. When the number of documents in a collection islarge enough for covering a lot of concepts, the occurrence probability of a concept inthe non-relevant documents can be approximated by the occurrence probability of theconcept in the entire document collection, that is, pj1qj. This is because the majority ofthe documents in the collection may be non-relevant to a speci®c query in general. Aretrieval with FUA and this approximation is called Faithful User Retrieval (FUR).

The parameters qj and rji will be approximated with the probabilities for the

descriptors rather than with the frequency of the concept since a concept is aconceptual object represented by descriptors which are physical objects. The detail willbe presented in the next section because it depends on a knowledge constructionmethod.

4. INCREMENTAL IDENTIFICATION OF ANALOGUES

In this section we propose a knowledge construction method called ``Learning fromRelevant Documents (LRD)'' which involves user feedback. Although the idea ofemploying user feedback resembles the notion of traditional relevance feedback (Hainesand Croft, 1993; Harman, 1988, 1992; Smeaton and van Rijsbergen, 1983; Spink, 1994),quite well known in the IR literature, the objective of LRD is somewhat di�erent. Inthe traditional relevance feedback the user indicates which documents are relevant tohis/her request in the retrieved documents for the current query. Then some of thedescriptors in the relevant documents are added to the query. The added descriptorsmake the information need more precise which may not be always known exactly in theinitial query, and they also reduce the vocabulary gap between documents and the

Won Yong Kim et al.424

Page 9: Probabilistic retrieval incorporating the relationships of descriptors incrementally

query. The query expansion by the traditional relevance feedback has shown signi®cant

improvement over the initial query. It also shows an improvement over the expanded

query with thesaurus in many cases. The traditional relevance feedback focuses on thesuccessive re®nement of the condition speci®ed in a query, which is based on the

relevance information for the query. It, however, does not consider utilizing those

previous relevance information for other di�erent information needs. In other words, a

knowledge obtained from some query's re®nement is not used to improve performance

for other information requests. LRD, on the other hand, makes use of the relevance

information given by many past users increasing the retrieval e�ectiveness of other

di�erent queries. It e�ectively constructs knowledge-base which is certain stable

relationships among descriptors to alleviate the vocabulary gap problem.

The knowledge-base created and maintained by LRD is a collection of analogues of

the descriptors which have been used at least once in the past queries. If descriptor Tk

can designate perfectly or partially concept Cj represented by descriptor Tj with

certainty, Tk is regarded as an analogue of Tj shown as a parent of Cj node in Fig. 3.

All descriptors classi®ed as synonyms, broader or narrower descriptors in a general

thesaurus are analogues. The concept occurrence probability in a document is a

parameter of the proposed ranking formula. In order to obtain the concept occurrence

probability when one or more analogues are observed in a document, we need a samplecollection and users. The collection should have su�cient vocabulary coverage to re¯ect

all analogues of descriptors in an application area. The users should make decisions

whether a concept appears in a document, or not. In most cases, however, it is di�cult

to get the collection and the users.

We are to derive the concept occurrence probability with the distributions of relevant

documents in descriptor spaces alternatively. The Bayesian network shown in Fig. 3

assumes that relevant documents and a descriptor are conditionally independent given a

concept. We can get the following equation with the assumption:

P�RCj� 1jCj � 1,Tj � 1� � P�RCj

� 1jCj � 1,yj � � P�RCj� 1jCj � 1� �10�

where RCjis a relevant document to a query with descriptor Tj that represents the

concept Cj with certainty, and yj denotes a binary vector (yj1, yj2, . . . , yjn) and each yjkcorresponds to a possible value of a parent node tjk.

The equation can be transformed into

P�Cj � 1jyj � �P�RCj

� 1,Cj � 1jyj �P�RCj

� 1,Cj � 1jTj � 1�P�Cj � 1jTj � 1� �11�

We have made the assumption, FUA that P(Cj=1vRCj=1) = 1 in the previous section.

The assumption implies that P(RCj=1, Cj=1vz) = P(RCj

=1vz) where z is a space

speci®ed by some descriptors. And P(Cj=1vTj=1) = 1 because Cj is the concept

designated by Tj with certainty. Therefore, equation [11] is simpli®ed as the following:

P�Cj � 1jyj � �P�RCj

� 1jyj �P�RCj

� 1jTj � 1� �12�

Given a query with Tj, P(RCj=1vyj) and P(RCj

=1vTj=1) must be predicted because

the distribution of relevant documents to the query is not known in advance of

retrieval. We estimate that the probabilities are the same as the expected probabilities,

that is, P(RCj=1vyj) = E(P(RCj

=1vyj)) and P(RCj=1vTj=1) = E(P(RCj

=1vTj=1)) in the

queries with Tj. In environments a general or a manually constructed thesaurus is

provided, one can get the expected probabilities to estimate the concept occurrence

probability when a document contains one or more analogues identi®ed by the

thesaurus. In the other environments, it is very di�cult to obtain E(P(RCj=1vyj)) since

all analogues are not identi®ed previously. LRD is a learning scheme to estimate theconcept occurrence probability in the latter environments.

Probabilistic retrieval of descriptors incrementally 425

Page 10: Probabilistic retrieval incorporating the relationships of descriptors incrementally

LRD incrementally identi®es analogues from the descriptors in the relevant

documents. The analogues are represented as an ordered set (tj1, tj2, . . . , tjn). The ®rst

item tj1 is descriptor Tj that designates concept Cj with certainty, and the others are

descriptors identi®ed from the past queries incrementally. LRD assumes for incremental

update without storing large past data Exclusive Determination of Concept (EDC) that

when a document contains tjk and some tjl where 1RkR lRn, the concept occurrence

in the document is determined by tjk exclusively and independent of the others, that is:

P�Cj � 1jtj1 � yj1, . . . ,tjn � yjn� � P�Cj � 1jtj1 � yj1, . . . ,tjk � yjk� �13�where yj1=0, . . . , yj(k ÿ 1)=0, yjk=1.

Each time retrieval results for a query with Tj are available, the expected probabilities

should be updated for all possible values of the parents of Cj in an attempt to capture

the consensual meaning of descriptors shared by users. Note that (2nÿ1) possible values

of the parents are covered by n possibilities according to equation [13]. If there are one

or more relevant documents in which any analogues do not appear, LRD selects new

analogues recursively from the descriptors (tj1', tj2', . . . , tjm') attached to the relevant

documents until all of them have at least one analogues. tjk' is regarded as the (n + 1)th

analogue if it satis®es the following condition for all descriptors tjl' where

P(Cj=1vtj1=0, . . . , tjn=0,tjl' = 1)R1:

0< P�Cj � 1jtj1 � 0, . . . ,tjn � 0,tjl0 � 1�RP�Cj � 1jtj1 � 0, . . . ,tjn � 0,tjk

0 � 1�R1 �14�If tjo' also satis®es this condition as tjk' where k < o, only tjk' is registered in the ordered

set.

Someone may criticize that an analogue identi®ed by LRD may be unrelated to the

query descriptor completely. Although this is true, it does not matter. This is because

the relation of an analogue to the query descriptor is quanti®ed. If the number of

documents that have analogue Tk and are relevant to the past queries with descriptor

Tj, is larger than the number of documents that have analogue Tl and are relevant to

the past queries, one can say that Tk is more tightly related to Tj than Tl. LRD assigns

larger value to Tk than to Tl. If an analogue is unrelated completely, the quanti®ed

value would be so small that it does not a�ect the ranking of a document. An IR

system with LRD, therefore, would outperform the others without knowledge for the

same reason as one expects that an expanded query with descriptors related to the

query descriptors tell relevant documents from non-relevant ones more e�ectively than

the original query.

Finally, if the concept occurrence probability in the collection P(Cj=1) becomes

estimated, we can rank documents with respect to the given query. With equation [13]

the probability is expressed as the following:

P�Cj � 1� �Xnk�1

P�Cj � 1jtj1 � 0, . . . ,tj�kÿ1� � 0,tjk � 1�P�tj1 � 0, . . . ,tj�kÿ1� � 0,tjk � 1�

� P�Cj � 1jtj1 � 0, . . . ,tjn � 0�P�tj1 � 0, . . . ,tjn � 0� �15�In this equation P(Cj=1vtj1=0, . . . , tjn=0) is the concept occurrence probability in the

documents which have not any analogues identi®ed previously. The probability is

assumed to be equal to the probability that the concept appears in the documents with

at least one analogue, that is:

P�Cj � 1jtj1 � 0, . . . ,tjn � 0�

�Xnk�1

P�Cj � 1jtj1 � 0, . . . ,tj�kÿ1� � 0,tjk � 1�P�tj1 � 0, . . . ,tj�kÿ1� � 0,tjk � 1� �16�

Won Yong Kim et al.426

Page 11: Probabilistic retrieval incorporating the relationships of descriptors incrementally

5. EXPERIMENTAL EVALUATION

We implemented the proposed ranking formula FUR and the knowledgeconstruction method LRD on the SMART experimental system. Experiments werecarried out with two di�erent document collections, namely CACM and CISI which arestandard collections frequently used to assess the e�ectiveness of a retrieval scheme.CACM is a collection of 3204 documents and 52 queries in the ®eld of computerscience. The CISI collection is composed of 76 queries and 1460 documents ofinformation science literature. A document in the collections has a title and an abstract,and a query is expressed with natural language. The collections provide relevanceassessments for each document with respect to a query. Descriptors are attached to thedocuments and the queries by the automatic single word indexing facility of SMART.

To evaluate the e�ectiveness of a retrieval scheme, it is customary to compute theaverage values of the recall and the precision over queries in a collection. For a query,documents have been ranked by the retrieval scheme, and then we have computed therecall after retrieving the ®rst 100 high-ranked documents and the average precision atthe eleven ®xed recall levels such as 0.0, 0.1 and so on up to 1.0. Note that a precisionat a ®xed recall level is interpolated precision: e.g., for a particular query, the precisionat 0.4 is the maximum of the precision at any recall point greater than or equal to 0.4.Finally, the recall and the precision were averaged over queries in a collection.

The experiments for the proposed retrieval scheme have been run 20 times to showthe learning e�ects. The same set of queries have been used for all runs. For a query,documents are ranked by FUR with the knowledge constructed from retrieval resultsfor the past queries in the same run and the previous runs. Then we use the relevantdocuments only in the ®rst 100 high-ranked documents in order to update theknowledge, regarding the others as non-relevant ones. If the recall and the precisionwould be increased for the query in the next run, we could a�rm that the updatedknowledge make relevant documents move to top ranking and increase the number ofrelevant documents in the ®rst 100 high ranking. In short, we could insist that theknowledge improve the retrieval e�ectiveness.

We have also evaluated BIR 20 times to serve as basis for comparison with FUR.Given a document D= (d1,d2, . . . , dt) where dj is 1 or 0, depending on whether the jthdescriptor is present in the document, BIR de®nes a document ranking function for aquery Q as the following (Salton, 1989):Xt

j�1djlog

pj�1ÿ qj �qj�1ÿ pj � �17�

where pj=P(tj=1vRQ=1) and qj=P(tj=1vRQ=0).When little is known about the relevance or non-relevance of any particular

document with respect to the query, qj is approximated by P(tj=1), and pj is assumedto be equal to 0.5 (Croft and Harper, 1979). If retrieval results by BIR are available forthe past queries with tj in the same run and the previous runs, the probabilities areestimated with the relevance information. pj and qj are is assumed to be equal totheir expected probabilities in the queries with tj, i.e., pj=E(P(tj=1vRtj

=1)) andqj=E(P(tj=1vRtj

=0)). The relevant documents to a query not in the top 100 rankingare regarded as non-relevant ones.

The experimental results of each run for FUR and BIR are plotted in Figs 4 and 5.At the ®rst run, the average recalls of BIR for CACM and CISI at 100 top-rankeddocuments are 0.6459 and 0.3570 respectively. The average precision for CACM is0.2846 and that for CISI is 0.1624. In FUR, the average recall and the averageprecision for CACM are 0.6346 and 0.2814, and those for CISI are 0.3653 and 0.1688respectively. Although a query is evaluated with the knowledge collected from theprevious queries, the results for the ®rst run can be considered as the retrievale�ectiveness without knowledge. Because the number of queries is small, and most

Probabilistic retrieval of descriptors incrementally 427

Page 12: Probabilistic retrieval incorporating the relationships of descriptors incrementally

query terms appear in the number of queries less than or equal to three. Therefore, wecan claim with these results that the retrieval e�ectiveness of an IR system based onFUR is similar to that of BIR when no knowledge is available.

The knowledge constructed by LRD for the ®rst run makes the number of relevantdocuments in the ®rst 100 high ranking increase at the second run. In other words itimproves the average recall of FUR after retrieving 100 documents, that is, 0.6811 forCACM and 0.3892 for CISI. The average precision of FUR is also increased over the®rst run, i.e., 0.4458 for CACM and 0.2265 for CISI. The retrieval e�ectiveness is

Fig. 4. Precisions and Recalls for CACM.

Fig. 5. Precisions and Recalls for CISI.

Won Yong Kim et al.428

Page 13: Probabilistic retrieval incorporating the relationships of descriptors incrementally

improved signi®cantly until the forth run with the aid of the knowledge updated fromthe previous runs. In the remainders the average recalls do not varied much, whichbrings on the few change of the knowledge. This causes the average precision to bealmost constant. The average recall and the average precision of FUR after the 20thrun of CACM are 0.7132 and 0.4937 respectively. 0.4008 and 0.3032 correspond tothose of CISI. The corresponding recalls and precisions of BIR are 0.7201 and 0.3598in CACM, 0.4281 and 0.2155 in CISI, respectively. After learning based on theassumptions FUA and EDC, the proposed scheme provides improvements of averageprecision over BIR, that is, +37% for CACM and +41% for CISI, while thecorresponding improvements of the average recall are ÿ1% and ÿ6%.

The collections with which we have experimented are fairly homogeneous and have asmall number of queries. The experimentation shows that improvements on e�ectivenessare achieved within the ®rst four or ®ve runs, after which the situation stabilizes. Thisplateau e�ect takes place when query expansion by the analogues of query descriptorsdoes not improve recall, and then the knowledge becomes unvarying. The number ofqueries to ®gure out stable relationships between a query descriptor and its analoguesdepends on the number of contexts related to the query descriptor. The larger thenumber of the contexts is, the more queries are required. A real environment wouldconsist of more heterogeneous documents and a large number of queries related to avariety of contexts. Hence, the stability in a real environment would happen quite laterthan in these experiments.

In these experiments a query was expanded with the analogues of all the descriptorsin the query itself. However, it is desirable to expand a query with the analogues ofonly narrow query descriptors in a real environment. Although this paper does notde®ne narrow descriptors formally, this seems to be worthwhile for several reasons:®rst, because the expansion with the analogues of broad query descriptors is able tomake ambiguous the topics expressed in the query. Consequently, the retrievale�ectiveness is decreased. The analogues of a descriptor re¯ect the contexts in which thedescriptor has appeared. Since a broad query descriptor relates to a lot of contexts, theexpansion with its analogues makes the query cover many contexts not related to thequery. Second, because a narrow query descriptor occurs in correlated contexts, and itsanalogues clarify the contexts. Third, because document ranking is dominated bynarrow query descriptors. One could obtain large improvements by alleviatingvocabulary gap for only narrow query descriptors.

6. CONCLUDING REMARKS

Most information retrieval systems present to a user a sequence of documents whichare ranked in decreasing order of relevance probabilities of documents. This is becauseonly and all relevant documents cannot be precisely retrieved. In the previousprobabilistic retrieval models the relevance probability of a document is expressed asthe joint probability of descriptors. Those models have assumed that the probability isindependent of the descriptors that are not speci®ed in the query. However, there canbe many descriptors that represent the same concepts. Thus, the probability is in factdependent on those many descriptors designating the concepts of the query descriptorsregardless of their being speci®ed in the query or not. The probability is independent ofthe concepts which do not appear in the query. We have extended the probabilisticretrieval models to re¯ect this notion by incorporating the dependence relationshipsbetween descriptors and concepts. We have also proposed a learning method fromrelevant documents that incrementally ®gures out the descriptors designating a concept.It is distinguished from traditional relevance feedback in that it attempts to capturecertain stable relationships of descriptors from many past user queries.

Probabilistic retrieval of descriptors incrementally 429

Page 14: Probabilistic retrieval incorporating the relationships of descriptors incrementally

The experimental results have shown that the proposed retrieval model with thelearning method can successfully be used for ranking documents. While the results ofthe proposed model without knowledge are similar to those of the previous probabilisticretrieval model, signi®cant improvements are achieved after learning from relevantdocuments. Since the quality of the knowledge dominates the retrieval e�ectiveness ofthe proposed model, our method can provide better performance if the learning methodis improved to identify analogues more accurately.

REFERENCES

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. InProceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval,pp. 57±61.

Cooper, W. S., Gey, F. C., and Dabney, D. P. (1992) Probabilistic retrieval based on staged logisticregression. In Proceedings of ACM SIGIR International Conference on Research and Development inInformation Retrieval, pp. 198±210.

Croft, W., & Harper, D. (1979). Using probabilistic models of information retrieval without relevanceinformation. Journal of Documentation, 35(4), 285±295.

Fuhr, N. (1989). Models for retrieval with probabilistic indexing. Information Processing and Management,25(1), 55±72.

Fuhr, N., & Huther, H. (1989). Optimum probability estimation from empirical distributions. InformationProcessing and Management, 25(5), 491±507.

Haines, D. and Croft, W. B. (1993). Relevance feedback and inference networks. In Proceedings of ACMSIGIR International Conference on Research and Development in Information Retrieval, pp. 2±11.

Harman, D. (1988). Towards interactive query expansion. In Proceedings of ACM SIGIR InternationalConference on Research and Development in Information Retrieval, pp. 321±331.

Harman, D. (1992). Relevance feedback revisited. In Proceedings of ACM SIGIR International Conference onResearch and Development in Information Retrieval, pp. 1±10.

Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is ahierarchies. Journal of Documentation, 49(2), 188±207.

Losee, R. (1988). Parameter estimation for probabilistic document-retrieval models. Journal of the AmericanSociety for Information Science, 39(1), 8±16.

Park, Y. C., & Choi, K.-S. (1996). Automatic thesaurus construction using bayesian networks. InformationProcessing and Management, 32(5), 543±553.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. MorganKaufmann Publishers.

Robertson, S. and Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society forInformation Science, pp. 129±146.

Robertson, S. and Walker, S. (1994). Some simple e�ective approximatinos to the 2-Poisson model forprobabilistic weighted retrieval. In Proceedings of ACM SIGIR International Conference on Research andDevelopment in Information Retrieval, pp. 232±241.

Salton, G. (1971). The Smart Retrieval System Ð Experiments in Automatic Document Processing. PrenticeHall, Englewood Cli�s, NJ.

Salton, G. (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information byComputer. Addison Wesley.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. InformationProcessing and Management, 24(5), 513±523.

Smeaton, A., & van Rijsbergen, C. (1983). The retrieval e�ects of query expansion on a feedback documentretrieval system. The Computer Journal, 26(3), 239±246.

Spink, A. (1994). Term relevance feedback and query expansion. In Proceedings of ACM SIGIR InternationalConference on Research and Development in Information Retrieval, pp. 79±90.

Turtle, H. and Croft, W. (1990). Inference networks for docuement retrieval. In Proceedings of ACM SIGIRInternational Conference on Research and Development in Information Retrieval, pp. 1±24.

Wu, H., & Salton, G. (1981). The estimation of term relevance weights using relevance feedback. Journal ofDocumentation, 37(4), 194±214.

Won Yong Kim et al.430