Application of Markov chains in an interactive information retrieval system A brief introduction for people knowledgeable of Markov chains, but not information

Application of Markov chains

in an interactive information retrieval

system

A brief introduction for people knowledgeable of Markov chains, but not

information retrieval, and vice versa.

Scenario• Need info to perform a task, but are uncertain

about the utility of the documents ... • Guidance from experienced people would be

helpful!• Information Retrieval (IR) system presents lots

of documents, but which suite of documents would be helpful?

• From an IR perspective, we’d have to design an IR system to retrieve relevant documents and gather experienced users’ preferences over time to establish the probability of a suite of associated documents.

Scenario

• Today’s talk is about the feasibility of modeling and building just such a system. Is it feasible to model and build a probabilistic interactive retrieval system?

Audience: Given that the audience either knows about Markov chains but not about IR, or about IR and not Markov chains, I will present only the basics of the project. It may be useful first to define both IR and Markov chains and then discuss their intersection.

The Problem

In IR the goal is to retrieve the most relevant documents in response to a user’s query and then present them in a comprehendible way. The problem is that IR systems rely on

(a)a semantic, or surface, level model of language without any contextualization,

(b) ignores how the language is actually used (the pragmatic entailments),

(c) the domain of use by real people, and(d)how people think during information

seeking sessions.

A potential solution

• There are many avenues of research

• Group awareness, Query chains, Interactive visualzation

• This sounds like a lot: to make things clearer for this audience, I’ll present the usual models of IR and then explain the project.

Taxonomy of IR models

Browsing(user not sure)

[Classic Models]BooleanVectorProbabilistic

[Structured Models]Non-overlapping listsProximal nodes

[Algebraic]Generalized vectorLatent Semantic IndexingNeural NetworksGenetic AlgorithmsGenetic Programming

[Set Theoretic]Fuzzy set theoreticExtended Boolean

[Probabilistic]Inference NetworksBelief Netowrks

[Browsing]Flat filesStructured guidedHypertext

Retrieval(user knows whathe wants)

Information Retrieval (IR)

• Focuses on the– Representation,– Storage of,– Organization to, and– Access to information sources.The emphasis in IR is trying to locate

and present data that are useful to people, that is, information.

IR

• IR is mostly associated with “full-text retrieval”

• Also focuses on “indexing and searching” of records, users and interface design, and novel representations of data, provided they help the user interpret the retrieval set.

• Since IR encompasses so many types of computer-based files, IR sees documents, users, and information from an abstract point of view, leaving researchers & developers to create their own implementation of the abstract model.

IR defined

An information retrieval model is a quadruple D, Q, F, R(qi, dj) where

D = the “logical view” of the document collection

Q = the “logical view” of the user’s queriesF = a framework for matching D and Q andR(qi, dj) = a mathematical function to rank

retrieved individual documents (dj) from the collection to the user’s query (qi)

IR example

• One example of IR is any Internet search engine.– For instance, have you ever wondered what

happens to your search terms when you send them to an Internet search engine?

• Why are some documents presented to you in the retrieval set and others are not?

• Why are the retrieved documents listed (or ranked) as they are?

The IR Model

• D [document collection] can be full text documents, documents with library cataloguing records, html documents … pretty much anything!

• Q [query] is the “user’s expression of information need.” Since everyone expresses himself differently from others, queries are the least stable part of IR.

The IR Model

• F [framework] is typically the computing environment

• R [ranking] is how retrieved elements are associated with the user’s query and with each other

File parsing

• Before IR can occur, the document must be parsed.

• Using the example of a full-text document:– The file is opened by the computer program– Each term, one by one, is examined by the

program:• Is this term a “stop term” (e.g., a term to be

skipped)• Is this term very common (e.g., “the”, “an”)

File parsing

• The term is stored in a database along with the frequency of term’s occurrence either within the individual document or within the entire document collection

• Optionally: terms may be “stemmed” - that is, the grammatical endings are removed (e.g., “fishes” becomes “fish”; “goes” becomes “go”). In English, the usual stemming technique is the “Porter stemming algorithm.”

File parsing

• When complete, the program calculates a “weight” for each term which is stored in a “term/document matrix”. The matrix looks like a spreadsheet!

• The weight may be based on normalized frequency (a comparable weight value is calculated based on the size of the document) or something else.

• Usually, the weight is calculated based on the famous “idf•tf” [inverse document frequency/term frequency]

File parsing

• The idea is that rarely-occuring terms have more “informational value” and so should be weighted to cause documents with those terms to rank highly in the retrieval set.

• Terms that occur very frequency or very rarely rank lower.

• User-oriented techniques for interacting with the IR system; graphically or term-based, e.g., using boolean operators (“and”, “or”, “not”) in the query helps the user manipulate the weights to get a useful retrieval set.

Document/Term Frequency Matrix 1

Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term n

Doc 1 0 0 0 0 0 3 0

Doc 2 2 0 9 8 7 3 1

Doc 3 49 39 28 73 64 100 92

Doc 4 0 0 1938 27362 2737 1162 283

Doc 5 And so on…

Doc 6

Doc n

RAW COUNTS: the actual number of times the term appears in each document.

Document/Term Frequency Matrix 2

Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term n

Doc 1 0 0 0 0 0 .01 0

Doc 2 .02 0 .03 .021 .22 .013 .02

Doc 3 .432 .32 .23 .73 .32 .4092 .92

Doc 4 0 0 .47 .752 .74 .5012 .350

Doc 5 And so on…

Doc 6

Doc n

NORMALIZED FREQUENCIES: calculate the number of times the term appears based on the normalized frequency (overcomes the different document lengths). First step towards usual idf·tw (inverse document idf·tw (inverse document frequency•term weighting)frequency•term weighting).

Related work

• Markov (Anderson ‘91, Asmussen 87; Jackson & Lafrere 1998)

• AI (Zhang 2001)• Paterman (1990), Cassandra (1998),

Rajgopal & Mazumdar (2002); Chen & Cooper (2002) -stochastic modeling of use; statistical inference

• Danilowicz & Balinski (2001) vector/idf•tw

Related work• “If query terms have multiple senses, a mixture

of these senses may be present in the expanded model. For semantic smoothing, a more content-dependent model that takes into account the relationship between query terms may be desirable. One way to accomplish this is through a pseudo-feedback mechanism ... In this way the expanded language model may be more ‘semantically coherent,’ capturing the topic implicit in the set of documents rather than representing words related to the query terms qi in general.” (Lafferty & Zhai, 2001, p. 15)

In short...

Semantic-level parsing of documentsIDF•TW for relevance-ranked retrieval sets

Hierarchical lists & graphic displays

What type of visualization?How to incorporate group awareness & query chains?

How to build on what the IR model offers?

We have ...

We ask ...

IR as a Markov Process

• The randomness of query term• Modeling the Markov process• Incorporating group/previous

users’ input• Present the whole through an

interactive information retrieval systems’ interface

Markov chain defined

The behavior of an informationally closed and generative system that is specified by transition probabilitiels between that system’s states.

Named after A. A. Markov who studied stochastic sequences of characters (symbols, letters, words).

Probabilities of a Markov chain are entered in a transition matrix indicating which state or symbol follows which other state or symbol.

Modeling with Markov chains

Consider the following example of a homogeneous stochastic process with discrete time and finite state space. The physical model of IR permits a number of terms and allows users to move from one set of terms to another at arbitrary time points. For our purposes, we identify the set of possible terms with a finite set of states S = {1, ..., m}.

Feedback from the end-user of the system causes the retrieval system to jump from one state into another and to recreate the retrieval set’s members’ association. Furthermore such transitions may take place only at certain instants [feedback inputs from the end-user] of a discrete time unit. ...€

n ∈ N0


Using idf·tw as a starting point, we’re able to estimate the (hypothetical) probabilities pij, , for a transition from state i to state j.

These probabilities do not depend on or vary with the time n. For the complete specification of the stochastic process , where xn is the state of the system at time point n, we need to provide a distribution of the initial state x0, which is denoted by P0 = (P01, ..., P0m). Here Poi = P[X0=i] denotes the probability of a start in state i, . There’s no risk of confusion between the initial probabilities p0i and the transition probabilities pij since we index with natural numbers.

€

(i, j) ∈ S2

€

(Xn )n ∈N0


The actual state Xn may not be useful. So we consider a function f: SR where the value f() expresses a property of the system which can be measured (the “observables”). As a natural extension we consider observables which are defined on the set of all possible s-tuples of successive states of the chain, where . In the case of IR, it’s appropriate to work with observables which depend on pairs (query/query representations) of states, where i is the state of departure and j is the destination of the transition which takes place between time points n and n+1. We may, then, consider the query terms. This observable depends on pairs (Xn,Xn+1) of successive states.

€

s ∈ N

€

(i, j) ∈ S2

Randomness of query terms

• Query represents seeker’s semantic representation of a concept, e.g., “Ford, car, auto, vehicle”

• Finite set of terms; without other considerations, have an equal chance of being selected;

• Can know where a user is in the chain; predict next choices ...

A state vector gives the probability of each state i. A transition matrix of query terms Q can be made of the entries that reflect the transition probabilities mijThe probability of a given term being selected the first time,Probability of the term being selected the second time,And so on....8 .3 .2

T= .1 .2 .6

.1 .5 .2

After selecting a term and trying again, the probabilities of a term being selected changes...Over time, we capture user choices to seed the probability of a term being selected0

1

0

.3

.2

.5

.4

.37

.23

initial 2nd 3rd

If potential terms were numbered from 1...m, the order of the queries is described by some permutation (i1, i2, ... im)

The probability of each terms being selected: let pk be the probability of term k being selected

Ex: if there were 2 possible states, m=2, then there are only two possible states: c1 = (1,2) and c2 = (2,1).

Probabilities are p11 = p21 = p1; p12 = p22 = p2.

Model description

• System side:– In relevance feedback systems,

recalculate relevance ranking– Can calculate probability that control

passes from node i to node j, at different times or states

– Transition probabilities are used to reflect the IR system’s relevancy ranking ...

€

piS+ pij =1; piF

+ pij = 0j=1

n

∑j=1

n

∑

Group awareness and Best choice

• Recall from a closed set of terms, the individual info seeker may select with equal probability any term...

• IR systems usually add weights • Previous input by a group of

experienced users may also add weights ... Probabilities of how experienced users move from tn to tm ... tx

Compare

€

Pn = pij (n) =

p11(n) p12(n)

p21(n) p22(n) ...

...

€

Pn = pij + wij (n) =

p11(n) + w11n p12(n) + w12n

p21(n) + w21np22(n) + w22n

...

...

Group input provides weighting factor

Individual terms; each has equal probability of selection

ExampleUser interested in financial work, uses terms {budget, payroll, faculty}

Doc 1 = {t1, t2, t3, ... tn}

Doc 2 = {t1, t2, t3}

Doc 3 = {t3, t5, t9}

Each term has 33% chance

budget payroll faculty

.8 .3 .2

R = .1 .2 .6

.1 .5 .2

With additional input, t23 = .6 has a 60% chance of becomingHigher ranked after one user interaction.

Interface over matrix

Discussion

• Markov model parallels IR ideas of– Relevancy ranking– Uses what system offers: semantic tokens– Uses transition probabilities same as

relevancy feedback systems’ “more like these”

– Integrates group awareness as weighting scheme

– Provides data on group & individual heuristics

Discussion

• Potential uses– Can be used to recommend search

paths– In group settings: may encourage

confidence and certainty– System can react to/prevent seeker

going too far astray

Discussion

• Potential uses– Provides data about probable

relationships that can be incorporated into larger IR systems

– Combined with class-relationship approach, prob. data compensates for missing data in the object definition (Asmusssen, 2000)

– Probabilistic distribution of the sum of classes (Conniffe & Spencer, 2000)

Future research

• Test end-user confidence• Compare idf·tw to term weights

generated by group awareness• 2D and 3D interfaces

Documents

Application of Markov chains in an interactive information retrieval system A brief introduction for people knowledgeable of Markov chains, but not information