Upload
alisha-simmons
View
219
Download
2
Embed Size (px)
Citation preview
Application of Markov chains
in an interactive information retrieval
system
A brief introduction for people knowledgeable of Markov chains, but not
information retrieval, and vice versa.
Scenario• Need info to perform a task, but are uncertain
about the utility of the documents ... • Guidance from experienced people would be
helpful!• Information Retrieval (IR) system presents lots
of documents, but which suite of documents would be helpful?
• From an IR perspective, we’d have to design an IR system to retrieve relevant documents and gather experienced users’ preferences over time to establish the probability of a suite of associated documents.
Scenario
• Today’s talk is about the feasibility of modeling and building just such a system. Is it feasible to model and build a probabilistic interactive retrieval system?
Audience: Given that the audience either knows about Markov chains but not about IR, or about IR and not Markov chains, I will present only the basics of the project. It may be useful first to define both IR and Markov chains and then discuss their intersection.
The Problem
In IR the goal is to retrieve the most relevant documents in response to a user’s query and then present them in a comprehendible way. The problem is that IR systems rely on
(a)a semantic, or surface, level model of language without any contextualization,
(b) ignores how the language is actually used (the pragmatic entailments),
(c) the domain of use by real people, and(d)how people think during information
seeking sessions.
A potential solution
• There are many avenues of research
• Group awareness, Query chains, Interactive visualzation
• This sounds like a lot: to make things clearer for this audience, I’ll present the usual models of IR and then explain the project.
Taxonomy of IR models
Browsing(user not sure)
[Classic Models]BooleanVectorProbabilistic
[Structured Models]Non-overlapping listsProximal nodes
[Algebraic]Generalized vectorLatent Semantic IndexingNeural NetworksGenetic AlgorithmsGenetic Programming
[Set Theoretic]Fuzzy set theoreticExtended Boolean
[Probabilistic]Inference NetworksBelief Netowrks
[Browsing]Flat filesStructured guidedHypertext
Retrieval(user knows whathe wants)
Information Retrieval (IR)
• Focuses on the– Representation,– Storage of,– Organization to, and– Access to information sources.The emphasis in IR is trying to locate
and present data that are useful to people, that is, information.
IR
• IR is mostly associated with “full-text retrieval”
• Also focuses on “indexing and searching” of records, users and interface design, and novel representations of data, provided they help the user interpret the retrieval set.
• Since IR encompasses so many types of computer-based files, IR sees documents, users, and information from an abstract point of view, leaving researchers & developers to create their own implementation of the abstract model.
IR defined
An information retrieval model is a quadruple D, Q, F, R(qi, dj) where
D = the “logical view” of the document collection
Q = the “logical view” of the user’s queriesF = a framework for matching D and Q andR(qi, dj) = a mathematical function to rank
retrieved individual documents (dj) from the collection to the user’s query (qi)
IR example
• One example of IR is any Internet search engine.– For instance, have you ever wondered what
happens to your search terms when you send them to an Internet search engine?
• Why are some documents presented to you in the retrieval set and others are not?
• Why are the retrieved documents listed (or ranked) as they are?
The IR Model
• D [document collection] can be full text documents, documents with library cataloguing records, html documents … pretty much anything!
• Q [query] is the “user’s expression of information need.” Since everyone expresses himself differently from others, queries are the least stable part of IR.
The IR Model
• F [framework] is typically the computing environment
• R [ranking] is how retrieved elements are associated with the user’s query and with each other
File parsing
• Before IR can occur, the document must be parsed.
• Using the example of a full-text document:– The file is opened by the computer program– Each term, one by one, is examined by the
program:• Is this term a “stop term” (e.g., a term to be
skipped)• Is this term very common (e.g., “the”, “an”)
File parsing
• The term is stored in a database along with the frequency of term’s occurrence either within the individual document or within the entire document collection
• Optionally: terms may be “stemmed” - that is, the grammatical endings are removed (e.g., “fishes” becomes “fish”; “goes” becomes “go”). In English, the usual stemming technique is the “Porter stemming algorithm.”
File parsing
• When complete, the program calculates a “weight” for each term which is stored in a “term/document matrix”. The matrix looks like a spreadsheet!
• The weight may be based on normalized frequency (a comparable weight value is calculated based on the size of the document) or something else.
• Usually, the weight is calculated based on the famous “idf•tf” [inverse document frequency/term frequency]
File parsing
• The idea is that rarely-occuring terms have more “informational value” and so should be weighted to cause documents with those terms to rank highly in the retrieval set.
• Terms that occur very frequency or very rarely rank lower.
• User-oriented techniques for interacting with the IR system; graphically or term-based, e.g., using boolean operators (“and”, “or”, “not”) in the query helps the user manipulate the weights to get a useful retrieval set.
Document/Term Frequency Matrix 1
Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term n
Doc 1 0 0 0 0 0 3 0
Doc 2 2 0 9 8 7 3 1
Doc 3 49 39 28 73 64 100 92
Doc 4 0 0 1938 27362 2737 1162 283
Doc 5 And so on…
Doc 6
Doc n
RAW COUNTS: the actual number of times the term appears in each document.
Document/Term Frequency Matrix 2
Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term n
Doc 1 0 0 0 0 0 .01 0
Doc 2 .02 0 .03 .021 .22 .013 .02
Doc 3 .432 .32 .23 .73 .32 .4092 .92
Doc 4 0 0 .47 .752 .74 .5012 .350
Doc 5 And so on…
Doc 6
Doc n
NORMALIZED FREQUENCIES: calculate the number of times the term appears based on the normalized frequency (overcomes the different document lengths). First step towards usual idf·tw (inverse document idf·tw (inverse document frequency•term weighting)frequency•term weighting).
Related work
• Markov (Anderson ‘91, Asmussen 87; Jackson & Lafrere 1998)
• AI (Zhang 2001)• Paterman (1990), Cassandra (1998),
Rajgopal & Mazumdar (2002); Chen & Cooper (2002) -stochastic modeling of use; statistical inference
• Danilowicz & Balinski (2001) vector/idf•tw
Related work• “If query terms have multiple senses, a mixture
of these senses may be present in the expanded model. For semantic smoothing, a more content-dependent model that takes into account the relationship between query terms may be desirable. One way to accomplish this is through a pseudo-feedback mechanism ... In this way the expanded language model may be more ‘semantically coherent,’ capturing the topic implicit in the set of documents rather than representing words related to the query terms qi in general.” (Lafferty & Zhai, 2001, p. 15)
In short...
Semantic-level parsing of documentsIDF•TW for relevance-ranked retrieval sets
Hierarchical lists & graphic displays
What type of visualization?How to incorporate group awareness & query chains?
How to build on what the IR model offers?
We have ...
We ask ...
IR as a Markov Process
• The randomness of query term• Modeling the Markov process• Incorporating group/previous
users’ input• Present the whole through an
interactive information retrieval systems’ interface
Markov chain defined
The behavior of an informationally closed and generative system that is specified by transition probabilitiels between that system’s states.
Named after A. A. Markov who studied stochastic sequences of characters (symbols, letters, words).
Probabilities of a Markov chain are entered in a transition matrix indicating which state or symbol follows which other state or symbol.
Modeling with Markov chains
Consider the following example of a homogeneous stochastic process with discrete time and finite state space. The physical model of IR permits a number of terms and allows users to move from one set of terms to another at arbitrary time points. For our purposes, we identify the set of possible terms with a finite set of states S = {1, ..., m}.
Feedback from the end-user of the system causes the retrieval system to jump from one state into another and to recreate the retrieval set’s members’ association. Furthermore such transitions may take place only at certain instants [feedback inputs from the end-user] of a discrete time unit. ...€
n ∈ N0
Modeling with Markov chains
Using idf·tw as a starting point, we’re able to estimate the (hypothetical) probabilities pij, , for a transition from state i to state j.
These probabilities do not depend on or vary with the time n. For the complete specification of the stochastic process , where xn is the state of the system at time point n, we need to provide a distribution of the initial state x0, which is denoted by P0 = (P01, ..., P0m). Here Poi = P[X0=i] denotes the probability of a start in state i, . There’s no risk of confusion between the initial probabilities p0i and the transition probabilities pij since we index with natural numbers.
€
(i, j) ∈ S2
€
(Xn )n ∈N0
Modeling with Markov chains
The actual state Xn may not be useful. So we consider a function f: SR where the value f() expresses a property of the system which can be measured (the “observables”). As a natural extension we consider observables which are defined on the set of all possible s-tuples of successive states of the chain, where . In the case of IR, it’s appropriate to work with observables which depend on pairs (query/query representations) of states, where i is the state of departure and j is the destination of the transition which takes place between time points n and n+1. We may, then, consider the query terms. This observable depends on pairs (Xn,Xn+1) of successive states.
€
s ∈ N
€
(i, j) ∈ S2
Randomness of query terms
• Query represents seeker’s semantic representation of a concept, e.g., “Ford, car, auto, vehicle”
• Finite set of terms; without other considerations, have an equal chance of being selected;
• Can know where a user is in the chain; predict next choices ...
A state vector gives the probability of each state i. A transition matrix of query terms Q can be made of the entries that reflect the transition probabilities mijThe probability of a given term being selected the first time,Probability of the term being selected the second time,And so on....8 .3 .2
T= .1 .2 .6
.1 .5 .2
After selecting a term and trying again, the probabilities of a term being selected changes...Over time, we capture user choices to seed the probability of a term being selected0
1
0
.3
.2
.5
.4
.37
.23
initial 2nd 3rd
If potential terms were numbered from 1...m, the order of the queries is described by some permutation (i1, i2, ... im)
The probability of each terms being selected: let pk be the probability of term k being selected
Ex: if there were 2 possible states, m=2, then there are only two possible states: c1 = (1,2) and c2 = (2,1).
Probabilities are p11 = p21 = p1; p12 = p22 = p2.
Model description
• System side:– In relevance feedback systems,
recalculate relevance ranking– Can calculate probability that control
passes from node i to node j, at different times or states
– Transition probabilities are used to reflect the IR system’s relevancy ranking ...
€
piS+ pij =1; piF
+ pij = 0j=1
n
∑j=1
n
∑
Group awareness and Best choice
• Recall from a closed set of terms, the individual info seeker may select with equal probability any term...
• IR systems usually add weights • Previous input by a group of
experienced users may also add weights ... Probabilities of how experienced users move from tn to tm ... tx
Compare
€
Pn = pij (n) =
p11(n) p12(n)
p21(n) p22(n) ...
...
€
Pn = pij + wij (n) =
p11(n) + w11n p12(n) + w12n
p21(n) + w21np22(n) + w22n
...
...
Group input provides weighting factor
Individual terms; each has equal probability of selection
ExampleUser interested in financial work, uses terms {budget, payroll, faculty}
Doc 1 = {t1, t2, t3, ... tn}
Doc 2 = {t1, t2, t3}
Doc 3 = {t3, t5, t9}
Each term has 33% chance
budget payroll faculty
.8 .3 .2
R = .1 .2 .6
.1 .5 .2
With additional input, t23 = .6 has a 60% chance of becomingHigher ranked after one user interaction.
Interface over matrix
Discussion
• Markov model parallels IR ideas of– Relevancy ranking– Uses what system offers: semantic tokens– Uses transition probabilities same as
relevancy feedback systems’ “more like these”
– Integrates group awareness as weighting scheme
– Provides data on group & individual heuristics
Discussion
• Potential uses– Can be used to recommend search
paths– In group settings: may encourage
confidence and certainty– System can react to/prevent seeker
going too far astray
Discussion
• Potential uses– Provides data about probable
relationships that can be incorporated into larger IR systems
– Combined with class-relationship approach, prob. data compensates for missing data in the object definition (Asmusssen, 2000)
– Probabilistic distribution of the sum of classes (Conniffe & Spencer, 2000)
Future research
• Test end-user confidence• Compare idf·tw to term weights
generated by group awareness• 2D and 3D interfaces