Upload
gwendolyn-hogan
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Information Retrieval Modeling. CS 652 Information Extraction and Integration. Introduction. IR systems usually adopt index terms to index documents & process queries Index term: a keyword or group of selected/related words any word (in general) Problems of IR based on index terms: - PowerPoint PPT Presentation
Citation preview
2
Introduction IR systems usually adopt index terms to index documents
& process queries
Index term:
a keyword or group of selected/related words
any word (in general)
Problems of IR based on index terms:
oversimplication & lose of semantics
badly-formed queries
Stemming might be used:
connect: connecting, connection, connections
An inverted file is built for the chosen index terms
4
Introduction Matching at index term level is quite imprecise
Consequence: no surprise that users get frequently unsatisfied
Since most users have no training in query formation, problem is even worst
Result: frequent dissatisfaction of Web users
Issue of deciding relevance is critical for IR systems: ranking
5
Introduction A ranking is an ordering of the documents retrieved
that (hopefully) reflects the relevance of the documents to the user query
A ranking is based on fundamental premises regarding the notion of relevance, such as:
common sets of index terms
sharing of weighted terms
likelihood of relevance
Each set of premises leads to a distinct IR model
6
IR Models
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval: Adhoc Filtering
Browsing
U s e r
T a s k
Classic Models
Boolean Vector (Space) Probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector (Space) Latent Semantic Index Neural Networks
Browsing
Flat Structure Guided Hypertext
7
IR Models The IR model, the logical view of the docs, and the
retrieval task are distinct aspects of the system
Index Terms Full Text Full Text +Structure
RetrievalClassic
Set TheoreticAlgebraic
Probabilistic
ClassicSet Theoretic
AlgebraicProbabilistic
Structured
Browsing FlatFlat
HypertextStructure Guided
Hypertext
LOGICAL VIEW OF DOCUMENTS
USER
TASK
8
Retrieval: Ad Hoc vs. Filtering Ad hoc retrieval: Static set of documents & dynamic
queries
Collection“Fixed Size”
Q2
Q3
Q1
Q4Q5
9
Retrieval: Ad Hoc vs. Filtering Filtering: dynamic set of documents & static queries
Documents Stream
User 1Profile
User 2Profile
Docs Filteredfor User 2
Docs forUser 1
incoming
outgoing
outgoing
10
Classic IR Models – Basic Concepts Each document is described by a set of
representative keywords or index terms
Index terms are document words (i.e. nouns), which have meaning by themselves for remembering
the main themes of a document
However, search engines assume that all words are index terms (full text representation)
User profile construction process:1) user-provided keywords
2) relevance feedback cycle
3) system adjusted user profile description
4) repeat step (2) till stabilized
11
Classic IR Models - Basic Concepts Not all terms are equally useful for representing the
document contensts: less frequent terms allow identifying a narrower set of documents
The importance of the index terms is represented by weights associated to them
Let
ki be an index term
dj be a document
wij is a weight associated with (ki,dj ), which quantifies
the importance of ki for describing the contents of dj
12
Classic IR Models - Basic Concepts Ki is a generic index term
dj is a document
t is the total number of index terms in an IR sytem
K = (k1, k2, …, kt) is the set of all index terms
wij 0 is a weight associated with (ki,dj)
wij = 0 indicates that ki does not belong to dj
vec(dj) = (w1j, w2j
, …, wtj ) is a weighted vector
associated with the document dj
gi(vec(dj)) = wij is a function which returns the
weight associated with pair (ki,dj )
13
The Boolean Model Simple model based on set theory & Boolean algebra Queries specified as boolean expressions
precise semantics neat formalism q = ka (kb kc )
Terms are either present or absent. Thus, Ki {0,1} Consider
q = ka (kb kc )
vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)
vec(qcc) = (1,1,0) is a conjunctive component
14
The Boolean Model
q = ka (kb kc)
1 if vec(qcc) | (vec(qcc) vec(qdnf))
( ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise
(1,1,1)(1,0,0)
(1,1,0)
Ka Kb
Kc
sim(q,dj ) =
15
Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no
notion of partial matching
No ranking of the documents is provided (absence of a grading scale)
Information need has to be translated into a Boolean expression which most users find awkward
The Boolean queries formulated by the users are most often too simplistic
Using the exact matching approach, the model frequently returns either too few or too many documents in response to a user query
16
The Vector (Space) Model (VSM)
Use of binary weights is too limiting
Non-binary weights provide consideration for partial matching
These term weights are used to compute a degree of similarity between a query and a document
Ranked set of documents provides better matching,i.e., partal matching vs. exact matching
Answer set of VSM is a lot more precise than the answer set retrieved by the Boolean model
17
The Vector (Space) Model Define:
wij > 0 whenever ki dj
wiq>= 0 associated with the pair (ki,q)
vec(dj ) = (w1j, w2j
, ..., wtj ), document vector of dj
vec(q) = (w1q, w2q
, ..., wtq ), query vector of q
To each term ki is associated a unitary vector vec(i)
The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
The t unitary vectors vec(i) form an orthonormal basis for a t-dimensional space
In this space, queries and documents are represented as weighted vectors
18
The Vector (Space) Model
Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [t
i=1 wij wiq
] / ti=1 wij
2
ti=1 wiq
2
where is the inner product operator & |q| is the length of q
Since wij 0 and wiq
0, 1 sim(q, dj ) 0
A document is retrieved even if it matches the query terms only partially
i
jdj
q
19
The Vector (Space) Model
Sim(q, dj ) = [ti=1 wij
wiq ] / |dj | |q|
How to compute the weights wij and wiq
?
A good weight must take into account two effects:
1. quantification of intra-document contents (similarity) tf factor, the term frequency within a document
2. quantification of inter-documents separation (dissi-milarity)
idf factor, the inverse document frequency
wij = tf(i, j) idf(i)
20
The Vector (Space) Model Let,
N be the total number of documents in the collection ni be the number of documents which contain ki
freq(i, j), the raw frequency of ki within dj
A normalized tf factor is given by f(i, j) = freq(i, j) / max(freq(l, j)),
where the maximum is computed over all terms which occur within the document dj
The inverse document frequency (idf) factor is idf(i) = log (N / ni ) the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of information associated with term ki.
21
The Vector (Space) Model
The best term-weighting schemes use weights which are give by
wij = f(i, j) log(N / ni)
the strategy is called a tf-idf weighting scheme
For the query term weights, a suggestion is Wiq
= (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni)
The vector model with tf-idf weights is a good ranking strategy with general collections
The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute
22
The Vector (Space) Model Advantages:
term-weighting improves quality of the answer set
partial matching allows retrieval of documents that approximate the query conditions
cosine ranking formula sorts documents according to degree of similarity to the query
A popular IR model because of its simplicity & speed
Disadvantages:
assumes mutually independence of index terms (??); not clear that this is bad though
23
The Vector (Space) Model
k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1
q 1 1 1
d1
d2
d3
d4 d5
d6
d7
k1
k2
k3
Example I
24
d1
d2
d3
d4 d5
d6
d7
k1
k2
k3
k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2
q 1 2 3
The Vector (Space) Model
Example II
25
d1
d2
d3
d4 d5
d6
d7
k1
k2
k3
k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10
q 1 2 3
The Vector (Space) Model
Example III
26
Probabilistic Model Objective: to capture the IR problem using a
probabilistic framework
Given a user query, there is an ideal answer set
Querying as specification of the properties of this ideal answer set (clustering)
But, what are these properties?
Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)
Improve by iteration
27
Probabilistic Model An initial set of documents is retrieved somehow
User inspects these docs looking for the relevant ones (in truth, only top 10 - 20 need to be inspected)
IR system uses this information to refine description of ideal answer set
By repeating this process, it is expected that the description of the ideal answer set will improve
Have always in mind the need to guess at the verybeginning the description of the ideal answer set
Description of ideal answer set is modeled in probabilistic terms
28
Probabilistic Ranking Principle Given a user query q and a document d in a collection,
the probabilistic model tries to estimate the probability that the user will find the document d
interesting (i.e., relevant).
The model assumes that this probability of relevance depends on the query and the
document representations only. The ideal answer set R should maximize the probability of relevance. Documents in R are predicted to be relevant to q.
Problems: How to compute probabilities of relevance? What is the sample space?
29
The Ranking Probabilistic ranking is computed as:
sim(q, dj ) = P(dj relevant-to q) / P(dj non-relevant-to q)
This is the odds of the document dj being relevant to q
Taking the odds minimize the probability of an erroneous judgement
Definitions:
wij , wiq
{0,1}, i.e., all index term weights are binary
P(R | vec(dj )): probability that doc dj is relevant to q, where R is the known set of docs to be relevant
P(R | vec(dj )): probability that doc dj is not relevant to q, where R is the complement of R
30
The Ranking sim(dj, q) = P(R | vec(dj)) / P(R | vec(dj))
= [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)]
~ P(vec(dj) | R) P(vec(dj) | R)
where,
P(vec(dj ) | R): probability of randomly selecting dj from R of relevant documents
P(R): probability of a randomly selected doc from thecollection is relevant
(Bayes’ rule)
31
The Ranking
sim(dj, q) ~ P(vec(dj) | R) P(vec(dj) | R)
~ [ gi(vec(dj))=1 P(ki | R)] * [ gi(vec(dj))=0 P(ki | R)]
[ gi(vec(dj))=1 P(ki | R)] * [ gi(vec(dj))=0
P(ki | R)]
where,
P(ki | R) : probabilityprobability that the index term ki is present in a document randomly selected from
R of relevant documents
32
The Ranking
sim(dj,q) ~ log [ P(ki | R)] * [ P(ki | R)] [ P(ki | R)] * [ P(ki | R)]
~ K * [ log P(ki | R) + log P(ki | R) ]
P(ki | R) P(ki | R)
~ ti=1 wiq
* wij* (log P(ki | R) + log 1 - P(ki | R))
P(ki | R) 1 - P(ki | R)
33
The Initial Ranking sim(dj, q) ~
wiq* wij
* (log P(ki | R) + log 1 - P(ki | R))
P(ki | R) 1 - P(ki | R)
Probabilities P(ki | R) and P(ki | R) ?
Estimates based on assumptions:
P(ki | R) = 0.5
P(ki | R) = ni
Nwhere ni is the number of docs that contain ki
N is the number of docs in the collection
Use this initial guess to retrieve an initial ranking
Improve upon this initial ranking
34
Improving the Initial Ranking sim(d
j,q) ~
wiq* wij
* (log P(ki | R) + log 1 – P(ki | R))
P(ki | R) 1 - P(ki | R) Let
V : set of docs initially retrieved & ranked Vi : subset of V retrieved that contain ki
Reevaluate estimates: P(ki | R) = Vi
V P(ki | R) = ni - Vi
N - V
Repeat recursively
(~ P(Ki | R) by the distribution of Ki among V)
(All the non-retrieved docs are irrelevant)
35
Improving the Initial Ranking sim(dj, q) ~ wiq
* wij*
(log P(ki | R) + log 1 - P(ki | R ) ) P(ki | R) 1 - P(ki | R)
To avoid problems with V = 1 & Vi = 0:
P(ki | R) = Vi + 0.5 V + 1
P(ki | R) = ni - Vi + 0.5 N - V + 1
Or (alteratively) P(ki | R) = Vi + ni /N
V + 1 P(ki | R) = ni - Vi + ni /N
N - V + 1
(ni / N is an adjustment factor)
36
Pluses and Minuses
Advantages:
Documents ranked in decreasing order of probability of relevance
Disadvantages:
Need to guess initial estimates for P(ki | R)
Method does not take into account tf and idf factors
assumes mutually independence of index terms (??); not clear that this is bad though
37
Brief Comparison of Classic Models
Boolean model does not provide for partial matches and is considered to be the weakest classic model
Salton and Buckley did a series of experiments that indicate that, in general, the vector
(space) model outperforms the probabilistic model with general collections
This seems also to be the view of the research community