Upload
malik-treadway
View
280
Download
0
Tags:
Embed Size (px)
Citation preview
Query operations
1- Introduction
2- Relevance feedback with user relevance information
3- Relevance feedback without user relevance information
- Local analysis (pseudo-relevance feedback)
- Global analysis (thesaurus)
4- Evaluation
5- Issues
Introduction (1)
No detailed knowledge of collection and retrieval environment difficult to formulate queries well designed for retrieval Need many formulations of queries for effective retrieval
First formulation: often naïve attempt to retrieve relevant information Documents initially retrieved:
Examined for relevance information (user, automatically) Improve query formulations for retrieving additional relevant documents
Query reformulation: Expanding original query with new terms Reweighting the terms in expanded query
Introduction (2)
Approaches based on feedback from users (relevance feedback)
Approaches based on information derived from set of initially retrieved documents (local set of documents)
Approaches based on global information derived from document collection
Relevance feedback with user relevance information (1)
Most popular query reformulation strategy Cycle:
User presented with list of retrieved documents User marks those which are relevant
In practice: top 10-20 ranked documents are examined Incremental
Select important terms from documents assessed relevant by users Enhance importance of these terms in a new query
Expected: New query moves towards relevant documents and away from non-
relevant documents
Relevance feedback with user relevance information (2)
Two basic techniques Query expansion
Add new terms from relevant documents Term reweighting
Modify term weights based on user relevance judgements
Advantages Shield users from details of query reformulation process Search broken down in sequence of small steps Controlled process
Emphasise some terms (relevant ones) De-emphasise other terms (non-relevant ones)
Relevance feedback with user relevance information (3)
Query expansion and term reweighting in the vector space model
Term reweighting in the probabilistic model
Query expansion and term reweighting in thevector space model
Term weight vectors of documents assessed relevantSimilarities among themselves
Term weight vectors of documents assessed non-relevantDissimilar for those of relevant documents
Reformulated query:Closer to term weight vectors of relevant documents
Query expansion and term reweighting in thevector space model
For query q Dr: set of relevant documents among retrieved documents Dn: set of non-relevant documents among retrieved documents Cr: set of relevant documents among all documents in collection ,,: tuning constants
Assume that Cr is known (unrealistic!)
Best query vector for distinguishing relevant documents from non-relevant documents
qopt
1
Crd
j
d j Cr
1
N Crd
j
d jCr
Query expansion and term reweighting in thevector space model
Problem: |Cr| is unknown Approach
Formulate initial query Incrementally change initial query vector Use |Dr| and |Dn| instead
Rochio formula Ide formula
Rochio formula
Direct application of previous formula + add query Initial formulation =1 Usually information in relevant documents more important than in
non-relevant documents (<<) Positive relevance feedback (=0)
qi1
qi
Dr
dj
d jDr
Dnd
j
d jDn
Rochio formula in practice (SMART)
=1 Terms
Original query Appear in more relevant documents that non-relevant documents Appear in more than half the relevant documents
Negative weights ignored
q i1 q i Dr
d j
d jDr
Dnd j
d jDn
Ide formula
Initial formulation = = =1 Same comments as for the Rochio formula
Both Ide and Rochio: no optimal criterion
qi1
qi d
j
d jDr
dj
d j Dn
Term reweighting for the probabilistic model
(see note on the BIR model)
Use idf to rank documents for original query
Calculate
Predict relevanceImproved (optimal) retrieval function
g(D) c id ii1,n
Term reweighting for the probabilistic model
Independence assumptions I1: distribution of terms in relevant documents is independent
and their distribution in all documents is independent I2: distribution of terms in relevant documents is independent
and their distribution in irrelevant documents is independent
Ordering principle O1: probable relevance based on presence of search terms in documents O2: probable relevance based on presence of search terms in documents
and their absence from documents
Term reweighting for the probabilistic model
Various combinations
IdependenceAssumption I1
IndependenceAssumption I2
Orderingprinciple O1
F1 F2
Ordering principle O2
F3 F4
Term reweighting for the probabilistic model
F1 formula
ri = number of relevant documents containing ti
ni = number of documents containing ti
ratio of the proportion of relevant documents in which the query term ti occurs to the proportion of all documents in which the term ti occurs
R = number of relevant documents
N= number of documents in collection
c i logrR
nN
Term reweighting for the probabilistic model
F2 formula
ri = number of relevant documents containing ti
ni = number of documents containing ti
proportion of relevant documents in which the term ti occurs to the proportion of all irrelevant documents in which ti occurs
R = number of relevant documents
N= number of documents in collection
c i logr
R(n r)
(N R)
Term reweighting for the probabilistic model
ratio of “relevance odds” (ratio of relevant documents containing term ti and non-relevant documents containing term ti) and “collection odds” (ratio of documents containing ti and documents not containing ti)
ri = number of relevant documents containing ti
ni = number of documents containing ti
F3 formula
R = number of relevant documents
N= number of documents in collection
c i logr(R r)
n(N n)
Term reweighting for the probabilistic model
ratio of “relevance odds” and “non-relevance odds” (ratio of relevant documents not containing ti and the non-relevant documents not containing ti)
ri = number of relevant documents containing ti
ni = number of documents containing ti
F4 formula
R = number of relevant documents
N= number of documents in collection
c i logr
(R r)(n r)
(N n R r)
Experiments
F1, F2, F3 and F4 outperform no relevance weighting and idf F1 and F2; F3 and F4 perform in the same range
F3 and F4 > F1 and F2 F4 slightly > F3
O2 is correct (looking at presence and absence of terms)
No conclusion with respect to I1 and I2, although I2 seems a more realistic assumption.
Relevance feedback without user relevance
Relevance feedback with user relevance Clustering hypothesis
known relevant documents contain terms which can be used to describe a larger cluster of relevant documents
Description of cluster built interactively with user assistance
Relevance feedback without user relevance Obtain cluster description automatically Identify terms related to query terms
(e.g. synonyms, stemming variations, terms close to query terms in text)
Local strategies Global strategies
Local analysis (pseudo-relevance feedback)
Examine documents retrieved for query to determine query expansion
No user assistance
Clustering techniques
Query “drift”
Clusters (1)
Synonymy association (one example): terms that frequently co-occur inside local set of documents
Term-term (e.g., stem-stem) association matrix (normalised)
c i, j tf (t i,d) tf (t j, d)dDl
m i ,j c i, j
c i,i c j, j c i, j
Clusters (2)
For term ti
Take the n largest values mi,j
The resulting terms tj form cluster for ti
Query q Finding clusters for the |q| query terms Keep clusters small Expand original query
Global analysis
Expand query using information from whole set of documents in collection
Thesaurus-like structure using all documents
Approach to automatically built thesaurus (e.g. similarity thesaurus based on co-occurrence frequency)
Approach to select terms for query expansion
Evaluation of relevance feedback strategies
Use qi and compute precision and recall graph
Use qi+1 and compute precision recall graph
Use all documents in the collection
Spectacular improvements Also due to relevant documents ranked higher Documents known to user Must evaluate with respect to documents not seen by user
Three techniques
Evaluation of relevance feedback strategies
Freezing
Full-freezingTop n documents are frozen (ones used in RF)Remaining documents are re-ranked Precision-recall on whole rankingChange in effectiveness thus come from unseen documentsWith many iteration, higher contribution of frozen documents may
lead to decrease in effectiveness
Modified freezingRank position of the last marked relevant document
Evaluation of relevance feedback strategies
Test and control group
Random splitting of documents: test documents and group documentsQuery reformulation performed on test documentsNew query run against the control documents
RF performed only on control group
Difficulty in splitting the collectionDistribution of relevant documents
Evaluation of relevance feedback strategies
Residual ranking
Documents used in assessing relevance are removed Precision-recall on “residual collection”
Consider effect of unseen documents
Results not comparable with original ranking (fewer relevant documents)
Issues
Interface Allow user to quickly identify relevant and non-relevant documents What happen with 2D and 3D visualisation?
Global analysis On the web? Yahoo!
Local analysis Computation cost (on-line)
Interactive query expansion User choose the terms to be added
Negative relevance feedback
Documents explicitly marked as non-relevant by users
Implementation
Clarity
Usability