Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local

Query operations

1- Introduction

2- Relevance feedback with user relevance information

3- Relevance feedback without user relevance information

- Local analysis (pseudo-relevance feedback)

- Global analysis (thesaurus)

4- Evaluation

5- Issues

Introduction (1)

No detailed knowledge of collection and retrieval environment difficult to formulate queries well designed for retrieval Need many formulations of queries for effective retrieval

First formulation: often naïve attempt to retrieve relevant information Documents initially retrieved:

Examined for relevance information (user, automatically) Improve query formulations for retrieving additional relevant documents

Query reformulation: Expanding original query with new terms Reweighting the terms in expanded query

Introduction (2)

Approaches based on feedback from users (relevance feedback)

Approaches based on information derived from set of initially retrieved documents (local set of documents)

Approaches based on global information derived from document collection

Relevance feedback with user relevance information (1)

Most popular query reformulation strategy Cycle:

User presented with list of retrieved documents User marks those which are relevant

In practice: top 10-20 ranked documents are examined Incremental

Select important terms from documents assessed relevant by users Enhance importance of these terms in a new query

Expected: New query moves towards relevant documents and away from non-

relevant documents


Two basic techniques Query expansion

Add new terms from relevant documents Term reweighting

Modify term weights based on user relevance judgements

Advantages Shield users from details of query reformulation process Search broken down in sequence of small steps Controlled process

Emphasise some terms (relevant ones) De-emphasise other terms (non-relevant ones)


Query expansion and term reweighting in the vector space model

Term reweighting in the probabilistic model

Query expansion and term reweighting in thevector space model

Term weight vectors of documents assessed relevantSimilarities among themselves

Term weight vectors of documents assessed non-relevantDissimilar for those of relevant documents

Reformulated query:Closer to term weight vectors of relevant documents


For query q Dr: set of relevant documents among retrieved documents Dn: set of non-relevant documents among retrieved documents Cr: set of relevant documents among all documents in collection ,,: tuning constants

Assume that Cr is known (unrealistic!)

Best query vector for distinguishing relevant documents from non-relevant documents

qopt

1

Crd

j

d j Cr

1

N Crd

j

d jCr


Problem: |Cr| is unknown Approach

Formulate initial query Incrementally change initial query vector Use |Dr| and |Dn| instead

Rochio formula Ide formula

Rochio formula

Direct application of previous formula + add query Initial formulation =1 Usually information in relevant documents more important than in

non-relevant documents (<<) Positive relevance feedback (=0)

qi1

qi

Dr

dj

d jDr

Dnd

j

d jDn

Rochio formula in practice (SMART)

=1 Terms

Original query Appear in more relevant documents that non-relevant documents Appear in more than half the relevant documents

Negative weights ignored

q i1 q i Dr

d j

d jDr

Dnd j

d jDn

Ide formula

Initial formulation = = =1 Same comments as for the Rochio formula

Both Ide and Rochio: no optimal criterion

qi1

qi d

j

d jDr

dj

d j Dn

Term reweighting for the probabilistic model

(see note on the BIR model)

Use idf to rank documents for original query

Calculate

Predict relevanceImproved (optimal) retrieval function

g(D) c id ii1,n


Independence assumptions I1: distribution of terms in relevant documents is independent

and their distribution in all documents is independent I2: distribution of terms in relevant documents is independent

and their distribution in irrelevant documents is independent

Ordering principle O1: probable relevance based on presence of search terms in documents O2: probable relevance based on presence of search terms in documents

and their absence from documents


Various combinations

IdependenceAssumption I1

IndependenceAssumption I2

Orderingprinciple O1

F1 F2

Ordering principle O2

F3 F4


F1 formula

ri = number of relevant documents containing ti

ni = number of documents containing ti

ratio of the proportion of relevant documents in which the query term ti occurs to the proportion of all documents in which the term ti occurs

R = number of relevant documents

N= number of documents in collection

c i logrR

nN


F2 formula



proportion of relevant documents in which the term ti occurs to the proportion of all irrelevant documents in which ti occurs



c i logr

R(n r)

(N R)


ratio of “relevance odds” (ratio of relevant documents containing term ti and non-relevant documents containing term ti) and “collection odds” (ratio of documents containing ti and documents not containing ti)



F3 formula



c i logr(R r)

n(N n)


ratio of “relevance odds” and “non-relevance odds” (ratio of relevant documents not containing ti and the non-relevant documents not containing ti)



F4 formula



c i logr

(R r)(n r)

(N n R r)

Experiments

F1, F2, F3 and F4 outperform no relevance weighting and idf F1 and F2; F3 and F4 perform in the same range

F3 and F4 > F1 and F2 F4 slightly > F3

O2 is correct (looking at presence and absence of terms)

No conclusion with respect to I1 and I2, although I2 seems a more realistic assumption.

Relevance feedback without user relevance

Relevance feedback with user relevance Clustering hypothesis

known relevant documents contain terms which can be used to describe a larger cluster of relevant documents

Description of cluster built interactively with user assistance

Relevance feedback without user relevance Obtain cluster description automatically Identify terms related to query terms

(e.g. synonyms, stemming variations, terms close to query terms in text)

Local strategies Global strategies

Local analysis (pseudo-relevance feedback)

Examine documents retrieved for query to determine query expansion

No user assistance

Clustering techniques

Query “drift”

Clusters (1)

Synonymy association (one example): terms that frequently co-occur inside local set of documents

Term-term (e.g., stem-stem) association matrix (normalised)

c i, j tf (t i,d) tf (t j, d)dDl

m i ,j c i, j

c i,i c j, j c i, j

Clusters (2)

For term ti

Take the n largest values mi,j

The resulting terms tj form cluster for ti

Query q Finding clusters for the |q| query terms Keep clusters small Expand original query

Global analysis

Expand query using information from whole set of documents in collection

Thesaurus-like structure using all documents

Approach to automatically built thesaurus (e.g. similarity thesaurus based on co-occurrence frequency)

Approach to select terms for query expansion

Evaluation of relevance feedback strategies

Use qi and compute precision and recall graph

Use qi+1 and compute precision recall graph

Use all documents in the collection

Spectacular improvements Also due to relevant documents ranked higher Documents known to user Must evaluate with respect to documents not seen by user

Three techniques


Freezing

Full-freezingTop n documents are frozen (ones used in RF)Remaining documents are re-ranked Precision-recall on whole rankingChange in effectiveness thus come from unseen documentsWith many iteration, higher contribution of frozen documents may

lead to decrease in effectiveness

Modified freezingRank position of the last marked relevant document


Test and control group

Random splitting of documents: test documents and group documentsQuery reformulation performed on test documentsNew query run against the control documents

RF performed only on control group

Difficulty in splitting the collectionDistribution of relevant documents


Residual ranking

Documents used in assessing relevance are removed Precision-recall on “residual collection”

Consider effect of unseen documents

Results not comparable with original ranking (fewer relevant documents)

Issues

Interface Allow user to quickly identify relevant and non-relevant documents What happen with 2D and 3D visualisation?

Global analysis On the web? Yahoo!

Local analysis Computation cost (on-line)

Interactive query expansion User choose the terms to be added

Negative relevance feedback

Documents explicitly marked as non-relevant by users

Implementation

Clarity

Usability