53
1 Advanced information Advanced information retrieval retrieval Chapter. 05: Query Reformulation Chapter. 05: Query Reformulation

1 Advanced information retrieval Chapter. 05: Query Reformulation

Embed Size (px)

Citation preview

Page 1: 1 Advanced information retrieval Chapter. 05: Query Reformulation

11

Advanced information Advanced information retrievalretrieval

Chapter. 05: Query ReformulationChapter. 05: Query Reformulation

Page 2: 1 Advanced information retrieval Chapter. 05: Query Reformulation

22

OutlineOutline

1. Basic concept1. Basic concept

2. Query refinement through relevance feedback2. Query refinement through relevance feedback

3. Query expansion through term clustering3. Query expansion through term clustering

1. Query-specific clustering1. Query-specific clustering

2. Global clustering (statistical thesauri)2. Global clustering (statistical thesauri)1. Clustering techniques1. Clustering techniques

2. Document clustering2. Document clustering

3. Cluster Hierarchies3. Cluster Hierarchies

• 3. Semantic thesaurus3. Semantic thesaurus

Page 3: 1 Advanced information retrieval Chapter. 05: Query Reformulation

33

Basic concept

• Often, initial queries are not optimal. Possibly because• User's lack of knowledge of the domain.• User's vocabulary does not match authors' vocabulary.

• Normally, users follow-up initial queries with reformulations that take advantage of experience gained in the initial attempt.• Automatic query reformulation: Rewrite the initial query automatically (or with some assistance from the user). • Process could be a single (final) reformulation, or an iterative refinement process that (hopefully) converges to the optimal query.• We examine two approaches:

1. Query refinement through user relevance feedback.2. Query expansion through word clusters.

Page 4: 1 Advanced information retrieval Chapter. 05: Query Reformulation

44

Query refinement by Relevance Feedback

• User examines the set of documents retrieved by the query and marks those that are relevant to his search.• In practice, only the top (10-20) documents retrieved need to be examined.• The system then reformulates the query to emphasize the concepts (e.g., terms) expressed in the documents judged to be relevant, and de-emphasize the concepts in the documents judged to be non-relevant.• Process may be repeated iteratively.• Advantage: User is only required to mark documents, and does not get involved in the technicalities of reformulation.• Next, we examine this technique in the vector model.

Page 5: 1 Advanced information retrieval Chapter. 05: Query Reformulation

55

Refinement by relevance feedback (cont.)• In the vector model, a query is a vector of term weights; hence reformulation involves reassigning term weights.• If a document is known to be relevant, the query can be improved by increasing its similarity to that document.• If a document is known to be non-relevant, the query can be improved by decreasing its similarity to that document.• Problems:

• What if the query has to increase its similarity to two very non-similar documents (each “pulls” the query in an entirely different direction)?• What if the query has to be decrease its similarity to two very non-similar documents (each “pushes” the query in an entirely different direction)?

• Critical assumptions that must be made: • Relevant documents resemble each other (are clustered).• Non-relevant documents resemble each other (are clustered).• Non-relevant documents differ from the relevant documents.

Page 6: 1 Advanced information retrieval Chapter. 05: Query Reformulation

66

Refinement by relevance feedback (cont.)

• For a query q, denote• DR : The set of relevant documents in the answer (as identified by the user)• DN : The set of non-relevant documents in the answer.• CR : The set of relevant documents in the collection (the ideal answer).

• Assume (unrealistic!) that CR is known in advance.• It can then be shown that the best query vector for distinguishing the relevant documents from the non-relevant documents is:

• The left expression is the centroid of the relevant documents, the right expression is the centroid of the non-relevant documents.• Note: This expression is vector arithmetic! dj are vectors, whereas |CR| and |n – CR| are scalars.

RjRj Cd

j

R

Cdj

R

dCn

dC

11

Page 7: 1 Advanced information retrieval Chapter. 05: Query Reformulation

77

Refinement by relevance feedback (cont.)

• Since we don’t know CR, we shall substitute it by DR in each of the two expressions, and then use them to modify the initial query q:

• , , and are tuning constants; for example, 1.0, 0.5, 0.25.• Note: This expression is vector arithmetic! q and dj are vectors, whereas |DR|, |DN|, , , and are scalars.

NjRj Dd

j

N

Ddj

R

newd

Dd

Dqq

.

Page 8: 1 Advanced information retrieval Chapter. 05: Query Reformulation

88

Refinement by relevance feedback (cont.)

• Positive feedback factor. Uses the user's judgments on relevant documents to increase the values of terms. Moves the query to retrieve documents similar to relevant documents retrieved (in the direction of more relevant documents).

• Negative feedback factor. Uses the user's judgments on non-relevant documents to decrease the values of terms. Moves the query away from non-relevant documents.

• Positive feedback often weighted significantly more than negative feedback; often, only positive feedback is used.

Rj Dd

j

R

dD

Nj Dd

j

N

dD

Page 9: 1 Advanced information retrieval Chapter. 05: Query Reformulation

99

Refinement by relevance feedback (cont.)

• Example: • Assume query q = (3,0,0,2,0) retrieved three documents: d1, d2, d3. • Assume d1 and d2 are judged relevant and d3 is judged non-relevant.• Assume the tuning constants used are 1.0, 0.5, 0.25.

k1 k2 k3 k4 k5 q 3 0 0 2 0 d1 2 4 0 0 2 d2 1 3 0 0 0 d3 0 0 4 3 3 qnew 3.75 1.75 0 1.25 0 The revised query is:

qnew = (3, 0, 0, 2, 0) + 0.5 * ((2+1)/2, (4+3)/2, (0+0)/2, (0+0)/2, (2+0)/2) – 0.25 * (0, 0, 4, 3, 3)

= (3.75, 1.75, –1, 1.25, 0) = (3.75, 1.75, 0, 1.25, 0)

Page 10: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1010

Refinement by relevance feedback (cont.)• Using a simplified similarity formula (the nominator only of the cosine):

we can compare the similarity of q and qnew to the three documents:

d1 d2 d3 q 6 3 6 qnew 14.5 9 3.75

• Compared to the original query, the new query is indeed more similar to d1, d2. (which were judged relevant), and less similar to d3 (which was judged non-relevant).• Notice how the new query added k2, which was not in the original query.

• For example, a user may be searching for software for a PC, may (inadvertently) mark as relevant documents about software which is actually for a Mac, and the revised query may introduce the term Mac.

similarity d j , qi 1

t

W i , j W i , q

Page 11: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1111

Refinement by relevance feedback (cont.)

• Problem: Relevance feedback may not operate satisfactorily, if the identified relevant documents do not form a tight cluster.

• Possible solution: Cluster the identified relevant documents, then split the original query into several, by constructing a new query for each cluster.

• Problem: Some of the query terms might not be found in any of the retrieved documents. This will lead to reduction of their relative weight in the modified query (or even elimination). Undesirable, because these terms might still be found in future iterations.

• Possible solutions: Ensure that the original terms are kept; or present all modified queries to the user for review.

• Problem: New query terms might be introduced that conflict with the intention of the user (as in the previous example).

• Possible solutions: Present all modified queries to the user for review.

Page 12: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1212

Refinement by relevance feedback (cont.)

• Conclusion: Experimentation showed that user relevance feedback in the vector model gives good results.• However:

• Users are sometimes reluctant to provide explicit feedback• Results in long queries that needs more computation to retrieve, which is a lot of time for search engines.• Makes it harder to understand why a particualr document was retrieved.

• “Fully automatic” relevance feedback: The rank values for the documents in the first answer are used as relevance feedback to automatically generate the second query (no human judgment).

• The highest ranking documents are assumed to be relevant (positive feedback only).

Page 13: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1313

Query expansion by term clustering

• Expand the query with terms that are similar to the terms mentioned in the query.

• For each term, create a cluster of similar terms.• The relevance feedback method shifts the query:

• Some terms are emphasized, others de-emphasized.• Consequently, new documents are discovered and some old documents are abandoned.

• The word-clustering method expands the query:• Terms are added.• New documents are discovered (but none discarded).• Recall increases (and precision decreases).• Compensates for the different vocabularies of author and user.• Requires no user involvement.

Page 14: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1414

Local vs. Global Automatic AnalysisLocal vs. Global Automatic Analysis

• Local – Documents retrieved are Local – Documents retrieved are examined to automatically examined to automatically determine query expansion. No determine query expansion. No relevance feedback needed.relevance feedback needed.

• Global – Thesaurus used to help Global – Thesaurus used to help select terms for expansion.select terms for expansion.

Page 15: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1515

Automatic Local AnalysisAutomatic Local Analysis

• At query time, dynamically determine At query time, dynamically determine similar terms based on analysis of top-similar terms based on analysis of top-ranked retrieved documents.ranked retrieved documents.

• Base correlation analysis on only the Base correlation analysis on only the “local” set of retrieved documents for “local” set of retrieved documents for a specific query.a specific query.

• Avoids ambiguity by determining Avoids ambiguity by determining similar (correlated) terms only within similar (correlated) terms only within relevant documents.relevant documents.– ““Apple computer” Apple computer”

“Apple computer Powerbook “Apple computer Powerbook laptop”laptop”

Page 16: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1616

Automatic Local AnalysisAutomatic Local Analysis

• Expand query with terms found in local Expand query with terms found in local clusters.clusters.

• DDll – set of documents retireved for query q. – set of documents retireved for query q.

• VVll – Set of words used in D – Set of words used in Dll..

• SSll – Set of distinct stems in V – Set of distinct stems in Vll..

• ffsi,jsi,j –Frequency of stem s –Frequency of stem sii in document d in document djj found in Dfound in Dll..

• Construct stem-stem association matrix.Construct stem-stem association matrix.

Page 17: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1717

Association MatrixAssociation Matrixw1 w2 w3 …………………..wn

w1

w2

w3

.

.wn

c11 c12 c13…………………c1n

c21

c31

.

.cn1

cij: Correlation factor between stems si and stem sj

k l

ij sik sjkd D

c f f

fik : Frequency of term i in document k

Page 18: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1818

Normalized Association Normalized Association MatrixMatrix• Frequency based correlation factor Frequency based correlation factor

favors more frequent terms.favors more frequent terms.

• Normalize association scores:Normalize association scores:

• Normalized score is 1 if two stems Normalized score is 1 if two stems have the same frequency in all have the same frequency in all documents.documents.

ijjjii

ijij ccc

cs

Page 19: 1 Advanced information retrieval Chapter. 05: Query Reformulation

1919

Metric Correlation MatrixMetric Correlation Matrix

• Association correlation does not Association correlation does not account for the proximity of terms account for the proximity of terms in documents, just co-occurrence in documents, just co-occurrence frequencies within documents.frequencies within documents.

• Metric correlations account for term Metric correlations account for term proximity.proximity.

iu jvVk Vk vu

ijkkr

c),(

1

Vi: Set of all occurrences of term i in any document.r(ku,kv): Distance in words between word occurrences ku and kv

( if ku and kv are occurrences in different documents).

Page 20: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2020

Normalized Metric Correlation Normalized Metric Correlation

MatrixMatrix • Normalize scores to account for Normalize scores to account for

term frequencies:term frequencies:

ji

ijij

VV

cs

Page 21: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2121

Query Expansion with Correlation Query Expansion with Correlation MatrixMatrix

• For each term For each term ii in query, expand in query, expand query with the query with the nn terms, terms, jj, with the , with the highest value of highest value of ccijij ((ssijij).).

• This adds semantically related This adds semantically related terms in the “neighborhood” of the terms in the “neighborhood” of the query terms.query terms.

Page 22: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2222

Problems with Local Problems with Local AnalysisAnalysis• Term ambiguity may introduce Term ambiguity may introduce

irrelevant statistically correlated irrelevant statistically correlated terms.terms.– ““Apple computer” Apple computer” “Apple red fruit “Apple red fruit

computer”computer”

• Since terms are highly correlated Since terms are highly correlated anyway, expansion may not retrieve anyway, expansion may not retrieve many additional documents.many additional documents.

Page 23: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2323

Automatic Global AnalysisAutomatic Global Analysis• Determine term similarity through Determine term similarity through

a pre-computed statistical analysis a pre-computed statistical analysis of the complete corpus.of the complete corpus.

• Compute association matrices Compute association matrices which quantify term correlations in which quantify term correlations in terms of how frequently they co-terms of how frequently they co-occur.occur.

• Expand queries with statistically Expand queries with statistically most similar terms.most similar terms.

Page 24: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2424

Automatic Global AnalysisAutomatic Global Analysis

• There are two modern variants There are two modern variants based on a thesaurus-like structure based on a thesaurus-like structure built using all documents in built using all documents in collectioncollection– Query Expansion based on a Similarity Query Expansion based on a Similarity

ThesaurusThesaurus– Query Expansion based on a Statistical Query Expansion based on a Statistical

ThesaurusThesaurus

Page 25: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2525

ThesaurusThesaurus

•A thesaurus provides information A thesaurus provides information on synonyms and semantically on synonyms and semantically related words and phrases.related words and phrases.

•Example:Example: physician physician

syn: ||croaker, doc, doctor, MD, syn: ||croaker, doc, doctor, MD, medical, mediciner, medico, ||medical, mediciner, medico, ||sawbonessawbones

rel: medic, general practitioner, rel: medic, general practitioner, surgeon, surgeon,

Page 26: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2626

Thesaurus-based Query Thesaurus-based Query ExpansionExpansion

• For each term, For each term, tt, in a query, expand , in a query, expand the query with synonyms and related the query with synonyms and related words of words of tt from the thesaurus. from the thesaurus.

• May weight added terms less than May weight added terms less than original query terms.original query terms.

• Generally increases recall.Generally increases recall.• May significantly decrease precision, May significantly decrease precision,

particularly with ambiguous terms.particularly with ambiguous terms.– ““interest rate” interest rate” “interest rate fascinate “interest rate fascinate

evaluate”evaluate”

Page 27: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2727

Similarity ThesaurusSimilarity Thesaurus

• The similarity thesaurus is based on term to term The similarity thesaurus is based on term to term relationships rather than on a matrix of co-relationships rather than on a matrix of co-occurrence.occurrence.

• This relationship are not derived directly from co-This relationship are not derived directly from co-occurrence of terms inside documents.occurrence of terms inside documents.

• They are obtained by considering that the terms They are obtained by considering that the terms are concepts in a concept space.are concepts in a concept space.

• In this concept space, each term is indexed by the In this concept space, each term is indexed by the documents in which it appears.documents in which it appears.

• Terms assume the original role of documents while Terms assume the original role of documents while documents are interpreted as indexing elementsdocuments are interpreted as indexing elements

Page 28: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2828

Similarity ThesaurusSimilarity Thesaurus

• The following definitions establish The following definitions establish the proper frameworkthe proper framework– t: number of terms in the collectiont: number of terms in the collection– N: number of documents in the N: number of documents in the

collectioncollection

– ffi,ji,j: frequency of occurrence of the term : frequency of occurrence of the term ki in the document djki in the document dj

– ttjj: vocabulary of document d: vocabulary of document djj

– itfitfjj: inverse term frequency for : inverse term frequency for document ddocument djj

Page 29: 1 Advanced information retrieval Chapter. 05: Query Reformulation

2929

Similarity ThesaurusSimilarity Thesaurus

• Inverse term frequency for Inverse term frequency for document ddocument djj

• To kTo kii is associated a vector is associated a vector

jj t

titf log

),....,,(k ,2,1,i Niii www

Page 30: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3030

Similarity ThesaurusSimilarity Thesaurus

• where wwhere wi,ji,j is a weight associated to is a weight associated to index-document pair[kindex-document pair[kii,d,djj]. These ]. These weights are computed as followsweights are computed as follows

N

lj

lil

li

jjij

ji

ji

itff

f

itff

f

w

1

22

,

,

,

,

,

))(max

5.05.0(

))(max

5.05.0(

Page 31: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3131

Similarity ThesaurusSimilarity Thesaurus

• The relationship between two The relationship between two terms kterms kuu and k and kvv is computed as a is computed as a correlation factor c correlation factor c u,vu,v given by given by

• The global similarity thesaurus is The global similarity thesaurus is built through the computation of built through the computation of correlation factor ccorrelation factor cu,vu,v for each pair for each pair of indexing terms [kof indexing terms [kuu,k,kvv] in the ] in the collectioncollection

jd

jv,ju,vuvu, wwkkc

Page 32: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3232

Similarity ThesaurusSimilarity Thesaurus

• This computation is expensiveThis computation is expensive• Global similarity thesaurus has to Global similarity thesaurus has to

be computed only once and can be be computed only once and can be updated incrementallyupdated incrementally

Page 33: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3333

Query Expansion based on a Similarity ThesaurusQuery Expansion based on a Similarity Thesaurus

• Query expansion is done in three Query expansion is done in three steps as follows:steps as follows: Represent the query in the concept space Represent the query in the concept space

used for representation of the index termsused for representation of the index terms2 Based on the global similarity thesaurus, Based on the global similarity thesaurus,

compute a similarity sim(q,kv) between compute a similarity sim(q,kv) between each term kv correlated to the query each term kv correlated to the query terms and the whole query q.terms and the whole query q.

3 Expand the query with the top r ranked Expand the query with the top r ranked terms according to sim(q,kv)terms according to sim(q,kv)

Page 34: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3434

Query Expansion - step oneQuery Expansion - step one

• To the query q is associated a vector To the query q is associated a vector q in the term-concept space given q in the term-concept space given byby

• where wwhere wi,qi,q is a weight associated to is a weight associated to the index-query pair[kthe index-query pair[kii,q],q]

iqk

qi kwi

,q

Page 35: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3535

Query Expansion - step twoQuery Expansion - step two

• Compute a similarity sim(q,kCompute a similarity sim(q,kvv) ) between each term kbetween each term kvv and the user and the user query qquery q

• where cwhere cu,vu,v is the correlation factor is the correlation factor

qk

vu,qu,vv

u

cwkq)ksim(q,

Page 36: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3636

Query Expansion - step threeQuery Expansion - step three

• Add the top r ranked terms according to sim(q,kAdd the top r ranked terms according to sim(q,kvv) ) to the original query q to form the expanded to the original query q to form the expanded query q’query q’

• To each expansion term kTo each expansion term kvv in the query q’ is in the query q’ is assigned a weight wassigned a weight wv,q’v,q’ given by given by

• The expanded query q’ is then used to retrieve The expanded query q’ is then used to retrieve new documents to the usernew documents to the user

qkqu,

vq'v,

u

w

)ksim(q,w

Page 37: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3737

Query Expansion SampleQuery Expansion Sample

• Doc1 = D, D, A, B, C, A, B, CDoc1 = D, D, A, B, C, A, B, C• Doc2 = E, C, E, A, A, DDoc2 = E, C, E, A, A, D• Doc3 = D, C, B, B, D, A, B, C, ADoc3 = D, C, B, B, D, A, B, C, A• Doc4 = ADoc4 = A

• c(A,A) = 10.991c(A,A) = 10.991• c(A,C) = 10.781c(A,C) = 10.781• c(A,D) = 10.781c(A,D) = 10.781• ......• c(D,E) = 10.398c(D,E) = 10.398• c(B,E) = 10.396c(B,E) = 10.396• c(E,E) = 10.224c(E,E) = 10.224

Page 38: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3838

Query Expansion SampleQuery Expansion Sample

• Query: q = A E EQuery: q = A E E• sim(q,A) = 24.298sim(q,A) = 24.298• sim(q,C) = 23.833sim(q,C) = 23.833• sim(q,D) = 23.833sim(q,D) = 23.833• sim(q,B) = 23.830sim(q,B) = 23.830• sim(q,E) = 23.435sim(q,E) = 23.435

• New query: q’ = A C D E ENew query: q’ = A C D E E• w(A,q')= 6.88w(A,q')= 6.88• w(C,q')= 6.75w(C,q')= 6.75• w(D,q')= 6.75w(D,q')= 6.75• w(E,q')= 6.64w(E,q')= 6.64

Page 39: 1 Advanced information retrieval Chapter. 05: Query Reformulation

3939

WordNetWordNet

• A more detailed database of A more detailed database of semantic relationships between semantic relationships between English words.English words.

• Developed by famous cognitive Developed by famous cognitive psychologist George Miller and a psychologist George Miller and a team at Princeton University.team at Princeton University.

• About 144,000 English words.About 144,000 English words.• Nouns, adjectives, verbs, and Nouns, adjectives, verbs, and

adverbs grouped into about 109,000 adverbs grouped into about 109,000 synonym sets called synonym sets called synsetssynsets..

Page 40: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4040

WordNet Synset WordNet Synset RelationshipsRelationships• AntonymAntonym: front : front back back• AttributeAttribute: benevolence : benevolence good (noun to good (noun to

adjective)adjective)• PertainymPertainym: alphabetical : alphabetical alphabet alphabet

(adjective to noun)(adjective to noun)• SimilarSimilar: unquestioning : unquestioning absolute absolute• CauseCause: kill : kill die die• EntailmentEntailment: breathe : breathe inhale inhale• HolonymHolonym: chapter : chapter text (part-of) text (part-of)• MeronymMeronym: computer : computer cpu (whole-of) cpu (whole-of)• Hyponym: Hyponym: tree tree plant (specialization) plant (specialization)• Hypernym:Hypernym: fruit fruit apple (generalization) apple (generalization)

Page 41: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4141

WordNet Query ExpansionWordNet Query Expansion• Add synonyms in the same synset.Add synonyms in the same synset.

• Add hyponyms to add specialized Add hyponyms to add specialized terms.terms.

• Add hypernyms to generalize a Add hypernyms to generalize a query.query.

• Add other related terms to expand Add other related terms to expand query.query.

Page 42: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4242

Statistical ThesaurusStatistical Thesaurus

• Existing human-developed thesauri Existing human-developed thesauri are not easily available in all are not easily available in all languages.languages.

• Human thesuari are limited in the Human thesuari are limited in the type and range of synonymy and type and range of synonymy and semantic relations they represent.semantic relations they represent.

• Semantically related terms can be Semantically related terms can be discovered from statistical analysis discovered from statistical analysis of corpora.of corpora.

Page 43: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4343

Query Expansion Based on a Statistical ThesaurusQuery Expansion Based on a Statistical Thesaurus

• Global thesaurus is composed of classes which group Global thesaurus is composed of classes which group correlated terms in the context of the whole collectioncorrelated terms in the context of the whole collection

• Such correlated terms can then be used to expand the Such correlated terms can then be used to expand the original user queryoriginal user query

• This terms must be low frequency termsThis terms must be low frequency terms• However, it is difficult to cluster low frequency terms However, it is difficult to cluster low frequency terms • To circumvent this problem, we cluster documents To circumvent this problem, we cluster documents

into classes instead and use the low frequency terms into classes instead and use the low frequency terms in these documents to define our thesaurus classes.in these documents to define our thesaurus classes.

• This algorithm must produce small and tight clusters.This algorithm must produce small and tight clusters.

Page 44: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4444

Complete link algorithmComplete link algorithm

• This is document clustering algorithm This is document clustering algorithm with produces small and tight clusterswith produces small and tight clusters– Place each document in a distinct cluster.Place each document in a distinct cluster.– Compute the similarity between all pairs of Compute the similarity between all pairs of

clusters.clusters.– Determine the pair of clusters [CDetermine the pair of clusters [Cuu,C,Cvv] with the ] with the

highest inter-cluster similarity.highest inter-cluster similarity.– Merge the clusters CMerge the clusters Cuu and C and Cvv

– Verify a stop criterion. If this criterion is not met Verify a stop criterion. If this criterion is not met then go back to step 2.then go back to step 2.

– Return a hierarchy of clusters.Return a hierarchy of clusters.• Similarity between two clusters is defined Similarity between two clusters is defined

as the minimum of similarities between as the minimum of similarities between all pair of inter-cluster documentsall pair of inter-cluster documents

Page 45: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4545

Selecting the terms that compose each classSelecting the terms that compose each class

• Given the document cluster Given the document cluster hierarchy for the whole collection, hierarchy for the whole collection, the terms that compose each class the terms that compose each class of the global thesaurus are selected of the global thesaurus are selected as followsas follows– Obtain from the user three parametersObtain from the user three parameters

• TC: Threshold classTC: Threshold class• NDC: Number of documents in classNDC: Number of documents in class• MIDF: Minimum inverse document frequencyMIDF: Minimum inverse document frequency

Page 46: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4646

Selecting the terms that compose each classSelecting the terms that compose each class

• Use the parameter TC as threshold Use the parameter TC as threshold value for determining the document value for determining the document clusters that will be used to clusters that will be used to generate thesaurus classesgenerate thesaurus classes

• This threshold has to be surpassed This threshold has to be surpassed by sim(Cby sim(Cuu,C,Cvv) if the documents in ) if the documents in the clusters Cthe clusters Cuu and C and Cvv are to be are to be selected as sources of terms for a selected as sources of terms for a thesaurus classthesaurus class

Page 47: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4747

Selecting the terms that compose each classSelecting the terms that compose each class

• Use the parameter NDC as a limit Use the parameter NDC as a limit on the size of clusters (number of on the size of clusters (number of documents) to be considered.documents) to be considered.

• A low value of NDC might restrict A low value of NDC might restrict the selection to the smaller cluster the selection to the smaller cluster

Page 48: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4848

Selecting the terms that compose each Selecting the terms that compose each classclass

• Consider the set of document in each Consider the set of document in each document cluster pre-selected above.document cluster pre-selected above.

• Only the lower frequency documents are Only the lower frequency documents are used as sources of terms for the thesaurus used as sources of terms for the thesaurus classesclasses

• The parameter MIDF defines the minimum The parameter MIDF defines the minimum value of inverse document frequency for value of inverse document frequency for any term which is selected to participate in any term which is selected to participate in a thesaurus classa thesaurus class

Page 49: 1 Advanced information retrieval Chapter. 05: Query Reformulation

4949

Query Expansion SampleQuery Expansion Sample

• TC = 0.90 NDC = 2.00 MIDF = 0.2TC = 0.90 NDC = 2.00 MIDF = 0.2

Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

idf A = 0.0idf B = 0.3idf C = 0.12idf D = 0.12idf E = 0.60 q'=A B E E

q= A E E

sim(1,3) = 0.99sim(1,2) = 0.40sim(1,2) = 0.40sim(2,3) = 0.29sim(4,1) = 0.00sim(4,2) = 0.00sim(4,3) = 0.00

Page 50: 1 Advanced information retrieval Chapter. 05: Query Reformulation

5050

Query Expansion based on a Statistical ThesaurusQuery Expansion based on a Statistical Thesaurus

• Problems with this approachProblems with this approach– initialization of parameters TC,NDC and initialization of parameters TC,NDC and

MIDFMIDF– TC depends on the collection TC depends on the collection – Inspection of the cluster hierarchy is almost Inspection of the cluster hierarchy is almost

always necessary for assisting with the always necessary for assisting with the setting of TC.setting of TC.

– A high value of TC might yield classes with A high value of TC might yield classes with too few termstoo few terms

Page 51: 1 Advanced information retrieval Chapter. 05: Query Reformulation

5151

Global vs. Local AnalysisGlobal vs. Local Analysis

• Global analysis requires intensive term Global analysis requires intensive term correlation computation only once at correlation computation only once at system development time.system development time.

• Local analysis requires intensive term Local analysis requires intensive term correlation computation for every correlation computation for every query at run time (although number of query at run time (although number of terms and documents is less than in terms and documents is less than in global analysis).global analysis).

• But local analysis gives better results.But local analysis gives better results.

Page 52: 1 Advanced information retrieval Chapter. 05: Query Reformulation

5252

Query Expansion Query Expansion ConclusionsConclusions• Expansion of queries with related Expansion of queries with related

terms can improve performance, terms can improve performance, particularly recall.particularly recall.

• However, must select similar terms However, must select similar terms very carefully to avoid problems, very carefully to avoid problems, such as loss of precision.such as loss of precision.

Page 53: 1 Advanced information retrieval Chapter. 05: Query Reformulation

5353

ConclusionConclusion• Thesaurus is a efficient method to Thesaurus is a efficient method to

expand queriesexpand queries• The computation is expensive but it The computation is expensive but it

is executed only onceis executed only once• Query expansion based on Query expansion based on

similarity thesaurus may use high similarity thesaurus may use high term frequency to expand the queryterm frequency to expand the query

• Query expansion based on Query expansion based on statistical thesaurus need well statistical thesaurus need well defined parametersdefined parameters