26
A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiang Zhai Presented By Kumar Ashish INF384H/CS395T: Concepts of Information Retrieval (and Web Search) Fall 2011

Language Model Information Retrieval with Document Expansion

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Language Model Information Retrieval with Document Expansion

A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiangZhai

Presented By Kumar AshishINF384H/CS395T: Concepts of Information Retrieval (and Web

Search) Fall 2011

Page 2: Language Model Information Retrieval with Document Expansion

Zero Count Problem: Term is a possible word of Information need does not occur in document

General Problem of Estimation: Terms occurring once are overestimated even though their occurrence was partly by chance

In order to solve above problems, high quality extra data is required to enlarge the sample of document.

Page 3: Language Model Information Retrieval with Document Expansion

This gives the average logarithmicdistance between the probabilities: aword would be observed at random fromunigram query language model andunigram document language model.

Page 4: Language Model Information Retrieval with Document Expansion

C(w, d) is number of times word w occur in document d, and |d| is the length of document.

Problems:•Assigns Zero Probability to any word not present in document causing problem in scoring a document with KL-Divergence.

Page 5: Language Model Information Retrieval with Document Expansion

Jelinek-Mercer(JM) Smoothing

Dirichlet Smoothing

Page 6: Language Model Information Retrieval with Document Expansion

Proposes a fixed parameter λ to control interpolation.

Probability of word w given by the collection model Θc

Page 7: Language Model Information Retrieval with Document Expansion

It uses document dependent coefficient (parameterized with μ) to control the interpolation.

Page 8: Language Model Information Retrieval with Document Expansion

Uses clustering information to smooth a document.

Divides all documents into K clusters. First smoothes cluster model with collection

model using Dirichlet Smoothing. Takes smoothed cluster as a new reference

model to smooth document using JM Smoothing

Page 9: Language Model Information Retrieval with Document Expansion

ΘLd stand for document d’s cluster model and λ, β are smoothing parameters.

Page 10: Language Model Information Retrieval with Document Expansion

Better than JM or Dirichlet Smoothing: It expands a document with more data from the cluster instead of just using the same collection language model.

Page 11: Language Model Information Retrieval with Document Expansion

Cluster D good for smoothing document a but not good for document d.

Ideally each document should have its own cluster centered around itself.

Page 12: Language Model Information Retrieval with Document Expansion

Expand each document using ProbabilisticNeighborhood to estimate a virtualdocument(d’).

Apply any interpolation based method(e.g. JMor Dirchlet) to such a virtual document andtreat the word counts given by this virtualdocument as if they were the original wordcount.

Page 13: Language Model Information Retrieval with Document Expansion

Can use Cosine rule to determine documents in the neighborhood of Original document.

Problems:◦ In narrow sense would contain only few documents

whereas in wide sense the whole collection may included.◦ Neighbor documents can’t be sampled the same as

original document.

Page 14: Language Model Information Retrieval with Document Expansion

Associates a Confidence Value with every document in the collection

◦ This Confidence Value reflects the belief that the document is sampled from the same underlying model as the original one.

Page 15: Language Model Information Retrieval with Document Expansion

Confidence Value(γd) is associated to every document to indicate how strongly it is sampled from d’s document.

Confidence Value should follow normal distribution:

Page 16: Language Model Information Retrieval with Document Expansion

Shorter document require more help from its neighbor.

Longer documents rely more on itself.

In order to take care of this a parameter α is introduced to control this balance.

Page 17: Language Model Information Retrieval with Document Expansion

For Efficiency: Pseudo term count can be calculated only using top M closest Neighbors ( as confidence value follows decay shape)

Page 18: Language Model Information Retrieval with Document Expansion

For performance comparison: ◦ It uses four TREC data sets AP(Associate Press news 1988-90) LA ( LA times) WSJ(Wall Street Journals 1987- 92) SJMN(San Jose Mercury News 1991)

For Testing Algorithm Scale Up ◦ Uses TREC8

For Testing Effect on Short Documents◦ Uses DOE( Department of Energy)

Page 19: Language Model Information Retrieval with Document Expansion

λ for JM, μ for Dirichet are optimal and the same values of λ or μ are used for DELM without further tuning. M is 100 and α is 0.5 for DELM. DELM Outperforms JM and Dirichlet on each Data Sets with improvement as much as 15% in case of Associated Press News(AP).

Comparison of DELM +(Diri/JM) with Diri/JM

Page 20: Language Model Information Retrieval with Document Expansion

Compared Precision values at different levels of recall for AP data sets.DELM + Dirichetoutperforms Dirichleton every precision point.

Precision-Recall Curve on AP Data

Page 21: Language Model Information Retrieval with Document Expansion

ComparesPerformance Trendwith respect to M(top M closestneighbors for eachDocument)

Conclusion: Neighborhood information improves retrieval accuracyPerformance becomes insensitive to M when M is sufficiently large

Performance change with respect to M

Page 22: Language Model Information Retrieval with Document Expansion

DELM + Dirichet outperforms CBDM in MAP values on all four data sets.

Comparison of DELM+Dirichlet with CBDM

Page 23: Language Model Information Retrieval with Document Expansion

Document in AP88-89 was shrinked to 30% of original in 1st,

50% of original in 2nd and 70% of original in 3rd .Results shows that DELM help shorter documents more than longer ones (41% on 30%-length corpus to 16% on full length)

Page 24: Language Model Information Retrieval with Document Expansion

Optimal Points Migrate when document length becomes shorter. ( 100% corpus length gets optimal at α = 0.4 but 30% corpus has to use α = 0.2)

Performance change with respect to α

Page 25: Language Model Information Retrieval with Document Expansion

DELM combined with Model-Based Feedback proposed in (Zhaiand Lafferty, 2001a)Experiment Performed by:

Retrieving Documents by DELM methodChoosing top five document to do model based FeedbackUsing Expanded query model to retrieve documents again

Result: DELM can be combined with pseudo feedback to improve performance

Combination of DELM with Pseudo Feedback

Page 26: Language Model Information Retrieval with Document Expansion

References:◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf