Language Model Information Retrieval with Document Expansion

A paper by Tao Tao, Xuanhui Wang, Qiaozhu Mei, ChengXiangZhai

Presented By Kumar AshishINF384H/CS395T: Concepts of Information Retrieval (and Web

Search) Fall 2011

Zero Count Problem: Term is a possible word of Information need does not occur in document

General Problem of Estimation: Terms occurring once are overestimated even though their occurrence was partly by chance

In order to solve above problems, high quality extra data is required to enlarge the sample of document.

This gives the average logarithmicdistance between the probabilities: aword would be observed at random fromunigram query language model andunigram document language model.

C(w, d) is number of times word w occur in document d, and |d| is the length of document.

Problems:•Assigns Zero Probability to any word not present in document causing problem in scoring a document with KL-Divergence.

Jelinek-Mercer(JM) Smoothing

Dirichlet Smoothing

Proposes a fixed parameter λ to control interpolation.

Probability of word w given by the collection model Θc

It uses document dependent coefficient (parameterized with μ) to control the interpolation.

Uses clustering information to smooth a document.

Divides all documents into K clusters. First smoothes cluster model with collection

model using Dirichlet Smoothing. Takes smoothed cluster as a new reference

model to smooth document using JM Smoothing

ΘLd stand for document d’s cluster model and λ, β are smoothing parameters.

Better than JM or Dirichlet Smoothing: It expands a document with more data from the cluster instead of just using the same collection language model.

Cluster D good for smoothing document a but not good for document d.

Ideally each document should have its own cluster centered around itself.

Expand each document using ProbabilisticNeighborhood to estimate a virtualdocument(d’).

Apply any interpolation based method(e.g. JMor Dirchlet) to such a virtual document andtreat the word counts given by this virtualdocument as if they were the original wordcount.

Can use Cosine rule to determine documents in the neighborhood of Original document.

Problems:◦ In narrow sense would contain only few documents

whereas in wide sense the whole collection may included.◦ Neighbor documents can’t be sampled the same as

original document.

Associates a Confidence Value with every document in the collection

◦ This Confidence Value reflects the belief that the document is sampled from the same underlying model as the original one.

Confidence Value(γd) is associated to every document to indicate how strongly it is sampled from d’s document.

Confidence Value should follow normal distribution:

Shorter document require more help from its neighbor.

Longer documents rely more on itself.

In order to take care of this a parameter α is introduced to control this balance.

For Efficiency: Pseudo term count can be calculated only using top M closest Neighbors ( as confidence value follows decay shape)

For performance comparison: ◦ It uses four TREC data sets AP(Associate Press news 1988-90) LA ( LA times) WSJ(Wall Street Journals 1987- 92) SJMN(San Jose Mercury News 1991)

For Testing Algorithm Scale Up ◦ Uses TREC8

For Testing Effect on Short Documents◦ Uses DOE( Department of Energy)

λ for JM, μ for Dirichet are optimal and the same values of λ or μ are used for DELM without further tuning. M is 100 and α is 0.5 for DELM. DELM Outperforms JM and Dirichlet on each Data Sets with improvement as much as 15% in case of Associated Press News(AP).

Comparison of DELM +(Diri/JM) with Diri/JM

Compared Precision values at different levels of recall for AP data sets.DELM + Dirichetoutperforms Dirichleton every precision point.

Precision-Recall Curve on AP Data

ComparesPerformance Trendwith respect to M(top M closestneighbors for eachDocument)

Conclusion: Neighborhood information improves retrieval accuracyPerformance becomes insensitive to M when M is sufficiently large

Performance change with respect to M

DELM + Dirichet outperforms CBDM in MAP values on all four data sets.

Comparison of DELM+Dirichlet with CBDM

Document in AP88-89 was shrinked to 30% of original in 1st,

50% of original in 2nd and 70% of original in 3rd .Results shows that DELM help shorter documents more than longer ones (41% on 30%-length corpus to 16% on full length)

Optimal Points Migrate when document length becomes shorter. ( 100% corpus length gets optimal at α = 0.4 but 30% corpus has to use α = 0.2)

Performance change with respect to α

DELM combined with Model-Based Feedback proposed in (Zhaiand Lafferty, 2001a)Experiment Performed by:

Retrieving Documents by DELM methodChoosing top five document to do model based FeedbackUsing Expanded query model to retrieve documents again

Result: DELM can be combined with pseudo feedback to improve performance

Combination of DELM with Pseudo Feedback

References:◦ http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf◦ http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf◦ http://krisztianbalog.com/files/sigir2008-csiro.pdf

http://sifaka.cs.uiuc.edu/czhai/pub/hlt06-exp.pdf�

http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf�

http://krisztianbalog.com/files/sigir2008-csiro.pdf�

Education

Language Model Information Retrieval with Document Expansion