View
222
Download
4
Category
Preview:
Citation preview
Language Models for Information Retrieval
References:1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. July 20032. X. Liu and W.B. Croft, Statistical Language Modeling For Information Retrieval, the Annual Review of Information Science and
Technology, vol. 39, 2005 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University
Press, 2008. (Chapter 12)4. D. A. Grossman, O. Frieder, Information Retrieval: Algorithms and Heuristics, Springer, 2004 (Chapter 2)5. C.X. Zhai. Statistical Language Models for Information Retrieval (Synthesis Lectures Series on Human Language Technologies).
Morgan & Claypool Publishers, 2008
Berlin ChenDepartment of Computer Science & Information Engineering
National Taiwan Normal University
IR – Berlin Chen 2
Taxonomy of Classic IR Models
Document Property
TextLinksMultimedia Proximal Nodes, others
XML-based
Semi-structured Text
Classic Models
BooleanVectorProbabilistic
Set Theoretic
FuzzyExtended BooleanSet-based
Probabilistic
BM25Language ModelsDivergence from RamdomnessBayesian Networks
Algebraic
Generalized VectorLatent Semanti IndexingNeural NetworksSupport Vector Machines
Page RankHubs & Authorities
Web
Image retrievalAudio and Music RetrievalVideo Retrieval
Multimedia Retrieval
IR – Berlin Chen 3
Statistical Language Models (1/2)
• A probabilistic mechanism for “generating” a piece of text– Define a distribution over all possible word sequences
– Used LM to quantify the acceptability of a given word sequence • What is LM Used for ?
– Speech recognition– Spelling correction– Handwriting recognition– Optical character recognition– Machine translation– Document classification and routing– Information retrieval …
LwwwW 21
?WP
IR – Berlin Chen 4
Statistical Language Models (2/2)
• (Statistical) language models (LM) have been widely used for speech recognition and language (machine) translation for more than thirty years
• However, their use for information retrieval started only in 1998 [Ponte and Croft, SIGIR 1998]– Basically, a query is considered generated from an “ideal”
document that satisfies the information need– The system’s job is then to estimate the likelihood of each
document in the collection being the ideal document and rank then accordingly (in decreasing order)
Ponte and Croft. A language modeling approach to information retrieval. SIGIR 1998
Three Ways of Developing LM Approaches for IR
IR – Berlin Chen 5
(a) Query likelihood(b) Document likelihood(c) Model comparison
literal term matching or concept matching
IR – Berlin Chen 6
Query-Likelihood Language Models
• Criterion: Documents are ranked based on Bayes (decision) rule
– is the same for all documents, and can be ignored
– might have to do with authority, length, genre, etc.• There is no general way to estimate it• Can be treated as uniform across all documents
• Documents can therefore be ranked based on
– The user has a prototype (ideal) document in mind, and generates a query based on words that appear in this document
– A document is treated as a model to predict (generate) the query
QP
DPDQPQDP
QP DP
M as denotedor DQPDQP
D DM
document model
IR – Berlin Chen 7
Another Criterion: Maximum Mutual Information
• Documents can be ranked based their mutual information with the query (in decreasing order)
• Document ranking by mutual information (MI) is equivalent that by likelihood
QPDQP
DPQPDQPDQMI
loglog
,log,
being the same for all documents, and hence can be ignored
DQPDQMIDDmaxarg,maxarg
rank
IR – Berlin Chen 8
Yet Another Criterion: Minimum KL Divergence
• Documents are ranked by Kullback-Leibler (KL) divergence (in increasing order)
log,
log
loglog
log
rankDQPDwPQwc
DwPQwP
DwPQwPQwPQwP
DwPQwP
QwPDQKL
w
w
ww
w
The same for all document=> can be disregarded
Relevant documents are deemed tohave lower cross entropies
Cross entropy between the language models of a query and a document
Documentmodel
Querymodel
Equivalent to ranking in decreasing order of
IR – Berlin Chen 9
Schematic Depiction for Query-Likelihood Approach
D1
D2
D3
.
.
.
DocumentCollection
.
.
.
MD1
MD2
MD3
query (Q)
DocumentModels
IR – Berlin Chen 10
Building Document Models: n-grams
• Multiplication (Chain) rule
– Decompose the probability of a sequence of events into the probability of each successive events conditioned on earlier events
• n-gram assumption– Unigram
• Each word occurs independently of the other words• The so-called “bag-of-words” model (e.g., how to distinguish
“street market” from “market street)– Bigram
– Most language-modeling work in IR has used unigram models• IR does not directly depend on the structure of sentences
12121312121 LLL wwwwPwwwPwwPwPw....wwP
LL wPwPwPwPw....wwP 32121
12312121 LLL wwPwwPwwPwPw....wwP
IR – Berlin Chen 11
Unigram Model (1/4)
• The likelihood of a query given a document
– Words are conditionally independent of each other given the document
– How to estimate the probability of a (query) word given the document ?
• Assume that words follow a multinomial distributiongiven the document
Li Di
DLDDD
wP
wPwPwPQP
1
21
M
MM MM
Lw....wwQ 21 D
Vi wDiw
i
Vi
wcwV
i i
Vj j
DV
ii
ii
λ,wPλ
:wc
λ!wc
!wcwc,...,wcP
1
11
11
1 M
occursword a timesofnumber the where
M
permutation is considered here
M DwP
IR – Berlin Chen 12
Unigram Model (2/4)
• Use each document itself a sample for estimating its corresponding unigram (multinomial) model– If Maximum Likelihood Estimation (MLE) is adopted
wa
wa
wa
wb
wa
wbwb
wc
Doc D
wc
P(wb|MD)=0.3
wd
P(wc |MD)=0.2P(wd |MD)=0.1P(we |MD)=0.0P(wf |MD)=0.0
DD,wcD:D
Dw:D,wc
DD,wc
wP
i i
ii
iDi
, oflength in occurs timesofnumber
where
M
The zero-probability problemIf we and wf do not occur in Dthen P(we |MD)= P(wf |MD)=0
This will cause a problem in predicting the query likelihood (See the equation for the query likelihood in the preceding slide)
IR – Berlin Chen 13
Unigram Model (3/4)
• Smooth the document-specific unigram model with a collection model (two states, or a mixture of two multinomials)
• The role of the collection unigram model– Help to solve zero-probability problem– Help to differentiate the contributions of different missing terms in
a document (global information like IDF ? )
• The collection unigram model can be estimated in a similar way as what we do for the document-specific unigram model
Li CiDiD wPλwPλQP 1 1 MMM
CiwP M
ll w l
i
wl
iCi
nNnN
Collection,wcCollection,wcwP or M ii wn
N containing collection in the doc ofnumber :
collection in the doc ofnumber :
Normalized doc freq
DiwP M
CiwP M
A document model
Query
Lw....wwQ 21
IR – Berlin Chen 14
Unigram Model (4/4)
• An evaluation on the Topic Detection and Tracking (TDT) corpora– Language Model
– Vector Space Model
mAP Unigram Unigram+Bigram
TQ/TD 0.6327 0.5427
TDT2 TQ/SD 0.5658 0.4803
TQ/TD 0.6569 0.6141
TDT3 TQ/SD 0.6308 0.5808
mAP Unigram Unigram+Bigram
TQ/TD 0.5548 0.5623
TDT2 TQ/SD 0.5122 0.5225
TQ/TD 0.6505 0.6531
TDT3 TQ/SD 0.6216 0.6233
CiLi Di
DUnigram
MwPλMwPλ
MQP
11
Cii
Dii
Li CiDi
DBigramUnigram
M,wwPλλλ
M,wwPλ
MwPλMwPλ
MQP
1321
13
1 21
1
• Consideration of contextual information (Higher-order language models, e.g., bigrams)will not always lead to improved performance
Training Mixture Weights of LMs
IR – Berlin Chen 15
• Expectation-Maximization (EM) Training– The weights are tied among the documents
– E.g. m1 of Type I HMM can be trained using the following equation:
• Where is the set of training query exemplars, is the set of docs that are relevant to a specific
training query exemplar , is the length of the query , and is the total number of docs relevant to the query
Q
Q QR n
TrainSetQQR
TrainSetQ DocD Qq nn
n
DocQCorpusqPmDqPm
DqPm
m to
21
1
1 to
ˆ
QTrainSet QRDoc to
Q Q QRDoc to
Q
the old weight
the new weight819 queries ≦2265 docs
Discriminative Training of LMs (1/5)
IR – Berlin Chen 16
• Minimum Classification Error (MCE) Training– Given a query and a desired relevant doc , define the
classification error function as:
“>0”: means misclassified; “<=0”: means a correct decision
– Transform the error function to the loss function
• In the range between 0 and 1
Q *D
RDQPRDQPQ
DQED
not is 'logmax is log1),( '
**
)),(exp(11),(
**
DQEDQL
),( *DQE
),( *DQL
1
B. Chen et al., “A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken documents,” ACM Transactions on Asian Language Information Processing, 3(2), pp. 128-145, June 2004
Discriminative Training of LMs (2/5)
IR – Berlin Chen 17
• Minimum Classification Error (MCE) Training– Apply the loss function to the MCE procedure for iteratively
updating the weighting parameters• Constraints:
• Parameter Transformation, (e.g.,Type I HMM)and
– Iteratively update (e.g., Type I HMM)
• Where,
1 , 0 k
kk mm
21
1
~~
~
1 mm
m
eeem
21
2
~~
~
2 mm
m
eeem
1m
iDDm
DQLiimim **
1
*
11 ~,~1~
,~),(
),(),(
~),(
1
*
*
*1
*~, 1
*
mDQE
DQEDQLi
mDQLimD
),(1),(),(),( **
*
*DQLDQL
DQEDQL
Gradient descent
Discriminative Training of LMs (3/5)
• Minimum Classification Error (MCE) Training– Iteratively update (e.g., Type I HMM)
,1
1
1
~
log1
~),(
2*
1
*1
1
~~
~*
~~
~
*~~
~
~~
~
~~
~*
~~
~
*~~
~~*~
2~~
~
1
~~
~*
~~
~
1
*
21
2
21
1
21
1
21
1
21
2
21
1
21
121
21
1
21
2
21
1
Qq nn
n
Qqnmm
m
nmm
m
nmm
m
mm
m
Qqnmm
m
nmm
m
nmm
m
nm
nm
mm
m
Qqnmm
m
nmm
m
n
n
n
n
CorpusqPmDqPmDqPm
Qm
CorpusqPee
eDqPee
e
DqPee
e
Qeee
CorpusqPee
eDqPee
e
DqPee
eCorpusqPeDqPeeee
Q
m
CorpusqPee
eDqPee
e
QmDQE
1m
xg
xgxfxgxfxgxf
xgxfxgxfxgxf
xfxf
xf
2
1log
:Note
IR – Berlin Chen 18
Discriminative Training of LMs (4/5)
• Minimum Classification Error (MCE) Training– Iteratively update (e.g., Type I HMM)
,1
,1,)(
2*
1
*1
1
**~, 1*
Qq nn
n
mD
n CorpusqPimDqPimDqPim
Qim
DQLDQLii
,
1
)(
2
)(
1
)(
1
~~)(~~~)(~
~~)(~
)(~)(~
)(~
1~1~
1~
1
2~,*1~,*
1~,*
212~,*2211~,*
1
211~,*1
2~,*21~,*
1
1~,*1
21
1
ii
i
imimiimimimiim
imimiim
iimiim
iim
imim
im
mDmD
mD
mDmD
mD
mDmD
mD
eimeim
eim
eeeeeeee
eeee
eeee
ee
eeeim
1m
the new weight
the old weight
)(~1~1
~,*11 iimimmD
IR – Berlin Chen 19
Discriminative Training of LMs (5/5)
• Minimum Classification Error (MCE) Training– Final Equations
• Iteratively update
• can be updated in the similar way
1m
Qnq
nn
n
mD
CorpusqPimDqPimDqPim
Qim
DQLDQLii
2*
1
*1
1
**
1~,*
1
,1,)(
)(
2~,*
2
)(1~,*
1
)(1~,*
11 1 i
mDi
mD
imD
eimeimeimim
2m
IR – Berlin Chen 20
IR – Berlin Chen 21
Statistical Translation Model (1/2)
• A query is viewed as a translation or distillation from a document– That is, the similarity measure is computed by estimating the
probability that the query would have been generated as a translation of that document
• Assumption of context-independence (the ability to handle the ambiguity of word senses is limited)
• However, it has the capability of handling the issues of synonymy (multiple terms having similar meaning) and polysemy (the same term having multiple meanings)
Qqc
DwQq
Qqc DwPwqPDqPDQPDQsim ,,Trans,
Berger & Lafferty (1999)
A. Berger and J. Lafferty. Information retrieval as statistical translation. SIGIR 1999
word-to-word translation
QqcCqDqPDQP ,
Trans M1ˆ
QD
IR – Berlin Chen 22
Statistical Translation Model (2/2)
• Weakness of the statistical translation model– The need of a large collection of training data for estimating
translation probabilities, and inefficiency for ranking documents
• Jin et al. (2002) proposed a “Title Language Model” approach to capture the intrinsic document to query translation patterns– Queries are more like titles than documents (queries and titles
both tend to be very short and concise descriptions of information, and created through a similar generation process)
– Train the statistical translation model based on the document-title pairs in the whole collection
R. Jin et al. Title language model for information retrieval. SIGIR 2002
N
j Tt
Ttc
DwMM
N
j TtjMM
N
jjjMM
j
j
j
DwPwtP
DtPDTPM
1
,
11
*
maxarg
maxarg maxarg
IR – Berlin Chen 23
Probabilistic Latent Semantic Analysis (PLSA)
• Also called The Aspect Model, Probabilistic Latent Semantic Indexing (PLSI)– Graphical Model Representation (a kind of Bayesian Networks)
T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 2001
Hofmann (1999)
Qw
Q,wcCD MwPλMwPλ
DQP
DPDQPQPDPDQP
QDPD,Qsim
1
Q,wc
Qw
K
kkk
Q,wc
Qw
K
kk
Qw
Q,wc
DTPTwP
DT,wP
DwPDQPD,Qsim
1
1
The latent variables=>The unobservable class variables Tk
(topics or domains)
Language (unigram) model
PLSA
IR – Berlin Chen 24
PLSA: Formulation
• Definition– : the prob. when selecting a doc
– : the prob. when pick a latent class for the doc
– : the prob. when generating a word from the class
DP D
D DTP k kT
kTwP kTw
IR – Berlin Chen 25
PLSA: Assumptions
• Bag-of-words: treat docs as memoryless source, words are generated independently
• Conditional independent: the doc and word are independent conditioned on the state of the associated latent variable
D
kT
w
kkk TDPTwPTDwP ,
w
Q,wcDwPDQPD,Qsim
K
kkk
K
k
kkK
k
kkk
K
k
kkK
k
kK
kk
DTPTwP
DPDTPTwP
DPTPTDPTwP
DPTPTDwP
DPTDwPDTwPDwP
1
11
111
,
,,,,
IR – Berlin Chen 26
PLSA: Training (1/2)
• Probabilities are estimated by maximizing the collection likelihood using the Expectation-Maximization (EM) algorithm
D w Tkk
D wC
k
DTPTwPlogD,wc
DwPlogD,wcL
EM tutorial:- Jeff A. Bilmes "A Gentle Tutorial of the EM Algorithm and its Application
to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," U.C. Berkeley TR-97-021
IR – Berlin Chen 27
PLSA: Training (2/2)
• E (expectation) step
• M (Maximization) step
w D k
D kk D,wTPD,wc
D,wTPD,wcTwP
w
kwk Dwc
DwTPDwcDTP
,,,ˆ
kT kk
kkk DTPTwP
DTPTwPDwTP ,
IR – Berlin Chen 28
PLSA: Latent Probability Space (1/2)
image sequenceanalysis
medical imagingcontext of contourboundary detection
phonetic segmentation
kikT
kj
ikT
ikjT
ikjij
TDPTPTwP
DTPDTwPDTwPDwP
k
kk
,,,,,
kjkj TwP
,:U kkTPdiag:Σ
kiki TDP,
:V DWP , .= .
Dimensionality K=128 (latent classes)
IR – Berlin Chen 29
PLSA: Latent Probability Space (2/2)
=
D1 D2 Di Dn
mxn
kxk
mxk
rxn
P Umxk
Σk VTkxn
w1w2
wj
wm
T1…Tk… TK
kikT
kjij TDPTPTwPDwPk
,
kj TwP
kTPw1w2
wj
wm
ki TDP
ij DwP ,
D1 D2 Di Dn
IR – Berlin Chen 30
PLSA: One more example on TDT1 dataset
aviation space missions family love Hollywood love
IR – Berlin Chen 31
PLSA: Experiment Results (1/4)
• Experimental Results – Two ways to smoothen empirical distribution with PLSA
• Combine the cosine score with that of the vector space model (so does LSA)PLSA-U* (See next slide)
• Combine the multinomials individually PLSA-Q*
Both provide almost identical performance– The performance of PLSA ( ) is not promising
when being used alone
)|()1()|()|(* DwPDwPDwP PLSAEmpiricalQPLSA
DcDwcDwPEmpirical
,)|(
K
kkkPLSA DTPTwPDwP
1
DwcQw
PLSAEmpiricalQPLSA DwPDwPDQP ,* )|()1()|()|(
D|wPPLSA
IR – Berlin Chen 32
PLSA: Experiment Results (2/4)
PLSA-U*• Use the low-dimensional representation and
(be viewed in a k-dimensional latent space) to evaluate relevance by means of cosine measure
• Combine the cosine score with that of the vector space model
• Use the ad hoc approach to re-weight the different model components (dimensions) by
)|( QTP k )|( DTP k
kk
kk
kkk
UPLSADTPQTP
DTPQTPDQR
22* ),(
where,
Qw
Qwk
k Q,wc
Q,wTPQ,wcQTP
),(1),(),(~** DQRDQRDQR VSMUPLSAUPLSA
online folded-in
IR – Berlin Chen 33
PLSA: Experiment Results (3/4)
• Why ?
– Reminder that in LSA, the relations between any two docs can be formulated as
– PLSA mimics LSA in similarity measure
kik
kk
kikk
iQPLSIDTPQTP
DTPQTPDQR
22* ),(
A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T
Di
Ds
ksk
kik
kskik
kssk
kiik
ksskiik
kkki
kkki
kkskkki
siQPLSI
DTPDTP
DTPDTP
DPDTPDPDTP
DPDTPDPDTP
TPTDPTPTDP
TDPTPTPTDPDDR
22
22
22*
),(
vectorsrow are ˆ and ˆ
ˆˆˆˆ
)ˆ,ˆ(,2
si
si
Tsi
sisi
DD
DDDDDDcoineDDsim
iikkki DPDTPTPTDP
IR – Berlin Chen 34
PLSA: Experiment Results (4/4)
IR – Berlin Chen 35
PLSA vs. LSA
• Decomposition/Approximation– LSA: least-squares criterion measured on the L2- or Frobenius
norms of the word-doc matrices– PLSA: maximization of the likelihoods functions based on the
cross entropy or Kullback-Leibler divergence between the empirical distribution and the model
• Computational complexity– LSA: SVD decomposition– PLSA: EM training, is time-consuming for iterations ?
– The model complexity of Both LSA and PLSA grows linearly with the number of training documents
• There is no general way to estimate or predict the vector representation (of LSA) or the model parameters (of PLSA) for a newly observed document
IR – Berlin Chen 36
Latent Dirichlet Allocation (LDA) (1/2)
• The basic generative process of LDA closely resembles PLSA; however,– In PLSA, the topic mixture is conditioned on each
document ( is fixed, unknown)– While in LDA, the topic mixture is drawn from a Dirichlet
distribution, so-called the conjugate prior, ( is unknown and follows a probability distribution)
Blei et al. (2003)
DTP k
DTP k
DTP k
T
D
D
T
φw
KTα
D
Tφ
parameter withondistributi lmultinomiaa from orda Pick 4)parameter
withondistributi lmultinomiaa from ,,2,1 a topic Pick 3) parameter withondistributiet Dirichl
a from document eachfor ondistributi lmultinomiaa Pick )2parameter withondistributiet Dirichl
a from topiceachfor ondistributi lmultinomiaa Pick )1 LDA withcorpusa generating of Process
DTP k
Blei et al. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003
IR – Berlin Chen 37
Latent Dirichlet Allocation (2/2)
word 2
word 3
word 1
X (P(w1))Y (P(w2))
Z (P(w3))
X+Y+Z=1
DD
D
i
K
kDkzkiD
K
kT ddTPTwPpPL
1 11LDA ,||
IR – Berlin Chen 38
Word Topic Models (WTM)
• Each word of language are treated as a word topical mixture model for predicting the occurrences of other words
• WTM also can be viewed as a nonnegative factorization of a “word-word” matrix consisting probability entries – Each column encodes the vicinity information of all occurrences of
a distinct word in the document collection
K
kwkkiwi jj
TPTwPwP1
WTM M||M|
B. Chen, “Word topic models for spoken document retrieval and transcription,” ACM Transactions on Asian Language Information Processing, 8(1), pp. 2:1-2:27, March 2009
IR – Berlin Chen 39
Comparison of WTM and PLSA/LDA
• A schematic comparison for the matrix factorizations of PLSA/LDA and WTM
documents
wor
dsw
ords
vicinities of words
A
B
documentstopics
topics vicinities of words
wor
dsw
ords
normalized “word-document”co-occurrence matrix
normalized “word-word”co-occurrence matrix
mixture components
mixture weights
mixture components
mixture weights
G TH
Q TQ
PLSA/LDA
WTM
topi
csto
pics
K
kwkkiwi jj
TPTwPwP1
WTM M||M|
K
kDkkiDi TPTwPwP
1PLSA M||M|
IR – Berlin Chen 40
WTM: Information Retrieval (1/4)
• The relevance measure between a query and a document can be expressed by
• Unsupervised training– The WTM of each word can be trained by concatenating those
words occurring within a context window of size around each occurrence of the word, which are postulated to be relevant to the word
Qwc
Qw Dw
K
kwkkiDj
i
i jj
TPTwPDQP,
1,WTM M
.Mlog,Mloglog WTMWTM
www
j jwijj
jjj w Qw
wiwiw
ww wPOwcOPL
jw jw jw
1,jwO
Nwwww jjjjOOOO ,2,1, ,,,
2,jwO Nw j
O ,
IR – Berlin Chen 41
WTM: Information Retrieval (2/4)
• Supervised training: The model parameters are trained using a training set of query exemplars and the associated query-document relevance information– Maximize the log-likelihood of the training set of query
exemplars generated by their relevant documents
TrainSet QR
TrainSet Q DDQPL
Q DQ
toWTMloglog
WTM: Information Retrieval (3/4)
• Example topic distributions of WTM
IR – Berlin Chen 42
Topic 13
word weight
Vena (靜脈) 1.202
Resection (切除) 0.674
Myoma (肌瘤) 0.668
Cephalitis (腦炎) 0.618
Uterus (子宮) 0.501
Bronchus (支氣管) 0.500
Topic 23
word weight
Cholera (霍亂) 0.752
Colorectal cancer (大腸直腸癌)
0.681
Salmonella enterica(沙門氏菌)
0.471
Aphtae epizooticae(口蹄疫)
0.337
Thyroid (甲狀腺) 0.303
Gastric cancer (胃癌) 0.298
Topic 14
word weight
Land tax (土地稅) 0.704
Tobacco and alcohol tax law (菸酒稅法)
0.489
Tax (財稅) 0.457
Amend drafts (修正草案) 0.446
Acquisition (購併) 0.396
Insurance law (保險法) 0.373
WTM: Information Retrieval (4/4)
• Pairing of PLSA and WTM – Sharing the same set of latent topics
IR – Berlin Chen 43
wor
ds
documentsvicinity
documents
PLSA WTM wor
ds
topics
Topi
cs
documentsvicinity
documents
normalized “word-document” & “word-word”
co-occurrence matrix
mixture components
mixture weights
TwP DTP WTP M
IR – Berlin Chen 44
Applying Relevance Feedback to LM Framework (1/2)
• There is still no formal mechanism to incorporate relevance feedback (judgments) into the language modeling framework (especially for the query-likelihood approach)– The query is a fixed sample while focusing on estimating
accurate estimation of document language models
• Ponte (1998) proposed a limited way to incorporate blind reference feedback into the LM framework – Think of example relevant documents as examples of
what the query might have been, and re-sample (or expand) the query by adding k highly descriptive words from the these documents (blind reference feedback)
DwP
RD ~
RD C
D
w wPwP
w~
*MM
logmaxarg
J. M. Ponte, A language modeling approach to information retrieval, Ph.D. dissertation, UMass, 1998
IR – Berlin Chen 45
Applying Relevance Feedback to LM Framework (2/2)
• Miller et al. (1999) propose two relevance feedback approach – Query expansion: add those words to the initial query that
appear in two or more of the top m retrieved documents– Document model re-estimation: use a set of outside training
query exemplars to train the transition probabilities of the document models
• Where is the set of training query exemplars, is the set of documents that are relevant to a specific training query
exemplar , is the length of the query , and is the total number of documents relevant to the query
DiwP M
CiwP M
A document model
Query
Lw....wwQ 21
Q
Q QR n
TrainSetQQR
TrainSetQ DocD Qq CnDn
Dn
DocQqPλqPλ
qPλ
λ to
to MMM
1ˆ
the old weight
the new weight 819 queries ≦2265 docs
Li CiDiD wPλwPλQP 1 1 MMM
QTrainSet QRDoc to
Q Q QRDoc toQ
Miller et al. , A hidden Markov model information retrieval system, SIGIR 1999
Relevance Modeling (1/3)
• A schematic illustration of the information retrieval system with relevance feedback for improved query modeling
IR – Berlin Chen 46
Initial Query Model
Document Models
Initial Round of Retrieval
Top-Ranked Documents
Representative Documents
Various Query Modeling Second Round of Retrieval
Document Collection
Retrieval Result
Top-Ranked Documents
Query
K.-Y. Chen et al., "Exploring the use of unsupervised query modeling techniques for speech recognition and summarization,“Speech Communication, 80, pp. 49-59, June 2016.
Relevance Modeling (2/3)
• Use the top-ranked pseudo-relevant documents to approximate the relevance class of each query– The joint probability of a query Q and any word w being generated by
the relevance class of is computed as follows, on the basis of the top-ranked list of pseudo-relevant documents obtained from the initial round of retrieval:
• If we further assume that words are conditionally independent given Dm and their order is of no importance
IR – Berlin Chen 47
Mm mLm DwqqqPDPwQP 1 21RM |,,,,
Mm
Ll mlmm DqPDwPDPwQP 1 1RM ||,
Relevance Modeling (3/3)
• The enhanced query model , therefore, can be expressed by
• Incorporation of Latent Topic Information
IR – Berlin Chen 48
Mm
Ll mlm
Mm
Ll mlmm
DqPDPDqPDwPDP
QPwQPQwP
1 1
1 1
RM
RMRM |
||,
Kk mkkm DTPTwPDwP 1 |||~
Mm
Kk
Ll klkmkm TqPTwPDTPDPwQP 1 1 1TRM ||| ,
Leveraging Non-relevant Information
• Hypothesize that the low-ranked (or pseudo non-relevant) documents can provide useful cues as well to boost the retrieval effectiveness of a given query– For this idea to work, we may estimate a non-relevance model
for each test query based on those selected pseudo non-relevant documents
• The similarity measure between a query and document thus can be computed as follows
IR – Berlin Chen 49
DNRKLDQKLDQSIM Q ,
QNRwP
IR – Berlin Chen 50
Incorporating Prior Knowledge into LM Framework
• Several efforts have been paid to using prior knowledge for the LM framework, especially modeling the document prior– Document length– Document source– Average word-length – Aging (time information/period)– URL– Page links
DP
QP
DPDQPQDP
IR – Berlin Chen 51
Implementation Notes: Probability Manipulation
• For language modeling approaches to IR, many conditional probabilities are usually multiplied. This can result in a “floating point underflow”
• It is better to perform the computation by “adding” logarithms of probabilities instead– The logarithm function is monotonic (order-preserving)
• We also should avoid the problem of “zero probabilities (or estimates)” owing to sparse data, by using appropriate probability smoothing techniques
Li CiDiD wPλwPλQP 1 1 MMM
CiDil
iD wPwPQP M1MlogMlog 1
Implementation Notes: Converting to tf-idf-like Weighting
• The query likelihood retrieval model
IR – Berlin Chen 52
Li CiDiD wPλwPλQP 1 1 MMM
0, ,
rank
10, ,
10, ,
0, ,0, ,
1
1,1
,
log
,1log
,1
,1,
log
,1log
,1log
,1
,log
,1log
,1
,log
M1MlogMlog
Dwci i
i
Li
i
Dwci i
ii
Li
i
Dwci
iii
Dwci
i
Dwci
ii
Li CiDiD
i
i
i
ii
CCwc
DDwc
CCwc
CCwc
CCwc
DDwc
CCwc
CCwc
CCwc
DDwc
CCwc
CCwc
DDwc
wPwPQP
Therefore, the similarity score is directly proportional to the document frequency and inversely proportional to the collection frequency.=> Can be efficiently implemented with inverted files(To be discussed later on!)
Logarithm is a monotonic (rank‐preserving) transformation
Being the same for all documents
=> can be discarded!
Recommended