Upload
carter-boyd
View
54
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Language Models. Hongning Wang CS@UVa. Stage-1 -Explain unseen words - Dirichlet prior ( Bayesian). Stage-2 -Explain noise in query -2-component mixture. c(w,d). + p(w|C). (1- ). + p(w|U). . . |d|. + . Collection LM. P(w|d) =. User background model - PowerPoint PPT Presentation
Citation preview
Language ModelsHongning Wang
CS@UVa
CS 6501: Information Retrieval 2
Two-stage smoothing [Zhai &
Lafferty 02]
c(w,d)
|d|P(w|d) =
+p(w|C)
+
Stage-1
-Explain unseen words-Dirichlet prior (Bayesian)
Collection LM
(1-) + p(w|U)
Stage-2
-Explain noise in query-2-component mixture
User background modelCan be approximated by p(w|C)
CS@UVa
CS 6501: Information Retrieval 3
Estimating using EM algorithm [Zhai & Lafferty 02]
Query Q=q1…qm
1
N
...
mN
j i ji 1 j 1
p(Q | , U) ((1 )p(q | d ) p(q | U))
ˆ argmax p(Q | , U)
i
Expectation-Maximization (EM) algorithm for estimating and
P(w|d1)d1
P(w|dN)dN
… ...
Stage-1
(1-)p(w|d1)+ p(w|U)
(1-)p(w|dN)+ p(w|U)
Stage-2
ˆ( , ) ( | )( | )
ˆ| |j i j
j ii
c q d p q Cp q d
d
Estimated in stage-1
CS@UVa
Probability of choosing LM
Consider query generation as a mixture over all the relevant documents
CS 6501: Information Retrieval 4
Introduction to EM
• Parameter estimation• All data is observable
• Maximum likelihood method• Optimize the analytic form of
• Missing/unobservable data• Data: X (observed) + H(hidden)• Likelihood: • Approximate it!
CS@UVa
Most of cases are intractable
We have missing date.
CS 6501: Information Retrieval 5
Background knowledge
• Jensen's inequality• For any convex function and positive weights
CS@UVa
𝑓 (∑𝑖 𝜆𝑖 𝑥𝑖)≤∑𝑖
𝜆𝑖 𝑓 (𝑥 𝑖) ∑𝑖
𝜆𝑖=1
CS 6501: Information Retrieval 6
Expectation Maximization
• Maximize data likelihood function by pushing the lower bound
CS@UVa
≥∫𝑞 (𝐻 ) log𝑝 (𝑋 ,𝐻|𝜃 )𝑑𝐻−∫𝑞 (𝐻 ) log𝑞(𝐻 )𝑑𝐻Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]
Lower bound: easier to compute, many good properties!
Components we need to tune when optimizing : and
Proposal distributions for
CS 6501: Information Retrieval 7
Intuitive understanding of EM
Data likelihood p(X| )
CS@UVa
Lower bound
Easier to optimize, guarantee to improve data likelihood
CS 6501: Information Retrieval 8
Expectation Maximization (cont)• Optimize the lower bound with respect to
CS@UVa
≥∫𝑞 (𝐻 ) log𝑝 (𝑋 ,𝐻|𝜃 )𝑑𝐻−∫𝑞 (𝐻 ) log𝑞(𝐻 )𝑑𝐻Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]
¿∫𝑞 (𝐻 ) [ log𝑝 (𝐻|𝑋 ,𝜃 )+ log𝑝 (𝑋∨𝜃)]𝑑𝐻−∫𝑞 (𝐻 ) log𝑞 (𝐻 )𝑑𝐻¿∫𝑞 (𝐻 ) log
𝑝 (𝐻|𝑋 , 𝜃 )𝑞 (𝐻 )
𝑑𝐻+ log 𝑝 (𝑋∨𝜃)
KL-divergence between and Constant with respect to
𝐾𝐿¿
CS 6501: Information Retrieval 9
Expectation Maximization (cont)• Optimize the lower bound with respect to
• KL-divergence is non-negative, and equals to zero iff • A step further: when , we will get , i.e., the lower bound
is tight!• Other choice of cannot lead to this tight bound, but
might reduce computational complexity• Note: calculation of is based on current
CS@UVa
𝐿 (𝜃 ) ≥ −𝐾𝐿 (𝑞(𝐻 )∨¿𝑝 (𝐻|𝑋 , 𝜃 ) )+𝐿(𝜃)
CS 6501: Information Retrieval 10
Expectation Maximization (cont)• Optimize the lower bound with respect to
• Optimal solution:
CS@UVa
Posterior distribution of given current model
CS 6501: Information Retrieval 11
Expectation Maximization (cont)• Optimize the lower bound with respect to
•
CS@UVa
Constant w.r.t.
¿𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝐸𝐻∨𝑋 ,𝜃 𝑡 [ log𝑝 (𝑋 ,𝐻∨𝜃)]
Expectation of complete data likelihood
CS 6501: Information Retrieval 12
Expectation Maximization
• EM tries to iteratively maximize likelihood• “Complete” likelihood: • Starting from an initial guess (0),
1. E-step: compute the expectation of the complete likelihood
2. M-step: compute (t+1) by maximizing the Q-function
CS@UVa
𝑄 (𝜃 ;𝜃𝑡 )=E𝐻∨𝑋 ,𝜃𝑡 [𝐿𝑐 (𝜃 ) ]=∫𝑝 (𝐻|𝑋 ,𝜃 𝑡 ) log p ( X , H|𝜃𝑡 ) dH
𝜃𝑡+1=𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑄 (𝜃 ;𝜃𝑡 ) Key step!
13
Intuitive understanding of EM
Data likelihood p(X| )
current guess
Lower bound(Q function)
next guess
E-step = computing the lower boundM-step = maximizing the lower bound
S=We have missing date.𝑝 (𝑑𝑎𝑡𝑒|𝑆 ,𝜃 𝑡 )=0.1
𝑝 (𝑑𝑎𝑡𝑎|𝑆 ,𝜃 𝑡 )=0.9E-step:
𝑝 (𝑑𝑎𝑡𝑎|𝜃𝑡+1 )∝0.9+ ∑𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑
1M-step:
CS 6501: Information Retrieval 14
Convergence guarantee
• Proof of EM
CS@UVa
log𝑝 ( 𝑋|𝜃 )=log𝑝 (𝐻 ,𝑋|𝜃 ) − log𝑝 (𝐻∨𝑋 ,𝜃)
log𝑝 ( 𝑋|𝜃 )=∫𝑝 (𝐻∨𝑋 ,𝜃𝑡) log𝑝 (𝐻 ,𝑋|𝜃 )𝑑𝐻−∫𝑝 (𝐻∨𝑋 , 𝜃𝑡) log𝑝 (𝐻∨𝑋 , 𝜃)𝑑𝐻Taking expectation with respect to of both sides:
log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)=𝑄 (𝜃 ;𝜃 𝑡 )+𝐻 (𝜃 ;𝜃𝑡 ) −𝑄 (𝜃𝑡 ;𝜃𝑡 ) −𝐻 (𝜃𝑡 ;𝜃𝑡 )Then the change of log data likelihood between EM iteration is:
By Jensen’s inequality, we know , that means
log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)≥𝑄 (𝜃 ;𝜃 𝑡 ) −𝑄 (𝜃 𝑡 ;𝜃 𝑡 ) ≥ 0
log𝑝 ( 𝑋|𝜃 )=𝑄 (𝜃 ;𝜃𝑡 )+𝐻 (𝜃 ;𝜃𝑡) Cross-entropy
M-step guarantee this
CS 6501: Information Retrieval 15
• Global optimal is not guaranteed!• Likelihood: is non-convex in most of case• EM boils down to a greedy algorithm
• Alternative ascent
• Generalized EM• E-step: • M-step: choose that improves
CS@UVa
What is not guaranteed
CS 6501: Information Retrieval 16
Estimating using Mixture Model [Zhai & Lafferty 02]
CS@UVa
Query Q=q1…qm
1
N
...
mN
j i ji 1 j 1
p(Q | , U) ((1 )p(q | d ) p(q | U))
ˆ argmax p(Q | , U)
i
Expectation-Maximization (EM) algorithm for estimating and
P(w|d1)d1
P(w|dN)dN
… ...
Stage-1
(1-)p(w|d1)+ p(w|U)
(1-)p(w|dN)+ p(w|U)
Stage-2
ˆ( , ) ( | )( | )
ˆ| |j i j
j ii
c q d p q Cp q d
d
Estimated in stage-1
Origin of the query term is unknown, i.e., from user background or topic?
CS 6501: Information Retrieval 17
Variants of basic LM approach• Different smoothing strategies
• Hidden Markov Models (essentially linear interpolation) [Miller et al. 99]
• Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99]
• Performance tends to be similar to the basic LM approach• Many other possibilities for smoothing [Chen & Goodman 98]
• Different priors• Link information as prior leads to significant improvement
of Web entry page retrieval performance [Kraaij et al. 02] • Time as prior [Li & Croft 03]
• PageRank as prior [Kurland & Lee 05]
• Passage retrieval [Liu & Croft 02]
CS@UVa
CS 6501: Information Retrieval 18
Improving language models
• Capturing limited dependencies• Bigrams/Trigrams [Song & Croft 99] • Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04]
• Generally insignificant improvement as compared with other extensions such as feedback
• Full Bayesian query likelihood [Zaragoza et al. 03]
• Performance similar to the basic LM approach
• Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al. 05]
• Address polesemy and synonyms• Improves over the basic LM methods, but computationally expensive
• Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et al. 06]
• Improves over the basic LM, but computationally expensive
• Parsimonious LMs [Hiemstra et al. 04] • Using a mixture model to “factor out” non-discriminative words
CS@UVa
CS 6501: Information Retrieval 19
A unified framework for IR: Risk Minimization• Long-standing IR Challenges
• Improve IR theory• Develop theoretically sound and empirically effective models • Go beyond the limited traditional notion of relevance
(independent, topical relevance)• Improve IR practice
• Optimize retrieval parameters automatically
• Language models are promising tools …• How can we systematically exploit LMs in IR?• Can LMs offer anything hard/impossible to achieve in
traditional IR?
CS@UVa
CS 6501: Information Retrieval 20
Idea 1: Retrieval as decision-making(A more general notion of relevance)
Unordered subset?
Clustering?
Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ()Choose: (D,)
Query … Ranked list?1 2 3 4
CS@UVa
CS 6501: Information Retrieval 21
Idea 2: Systematic language modeling
DocumentLanguage Models
Documents
DOC MODELING
Query Query Language Model
QUERY MODELING
Loss Function User
USER MODELING
Retrieval Decision: ?
CS@UVa
CS 6501: Information Retrieval 22
Generative model of document & query [Lafferty & Zhai 01b]
observedPartiallyobserved
QU)|( Up Q
User
DS)|( Sp D
Source
inferred
),|( Sdp Dd Document
),|( Uqp Qq Query
( | , )Q Dp R R
CS@UVa
CS 6501: Information Retrieval 23
Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 06]
Choice: (D1,1)
Choice: (D2,2)
Choice: (Dn,n)
...
query quser U
doc set Csource S
q
1
N
∫
dSCUqpDLDD
),,,|(),,(minarg*)*,(,
hidden observedloss
Bayes risk for choice (D, )RISK MINIMIZATION
Loss
L
L
L
CS@UVa
CS 6501: Information Retrieval 24
Special cases
• Set-based models (choose D)• Ranking models (choose )
• Independent loss • Relevance-based loss
• Distance-based loss• Dependent loss
• Maximum Margin Relevance loss • Maximal Diverse Relevance loss
Boolean model
Vector-space ModelTwo-stage LM
KL-divergence model
Probabilistic relevance model Generative Relevance Theory
CS@UVa
Subtopic retrieval model
CS 6501: Information Retrieval 25
Optimal ranking for independent loss
CS@UVa
1 11 1
1 1
1
1 1
1
1 1
1
1 1
* arg min ( , ) ( | , , , )
( , ) ( | ... )
( )
( ) ( )
* arg min ( ) ( ) ( | , , , )
arg min ( ) ( ) (
j j
j
j
j
j
N i
ii j
N i
ii j
N jN
ij i
N jN
ij i
N jN
ij i
L p q U C S d
L s l
s l
s l
s l p q U C S d
s l p
∫
∫
| , , , )
( | , , , ) ( ) ( | , , , )
* ( | , , , )
j j
k k k k
k
q U C S d
r d q U C S l p q U C S d
Ranking based on r d q U C S
∫
∫
Decision space = {rankings}
Sequential browsing
Independent loss
Independent risk= independent scoring
“Risk ranking principle”[Zhai 02]
is the probability that the user would stop reading after seeing the top documents
CS 6501: Information Retrieval 26
Automatic parameter tuning• Retrieval parameters are needed to
• model different user preferences• customize a retrieval model to specific queries and documents
• Retrieval parameters in traditional models• EXTERNAL to the model, hard to interpret• Parameters are introduced heuristically to implement “intuition”• No principles to quantify them, must set empirically through many
experiments• Still no guarantee for new queries/documents
• Language models make it possible to estimate parameters…
CS@UVa
CS 6501: Information Retrieval 27
Parameter setting in risk minimization
Query Query Language Model
Document Language Models
Loss Function User
Documents
Query model parameters
Doc model parameters
User model parameters
Estimate
Estimate
Set
CS@UVa
CS 6501: Information Retrieval 28
Summary of risk minimization • Risk minimization is a general probabilistic retrieval
framework• Retrieval as a decision problem (=risk min.)• Separate/flexible language models for queries and docs
• Advantages• A unified framework for existing models• Automatic parameter tuning due to LMs• Allows for modeling complex retrieval tasks
• Lots of potential for exploring LMs…
CS@UVa
CS 6501: Information Retrieval 29
What you should know
• EM algorithm• Risk minimization framework
CS@UVa