Upload
everly
View
39
Download
2
Embed Size (px)
DESCRIPTION
Language Models for TR. Rong Jin Department of Computer Science and Engineering Michigan State University. What is a Statistical LM?. A probability distribution over word sequences p(“ Today is Wednesday ”) 0.001 p(“ Today Wednesday is ”) 0.0000000000001 - PowerPoint PPT Presentation
Citation preview
Language Models for TR
Rong Jin
Department of Computer Science and Engineering
Michigan State University
What is a Statistical LM?
• A probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001
• Context-dependent!
• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
Why is a LM Useful?• Provides a principled way to quantify the uncertainties
associated with natural language
• Allows us to answer questions like:– Given that we see “John” and “feels”, how likely will we see “happy”
as opposed to “habit” as the next word? (speech recognition)
– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)
The Simplest Language Model(Unigram Model)
• Generate a piece of text by generating each word INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn according to this word distribution
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02…
Topic 2:Health
Document
Text miningpaper
Food nutritionpaper
Sampling
Estimation of Unigram LM
(Unigram) Language Model p(w| )=? Document
text 10mining 5
association 3database 3algorithm 2
…query 1
efficient 1
…text ?mining ?assocation ?database ?…query ?…
Estimation
A “text mining paper”(total #words=100)
10/1005/1003/1003/100
1/100
Language Models for Retrieval(Ponte & Croft 98)
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?…
…food ?nutrition ?healthy ?diet ?…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
But, where is the relevance?
And, what’s good about this approach?
The Notion of Relevance
Relevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel
(Salton et al., 75)
Prob. distr.model
(Wong & Yao, 89)
…
GenerativeModel
RegressionModel
(Fox 83)
Classicalprob. Model(Robertson &
Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach
(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model
(Wong & Yao, 95)
Differentinference system
Inference network model
(Turtle & Croft, 91)
Refining P(R=1|Q,D) Method 2:generative models
• Basic idea– Define P(Q,D|R)– Compute P(R|Q,D) using Bayes’ rule
• Special cases– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)
)0()1(
)0|,()1|,(),|1(
RPRP
RDQPRDQPDQRO
Ignored for ranking D
Query Generation
))0|()0,|(()0|()1|()1,|(
)0|()0,|()1|()1,|(
)0|,()1|,(),|1(
RQPRDQPAssumeRDPRDPRDQP
RDPRDQPRDPRDQP
RDQPRDQPDQRO
Assuming uniform prior, we have
Query likelihood p(q| d) Document prior
)1,|(),|1( RDQPDQRO
Now, the question is how to compute ?)1,|( RDQP
Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated model
Retrieval as Language Model Estimation
• Document ranking based on query likelihood
n
ii
wwwqwhere
dwpdqp
...,
)|(log)|(log
21
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and distinguishes different approaches
Document language model
A General Smoothing Scheme• All smoothing methods try to
– discount the probability of words seen in a doc– re-allocate the extra probability so that unseen words
will have a non-zero probability
• Most use a reference model (collection language model) to discriminate unseen words
otherwiseCwp
dinseeniswifdwpdwp
d
seen
)|()|(
)|(
Discounted ML estimate
Collection language model
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain
i
id
qwdw id
iseen CwpnCwpdwpdqp
ii
)|(loglog])|()|([log)|(log
Ignore for rankingIDF weighting
TF weightingDoc length normalization(long doc is expected to have a smaller d)
• Smoothing with p(w|C) TF-IDF + length norm.
Three Smoothing Methods(Zhai & Lafferty 01)
• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)
)|()|()()|( Cwpdwpdwp ml 1
)|()|()|( ||||||
||)|();( Cwpdwpdwp dmld
dd
Cwpdwc
• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)
• Absolute discounting: Subtract a constant
||)|(||)0,);(max()|( d
Cwpddwc udwp
Comparison of Three Methods
Query Type JM Dir ADTitle 0.228 0.256 0.237Long 0.278 0.276 0.260
Relative performance of JM, Dir. and AD
0
0.1
0.2
0.3
JM DIR AD
Method
precision
TitleQuery
LongQuery
The Need of Query-Modeling(Dual-Role of Smoothing)
Verbosequeries
Keywordqueries
Another Reason for SmoothingQuery = “the algorithms for data mining”
d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004
p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
But p(q|d1)>p(q|d2)!
We should make p(“the”) and p(“for”) less different for all docs.
Two-stage Smoothing
c(w,d)
|d|P(w|d) =
+p(w|C)
+
Stage-1
-Explain unseen words-Dirichlet prior(Bayesian)
(1-) + p(w|U)
Stage-2
-Explain noise in query-2-component mixture
Estimating using leave-one-out
P(w1|d- w1)
P(w2|d- w2)
N
i Vw i
ii d
CwpdwcdwcCl1
1 )1||
)|(1),(log(),()|(
log-likelihood
)ˆ C|(μlargmaxμ 1μ
Maximum Likelihood Estimator
Newton’s Method
Leave-one-outw1
w2
P(wn|d- wn)
wn
...
Estimating using Mixture Model
query1
N
...
U)λ,|p(qargmaxλ
U))|λp(q)θ|λ)p(q((1πU)λ,|p(q
λ
N
1i
m
1jjdji i
ˆ
ˆ
Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm
P(w|d1)d1
P(w|dN)dN
… ...
Stage-1
(1-)p(w|d1)+ p(w|U)
(1-)p(w|dN)+ p(w|U)
Stage-2
Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*
AP88-89
WSJ87-92
ZIFF1-2
Automatic 2-stage results Optimal 1-stage results
Average precision (3 DB’s + 4 query types, 150 topics)
Acknowledgement
• Many thanks to Chengxiang Zhai who generously shares his slides on language modeling approach for information retrieval