Language Models for TR

Language Models for TR

Rong Jin

Department of Computer Science and Engineering

Michigan State University

What is a Statistical LM?

• A probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001

• Context-dependent!

• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

Why is a LM Useful?• Provides a principled way to quantify the uncertainties

associated with natural language

• Allows us to answer questions like:– Given that we see “John” and “feels”, how likely will we see “happy”

as opposed to “habit” as the next word? (speech recognition)

– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)

– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

The Simplest Language Model(Unigram Model)

• Generate a piece of text by generating each word INDEPENDENTLY

• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

• Essentially a multinomial distribution over words

• A piece of text can be regarded as a sample drawn according to this word distribution

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02…

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

Estimation of Unigram LM

(Unigram) Language Model p(w| )=? Document

text 10mining 5

association 3database 3algorithm 2

…query 1

efficient 1

…text ?mining ?assocation ?database ?…query ?…

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

Query = “data mining algorithms”

? Which model would most likely have generated this query?

Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

But, where is the relevance?

And, what’s good about this approach?

The Notion of Relevance

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

…

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Refining P(R=1|Q,D) Method 2:generative models

• Basic idea– Define P(Q,D|R)– Compute P(R|Q,D) using Bayes’ rule

• Special cases– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)

)0()1(

)0|,()1|,(),|1(

RPRP

RDQPRDQPDQRO

Ignored for ranking D

Query Generation

))0|()0,|(()0|()1|()1,|(

)0|()0,|()1|()1,|(

)0|,()1|,(),|1(

RQPRDQPAssumeRDPRDPRDQP

RDPRDQPRDPRDQP

RDQPRDQPDQRO

Assuming uniform prior, we have

Query likelihood p(q| d) Document prior

)1,|(),|1( RDQPDQRO

Now, the question is how to compute ?)1,|( RDQP

Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated model

Retrieval as Language Model Estimation

• Document ranking based on query likelihood

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

• Retrieval problem Estimation of p(wi|d)

• Smoothing is an important issue, and distinguishes different approaches

Document language model

A General Smoothing Scheme• All smoothing methods try to

– discount the probability of words seen in a doc– re-allocate the extra probability so that unseen words

will have a non-zero probability

• Most use a reference model (collection language model) to discriminate unseen words

otherwiseCwp

dinseeniswifdwpdwp

d

seen

)|()|(

)|(

Discounted ML estimate

Collection language model

Smoothing & TF-IDF Weighting

• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain

i

id

qwdw id

iseen CwpnCwpdwpdqp

ii

)|(loglog])|()|([log)|(log

Ignore for rankingIDF weighting

TF weightingDoc length normalization(long doc is expected to have a smaller d)

• Smoothing with p(w|C) TF-IDF + length norm.

Three Smoothing Methods(Zhai & Lafferty 01)

• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)

)|()|()()|( Cwpdwpdwp ml 1

)|()|()|( ||||||

||)|();( Cwpdwpdwp dmld

dd

Cwpdwc

• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)

• Absolute discounting: Subtract a constant

||)|(||)0,);(max()|( d

Cwpddwc udwp

Comparison of Three Methods

Query Type JM Dir ADTitle 0.228 0.256 0.237Long 0.278 0.276 0.260

Relative performance of JM, Dir. and AD

0

0.1

0.2

0.3

JM DIR AD

Method

precision

TitleQuery

LongQuery

The Need of Query-Modeling(Dual-Role of Smoothing)

Verbosequeries

Keywordqueries

Another Reason for SmoothingQuery = “the algorithms for data mining”

d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004

p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)

p( “mining”|d1) < p(“mining”|d2)

But p(q|d1)>p(q|d2)!

We should make p(“the”) and p(“for”) less different for all docs.

Two-stage Smoothing

c(w,d)

|d|P(w|d) =

+p(w|C)

+

Stage-1

-Explain unseen words-Dirichlet prior(Bayesian)

(1-) + p(w|U)

Stage-2

-Explain noise in query-2-component mixture

Estimating using leave-one-out

P(w1|d- w1)

P(w2|d- w2)

N

i Vw i

ii d

CwpdwcdwcCl1

1 )1||

)|(1),(log(),()|(

log-likelihood

)ˆ C|(μlargmaxμ 1μ

Maximum Likelihood Estimator

Newton’s Method

Leave-one-outw1

w2

P(wn|d- wn)

wn

...

Estimating using Mixture Model

query1

N

...

U)λ,|p(qargmaxλ

U))|λp(q)θ|λ)p(q((1πU)λ,|p(q

λ

N

1i

m

1jjdji i

ˆ

ˆ

Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*

AP88-89

WSJ87-92

ZIFF1-2

Automatic 2-stage results Optimal 1-stage results

Average precision (3 DB’s + 4 query types, 150 topics)

Acknowledgement

• Many thanks to Chengxiang Zhai who generously shares his slides on language modeling approach for information retrieval

Documents

Language Models for TR