Language Models for Information Retrieval

Language Models for Information RetrievalAndy Luong and Nikita Sudan

Outline Language Model Types of Language Models Query Likelihood Model Smoothing Evaluation Comparison with other approaches

Language Model A language model is a function that puts

a probability measure over strings drawn from some vocabulary.

Language Models

P(q|Md) instead of P(R=1|q,d)

Example Doc1: “frog said that toad likes frog” Doc2: “toad likes frog”

frog said that toad likes STOP

M1 1/6 1/6 1/6 .2

M2 1/3 0 0 1/3 .2

1/3 1/6

1/3

Example Continuedq = “frog likes toad”

P(q | M1) = (1/3)*(1/6)*(1/6)*0.8*0.8*0.2

P(q | M2) = (1/3)*(1/3)*(1/3)*0.8*0.8*0.2

P(q | M1) < P (S | M2)

frog said that toad likes STOP

M1 1/3 1/6 1/6 1/6 1/6 .2

M2 1/3 0 0 1/3 1/3 .2

Types of Language Models

CHAIN RULE

UNIGRAM LM

BIGRAM LM

Multinomial distribution

M is the size of the term vocabulary

Order Constraint Frequency

Query Likelihood Model

≈

Query Likelihood ModelInfer LM for each documentEstimate P(q | Md(i))Rank documents based on

probabilities

MLE

Smoothing Basic Intuition

New word or unseen word in the document

P( t | Md ) = 0 Zero probabilities will make P ( q | Md) = 0

Why else should we smooth?

Smoothing ContinuedNon-occurring term

Probability Bound

Linear Interpolation Language Model

Example Doc1: “frog said that toad likes frog” Doc2: “toad likes frog”

frog said that toad likes

M1 1/3 1/6 1/6 1/6 1/6

M2 1/3 0 0 1/3 1/3

C 1/3 1/9 1/9 2/9 2/9

Example Continuedq = “frog said” λ = ½

P(q | M1) = [(1/3 + 1/3)*(1/2)] * [(1/6 + 1/9)*(1/2)]

= .046

P(q | M2) = [(1/3 + 1/3)*(1/2)] * [(0 + 1/9)*(1/2)] = .018

P(q | M1) > P (q | M2)

Evaluation Precision = (relevant documents ∩

retrieved documents)/ retrieved documents

Recall = (relevant documents ∩ retrieved documents)/ relevant documents

Tf-Idf The importance increases proportionally

to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Ponte and Croft’s Experiments

Pros and Cons “Mathematically precise, conceptually

simple, computationally tractable and intuitively appealing.”

Relevancy is not captured

Query vs. Document Model

(a) Query Likelihood (b) Document Likelihood (c) Model Comparison

KL divergence

Thank you.

Questions?

Documents

Language Models for Information Retrieval