Upload
ira-bryant
View
226
Download
0
Embed Size (px)
Citation preview
Statistical Language Models
for Information RetrievalTutorial at ACM SIGIR 2005
Aug. 15, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai
© ChengXiang Zhai, 2005 2
Goal of the Tutorial
• Introduce the emerging area of applying statistical language models (SLMs) to information retrieval (IR).
• Targeted audience:
– IR practitioners who are interested in acquiring advanced modeling techniques
– IR researchers who are looking for new research problems in IR models
• Accessible to anyone with basic knowledge of probability and statistics
© ChengXiang Zhai, 2005 3
Scope of the Tutorial
• What will be covered– Brief background on IR and SLMs
– Review of recent applications of unigram SLMs in IR
– Details of some specific methods that are either empirically effective or theoretically important
– A framework for systematically exploring SLMs in IR
– Outstanding research issues in applying SLMs to IR
• What will not be covered– Traditional IR methods
– Implementation of IR systems
– Discussion of high-order or other complex SLMs
– Application of SLMs in supervised learning See [Manning & Schutze 99]and [Jelinek 98]
See any IR textbook e.g., [Baeza-Yates & Ribeiro-Neto 99,
Grossman & Frieder 04]See [Witten et al. 99]
E.g., TDT, Text Categorization…. See publications in Machine Learning, Speech Recognition, and Natural Language Processing
© ChengXiang Zhai, 2005 4
Tutorial Outline
1. Introduction
2.The Basic Language Modeling Approach
3.More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5.A General Framework for Applying SLMs to IR
6.Summary
© ChengXiang Zhai, 2005 5
Part 1: Introduction
1. Introduction
- Information Retrieval (IR)
- Statistical Language Models (SLMs)
- Applications of SLMs to IR
2. The Basic Language Modeling Approach
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 6
What is Information Retrieval (IR)?
• Narrow sense (= ad hoc text retrieval) – Given a collection of text documents (information items)
– Given a text query from a user (information need)
– Retrieve relevant documents from the collection
• A broader sense of IR may include– Retrieving non-textual information (e.g., images)
– Other tasks (e.g., filtering, categorization or summarization)
• In this tutorial, IR ad hoc text retrieval
• Ad hoc text retrieval is fundamental to IR and has many applications (e.g., search engines, digital libraries, …)
© ChengXiang Zhai, 2005 7
IR “is Easy”?• Easy queries
– Try “ACM SIGIR 2005” (or just “SIGIR 2005”) with Google, and you’ll get the conference home page right on the top
– Try “retrieval applications”, and you’ll be happy to see many pages mentioning “retrieval applications”
• IR CAN be perceived as being easy because– Queries can be specific and match the words in a page exactly
– The user can’t easily judge the completeness of results -- you’ll be happy if Google returns 3 relevant pages on the top, even if there are 30 more relevant pages missing
• Harder queries:– “design philosophy of Microsoft windows XP”, “progress in
developing new retrieval models”, “IR applications”, …
© ChengXiang Zhai, 2005 8
IR is Hard!
• Under/over-specified queries
– Ambiguous: “buying CDs” (money or music?)
– Incomplete: What kind of CDs?
– What if “CD” is never mentioned in document?
• Vague semantics of documents
– Ambiguity: e.g., word-sense, structural
– Incomplete: Inferences required
• Even hard for people!
– ~ 80% agreement in human judgments(?)
© ChengXiang Zhai, 2005 9
Formalization of IR Tasks
• Vocabulary V={w1, w2, …, wN} of language
• Query q = q1,…,qm, where qi V
• Document di = di1,…,dimi, where dij V
• Collection C= {d1, …, dk}
• Set of relevant documents R(q) C
– Generally unknown and user-dependent
– Query is a “hint” on which doc is in R(q)
• Task = compute R’(q), an “approximate R(q)”
© ChengXiang Zhai, 2005 10
Computing R’(q): Doc Selection vs. Ranking
+++ +
-- -- - - -
- - - -
-
- - +- -
Doc Selectionf(d,q)=?
++++
--+
-+
--
- --
---
1
0
R’(q)True R(q)
R(q) = {dC|f(d,q)>}, where f(d,q) is a ranking function; is a cutoff implicitly set by the
user
R(q)={dC|f(d,q)=1}, where f(d,q) {0,1}is an indicator function (classifier)
0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -
Doc Rankingf(d,q)=?
R’(q)
=0.77
© ChengXiang Zhai, 2005 11
Problems with Doc Selection
• The classifier is unlikely accurate
– “Over-constrained” query (terms are too specific): no relevant documents found
– “Under-constrained” query (terms are too general): over delivery
– It is extremely hard to find the right position between these two extremes
• Even if it is accurate, all relevant documents are not equally relevant
• Relevance is a matter of degree!
© ChengXiang Zhai, 2005 12
Ranking is often preferred
• A user can stop browsing anywhere, so the boundary/cutoff is controlled by the user– High recall users would view more items
– High precision users would view only a few
• Theoretical justification: Probability Ranking Principle [Robertson 77], Risk Minimization [Zhai 02, Zhai & Lafferty 03]
• The retrieval problem is now reduced to defining a ranking function f, such that, for all q, d1, d2, f(q,d1) > f(q,d2) iff p(Relevant|q,d1) >p(Relevant|q,d2)
• Function f is an operational definition of relevance
• Most IR research is centered on finding a good f…
© ChengXiang Zhai, 2005 13
Two Well-Known Traditional Retrieval Formulas [Singhal 01]
[ ]
Key retrieval heuristics:
TF (Term Frequency)IDF (Inverse Doc Freq.)
+Length normalization
Other heuristics:
StemmingStop word removal
Phrases
Similar quantities willoccur in the LMs…
© ChengXiang Zhai, 2005 14
Feedback in IR
Judgments:d1 +d2 -d3 +
…dk -...
Query RetrievalEngine
Results:d1 3.5d2 2.4…dk 0.5...
User
Documentcollection
Judgments:d1 +d2 +d3 +
…dk -...
top 10
Pseudo feedback
Assume top 10 docsare relevant
Relevance feedback User judges documents
Updatedquery
FeedbackLearn from Examples
© ChengXiang Zhai, 2005 15
Feedback in IR (cont.)
• An essential component in any IR method
• Relevance feedback is always desirable, but a user may not be willing to provide explicit judgments
• Pseudo/automatic feedback is always possible, and often improves performance on average through
– Exploiting word co-occurrences
– Enriching a query with additional related words
– Indirectly addressing issues such as ambiguous words and synonyms
© ChengXiang Zhai, 2005 16
Evaluation of Retrieval Performance
1. d1 2. d2 3. d3 4. d4 5. d5 6. d6 7. d7 8. d8 9. d9 10. d10
Total # relevant docs = 8
PR-curve As a ranked list precision
recall
x
x
x
x
1.0
1.00.0
As a SET of results # 4
0.4# 10
# 40.5
# 8
relretprecision
retrievedrelret
recallrelevant
How do we compare different rankings?
Which is the best?
0.0
A
C
B
A>CB>CBut is A>B?
Summarize a ranking with a single number
1
1 k
ii
AvgPrec pk
pi = prec at the rank where the i-th rel doc is retrieved
pi=0 if the i-th rel doc is not retrieved
k is the total # of rel docs
Avg. Prec. is sensitive to the position of each rel doc!
AvgPrec = (1/1+2/2+3/4+4/10+0+0+0+0)/8=0.394
© ChengXiang Zhai, 2005 17
Part 1: Introduction (cont.)
1. Introduction
- Information Retrieval (IR)
- Statistical Language Models (SLMs)
- Application of SLMs to IR
2. The Basic Language Modeling Approach
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 18
What is a Statistical LM?
• A probability distribution over word sequences
– p(“Today is Wednesday”) 0.001
– p(“Today Wednesday is”) 0.0000000000001
– p(“The eigenvalue is positive”) 0.00001
• Context/topic dependent!
• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
© ChengXiang Zhai, 2005 19
Why is a LM Useful?
• Provides a principled way to quantify the uncertainties associated with natural language
• Allows us to answer questions like:
– Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition)
– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)
© ChengXiang Zhai, 2005 20
Source-Channel Framework(Model of Communication System [Shannon 48] )
Source Transmitter(encoder)
DestinationReceiver(decoder)
NoisyChannel
P(X)P(Y|X)
X Y X’
P(X|Y)=?
)()|(maxarg)|(maxargˆ XpXYpYXpXXX
When X is text, p(X) is a language model
(Bayes Rule)
Many Examples: Speech recognition: X=Word sequence Y=Speech signal
Machine translation: X=English sentence Y=Chinese sentenceOCR Error Correction: X=Correct word Y= Erroneous wordInformation Retrieval: X=Document Y=QuerySummarization: X=Summary Y=Document
© ChengXiang Zhai, 2005 21
The Simplest Language Model(Unigram Model)
• Generate a piece of text by generating each word independently
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn according to this word distribution
© ChengXiang Zhai, 2005 22
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001
…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02
…
Topic 2:Health
Document D
Text miningpaper
Food nutritionpaper
Sampling
Given , p(D| ) varies according to D
© ChengXiang Zhai, 2005 23
Estimation of Unigram LM
(Unigram) Language Model p(w| )=?
Document
text 10mining 5
association 3database 3algorithm 2
…query 1
efficient 1
…text ?mining ?assocation ?database ?…query ?
…
Estimation
Total #words=100
10/1005/1003/1003/100
1/100
How good is the estimated model ?
It gives our document sample the highest prob,but it doesn’t generalize well… More about this later…
© ChengXiang Zhai, 2005 24
More Sophisticated LMs
• N-gram language models
– In general, p(w1 w2 ... wn)=p(w1)p(w2|w1)…p(wn|w1 …wn-1)
– n-gram: conditioned only on the past n-1 words
– E.g., bigram: p(w1 ... wn)=p(w1)p(w2|w1) p(w3|w2) …p(wn|wn-1)
• Remote-dependence language models (e.g., Maximum Entropy model)
• Structured language models (e.g., probabilistic context-free grammar)
• Will barely be covered in this tutorial. If interested, read [Jelinek 98, Manning & Schutze 99, Rosenfeld 00]
© ChengXiang Zhai, 2005 25
Why Just Unigram Models?
• Difficulty in moving toward more complex models
– They involve more parameters, so need more data to estimate (A doc is an extremely small sample)
– They increase the computational complexity significantly, both in time and space
• Capturing word order or structure may not add so much value for “topical inference”
• But, using more sophisticated models can still be expected to improve performance ...
© ChengXiang Zhai, 2005 26
Evaluation of SLMs
• Direct evaluation criterion: How well does the model fit the data to be modeled?
– Example measures: Data likelihood, perplexity, cross entropy, Kullback-Leibler divergence (mostly equivalent)
• Indirect evaluation criterion: Does the model help improve the performance of the task?
– Specific measure is task dependent
– For retrieval, we look at whether a model helps improve retrieval accuracy
– We hope more “reasonable” LMs would achieve better retrieval performance
© ChengXiang Zhai, 2005 27
Part 1: Introduction (cont.)
1. Introduction
- Information Retrieval (IR)
- Statistical Language Models (SLMs)
- Application of SLMs to IR
2. The Basic Language Modeling Approach
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 28
Representative LMs for IR1998 1999 2000 2001 2002 2003
Beyond unigram Song & Croft 99
Smoothing examined Zhai & Lafferty 01a
Bayesian Query likelihoodZaragoza et al. 03.
Theoretical justificationLafferty & Zhai 01a
Two-stage LMsZhai & Lafferty 02
Lavrenko 04Kraaij 04
Zhai 02Dissertations Hiemstra 01Berger 01
Ponte 98
FeedbackLMs
Translation modelBerger & Lafferty 99
Lafferty & Zhai 01b Zhai & Lafferty 03Framework
Basic LMs
URL prior Kraaij et al. 02
Lavrenko et al. 02 Ogilvie & Callan 03Zhai et al. 03
Xu et al. 01 Zhang et al. 02Cronen-Townsend et al. 02
Si et al. 02
Special IR tasks
Xu & Croft 99
2004
Lavrenko 04
Parsimonious LMHiemstra et al. 04
Cluster LMLiu & Croft 04;
Kurland & Lee 04
Relevance LM Lavrenko & Croft 01
Dep;endency LMGao et al. 04
Model-based FBZhai & Lafferty 01b
Rel. Query FBNallanati et al 03
Query likelihood scoring Ponte & Croft 98 Hiemstra & Kraaij 99; Miller et al. 99
Parameter tuningNg 00
Title LMJin et al. 02
Term-specific smoothingHiemstra 02
Concept Likelihood Srikanth & Srihari 03
Time priorLi & Croft 03
Shen et al. 05
© ChengXiang Zhai, 2005 29
Ponte & Croft’s Pioneering Work [Ponte & Croft 98]
• Contribution 1: – A new “query likelihood” scoring method: p(Q|D)
– [Maron and Kuhns 60] had the idea of query likelihood, but didn’t work out how to estimate p(Q|D)
• Contribution 2:– Connecting LMs with text representation and weighting in IR
– [Wong & Yao 89] had the idea of representing text with a multinomial distribution (relative frequency), but didn’t study the estimation problem
• Good performance is reported using the simple query likelihood method
© ChengXiang Zhai, 2005 30
Early Work (1998-1999)
• Slightly after SIGIR 98, in TREC 7, two groups explored similar ideas independently: BBN [Miller et al., 99] & Univ. of Twente [Hiemstra & Kraaij 99]
• In TREC-8, Ng from MIT motivated the same query likelihood method in a different way [Ng 99]
• All following the simple query likelihood method; methods differ in the way the model is estimated and the event model for the query
• All show promising empirical results
• Main problems: – Feedback is explored heuristically
– Lack of understanding why the method works….
© ChengXiang Zhai, 2005 31
Later Work (1999-)
• Attempt to understand why LMs work [Zhai & Lafferty 01a, Lafferty & Zhai 01a, Ponte 01, Greiff & Morgan 03, Sparck Jones et al. 03, Lavrenko 04]
• Further extend/improve the basic LMs [Song & Croft 99, Berger & Lafferty 99, Jin et al. 02, Nallapati & Allan 02, Hiemstra 02, Zaragoza et al. 03, Srikanth & Srihari 03, Nallapati et al 03, Gao et al. 04, Li & Croft 04, Kurland & Lee 04,Hiemstra et al. 04]
• Explore alternative ways of using LMs for ad hoc IR [Xu & Croft 99, Lavrenko & Croft 01, Lafferty & Zhai 01a, Zhai & Lafferty 01b, Lavrenko 04]
• Explore the use of SLMs for special retrieval tasks [Xu & Croft 99, Xu et al. 01, Lavrenko et al. 02, Cronen-Townsend et al. 02, Zhang et al. 02, Ogilvie & Callan 03, Zhai et al. 03, Shen et al. 05]
© ChengXiang Zhai, 2005 32
Part 2: The Basic LM Approach
1. Introduction
2. The Basic Language Modeling Approach - Query Likelihood Document Ranking
- Smoothing of Language Models
- Why does it work?
- Variants of the basic LM
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 33
The Basic LM Approach[Ponte & Croft 98]
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?
…
…food ?nutrition ?healthy ?diet ?
…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
© ChengXiang Zhai, 2005 34
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
© ChengXiang Zhai, 2005 35
Modeling Queries: Different Assumptions
• Multi-Bernoulli– Event: word presence/absence
– Q= (x1, …, x|V|), xi =1 for presence of word wi; xi =0 for absence
– Parameters: {p(wi=1|D), p(wi=0|D)} p(wi=1|D)+ p(wi=0|D)=1
• Multinomial (Unigram Language Model)– Event: word selection/sampling
– Q = (n1, …, n|V|), ni: frequency of word wi n=n1+…+ n|V|
– Conditioned on fixed n, Q=q1,…qn, p(Q|D)=p(q1|D)…p(qn|D)
– Parameters: {p(wi|D)} p(w1|D)+… p(w|v||D) = 1
| | | | | |
1 | |1 1, 1 1, 0
( ( ,..., ) | ) ( | ) ( 1| ) ( 0 | )i i
V V V
V i i i ii i x i x
p Q x x D p w x D p w D p w D
| |
1 | |1 | | 1
( ( ,..., ) | ) ( | ) ( | )...
i
Vn
v iV i
np Q n n D p n D p w D
n n
[Ponte & Croft 98] uses Multi-Bernoulli; all other work uses multinomialMultinomial appears to work better [Song & Croft 99, McCallum & Nigam 98,Lavrenko 04]
© ChengXiang Zhai, 2005 36
Retrieval as LM Estimation
• Document ranking based on query likelihood
m
m
ii
qqqQwhere
DqpDQp
...,
)|(log)|(log
21
1
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and distinguishes different approaches
Document language modelP(w|D)
© ChengXiang Zhai, 2005 37
How to Estimate p(w|D)
• Simplest solution: Maximum Likelihood Estimator
– P(w|D) = relative frequency of word w in D
– What if a word doesn’t appear in the text? P(w|D)=0
• In general, what probability should we give a word that has not been observed?
• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words
• This is what “smoothing” is about …
© ChengXiang Zhai, 2005 38
Part 2: The Basic LM Approach (cont.)
1. Introduction
2. The Basic Language Modeling Approach - Query Likelihood Document Ranking
- Smoothing of Language Models
- Why does it work?
- Variants of the basic LM
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 39
Language Model Smoothing (Illustration)
P(w)
Word w
Max. Likelihood Estimate
wordsallofcountwofcount
ML wp )(
Smoothed LM
© ChengXiang Zhai, 2005 40
How to Smooth?
• All smoothing methods try to
– discount the probability of words seen in a document
– re-allocate the extra counts so that unseen words will have a non-zero count
• Method 1 Additive smoothing [Chen & Goodman 98]: Add a constant to the counts of each word, e.g., “add 1”
( , ) 1( | )
| | | |
c w dp w d
d V
“Add one”, Laplace
Vocabulary size
Counts of w in d
Length of d (total counts)
© ChengXiang Zhai, 2005 41
Improve Additive Smoothing
• Should all unseen words get equal probabilities?
• We can use a reference model to discriminate unseen words
Discounted ML estimate
Reference language model
( | )( | )
( | )DML
d
p w d if w is seen in dp w d
p w REF otherwise
unseenisw
seeniswDML
d REFwp
dwp
)|(
)|(1
NormalizerProb. Mass for unseen words
© ChengXiang Zhai, 2005 42
Other Smoothing Methods
• Method 2 Absolute discounting [Ney et al. 94]: Subtract a constant from the counts of each word
• Method 3 Linear interpolation [Jelinek-Mercer 80]: “Shrink” uniformly toward p(w|REF)
max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d
# unique words
( , )( | ) (1 ) ( | )
| |
c w dp w d p w REF
d
parameterML estimate
© ChengXiang Zhai, 2005 43
Other Smoothing Methods (cont.)
• Method 4 Dirichlet Prior/Bayesian [MacKay & Peto 95, Zhai &
Lafferty 01a, Zhai & Lafferty 02]: Assume pseudo counts p(w|REF)
• Method 5 Good Turing [Good 53]: Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way
( ; ) ( | ) | || | | | | |
( , )( | ) ( | )
| |c w d p w REF d
d d d
c w dp w d p w REF
d
parameter
*( ; )1| |
1 2
0 1
1( | ) ; *( , ) * , ( , )
2*0* ,1* ,..... 0? | ?
c w drd
r
r
rp w d c w d r n where r c w d
n
n nWhat if n What about p w REF
n n
Heuristics needed
So, which method is the best?
It depends on the data and the task!
Cross validation is generally used to choose the best method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…
Backoff smoothing [Katz 87] doesn’t work well due to a lack of 2nd-stage smoothing…
Note that many other smoothing methods existSee [Chen & Goodman 98] and other publications in speech recognition…
© ChengXiang Zhai, 2005 45
Comparison of Three Methods[Zhai & Lafferty 01a]
Query Type J elinek-Mercer Dirichlet Abs. Discounting
Title 0.228 0.256 0.237Long 0.278 0.276 0.260
Relative performance of JM, Dir. and AD
0
0.1
0.2
0.3
JM DIR AD
Method
precision
TitleQuery
LongQuery
Comparison is performed on a variety of test collections
© ChengXiang Zhai, 2005 46
Part 2: The Basic LM Approach (cont.)
1. Introduction
2. The Basic Language Modeling Approach - Query Likelihood Document Ranking
- Smoothing of Language Models
- Why does it work?
- Variants of the basic LM
3. More Advanced Language Models
4. Language Models for Different Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 47
Understanding Smoothing
( | )( | )
( | )DML
d
p w d if w is seen in dp w d
p w REF otherwise
Discounted ML estimate
Reference language model
unseenisw
seeniswDML
d REFwp
dwp
)|(
)|(1
, ( , ) 0 , ( , ) 0
, ( , ) 0 , ( , ) 0
log ( | ) ( , ) log ( | )
( , ) log ( | ) ( , ) log ( | )
( , ) log ( | ) ( , ) log ( | ) ( , ) log ( | )
( |( , ) log
w V
DML dw V c w d w V c w d
DML d dw V c w d w V w V c w d
DML
p q d c w q p w d
c w q p w d c w q p w REF
c w q p w d c w q p w REF c w q p w REF
p wc w q
, ( , ) 0( , ) 0
)| | log ( , ) ( | )
( | ) dw V c w d w Vdc w q
dq c w q p w REF
p w REF
Retrieval formula using the general smoothing scheme
Key rewriting stepSimilar rewritings are very common when using LMs for IR…
© ChengXiang Zhai, 2005 48
Smoothing & TF-IDF Weighting [Zhai & Lafferty 01a]
• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain
Ignore for rankingIDF-like weighting
TF weightingDoc length normalization(long doc is expected to have a smaller d)
• Smoothing with p(w|C) TF-IDF + length norm. Smoothing implements traditional retrieval heuristics
• LMs with simple smoothing can be computed as efficiently as traditional retrieval models
, ( , ) 0( , ) 0
( | )log ( | ) ( , ) log | | log ( , ) ( | )
( | )DML
dw V c w d w Vdc w q
p w dp q d c w q q c w q p w REF
p w REF
Words in both query and doc
© ChengXiang Zhai, 2005 49
The Dual-Role of Smoothing [Zhai & Lafferty 02]
Verbosequeries
Keywordqueries
Why does query type affect smoothing sensitivity?
long
short
short
long
© ChengXiang Zhai, 2005 50
Another Reason for Smoothing
p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal…
Content words
Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)…
Query = “the algorithms for data mining”
d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004
Query = “the algorithms for data mining”P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001Smoothed d1: 0.04*0.1 0.001*0.1 0.02*0.1 0.002*0.1 0.003*0.1 +0.2*0.9 +0.00001*0.9 +0.2*0.9 +0.00001*0.9 +0.00001*0.9 = 0.184 = 0.000109 = 0.182 = 0.000209 = 0.000309Smoothed d2: 0.02*0.1 0.001*0.1 0.01*0.1 0.003*0.1 0.004*0.1 +0.2*0.9 +0.00001*0.9 +0.2*0.9 +0.00001*0.9 +0.001*0.9 = 0.182 = 0.000109 = 0.181 = 0.000309 = 0.000409
)!2|()1|(),|(9.0)|(1.0)|( dqpdqpREFwpdwpdwpwithsmoothingAfter DML
© ChengXiang Zhai, 2005 51
Two-stage Smoothing [Zhai & Lafferty 02]
c(w,d)
|d|P(w|d) =
+p(w|C)
+
Stage-1
-Explain unseen words-Dirichlet prior(Bayesian)
Collection LM
(1-) + p(w|U)
Stage-2
-Explain noise in query-2-component mixture
User background modelCan be approximated by p(w|C)
© ChengXiang Zhai, 2005 52
Estimating using leave-one-out [Zhai & Lafferty 02]
P(w1|d- w1)
P(w2|d- w2)
N
i Vw i
ii d
CwpdwcdwcCl
11 )
1||
)|(1),(log(),()|(
log-likelihood
)ˆ C|(μlargmaxμ 1μ
Maximum Likelihood Estimator
Newton’s Method
Leave-one-outw1
w2
P(wn|d- wn)
wn
...
© ChengXiang Zhai, 2005 53
Why would “leave-one-out” work?
abc abc ab c d dabc cd d d
abd ab ab ab abcd d e cd e
20 word by author1
20 word by author2
abc abc ab c d dabe cb e f
acf fb ef aff abefcdc db ge f s
Suppose we keep sampling and get 10 more words. Which author is likely to
“write” more new words?
Now, suppose we leave “e” out…
1 20 1(" " | 1) (" " | 1) (" " | )
19 20 19 20
0 20 0(" " | 2) (" " | 2) (" " | )
19 20 19 20
ml smooth
ml smooth
p e author p e author p e REF
p e author p e author p e REF
must be big! I.e. more smoothing
doesn’t have to be big
The amount of smoothing is closely related to the underlying vocabulary size
© ChengXiang Zhai, 2005 54
Estimating using Mixture Model [Zhai & Lafferty 02]
Query
Q=q1…qm
1
N
...
Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm
P(w|d1)d1
P(w|dN)dN
… ...
Stage-1
(1-)p(w|d1)+ p(w|U)
(1-)p(w|dN)+ p(w|U)
Stage-2
ˆ( , ) ( | )( | )
ˆ| |j i j
j ii
c q d p q Cp q d
d
Estimated in stage-1
© ChengXiang Zhai, 2005 55
Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*
AP88-89
WSJ87-92
ZIFF1-2
Automatic 2-stage results Optimal 1-stage results [Zhai & Lafferty 02]
Average precision (3 DB’s + 4 query types, 150 topics)* Indicates significant difference
Completely automatic tuning of parameters IS POSSIBLE!
© ChengXiang Zhai, 2005 56
The Notion of Relevance
Relevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel
(Salton et al., 75)
Prob. distr.model
(Wong & Yao, 89)
…
GenerativeModel
RegressionModel
(Fox 83)
Classicalprob. Model(Robertson &
Sparck Jones, 76)
Docgeneration
Querygeneration
Basic LMapproach
(Ponte & Croft, 98)
Prob. conceptspace model
(Wong & Yao, 95)
Differentinference system
Inference network model
(Turtle & Croft, 91)
The first application of LMs to IRLater, LMs are used along these lines too
© ChengXiang Zhai, 2005 57
Justification of Query Likelihood [Lafferty & Zhai 01a]
• The General Probabilistic Retrieval Model
– Define P(Q,D|R)
– Compute P(R|Q,D) using Bayes’ rule
– Rank documents by O(R|Q,D)
• Special cases
– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)
– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)
)0(
)1(
)0|,(
)1|,(),|1(
RP
RP
RDQP
RDQPDQRO
Ignored for ranking D
Doc generation leads to the classic Robertson-Sparck Jones modelQuery generation leads to the query likelihood language modeling approach
© ChengXiang Zhai, 2005 58
Query Generation [Lafferty & Zhai 01a]
))0|()0,|(()0|(
)1|()1,|(
)0|()0,|(
)1|()1,|(
)0|,(
)1|,(),|1(
RQPRDQPAssumeRDP
RDPRDQP
RDPRDQP
RDPRDQP
RDQP
RDQPDQRO
Assuming uniform prior, we have
Query likelihood p(q| d) Document prior
)1,|(),|1( RDQPDQRO
Computing P(Q|D, R=1) generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated model
P(Q|D)=P(Q|D, R=1)! Probability that a user who likes D would pose a query Q
Relevance-based interpretation of the so-called “document language model”
© ChengXiang Zhai, 2005 59
Part 2: The Basic LM Approach (cont.)
1. Introduction
2. The Basic Language Modeling Approach - Query Likelihood Document Ranking
- Smoothing of Language Models
- Why does it work?
- Variants of the basic LM
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 60
Variants of the Basic LM Approach
• Different smoothing strategies– Hidden Markov Models (essentially linear interpolation)
[Miller et al. 99]
– Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99]
– Performance tends to be similar to the basic LM approach
– Many other possibilities for smoothing [Chen & Goodman 98]
• Different priors– Link information as prior leads to significant improvement of
Web entry page retrieval performance [Kraaij et al. 02]
– Time as prior [Li & Croft 03]
• Passage retrieval [Liu & Croft 02]
© ChengXiang Zhai, 2005 61
Part 3: More Advanced LMs
1. Introduction
2.The Basic Language Modeling Approach
3.More Advanced Language Models
- Improving the basic LM approach
- Feedback and alternative ways of using LMs
4. Language Models for Special Retrieval Tasks
5.A General Framework for Applying SLMs to IR
6.Summary
We are here
© ChengXiang Zhai, 2005 62
Improving the Basic LM Approach• Capturing limited dependencies
– Bigrams/Trigrams [Song & Croft 99]; Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04]
– Generally insignificant improvement as compared with other extensions such as feedback
• Full Bayesian query likelihood [Zaragoza et al. 03]
– Performance similar to the basic LM approach
• Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02]
– Address polesemy and synonyms; improves over the basic LM methods, but computationally expensive
• Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04]
– Improves over the basic LM, but computationally expensive
• Parsimonious LMs [Hiemstra et al. 04]:
– Using a mixture model to “factor out” non-discriminative words
© ChengXiang Zhai, 2005 63
Translation Models
• Directly modeling the “translation” relationship between words in the query and words in a doc
• When relevance judgments are available, (q,d) serves as data to train the translation model
• Without relevance judgments, – Synthetic data can be used [Berger & Lafferty 99]
– <title, body> can be used as an approximation [Jin et al. 02]
1
( | , ) ( | ) ( | )j
m
t i j jw Vi
p Q D R p q w p w D
A basic translation model
Translation model Regular doc LM
© ChengXiang Zhai, 2005 64
Cluster-based Smoothing/Scoring
• Cluster-based smoothing: Smooth a document LM with a cluster of similar documents [Liu & Croft 04]
• Cluster-based query likelihood: Similar to the translation model, but “translate” the whole document to the query through a set of clusters [Kurland & Lee 04]
( | , ) ( | ) ( | )C Clusters
p Q D R p Q C p C D
cluster LM
How likely doc D belongs to cluster C
1 1 2 2( | , ) ( | ) (1 )[ ( | ( )) (1 ) ( | )]p w D R p w D p w Cluster D p w Collection
“self” LM
Improves over the basic LM method, but insignificantly
Only effective when interpolated with the basic LM scores
Likelihood of Q given C
© ChengXiang Zhai, 2005 65
Part 3: More Advanced LMs (cont.)
1. Introduction
2.The Basic Language Modeling Approach
3.More Advanced Language Models
- Improving the basic LM approach
- Feedback and Alternative ways of using LMs
4. Language Models for Special Retrieval Tasks
5.A General Framework for Applying SLMs to IR
6.Summary
We are here
© ChengXiang Zhai, 2005 66
Feedback and Doc/Query Generation
( 1| , ) ( | , 1)O R Q D P Q D R
( | , 1)( 1| , )
( | , 0)
P D Q RO R Q D
P D Q R
Classic Prob. Model
Query likelihood(“Language Model”)
Rel. doc model
NonRel. doc model
“Rel. query” model
P(D|Q,R=1)
P(D|Q,R=0)
P(Q|D,R=1)
(q1,d1,1)(q1,d2,1)(q1,d3,1)(q1,d4,0)(q1,d5,0)
(q3,d1,1)(q4,d1,1)(q5,d1,1)(q6,d2,1)(q6,d3,0)
ParameterEstimation
Initial retrieval: - query as rel doc vs. doc as rel query - P(Q|D,R=1) is more accurate
Feedback: - P(D|Q,R=1) can be improved for the
current query and future doc - P(Q|D,R=1) can also be improved, but
for current doc and future query
Doc-based feedbackQuery-based feedback
© ChengXiang Zhai, 2005 67
Difficulty in Feedback with Query Likelihood
• Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99]
– Improvement is reported, but there is a conceptual inconsistency
– What’s an expanded query, a piece of text or a set of terms?
• Avoid expansion
– Query term reweighting [Hiemstra 01, Hiemstra 02]
– Translation models [Berger & Lafferty 99, Jin et al. 02]
– Only achieving limited feedback
• Doing relevant query expansion instead [Nallapati et al 03]
• The difficulty is due to the lack of a query/relevance model
• The difficulty can be overcome with alternative ways of using LMs for retrieval
– Relevance model estimation [Lavrenko & Croft 01]
– Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b]
© ChengXiang Zhai, 2005 68
Two Alternative Ways of Using LMs• Classic Probabilistic Model :Doc-Generation as opposed to
Query-generation
– Natural for relevance feedback
– Challenge: Estimate p(D|Q,R=1) without relevance feedback [Lavrenko & Croft 01] (p(D|Q,R=0) can be approximated by p(D))
• Probabilistic Distance Model :Similar to the vector-space model, but with LMs as opposed to TF-IDF weight vectors– A popular distance function: Kullback-Leibler (KL) divergence,
covering query likelihood as a special case
– Retrieval is now to estimate query & doc models and feedback is treated as query LM updating [Lafferty & Zhai 01b; Zhai & Lafferty 01b]
( | , 1)( 1| , )
( | , 0)
P D Q RO R Q D
P D Q R
Both methods provide a more principled way for full feedback and are empirically effective
( , ) ( || ), ( | ) log ( | )Q D Q Dw V
score Q D D essentially p w p w
© ChengXiang Zhai, 2005 69
Relevance Model Estimation[Lavrenko & Croft 01]
• Question: How to estimate P(D|Q,R) (or p(w|Q,R)) without relevant documents?
• Key idea: – Treat query as observations about p(w|Q,R)
– Approximate the model space with all the document models
• Two methods for decomposing p(w,Q)– Independent sampling (Bayesian model averaging)
– Conditional sampling: p(w,Q)=p(w)p(Q|w)
1
( | , ) ( | ) ( | , ) ( | ) ( | ) ( | )
( | ) ( | ) ( | ) ( | ) ( | )
D D D D D D D
m
D D D D j DD C D C j
p w Q R p w p Q R d p w p R p Q d
p w p R p Q p w p q
1
( | , 1) ( ) ( | ) ( ) ( | ) ( | )
( | ) ( )( ) ( | ) ( ) ( | )
( )
m
iD Ci
D C
p w Q R p w p Q w p w p q D p D w
p w D p Dp w p w D p D p D w
p w
( | ) ( )( | )
( )
p w D p wp D w
p D
Original formula in [Lavranko &Croft 01]
© ChengXiang Zhai, 2005 70
Kernel-based Allocation [Lavrenko 04]
• A general generative model for text
• Choices of the kernel function
– Delta kernel:
– Dirichlet kernel: allow a training point to “spread” its influence
11
( ... ) ( | ) ( )
1( ) ( )
n
n ii
ww TrainingData
p w w p w p d
p d K dN
An infinite mixture model
Kernel-based density function
Kernel function ( ) ( , )wk similarity w
11
1( ... ) ( | )
n
n iw TrainingData i
p w w p w wN
Average probability of w1…wn over all training points
© ChengXiang Zhai, 2005 71
Query Model Estimation[Lafferty & Zhai 01b, Zhai & Lafferty 01b]
• Question: How to estimate a better query model than the ML estimate based on the original query?
• “Massive feedback” (Markov Chain) [Lafferty & Zhai 01b]:
– Improve a query model through co-occurrence pattern learned from a document-term Markov chain that outputs the query
• Model-based feedback (model interpolation) [ Zhai & Lafferty 01b]:
– Estimate a feedback topic model based on feedback documents
– Update the query model by interpolating the original query model with the learned feedback model
© ChengXiang Zhai, 2005 72
Feedback as Model Interpolation [Zhai & Lafferty 01b]
Query Q
D
)||( DQD
Document D
Results
Feedback Docs F={d1, d2 , …, dn}
FQQ )1('
Generative model
Divergence minimization
Q
F=0
No feedback
FQ '
=1
Full feedback
QQ '
© ChengXiang Zhai, 2005 73
F Estimation Method I: Generative Mixture Model
w
w
F={D1, …, Dn}
log ( | ) ( ; ) log((1 ) ( | ) ( | ))D F w D
p F c w D p w p w C
)|(logmaxarg
FpF Maximum Likelihood
P(w| )
P(w| C)
1-
P(source)
Background words
Topic words
The learned topic model is called a “parsimonious language model” in [Hiemstra et al. 04]
© ChengXiang Zhai, 2005 74
F Estimation Method II: Empirical Divergence Minimization
D1
F={D1, …, Dn}
1d
nd Dn
close
1| |
1
( , , ) ( || ) ( || ))n
D j CFi
D F C D D
),,(minarg CFDF
Empirical divergence
Divergence minimization
C far ()C
Background model
© ChengXiang Zhai, 2005 75
Example of Feedback Query Model
W p(W| )security 0.0558airport 0.0546
beverage 0.0488alcohol 0.0474bomb 0.0236
terrorist 0.0217author 0.0206license 0.0188bond 0.0186
counter-terror 0.0173terror 0.0142
newsnet 0.0129attack 0.0124
operation 0.0121headline 0.0121
Trec topic 412: “airport security”
W p(W| )the 0.0405
security 0.0377airport 0.0342
beverage 0.0305alcohol 0.0304
to 0.0268of 0.0241
and 0.0214author 0.0156bomb 0.0150
terrorist 0.0137in 0.0135
license 0.0127state 0.0127
by 0.0125
=0.9 =0.7
FF
Mixture model approach
Web database
Top 10 docs
© ChengXiang Zhai, 2005 76
Model-based feedback Improves over Simple LM [Zhai & Lafferty 01b]
Simple LM Mixture Improv. Div.Min. Improv.AvgPr 0.21 0.296 pos +41% 0.295 pos +40%InitPr 0.617 0.591 pos -4% 0.617 pos +0%Recall 3067/4805 3888/4805 pos +27% 3665/4805 pos +19%AvgPr 0.256 0.282 pos +10% 0.269 pos +5%InitPr 0.729 0.707 pos -3% 0.705 pos -3%Recall 2853/4728 3160/4728 pos +11% 3129/4728 pos +10%AvgPr 0.281 0.306 pos +9% 0.312 pos +11%InitPr 0.742 0.732 pos -1% 0.728 pos -2%Recall 1755/2279 1758/2279 pos +0% 1798/2279 pos +2%
collection
AP88-89
TREC8
WEB
Translation models, Relevance models, and Feedback-based query models have all been shown to improve performance significantly over the simple LMs
(Parameter tuning is necessary in many cases…)
© ChengXiang Zhai, 2005 77
Part 4: LMs for Special Retrieval Tasks
1. Introduction
2. The Basic Language Modeling Approach
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks - Cross-lingual IR
- Distributed IR
- Structured document retrieval
- Personalized/context-sensitive search
- Modeling redundancy
- Predicting query difficulty
- Subtopic retrieval
5. A General Framework for Applying SLMs to IR
6. Summary
We are here
© ChengXiang Zhai, 2005 78
Cross-lingual IR
• Use query in language A (e.g., English) to retrieve documents in language B (e.g., Chinese)
• Cross-lingual p(Q|D,R) [Xu et al 01]
• Cross-lingual p(D|Q,R) [Lavrenko et al 02]
1
( | , ) [ ( | ) (1 ) ( | ) ( | )]Chinese
m
i trans ic Vi
p Q D R p q REF p c D p q c
English Chinese Translation model
English Chinese word
1
1
1( , ) 1
11
( , ... )( | , )
( ... )
( , ... ) ( , ) ( | ) ( | )
( , ... ) ( ) ( | ) ( | ) ( | ) ( | ) ( | )
E C
C Chinese
m
m
m
m E C c i EM M M i
m
m C c i C i C trans i CM M c Vi
p c q qp c Q R
p q q
p c q q p M M p c M p q M
p c q q p M p c M p q M p q M p q c p c M
Method 1:
Method 2:
Estimate with parallel corporaEstimate with a bilingual lexicon
Or Parallel corpora
© ChengXiang Zhai, 2005 79
Distributed IR
• Retrieve documents from multiple collections
• The task is generally decomposed into two subtasks: Collection selection and result fusion
• Using LMs for collection selection [Xu & Croft 99, Si et al. 02]
– Treat collection selection as a “retrieving collections” as opposed to “documents”
– Estimate each collection model by maximum likelihood estimate [Si et al. 02] or clustering [Xu & Croft 99]
• Using LMs for result fusion [ Si et al. 02]
– Assume query likelihood scoring for all collections, but on each collection, a distinct reference LM is used for smoothing
– Adjust the bias score p(Q|D,Collection) to recover the fair score p(Q|D)
© ChengXiang Zhai, 2005 80
1 2
1
11
...
( | , 1) ( | , 1)
( | , 1) ( | , 1)
m
m
ii
m k
j i jji
Q q q q
p Q D R p q D R
s D D R p q D R
Structured Document Retrieval[Ogilvie & Callan 03]
Title
Abstract
Body-Part1
Body-Part2
…
D
D1
D2
D3
Dk
-Want to combine different parts of a document with appropriate weights-Anchor text can be treated as a “part” of a document- Applicable to XML retrieval
“part selection” prob. Serves as weight for Dj
Can be trained using EM
Select Dj and generate a query word using Dj
© ChengXiang Zhai, 2005 81
Personalized/Context-Sensitive Search[Shen et al. 05]
• KL-divergence retrieval model:
– Task1: estimating a query model
– Task2: estimating a doc model
• User information and search context information can be used to estimate a better query model
ˆ arg max ( | )
ˆ arg max ( | , , )
D
Q
p Doc
p Query User SearchContext
Refinement of this model leads to specific retrieval formulasSimple models often end up interpolating many unigram language
models based on different sources of evidence [Shen et al. 05]
© ChengXiang Zhai, 2005 82
Modeling Redundancy
• Given two documents D1 and D2, decide how redundant D1 (or D2) is w.r.t. D2 (or D1)
• Redundancy of D1 “to what extent can D1 be explained by a model estimated based on D2”
• Use a unigram mixture model [Zhai 02]
• [Zhang et al. 02] explored a more sophisticated redundancy model (3-component)
2 2
2
1 1
*1
log ( | , ) ( , ) log[ ( | ) (1 ) ( | )]
arg max log ( | , )
D Dw V
D
p D c w D p w p w REF
p D
Maximum Likelihood estimatorEM algorithm
Reference LMLM for D2
Measure of redundancy
© ChengXiang Zhai, 2005 83
Predicting Query Difficulty [Cronen-Townsend et al. 02]
• Observations:– Discriminative queries tend to be easier
– Comparison of the query model and the collection model can indicate how discriminative a query is
• Method:– Define “query clarity” as the KL-divergence between an
estimated query model or relevance model and the collection LM
– An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model)
• Correlation between the clarity scores and retrieval performance is found
( | )( ) ( | ) log
( | )Q
Qw
p wclarity Q p w
p w Collection
© ChengXiang Zhai, 2005 84
Subtopic Retrieval [Zhai 02, Zhai et al 03]
• Subtopic retrieval– Assume existence of subtopics for a query, and aim at retrieving
as many distinct subtopics as possible
– E.g., Retrieve “different applications of robotics”
– Need to go beyond independent relevance
• Two methods explored in [Zhai 02]
– Maximal Marginal Relevance: • Maximizing subtopic coverage indirectly through redundancy elimination
• LMs can be used to model redundancy
– Maximal Diverse Relevance: • Maximizing subtopic coverage directly through subtopic modeling
• Define a retrieval function based on subtopic representation of query and documents
• Mixture LMs can be used to model subtopics (essentially clustering)
© ChengXiang Zhai, 2005 85
Unigram Mixture Models
• Each subtopic is modeled with one unigram LM
• A document is treated as observations from a mixture model involving all subtopic LMs
• Two different sampling strategies to generate a document
– Strategy 1: Document Clustering
• Choose a subtopic model and use the chosen model to generate all the words in a document
• A document is always generated from one single LM
– Strategy 2: Aspect Models [Hofmann 99; Blei et al 02]
• Choose a (potentially) different subtopic model when generating each word in a document
• A document may be generated from multiple LMs
• For subtopic retrieval, we assume a document may have multiple subtopics, so strategy 2 is more appropriate
• Many other applications…
© ChengXiang Zhai, 2005 86
Aspect Models
P(w|1)
P(w|2)
P(w|k)
1 11
( | ,..., , ,..., ) ( | )k
k k i ii
p w p w
Subtopic 1
Subtopic 2
Subtopic k
w Document D=d1 … dn
1 1 111
( | ,..., , ,..., ) [ ( | ) ( | )] ( | ,..., )n A
k k i a kai
p D p d p a Dir d
Latent Dirichlet Allocation [Blei et al 02, Lafferty & Minka 03]
’s are drawn from a common Dirichlet distribution
is now regularized
1 111
( | ,..., , ,..., ) ( | )n A
D D Dk k a i a
ai
p D p d
Prob. LSI [Hofmann 99]: Different D has a different set of ’s
Flexible aspect distributionNeed regularization
© ChengXiang Zhai, 2005 87
Part 5: A General Framework for Applying SLMs to IR
1. Introduction
2. The Basic Language Modeling Approach
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
- Risk minimization framework
- Special cases
6. Summary
We are here
© ChengXiang Zhai, 2005 88
Risk Minimization: Motivation
• Long-standing IR Challenges
– Improve IR theory
• Develop theoretically sound and empirically effective models
• Go beyond the limited traditional notion of relevance (independent, topical relevance)
– Improve IR practice
• Optimize retrieval parameters automatically
• SLMs are very promising tools …
– How can we systematically exploit SLMs in IR?
– Can SLMs offer anything hard/impossible to achieve in traditional IR?
© ChengXiang Zhai, 2005 89
Idea 1: Retrieval as Decision-Making(A more general notion of relevance)
Unordered subset?
Clustering?
Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ()Choose: (D,)
Query … Ranked list?1 2 3 4
© ChengXiang Zhai, 2005 90
Idea 2: Systematic Language Modeling
Document Language ModelsDocuments
DOC MODELING
QueryQuery
Language Model
QUERY MODELING
Loss Function User
USER MODELING
Retrieval Decision: ?
© ChengXiang Zhai, 2005 91
Generative Model of Document & Query [Lafferty & Zhai 01b]
observedPartiallyobserved
QU)|( Up QUser
DS )|( Sp DSource
inferred
),|( Sdp Dd Document
),|( Uqp Q q Query
( | , )Q Dp R R
© ChengXiang Zhai, 2005 92
Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 03]
Choice: (D1,1)
Choice: (D2,2)
Choice: (Dn,n)
...
query quser U
doc set Csource S
q
1
N
dSCUqpDLDD
),,,|(),,(minarg*)*,(,
hidden observedloss
Bayes risk for choice (D, )RISK MINIMIZATION
Loss
L
L
L
© ChengXiang Zhai, 2005 93
Special Cases
• Set-based models (choose D)
• Ranking models (choose )– Independent loss
• Relevance-based loss
• Distance-based loss
– Dependent loss
• MMR loss
• MDR loss
Boolean model
Probabilistic relevance model Generative Relevance Theory [Lavrenko 04]
Vector-space Model
Subtopic retrieval model
Two-stage LM
KL-divergence model
© ChengXiang Zhai, 2005 94
Optimal Ranking for Independent Loss
1 11 1
1 1
1
1 1
1
1 1
1
1 1
* arg min ( , ) ( | , , , )
( , ) ( | ... )
( )
( ) ( )
* arg min ( ) ( ) ( | , , , )
arg min ( ) ( ) (
j j
j
j
j
j
N i
ii j
N i
ii j
N jN
ij i
N jN
ij i
N jN
ij i
L p q U C S d
L s l
s l
s l
s l p q U C S d
s l p
| , , , )
( | , , , ) ( ) ( | , , , )
* ( | , , , )
j j
k k k k
k
q U C S d
r d q U C S l p q U C S d
Ranking based on r d q U C S
Decision space = {rankings}
Sequential browsing
Independent loss
Independent risk= independent scoring
“Risk ranking principle”[Zhai 02]
© ChengXiang Zhai, 2005 95
Automatic Parameter Tuning
• Retrieval parameters are needed to
– model different user preferences
– customize a retrieval model according to specific queries and documents
• Retrieval parameters in traditional models
– EXTERNAL to the model, hard to interpret
– Most parameters are introduced heuristically to implement our “intuition”
– As a result, no principles to quantify them, must set through empirical experiments
• Lots of experimentation
• Optimality for new queries is not guaranteed ty for new queries is not guaranteed. So far, parameters have been set through empirical experimentation
• Language models make it possible to estimate parameters…
© ChengXiang Zhai, 2005 96
Parameter Setting in Risk Minimization
QueryQuery
Language Model
Document Language Models
Loss Function User
Documents
Query model parameters
Doc model parameters
User model parameters
Estimate
Estimate
Set
© ChengXiang Zhai, 2005 97
Generative Relevance Hypothesis [Lavrenko 04]
• Generative Relevance Hypothesis: – For a given information need, queries expressing that need and
documents relevant to that need can be viewed as independent random samples from the same underlying generative model
• A special case of risk minimization when document models and query models are in the same space
• Implications for retrieval models: “the same underlying generative model” makes it possible to– Match queries and documents even if they are in different
languages or media
– Estimate/improve a relevant document model based on example queries or vice versa
© ChengXiang Zhai, 2005 98
Risk Minimization: Summary
• Risk minimization is a general probabilistic retrieval framework
– Retrieval as a decision problem (=risk min.)
– Separate/flexible language models for queries and docs
• Advantages
– A unified framework for existing models
– Automatic parameter tuning due to LMs
– Allows for modeling complex retrieval tasks
• Lots of potential for exploring LMs…
• For more information, see [Zhai 02]
© ChengXiang Zhai, 2005 99
Part 6: Summary
1. Introduction
2. The Basic Language Modeling Approach
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. A General Framework for Applying SLMs to IR
6. Summary– SLMs vs. traditional methods: Pros & Cons
– What we have achieved so far
– Challenges and future directions
We are here
© ChengXiang Zhai, 2005 100
SLMs vs. Traditional IR• Pros:
– Statistical foundations (better parameter setting)
– More principled way of handling term weighting
– More powerful for modeling subtopics, passages,..
– Leverage LMs developed in related areas (e.g., speech recognition, machine translation)
– Empirically as effective as well-tuned traditional models with potential for automatic parameter tuning
• Cons:– Limitation due to generative models in general (lack of
discrimination)
– Less robust in some cases (e.g., when queries are semi-structured)
– Computationally complex
– Empirically, performance appears to be inferior to well-tuned full-fledged traditional methods (at least, no evidence for beating them)
© ChengXiang Zhai, 2005 101
What We Have Achieved So Far
• Framework and justification for using LMs for IR
• Several effective models are developed – Basic LM with Dirichlet prior smoothing is a reasonable baseline
– Basic LM with informative priors often improves performance
– Translation model handles polysemy & synonyms
– Relevance model incorporates LMs into the classic probabilistic IR model
– KL-divergence model ties feedback with query model estimation
– Mixture models can model redundancy and subtopics
• Completely automatic tuning of parameters is possible
• LMs can be applied to virtually any retrieval task with great potential for modeling complex IR problems
© ChengXiang Zhai, 2005 102
Challenges and Future Directions
• Challenge 1: Establish a robust and effective LM that
– Optimizes retrieval parameters automatically
– Performs as well as or better than well-tuned traditional retrieval methods with pseudo feedback
– Is as efficient as traditional retrieval methods
• Challenge 2: Demonstrate consistent and substantial improvement by going beyond unigram LMs
– Model limited dependency between terms
– Derive more principled weighting methods for phrases
Can LMs completely and convincingly beat traditional methods?
Can we do much better by going beyond unigram LMs?
© ChengXiang Zhai, 2005 103
Challenges and Future Directions (cont.)
• Challenge 3: Develop LMs that can model document structures and subtopics– Recognize query-specific boundaries of relevant passages
– Passage-based/subtopic-based feedback
– Combine different parts of a document
• Challenge 4: Develop LMs to support personalized search– Infer and track a user’s interests with LMs
– Model search context with LMs
– Incorporate user’s preferences and search context in retrieval
– Customize/organize search results according to user’s interests
How can we exploit user information and search context to improve search?
How can we break the document unit in a principled way?
© ChengXiang Zhai, 2005 104
Challenges and Future Directions (cont.)
• Challenge 5: Generalize LMs to handle relational data
– Develop LMs for semi-structured data (e.g., XML)
– Develop LMs to handle structured queries
– Develop LMs for keyword search in relational databases
• Challenge 6: Develop LMs for retrieval with complex information need, e.g.,
– Subtopic retrieval
– Readability constrained retrieval
– Entity retrieval
How can we exploit LMs to develop models for complex retrieval tasks?
What role can LMs play when combining text with relational data?
© ChengXiang Zhai, 2005 105
References[Baeza-Yates & Ribiero-Neto 99] Modern Information Retrieval, Addison-Wesley, 1999.
[Berger & Lafferty 99] A. Berger and J. Lafferty. Information retrieval as statistical translation. Proceedings of the ACM SIGIR 1999, pages 222-229.
[Blei et al. 02] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In T G Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.
[Carbonell and Goldstein 98]J. Carbonell and J. Goldstein, The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR'98, pages 335--336.
[Chen & Goodman 98] S. F. Chen and J. T. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University.
[Cronen-Townsend et al. 02] Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. Predicting query performance. In Proceedings of the ACM Conference on Research in Information Retrieval (SIGIR), 2002.
[Croft & Lafferty 03] W. B. Croft and J. Lafferty (ed), Language Modeling and Information Retrieval. Kluwer Academic Publishers. 2003.
[Fox 83] E. Fox. Expending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. PhD thesis, Cornell University. 1983.
[Fuhr 01] N. Fuhr. Language models and uncertain inference in information retrieval. In Proceedings of the Language Modeling and IR workshop, pages 6--11.
[Gao et al. 04] J. Gao, J. Nie, G. Wu, and G. Cao, Dependence language model for information retrieval, In Proceedings of ACM SIGIR 2004.
[Good 53] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4):237--264, 1953.
[Greiff & Morgan 03] W. Greiff and W. Morgan, Contributions of Language Modeling to the Theory and Practice of IR, In W. B. Croft and J. Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic Pub. 2003.
[Hiemstra & Kraaij 99] D. Hiemstra and W. Kraaij, Twenty-One at TREC-7: Ad-hoc and Cross-language track, In Proceedings of the Seventh Text REtrieval Conference (TREC-7), 1999.
[Hiemstra 01] D. Hiemstra. Using Language Models for Information Retrieval. PhD dissertation, University of Twente, Enschede, The Netherlands, January 2001.
[Hiemstra 02] D. Hiemstra. Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term. In Proceedings of ACM SIGIR 2002, 35-41
© ChengXiang Zhai, 2005 106
References (cont.)
[Hiemstra et al. 04] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious language models for information retrieval, In Proceedings of ACM SIGIR 2004.
[Hofmann 99] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings on the 22nd annual international ACM-SIGIR 1999, pages 50-57.
[Jelinek 98] F. Jelinek, Statistical Methods for Speech Recognition, Cambirdge: MIT Press, 1998.
[Jelinek & Mercer 80] F. Jelinek and R. L. Mercer. Interpolated estimation of markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. 1980. Amsterdam, North-Holland,.
[Jeon et al. 03] J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Retrieval using Cross-media Relevance Models, In Proceedings of ACM SIGIR 2003
[Jin et al. 02] R. Jin, A. Hauptmann, and C. Zhai, Title language models for information retrieval, In Proceedings of ACM SIGIR 2002.
[Kalt 96] T. Kalt. A new probabilistic model of text classication and retrieval. University of Massachusetts Technical report TR98-18,1996.
[Katz 87] S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35:400--401.
[Kraaij 04] W. Kraaij. Variations on Language Modeling for Information Retrieval, Ph.D. thesis, University of Twente, 2004,
[Kurland & Lee 04] O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In Proceedings of ACM SIGIR 2004.
[Lafferty and Zhai 01a] J. Lafferty and C. Zhai, Probabilistic IR models based on query and document generation. In Proceedings of the Language Modeling and IR workshop, pages 1--5.
[Lafferty & Zhai 01b] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the ACM SIGIR 2001, pages 111-119.
[Lavrenko & Croft 01] V. Lavrenko and W. B. Croft. Relevance-based language models . In Proceedings of the ACM SIGIR 2001, pages 120-127.
[Lavrenko et al. 02] V. Lavrenko, M. Choquette, and W. Croft. Cross-lingual relevance models. In Proceedings of SIGIR 2002, pages 175-182.
[Lavrenko 04] V. Lavrenko, A generative theory of relevance. Ph.D. thesis, University of Massachusetts. 2004.
© ChengXiang Zhai, 2005 107
References (cont.)
[Li & Croft 03] X. Li, and W.B. Croft, Time-Based Language Models, In Proceedings of CIKM'03, 2003
[Liu & Croft 02] X. Liu and W. B. Croft. Passage retrieval based on language models . In Proceedings of CIKM 2002, pages 15-19.
[Liu & Croft 04] X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of ACM SIGIR 2004.
[MacKay & Peto 95] D. MacKay and L. Peto. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1(3):289--307.
[Maron & Kuhns 60] M. E. Maron and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7:216--244.
[McCallum & Nigam 98] A. McCallum and K. Nigam (1998). A comparison of event models for Naïve Bayes text classification. In AAAI-1998 Learning for Text Categorization Workshop, pages 41--48.
[Miller et al. 99] D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In Proceedings of ACM-SIGIR 1999, pages 214-221.
[Minka & Lafferty 03] T. Minka and J. Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the UAI 2002, pages 352--359.
[Nallanati & Allan 02] Ramesh Nallapati and James Allan, Capturing term dependencies using a language model based on sentence trees. In Proceedings of CIKM 2002. 383-390
[Nallanati et al 03] R. Nallanati, W. B. Croft, and J. Allan, Relevant query feedback in statistical language modeling, In Proceedings of CIKM 2003.
[Ney et al. 94] H. Ney, U. Essen, and R. Kneser. On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Comput. Speech and Lang., 8(1), 1-28.
[Ng 00]K. Ng. A maximum likelihood ratio information retrieval model. In Voorhees, E. and Harman, D., editors, Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 483--492. 2000.
[Ogilvie & Callan 03] P. Ogilvie and J. Callan Combining Document Representations for Known Item Search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pp. 143-150
© ChengXiang Zhai, 2005 108
References (cont.)
[Ponte & Croft 98]] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of ACM-SIGIR 1998, pages 275-281.
[Ponte 98] J. M. Ponte. A language modeling approach to information retrieval. Phd dissertation, University of Massachusets, Amherst, MA, September 1998.
[Ponte 01] J. Ponte. Is information retrieval anything more than smoothing? In Proceedings of the Workshop on Language Modeling and Information Retrieval, pages 37-41, 2001.
[Robertson & Sparch-Jones 76] S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms. JASIS, 27, 129-146.
[Robertson 77] S. E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294-304, 1977.
[Rosenfeld 00] R. Rosenfeld, Two decades of statistical language modeling: where do we go from
here? In Proceedings of IEEE, volume~88.
[Salton et al. 75] G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620.
[Shannon 48] Shannon, C. E. (1948).. A mathematical theory of communication. Bell System Tech. J. 27, 379-423, 623-656.
[Shen et al. 05] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval with implicit feedback. In Proceedings of ACM SIGIR 2005.
[Si et al. 02] L. Si , R. Jin, J. Callan and P.l Ogilvie. A Language Model Framework for Resource Selection and Results Merging. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM) . 2002
[Singhal 01] A. Singhal, Modern Information Retrieval: A Brief Overview. Amit Singhal. In IEEE Data Engineering Bulletin 24(4), pages 35-43, 2001.
[Song & Croft 99] F. Song and W. B. Croft. A general language model for information retrieval. In Proceedings of Eighth International Conference on Information and Knowledge Management (CIKM 1999)
© ChengXiang Zhai, 2005 109
References (cont.)
[Sparck Jones et al. 00] K. Sparck Jones, S. Walker, and S. E. Robertson, A probabilistic model of information retrieval: development and comparative experiments - part 1 and part 2. Information Processing and Management, 36(6):779--808 and 809--840.
[Sparck Jones et al. 03] K. Sparck Jones, S. Robertson, D. Hiemstra, H. Zaragoza, Language Modeling and Relevance, In W. B. Croft and J. Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic Pub. 2003.
[Srikanth & Srihari 03] M. Srikanth, R. K. Srihari. Exploiting Syntactic Structure of Queries in a Language Modeling Approach to IR. in Proceedings of Conference on Information and Knowledge Management(CIKM'03).
[Turtle & Croft 91]H. Turtle and W. B. Croft, Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222.
[van Rijsbergen 86] C. J. van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29(6).
[Witten et al. 99] I.H. Witten, A. Mo#at, and T.C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Academic Press, San Diego, 2nd edition, 1999.
[Wong & Yao 89] S. K. M. Wong and Y. Y. Yao, A probability distribution model for information retrieval. Information Processing and Management, 25(1):39--53.
[Wong & Yao 95] S. K. M. Wong and Y. Y. Yao. On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13(1):69--99.
[Kraaij et al. 02] Wessel Kraaij,Thijs Westerveld, Djoerd Hiemstra: The Importance of Prior Probabilities for Entry Page Search. Proceedings of SIGIR 2002, pp. 27-34
[Xu & Croft 99] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR 1999, pages 15-19,
[Xu et al. 01] J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of the ACM-SIGIR 2001, pages 105-110.
[Zaragoza et al. 03] Hugo Zaragoza, D. Hiemstra and M. Tipping, Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of SIGIR 2003: 4-9
© ChengXiang Zhai, 2005 110
References (cont.)
[Zhai & Lafferty 01a] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the ACM-SIGIR 2001, pages 334-342.
[Zhai & Lafferty 01b] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval, In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001).
[Zhai & Lafferty 02] C. Zhai and J. Lafferty. Two-stage language models for information retrieval . In Proceedings of the ACM-SIGIR 2002, pages 49-56.
[Zhai & Lafferty 03] C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, In Proceedings of the ACM SIGIR 2003 Workshop on Mathematical/Formal Methods in IR.
[Zhai et al. 03] C. Zhai, W. Cohen, and J. Lafferty, Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval, In Proceedings of ACM SIGIR 2003.
[Zhai 02] C. Zhai, Language Modeling and Risk Minimization in Text Retrieval, Ph.D. thesis, Carnegie Mellon University, 2002.
[Zhang et al. 02] Y. Zhang , J. Callan, and Thomas P. Minka, Novelty and redundancy detection in adaptive filtering. In Proceedings of SIGIR 2002, 81-88