Upload
sibyl-davidson
View
235
Download
0
Embed Size (px)
Citation preview
Information Retrieval Models
PengBoOct 30, 2010
上次课回顾
Basic Index Techniques Inverted index Dictionary & Postings
Scoring and Ranking Term weighting tf·idf Vector Space Model Cosine Similarity
IR evaluation Precision, Recall, F Interpolation MAP, interpolated AP
本次课大纲
Information Retrieval Models Vector Space Model (VSM) Latent Semantic Model (LSI) Language Model (LM)
Relevance FeedbackQuery ExpansionRelevance FeedbackQuery Expansion
Vector Space Model
Documents as vectors
每一个文档 j 能够被看作一个向量,每个 term 是一个维度,取值为 log-scaled tf.idf
So we have a vector space terms are axes docs live in this space 高维空间:即使作 stemming, may have 20,000+
dimensions
D1 D2 D3 D4 D5 D6…
中国 4.1 0.0 3.7 5.9 3.1 0.0
文化 4.5 4.5 0 0 11.6 0
日本 0 3.5 2.9 0 2.1 3.9
留学生 0 3.1 5.1 12.8 0 0
教育 2.9 0 0 2.2 0 0
北京 7.1 0 0 0 4.4 3.8
…
Intuition
Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
用例: Query-by-example , Free Text query as vector用例: Query-by-example , Free Text query as vector
Cosine similarity
11
,2
M
ijij wd
向量 d1 和 d2 的
“ closeness” 可以用它们之间的夹角大小来度量
具体的,可用 cosine of the angle x 来计算向量相似度 .
向量按长度归一化Normalization
t 1
d 2
d 1
t 3
t 2
θ
M
i ki
M
i ji
M
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
M
i ki
M
i ji
M
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
#1.COS Similarity
计算查询 “ digital cameras” 与文档 “ digital cameras and video cameras” 之间的相似度 .
假定 N = 10,000,000, query 和 document 都采用 logarithmic term weighting (wf columns), query 采用 idf weighting , document 采用 cosine normalization. “and” 作为 stop word.
#2. Evaluation
定义 precision-recall graph 如下:对一个查询结果列表,每一个返回结果文档处计算precision/recall 点,由这些点构成的图 .
在这个图上定义 breakeven point 为 precision和 recall 值相等的点 .
问:存在多于一个 breakeven point 的图吗?如果有,给出例子;没有的话,请证明之。
Latent Semantic Model
Vector Space Model: Pros
Automatic selection of index terms Partial matching of queries and
documents (dealing with the case where no document contains all search terms)
Ranking according to similarity score (dealing with large result sets)
Term weighting schemes (improves retrieval performance)
Various extensions Document clustering Relevance feedback (modifying query vector)
Geometric foundation
Problems with Lexical Semantics
Polysemy: 词通常有 multitude of meanings 和不同用法。 Vector Space Model 不能区分同一个词的不同含义,即ambiguity.
Synonymy: 不同的 terms 可能具有identical or a similar meaning. Vector Space Model 里不能表达词之间的associations.
Issues in the VSM
terms 之间的独立性假设 有些 terms 更可能在一起出现
同义词,相关词汇,拼写错误, etc. 根据上下文, terms 可能有不同的含义
term-document 矩阵维度很高
对每篇文档 / 每个词,真的有那么多重要的特征 ?
对每篇文档 / 每个词,真的有那么多重要的特征 ?
Singular Value Decomposition
对 term-document 矩阵作奇异值分解 Singular Value Decomposition r, 矩阵的 rank , singular values 的对角阵(按降序排列) D, T, 具有正交的单位长度列向量 (TT’=I, DD’=I)
t d t r
Wtd = T
r r
DT
r d
WWT 的特征值WWT 的特征值 WTW 和 WWT 的特征向量WTW 和 WWT 的特征向量
Singular Values
gives an ordering to the dimensions 值下降非常快 尾部的 singular
values at 代表 "noise"
在 low-value dimensions 截止可以减少 noise ,提高性能
Low-rank Approximation
t d t r
wtd = T
r r
DT
r d
t d t k
w'td = k k k d
T
DT
≈
Latent Semantic Indexing (LSI)
Perform a low-rank approximation of term-document matrix (typical rank 100-300)
General idea Map documents (and terms) to a low-
dimensional representation. Design a mapping such that the low-
dimensional space reflects semantic associations (latent semantic space).
Compute document similarity based on the inner product in this latent semantic space
What it is
从原始的 term-document 矩阵 Ar, 我们计算得到它的近似 Ak.
在 Ak 中,每行对应一个 term ,每列对应一个document
区别是,文档在新的空间,它的维度 k << r dimensions
怎样比较两个 term? 怎样比较两个 document?
怎样比较一个 term 和一个文档 ?
AKTAk = D T TT TDT =
(DT)T( DT)
AKTAk = D T TT TDT =
(DT)T( DT)
Ak[I,j]Ak[I,j]
AkAKT =TDT D T TT= (T)
(T)T
AkAKT =TDT D T TT= (T)
(T)T
LSI Term matrix T
T matrix 每个 term 在 LSI space 的向量 原始 matrix: terms 向量是 d-dimensional , T 中要
小很多 Dimensions 是在相同文档中倾向于与这个词“同现”
的一组 terms synonyms, contextually-related words,
variant endings (T) 用来计算 term 相似度
Document matrix D
D matrix 在 LSI space 中文档的表示 和 T vectors 有相同的 dimensionality (DT) 用来计算 document 相似度 可用于计算查询和一个文档的 similarity
Retrieval with LSI
LSI 检索过程: 查询映射 / 投影到 LSI 的 DT 空间,称为“ folded in“ : W=TDT ,若 q 投影到 DT 中后为 q’ ,则有
q = Tq’T
既有 q’= (-1T-1q)T = qT-1
Folded in 既为 document/query vector 乘上 T-1
文档集的文档向量为 DT
两者通过 dot-product 计算相似度
Improved Retrieval with LSI
性能提升来自… 去除了 noise 不需要 stem terms (variants will co-occur) 不需要 stop list 没有速度和空间上的改进 , though…
C=
Tr=
r=
DrT=
2=
2 D2T=
Example
Map into 2-dimenstion space
Latent Semantic Analysis
Latent semantic space: illustrating example
courtesy of Susan Dumais
Empirical evidence
Experiments on TREC 1/2/3 – Dumais Precision at or above median TREC precision
Top scorer on almost 20% of TREC topics Slightly better on average than straight vector
spaces Effect of dimensionality:
Dimensions
Precision
250 0.367
300 0.371
346 0.374
LSI has many other applications
在很多场合 , 我们都有 feature-object matrix. 矩阵是高维,有大量冗余,从而能使用 low-rank
approximation. 比如文本检索, the terms 是 features , the docs
是 objects. Latent Semantic Index 比如 opinions 和 users … 数据不全 (e.g., users’ opinions), 可以在低维空间里
恢复 . Powerful general analytical technique
Language Models
IR based on Language Model (LM)
query
d1
d2
dn
…
Information need
document collection
1dM
2dM
…
ndM 通常的 search 方法:猜测作者写相关文
档时使用的词,形成 query The LM approach directly exploits
that idea!
generationgeneration
)|( dMQP
Formal Language (Model)
传统的生成模型 generative model: 产生 strings Finite state machines or regular grammars, etc.
Example:
I wish
I wishI wish I wishI wish I wish I wishI wish I wish I wish I wish…
(I wish) *
Stochastic Language Models
Models probability of generating strings in the language (commonly all strings over alphabet ∑)
0.2 the
0.1 a
0.01 man
0.01 woman
0.03 said
0.02 likes
…
the man likes the woman
0.2 0.01 0.02 0.2 0.01
multiply
Model M
P(s | M) = 0.00000008
Stochastic Language Models
Model probability of generating any string
0.2 the
0.01 class
0.0001 sayst
0.0001 pleaseth
0.0001 yon
0.0005 maiden
0.01 woman
Model M1 Model M2
maidenclass pleaseth yonthe
0.00050.01 0.0001 0.00010.2
0.010.0001 0.02 0.10.2
P(s|M2) > P(s|M1)
0.2 the
0.0001 class
0.03 sayst
0.02 pleaseth
0.1 yon
0.01 maiden
0.0001 woman
Stochastic Language Models
用来生成文本的统计模型 Probability distribution over strings in a given
language
M
P ( | M ) = P ( | M ) P ( | M,
)P ( | M, )
P ( | M, )
Unigram and higher-order models
Unigram Language Models
Bigram (generally, n-gram) Language Models
Other Language Models Grammar-based models (PCFGs), etc.
Probably not the first thing to try in IR
= P ( )P ( | )P ( | )P ( | )
P ( ) P ( ) P ( ) P ( )
P ( )
P ( ) P ( | ) P ( | ) P ( | )
Easy.Effective!
Easy.Effective!
The fundamental problem of LMs
模型 M 是不知道的 只有代表这个模型的样例文本
从样例文本中来估计 Model 然后计算观察到的文本概率
P ( | M ( ) )
M
Using Language Models in IR
每篇文档对应一个 model 按 P(d | q) 对文档排序 P(d | q) = P(q | d) x P(d) / P(q)
P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for
all d But we could use criteria like authority,
length, genre P(q | d) is the probability of q given d’s model
Very general formal approach
Language Models for IR
Language Modeling Approaches 为 query generation process 建模 文档排序:按一个 query 作为由文档模型产生的随机样
本而被观察到的概率 the probability that a query would be observed as a random sample from the respective document model Multinomial approach
Retrieval based on probabilistic LM
把 query 的产生当作一个随机过程 方法
为每个文档 Infer a language model. Estimate the probability :估计每个文档模型产生这
个 query 的概率 Rank :按这个概率对文档排序 . 通常使用 Unigram model
Query generation probability (1)
排序公式
用最大似然估计 :
Qt d
dt
Qtdmld
dl
tf
MtpMQp
),(
)|(ˆ)|(ˆ
Unigram assumption:Given a particular language
model, the query terms occur independently
Unigram assumption:Given a particular language
model, the query terms occur independently
),( dttf
ddl
: language model of document d
: raw tf of term t in document d
: total number of tokens in document d
dM
)|()(
)|()(),(
dMQpdp
dQpdpdQp
Insufficient data
Zero probability 一个文档里没有 query 中的某个 term 时…
General approach 没有出现文档中的 term 按它出现在 collection 中的概
率来代替 . If ,
0)|( dMtp
0),( dttf
cs
cs
cfMtp t
d )|(
tcf : raw count of term t in the collection
: raw collection size(total number of tokens in the collection)
Insufficient data
Zero probabilities spell disaster 使用平滑: smooth probabilities
Discount nonzero probabilities Give some probability mass to unseen things 有很多方法,如 adding 1, ½ or to counts,
Dirichlet priors, discounting, and interpolation [See FSNLP ch. 6 if you want more]
使用混合模型: use a mixture between the document multinomial and the collection multinomial distribution
Mixture model
P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) 参数很重要
值高,使得查询成为 “ conjunctive-like” – 适合短查询
值低更适合长查询 调整 来优化性能
比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)
Basic mixture model summary
General formulation of the LM for IR
Qt
dMtptpdpdQp ))|()()1(()(),(
general language model
individual-document model
Example
Document collection (2 documents) d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases
further Model: MLE unigram from documents; = ½ Query: revenue down
P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] = 1/8 x 3/32 = 3/256
P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256
Ranking: d1 > d2
Alternative Models of Text Generation
Query Model Query
Doc Model Doc
Searcher
Writer
)|( SearcherMP
)|( WriterMP
)|( MQueryP
)|( MDocP
Is this the same model?
Retrieval Using Language Models
Query ModelQuery
Doc ModelDoc
)|( QuerywP
)|( DocwP
1
2
3
Query likelihood (1)Query likelihood (1) Document likelihood (2),Document likelihood (2), Model comparison (3)Model comparison (3)
Query Likelihood
P(Q|Dm) 主要问题是估计文档 model
i.e. smoothing techniques instead of tf.idf weights
检索效果不错 e.g. UMass, BBN, Twente, CMU
问题:处理 relevance feedback, query expansion, structured queries困难
Document Likelihood
按 P(D|R)/P(D|NR) 排序 P(w|R) is estimated by P(w|Qm), Qm is the query or
relevance model P(w|NR) is estimated by collection probabilities P(w)
问题是估计 relevance model Treat query as generated by mixture of topic and
background Estimate relevance model from related documents (query
expansion) Relevance feedback is easily incorporated
Good retrieval results e.g. UMass at SIGIR 01 inconsistent with heterogeneous document collections
Model Comparison
估计 query 和 document 模型,进行模型比较 KL divergence D(Qm||Dm)
取得了较前两方法更好的效果
Xx m
mmmm xD
xQxQDQD
)(
)(log)()||(
Language models: pro & con
Novel way of looking at the problem of text retrieval based on probabilistic language modeling Conceptually simple and explanatory Formal mathematical model Natural use of collection statistics, not heuristics
(almost…) LMs provide effective retrieval and can be
improved to the extent that the following conditions can be met Our language models are accurate representations of the
data. Users have some sense of term distribution.
Comparison With Vector Space
和传统的 tf.idf models 有一定联系 : (unscaled) term frequency is directly in model the probabilities do length normalization of term
frequencies the effect of doing a mixture with overall
collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking
Comparison With Vector Space
相似点 Term weights based on frequency Terms often used as if they were independent Inverse document/collection frequency used Some form of length normalization used
不同点 Based on probability rather than similarity Intuitions are probabilistic rather than geometric Details of use of document length and term,
document, and collection frequency differ
本次课小结
Latent Semantic Indexing singular value
decomposition Matrix Low-rank
Approximation LanguageModel
Generative model smooth probabilities Mixture model
Qt
dMtptpdpdQp ))|()()1(()(),(
Resources
The Template Numerical Toolkit (TNT)http://math.nist.gov/tnt/documentation.html
The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ CMU/Umass LM and IR system in C(++), currently actively developed.
Thank You!
Q&A
阅读材料
[1] IIR Ch12, Ch18 [2] M. Alistair, Z. Justin, and H. David,
"Recommended reading for IR research students" SIGIR Forum, vol. 39, pp. 3-14, 2005.
#2 Evaluation
Question a 不能有两个或两个以上的 breakeven point 证明:一次检索 I,相关文档集为 R,设当前为
breakeven point ,检出文档集为 A ,检出的相关文档集为 Ra,则 precision=|Ra|/|A| , recall=|Ra|/|R| ,根据 breakeven point 的定义, precision=recall ,推出 |R|=|A| 。假设再检出 k (k>0) 个文档后,又出现一个 breakeven point ,则此时的 precision=|R’a|/|A’| , recall=|R’a|/|R| ,推出 |A’|=|R| 。由于 |A’|=|A|+k, k>0 ,且 |A|=|R| ,推出矛盾,所以不能有两个或两个以上的 breakeven point注意:当没有检出相关文档时,查全率和查准率都是零,这时是 breakevenpoint吗?考虑到这种情况,则可以有两个或两个以上的 breakeven point
Matrix Low-rank Approximation for LSI
Eigenvalues & Eigenvectors
Eigenvectors (for a square mm matrix S)
How many eigenvalues are there at most?
only has a non-zero solution if
this is a m-th order equation in λ which can have at most m distinct solutions (roots of the characteristic polynomial) – can be complex even though S is real.
eigenvalue(right) eigenvector
Example
Matrix-vector multiplication
000
020
003
S has eigenvalues 3, 2, 0 withcorresponding eigenvectors
0
0
1
1v
0
1
0
2v
1
0
0
3v
Any vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v1 + 4v2 + 6v3
6
4
2
Matrix vector multiplication
Thus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of the eigenvalues/vectors:
Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors.
Suggestion: the effect of “small” eigenvalues is small.
332211321
321
642642
)642(
vvvSvSvSvSx
vvvSSx
Eigenvalues & Eigenvectors
0 and , 2121}2,1{}2,1{}2,1{ vvvSv
For symmetric matrices, eigenvectors for distincteigenvalues are orthogonal
TSS and 0 if ,complex for IS
All eigenvalues of a real symmetric matrix are real.
Example
Let
Then
The eigenvalues are 1 and 3 (nonnegative, real).
The eigenvectors are orthogonal (and real):
21
12S
.01)2(21
12 2
IS
1
1
1
1
Real, symmetric.
Plug in these values and solve for eigenvectors.
Let be a square matrix with m linearly independent eigenvectors
Theorem: Exists an eigen decomposition
Columns of U are eigenvectors of S Diagonal elements of are eigenvalues
of
Eigen/diagonal Decomposition
diagonal
Diagonal decomposition: why/how
nvvU ...1Let U have the eigenvectors as columns:
n
nnnn vvvvvvSSU
............
1
1111
Then, SU can be written
And S=UU–1.
Thus SU=U, or U–1SU=
Diagonal decomposition - example
Recall .3,1;21
1221
S
The eigenvectors and form
1
1
1
1
11
11U
Inverting, we have
2/12/1
2/12/11U
Then, S=UU–1 =
2/12/1
2/12/1
30
01
11
11
RecallUU–1 =1.
Example continued
Let’s divide U (and multiply U–1) by 2
2/12/1
2/12/1
30
01
2/12/1
2/12/1Then, S=
Q (Q-1= QT )
Why? Stay tuned …*
||11 AA
A
If is a symmetric matrix: Theorem: Exists a (unique) eigen
decomposition
where Q is orthogonal: Q-1= QT
Columns of Q are normalized eigenvectors
Columns are orthogonal. (everything is real)
Symmetric Eigen Decomposition
TQQS
Time out!
What do these matrices have to do with text?
Recall m n term-document matrices … But everything so far needs square
matrices – so …
Singular Value Decomposition
TVUA
mm mn V is nn
For an m n matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:
The columns of U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.
ii
rdiag ...1 Singular values.
Eigenvalues 1 … r of AAT are the eigenvalues of ATA.
Singular Value Decomposition
Illustration of SVD dimensions and sparseness
SVD example
Let
01
10
11
A
Thus m=3, n=2. Its SVD is
2/12/1
2/12/1
00
30
01
3/16/12/1
3/16/12/1
3/16/20
Typically, the singular values arranged in decreasing order.
SVD can be used to compute optimal low-rank approximations.
Approximation problem: Find Ak of rank k such that
Ak and X are both mn matrices.
Typically, want k << r.
Low-rank Approximation
Frobenius normFkXrankX
k XAA
min)(:
Solution via SVD
Low-rank Approximation
set smallest r-ksingular values to zero
Tkk VUA )0,...,0,,...,(diag 1
column notation: sum of rank 1 matrices
Tii
k
i ik vuA
1
k
Approximation error
How good (bad) is this approximation? It’s the best possible, measured by the
Frobenius norm of the error:
where the i are ordered such that i i+1.
Suggests why Frobenius error drops as k increased.
1)(:
min
kFkFkXrankX
AAXA
SVD Low-rank approximation
Whereas the term-doc matrix A may have m=50000, n=10 million (and rank close to 50000)
We can construct an approximation A100
with rank 100. Of all rank 100 matrices, it would have the
lowest Frobenius error. Great … but why would we?? Answer: Latent Semantic Indexing
C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.
Performing the maps
Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.
A query q is also mapped into this space, by
1 kkT
k Uqq