Information Retrieval Models PengBo Oct 30, 2010

Information Retrieval Models

PengBoOct 30, 2010

上次课回顾

Basic Index Techniques Inverted index Dictionary & Postings

Scoring and Ranking Term weighting tf·idf Vector Space Model Cosine Similarity

IR evaluation Precision, Recall, F Interpolation MAP, interpolated AP

本次课大纲

Information Retrieval Models Vector Space Model (VSM) Latent Semantic Model (LSI) Language Model (LM)

Relevance FeedbackQuery ExpansionRelevance FeedbackQuery Expansion

Vector Space Model

Documents as vectors

每一个文档 j 能够被看作一个向量，每个 term 是一个维度，取值为 log-scaled tf.idf

So we have a vector space terms are axes docs live in this space 高维空间：即使作 stemming, may have 20,000+

dimensions

D1 D2 D3 D4 D5 D6…

中国 4.1 0.0 3.7 5.9 3.1 0.0

文化 4.5 4.5 0 0 11.6 0

日本 0 3.5 2.9 0 2.1 3.9

留学生 0 3.1 5.1 12.8 0 0

教育 2.9 0 0 2.2 0 0

北京 7.1 0 0 0 4.4 3.8

…

Intuition

Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.Postulate: 在 vector space 中“ close together” 的文档会 talk about the same things.

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

用例： Query-by-example ， Free Text query as vector用例： Query-by-example ， Free Text query as vector

Cosine similarity

11

,2

M

ijij wd

向量 d1 和 d2 的

“ closeness” 可以用它们之间的夹角大小来度量

具体的，可用 cosine of the angle x 来计算向量相似度 .

向量按长度归一化Normalization

t 1

d 2

d 1

t 3

t 2

θ

M

i ki

M

i ji

M

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

M

i ki

M

i ji

M

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

#1.COS Similarity

计算查询 “ digital cameras” 与文档 “ digital cameras and video cameras” 之间的相似度 .

假定 N = 10,000,000, query 和 document 都采用 logarithmic term weighting (wf columns), query 采用 idf weighting ， document 采用 cosine normalization. “and” 作为 stop word.

#2. Evaluation

定义 precision-recall graph 如下：对一个查询结果列表，每一个返回结果文档处计算precision/recall 点，由这些点构成的图 .

在这个图上定义 breakeven point 为 precision和 recall 值相等的点 .

问：存在多于一个 breakeven point 的图吗？如果有，给出例子；没有的话，请证明之。

Latent Semantic Model

Vector Space Model: Pros

Automatic selection of index terms Partial matching of queries and

documents (dealing with the case where no document contains all search terms)

Ranking according to similarity score (dealing with large result sets)

Term weighting schemes (improves retrieval performance)

Various extensions Document clustering Relevance feedback (modifying query vector)

Geometric foundation

Problems with Lexical Semantics

Polysemy: 词通常有 multitude of meanings 和不同用法。 Vector Space Model 不能区分同一个词的不同含义，即ambiguity.

Synonymy: 不同的 terms 可能具有identical or a similar meaning. Vector Space Model 里不能表达词之间的associations.

Issues in the VSM

terms 之间的独立性假设有些 terms 更可能在一起出现

同义词，相关词汇，拼写错误， etc. 根据上下文， terms 可能有不同的含义

term-document 矩阵维度很高

对每篇文档 / 每个词，真的有那么多重要的特征 ?

对每篇文档 / 每个词，真的有那么多重要的特征 ?

Singular Value Decomposition

对 term-document 矩阵作奇异值分解 Singular Value Decomposition r, 矩阵的 rank , singular values 的对角阵（按降序排列） D, T, 具有正交的单位长度列向量 (TT’=I, DD’=I)

t d t r

Wtd = T

r r

DT

r d

WWT 的特征值WWT 的特征值 WTW 和 WWT 的特征向量WTW 和 WWT 的特征向量

Singular Values

gives an ordering to the dimensions 值下降非常快尾部的 singular

values at 代表 "noise"

在 low-value dimensions 截止可以减少 noise ，提高性能

Low-rank Approximation

t d t r

wtd = T

r r

DT

r d

t d t k

w'td = k k k d

T

DT

≈

Latent Semantic Indexing (LSI)

Perform a low-rank approximation of term-document matrix (typical rank 100-300)

General idea Map documents (and terms) to a low-

dimensional representation. Design a mapping such that the low-

dimensional space reflects semantic associations (latent semantic space).

Compute document similarity based on the inner product in this latent semantic space

What it is

从原始的 term-document 矩阵 Ar, 我们计算得到它的近似 Ak.

在 Ak 中，每行对应一个 term ，每列对应一个document

区别是，文档在新的空间，它的维度 k << r dimensions

怎样比较两个 term? 怎样比较两个 document?

怎样比较一个 term 和一个文档 ?

AKTAk = D T TT TDT =

(DT)T( DT)

AKTAk = D T TT TDT =

(DT)T( DT)

Ak[I,j]Ak[I,j]

AkAKT =TDT D T TT= (T)

(T)T

AkAKT =TDT D T TT= (T)

(T)T

LSI Term matrix T

T matrix 每个 term 在 LSI space 的向量原始 matrix: terms 向量是 d-dimensional ， T 中要

小很多 Dimensions 是在相同文档中倾向于与这个词“同现”

的一组 terms synonyms, contextually-related words,

variant endings (T) 用来计算 term 相似度

Document matrix D

D matrix 在 LSI space 中文档的表示和 T vectors 有相同的 dimensionality (DT) 用来计算 document 相似度可用于计算查询和一个文档的 similarity

Retrieval with LSI

LSI 检索过程：查询映射 / 投影到 LSI 的 DT 空间，称为“ folded in“ ： W=TDT ，若 q 投影到 DT 中后为 q’ ，则有

q = Tq’T

既有 q’= (-1T-1q)T = qT-1

Folded in 既为 document/query vector 乘上 T-1

文档集的文档向量为 DT

两者通过 dot-product 计算相似度

Improved Retrieval with LSI

性能提升来自… 去除了 noise 不需要 stem terms (variants will co-occur) 不需要 stop list 没有速度和空间上的改进 , though…

C=

Tr=

r=

DrT=

2=

2 D2T=

Example

Map into 2-dimenstion space

Latent Semantic Analysis

Latent semantic space: illustrating example

courtesy of Susan Dumais

Empirical evidence

Experiments on TREC 1/2/3 – Dumais Precision at or above median TREC precision

Top scorer on almost 20% of TREC topics Slightly better on average than straight vector

spaces Effect of dimensionality:

Dimensions

Precision

250 0.367

300 0.371

346 0.374

LSI has many other applications

在很多场合 , 我们都有 feature-object matrix. 矩阵是高维，有大量冗余，从而能使用 low-rank

approximation. 比如文本检索， the terms 是 features ， the docs

是 objects. Latent Semantic Index 比如 opinions 和 users … 数据不全 (e.g., users’ opinions), 可以在低维空间里

恢复 . Powerful general analytical technique

Language Models

IR based on Language Model (LM)

query

d1

d2

dn

…

Information need

document collection

1dM

2dM

…

ndM 通常的 search 方法：猜测作者写相关文

档时使用的词，形成 query The LM approach directly exploits

that idea!

generationgeneration

)|( dMQP

Formal Language (Model)

传统的生成模型 generative model: 产生 strings Finite state machines or regular grammars, etc.

Example:

I wish

I wishI wish I wishI wish I wish I wishI wish I wish I wish I wish…

(I wish) *

Stochastic Language Models

Models probability of generating strings in the language (commonly all strings over alphabet ∑)

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

…

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M) = 0.00000008


Model probability of generating any string

0.2 the

0.01 class

0.0001 sayst

0.0001 pleaseth

0.0001 yon

0.0005 maiden

0.01 woman

Model M1 Model M2

maidenclass pleaseth yonthe

0.00050.01 0.0001 0.00010.2

0.010.0001 0.02 0.10.2

P(s|M2) > P(s|M1)

0.2 the

0.0001 class

0.03 sayst

0.02 pleaseth

0.1 yon

0.01 maiden

0.0001 woman


用来生成文本的统计模型 Probability distribution over strings in a given

language

M

P ( | M ) = P ( | M ) P ( | M,

)P ( | M, )

P ( | M, )

Unigram and higher-order models

Unigram Language Models

Bigram (generally, n-gram) Language Models

Other Language Models Grammar-based models (PCFGs), etc.

Probably not the first thing to try in IR

= P ( )P ( | )P ( | )P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P ( | )

Easy.Effective!

Easy.Effective!

The fundamental problem of LMs

模型 M 是不知道的只有代表这个模型的样例文本

从样例文本中来估计 Model 然后计算观察到的文本概率

P ( | M ( ) )

M

Using Language Models in IR

每篇文档对应一个 model 按 P(d | q) 对文档排序 P(d | q) = P(q | d) x P(d) / P(q)

P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for

all d But we could use criteria like authority,

length, genre P(q | d) is the probability of q given d’s model

Very general formal approach

Language Models for IR

Language Modeling Approaches 为 query generation process 建模文档排序：按一个 query 作为由文档模型产生的随机样

本而被观察到的概率 the probability that a query would be observed as a random sample from the respective document model Multinomial approach

Retrieval based on probabilistic LM

把 query 的产生当作一个随机过程方法

为每个文档 Infer a language model. Estimate the probability ：估计每个文档模型产生这

个 query 的概率 Rank ：按这个概率对文档排序 . 通常使用 Unigram model

Query generation probability (1)

排序公式

用最大似然估计 :

Qt d

dt

Qtdmld

dl

tf

MtpMQp

),(

)|(ˆ)|(ˆ

Unigram assumption:Given a particular language

model, the query terms occur independently

Unigram assumption:Given a particular language

model, the query terms occur independently

),( dttf

ddl

: language model of document d

: raw tf of term t in document d

: total number of tokens in document d

dM

)|()(

)|()(),(

dMQpdp

dQpdpdQp

Insufficient data

Zero probability 一个文档里没有 query 中的某个 term 时…

General approach 没有出现文档中的 term 按它出现在 collection 中的概

率来代替 . If ,

0)|( dMtp

0),( dttf

cs

cs

cfMtp t

d )|(

tcf : raw count of term t in the collection

: raw collection size(total number of tokens in the collection)

Insufficient data

Zero probabilities spell disaster 使用平滑： smooth probabilities

Discount nonzero probabilities Give some probability mass to unseen things 有很多方法，如 adding 1, ½ or to counts,

Dirichlet priors, discounting, and interpolation [See FSNLP ch. 6 if you want more]

使用混合模型： use a mixture between the document multinomial and the collection multinomial distribution

Mixture model

P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) 参数很重要

值高，使得查询成为 “ conjunctive-like” – 适合短查询

值低更适合长查询调整来优化性能

比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)

Basic mixture model summary

General formulation of the LM for IR

Qt

dMtptpdpdQp ))|()()1(()(),(

general language model

individual-document model

Example

Document collection (2 documents) d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases

further Model: MLE unigram from documents; = ½ Query: revenue down

P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] = 1/8 x 3/32 = 3/256

P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256

Ranking: d1 > d2

Alternative Models of Text Generation

Query Model Query

Doc Model Doc

Searcher

Writer

)|( SearcherMP

)|( WriterMP

)|( MQueryP

)|( MDocP

Is this the same model?

Retrieval Using Language Models

Query ModelQuery

Doc ModelDoc

)|( QuerywP

)|( DocwP

1

2

3

Query likelihood (1)Query likelihood (1) Document likelihood (2),Document likelihood (2), Model comparison (3)Model comparison (3)

Query Likelihood

P(Q|Dm) 主要问题是估计文档 model

i.e. smoothing techniques instead of tf.idf weights

检索效果不错 e.g. UMass, BBN, Twente, CMU

问题：处理 relevance feedback, query expansion, structured queries困难

Document Likelihood

按 P(D|R)/P(D|NR) 排序 P(w|R) is estimated by P(w|Qm)， Qm is the query or

relevance model P(w|NR) is estimated by collection probabilities P(w)

问题是估计 relevance model Treat query as generated by mixture of topic and

background Estimate relevance model from related documents (query

expansion) Relevance feedback is easily incorporated

Good retrieval results e.g. UMass at SIGIR 01 inconsistent with heterogeneous document collections

Model Comparison

估计 query 和 document 模型，进行模型比较 KL divergence D(Qm||Dm)

取得了较前两方法更好的效果

Xx m

mmmm xD

xQxQDQD

)(

)(log)()||(

Language models: pro & con

Novel way of looking at the problem of text retrieval based on probabilistic language modeling Conceptually simple and explanatory Formal mathematical model Natural use of collection statistics, not heuristics

(almost…) LMs provide effective retrieval and can be

improved to the extent that the following conditions can be met Our language models are accurate representations of the

data. Users have some sense of term distribution.

Comparison With Vector Space

和传统的 tf.idf models 有一定联系 : (unscaled) term frequency is directly in model the probabilities do length normalization of term

frequencies the effect of doing a mixture with overall

collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

Comparison With Vector Space

相似点 Term weights based on frequency Terms often used as if they were independent Inverse document/collection frequency used Some form of length normalization used

不同点 Based on probability rather than similarity Intuitions are probabilistic rather than geometric Details of use of document length and term,

document, and collection frequency differ

本次课小结

Latent Semantic Indexing singular value

decomposition Matrix Low-rank

Approximation LanguageModel

Generative model smooth probabilities Mixture model

Qt

dMtptpdpdQp ))|()()1(()(),(

Resources

The Template Numerical Toolkit (TNT)http://math.nist.gov/tnt/documentation.html

The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ CMU/Umass LM and IR system in C(++), currently actively developed.

http://math.nist.gov/tnt/documentation.html

http://www-2.cs.cmu.edu/~lemur/

Thank You!

Q&A

阅读材料

[1] IIR Ch12, Ch18 [2] M. Alistair, Z. Justin, and H. David,

"Recommended reading for IR research students" SIGIR Forum, vol. 39, pp. 3-14, 2005.

#2 Evaluation

Question a 不能有两个或两个以上的 breakeven point 证明：一次检索 I，相关文档集为 R，设当前为

breakeven point ，检出文档集为 A ，检出的相关文档集为 Ra，则 precision=|Ra|/|A| ， recall=|Ra|/|R| ，根据 breakeven point 的定义， precision=recall ，推出 |R|=|A| 。假设再检出 k (k>0) 个文档后，又出现一个 breakeven point ，则此时的 precision=|R’a|/|A’| ， recall=|R’a|/|R| ，推出 |A’|=|R| 。由于 |A’|=|A|+k， k>0 ，且 |A|=|R| ，推出矛盾，所以不能有两个或两个以上的 breakeven point注意：当没有检出相关文档时，查全率和查准率都是零，这时是 breakevenpoint吗？考虑到这种情况，则可以有两个或两个以上的 breakeven point

Matrix Low-rank Approximation for LSI

Eigenvalues & Eigenvectors

Eigenvectors (for a square mm matrix S)

How many eigenvalues are there at most?

only has a non-zero solution if

this is a m-th order equation in λ which can have at most m distinct solutions (roots of the characteristic polynomial) – can be complex even though S is real.

eigenvalue(right) eigenvector

Example

Matrix-vector multiplication

000

020

003

S has eigenvalues 3, 2, 0 withcorresponding eigenvectors

0

0

1

1v

0

1

0

2v

1

0

0

3v

Any vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v1 + 4v2 + 6v3

6

4

2

Matrix vector multiplication

Thus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of the eigenvalues/vectors:

Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors.

Suggestion: the effect of “small” eigenvalues is small.

332211321

321

642642

)642(

vvvSvSvSvSx

vvvSSx

Eigenvalues & Eigenvectors

0 and , 2121}2,1{}2,1{}2,1{ vvvSv

For symmetric matrices, eigenvectors for distincteigenvalues are orthogonal

TSS and 0 if ,complex for IS

All eigenvalues of a real symmetric matrix are real.

Example

Let

Then

The eigenvalues are 1 and 3 (nonnegative, real).

The eigenvectors are orthogonal (and real):

21

12S

.01)2(21

12 2

IS

1

1

1

1

Real, symmetric.

Plug in these values and solve for eigenvectors.

Let be a square matrix with m linearly independent eigenvectors

Theorem: Exists an eigen decomposition

Columns of U are eigenvectors of S Diagonal elements of are eigenvalues

of

Eigen/diagonal Decomposition

diagonal

Diagonal decomposition: why/how

nvvU ...1Let U have the eigenvectors as columns:

n

nnnn vvvvvvSSU

............

1

1111

Then, SU can be written

And S=UU–1.

Thus SU=U, or U–1SU=

Diagonal decomposition - example

Recall .3,1;21

1221

S

The eigenvectors and form

1

1

1

1

11

11U

Inverting, we have

2/12/1

2/12/11U

Then, S=UU–1 =

2/12/1

2/12/1

30

01

11

11

RecallUU–1 =1.

Example continued

Let’s divide U (and multiply U–1) by 2

2/12/1

2/12/1

30

01

2/12/1

2/12/1Then, S=

Q (Q-1= QT )

Why? Stay tuned …*

||11 AA

A

If is a symmetric matrix: Theorem: Exists a (unique) eigen

decomposition

where Q is orthogonal: Q-1= QT

Columns of Q are normalized eigenvectors

Columns are orthogonal. (everything is real)

Symmetric Eigen Decomposition

TQQS

Time out!

What do these matrices have to do with text?

Recall m n term-document matrices … But everything so far needs square

matrices – so …


TVUA

mm mn V is nn

For an m n matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:

The columns of U are orthogonal eigenvectors of AAT.

The columns of V are orthogonal eigenvectors of ATA.

ii

rdiag ...1 Singular values.

Eigenvalues 1 … r of AAT are the eigenvalues of ATA.


Illustration of SVD dimensions and sparseness

SVD example

Let

01

10

11

A

Thus m=3, n=2. Its SVD is

2/12/1

2/12/1

00

30

01

3/16/12/1

3/16/12/1

3/16/20

Typically, the singular values arranged in decreasing order.

SVD can be used to compute optimal low-rank approximations.

Approximation problem: Find Ak of rank k such that

Ak and X are both mn matrices.

Typically, want k << r.


Frobenius normFkXrankX

k XAA

min)(:

Solution via SVD


set smallest r-ksingular values to zero

Tkk VUA )0,...,0,,...,(diag 1

column notation: sum of rank 1 matrices

Tii

k

i ik vuA

1

k

Approximation error

How good (bad) is this approximation? It’s the best possible, measured by the

Frobenius norm of the error:

where the i are ordered such that i i+1.

Suggests why Frobenius error drops as k increased.

1)(:

min

kFkFkXrankX

AAXA

SVD Low-rank approximation

Whereas the term-doc matrix A may have m=50000, n=10 million (and rank close to 50000)

We can construct an approximation A100

with rank 100. Of all rank 100 matrices, it would have the

lowest Frobenius error. Great … but why would we?? Answer: Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.

Performing the maps

Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.

A query q is also mapped into this space, by

1 kkT

k Uqq