Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector and Probabilistic Ranking

Ray Larson & Marti Hearst

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval


Review

• Inverted files

• The Vector Space Model

• Term weighting


Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms


How Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.


Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms

• Queries represented the same as documents• Query and Document weights are based on length

and direction of their vector• A vector distance measure between the query and

documents is used to rank retrieved documents


Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.


Vector Space Documentsand Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3

D10 0 1 1 5D11 1 0 1 4

Q 1 2 3q1 q2 q3

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

t2

t3

t1

Boolean term combinations

Q is a query – also represented as a vector


Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6


Assigning Weights to Terms

• Binary Weights• Raw term frequency• tf x idf

– Recall the Zipf distribution– Want to weight terms highly if they are

• frequent in relevant documents … BUT• infrequent in the collection as a whole

• Automatically derived thesaurus terms


Binary Weights

• Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1


Raw Term Weights

• The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1

D10 0 3 5D11 4 0 1


Assigning Weights• tf x idf measure:

– term frequency (tf)– inverse document frequency (idf) -- a way to

deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term in each document


tf x idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik


Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents


Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient


Vector Space Visualization


Text ClusteringClustering is

“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2


K-Means Clustering

• 1 Create a pair-wise similarity measure• 2 Find K centers using agglomerative clustering

– take a small sample

– group bottom up until K groups found

• 3 Assign each document to nearest center, forming new clusters

• 4 Repeat 3 as necessary


Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”


S/G Example: query on “star”Encyclopedia text

14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated


Another use of clustering

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation:


Clustering Multi-Dimensional Document Space

(image from Wise et al 95)


Clustering Multi-Dimensional Document Space

(image from Wise et al 95)


Concept “Landscapes”

Pharmocology

Anatomy

Legal

Disease

Hospitals

(e.g., Lin, Chen, Wise et al.) Too many concepts, or too coarse Too many concepts, or too coarse Single concept per documentSingle concept per document No titlesNo titles Browsing without searchBrowsing without search


Clustering

• Advantages:– See some main themes

• Disadvantage:– Many ways documents could group together

are hidden

• Thinking point: what is the relationship to classification systems and facets?


Today

• Vector Space Ranking

• Probabilistic Models and Ranking

• (lots of math)


tf x idf normalization• Normalize the term weights (so longer documents

are not unfairly given more weight)– normalize usually means force all values to fall within a certain

range, usually between 0 and 1, inclusive.

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(


Vector space similarity(use the weights to compare the documents)

terms.) thehting when weigdone tion was(Normaliza

product.inner normalizedor cosine, thecalled also is This

),(

:is documents twoof similarity theNow,

1

t

kjkikji wwDDsim


Vector Space Similarity Measurecombine tf x idf into a similarity measure

)()(

),(

:comparison similarity in the normalize otherwise

),( :normalized weights termif

absent is terma if 0 ...,,

,...,,

1

2

1

2

1

1

,21

21

t

jd

t

jqj

t

jdqj

i

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD


To Think About

• How does this ranking algorithm behave?– Make a set of hypothetical documents

consisting of terms and their weights– Create some hypothetical queries– How are the documents ranked, depending on

the weights of their terms and the queries’ terms?


Computing Similarity Scores

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2


Computing a similarity score

98.0 42.0

64.0

])7.0()2.0[(*])8.0()4.0[(

)7.0*8.0()2.0*4.0(),(

yield? comparison similarity their doesWhat

)7.0,2.0(document Also,

)8.0,4.0(or query vect have Say we

22222

2

DQsim

D

Q


Vector Space with Term Weights and Cosine Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim


Weighting schemes

• We have seen something of– Binary– Raw term weights– TF*IDF

• There are many other possibilities– IDF alone– Normalized term frequency


Term Weights in SMART

• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.

• Designed for laboratory experiments in IR – Easy to mix and match different weighting

methods– Really terrible user interface– Intended for use by code hackers (and even

they have trouble using it)


Term Weights in SMART

• In SMART weights are decomposed into three factors:

norm

collectfreqw kkd

kd


SMART Freq Components

1)ln(

)max(2

1

2

1)max(

}1,0{

kd

kd

kd

kd

kd

kd

freq

freqfreq

freq

freq

freq

Binary

maxnorm

augmented

log


Collection Weighting in SMART

k

k

k

k

k

k

Doc

Doc

DocNDocDoc

NDoc

Doc

NDoc

collect

1

log

log

log

2

Inverse

squared

probabilistic

frequency


Term Normalization in SMART

jvector

vectorj

vectorj

vectorj

w

w

w

w

norm

max

4

2

sum

cosine

fourth

max


Problems with Vector Space

• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real

basis– most similarity measures work about the same

regardless of model

• Terms are not really orthogonal dimensions– Terms are not independent of all other terms


Probabilistic Models

• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

• Rely on accurate estimates of probabilities


Probability Ranking Principle

• If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.

Stephen E. Robertson, J. Documentation 1977


Model 1 – Maron and Kuhns

• Concerned with estimating probabilities of relevance at the point of indexing:– If a patron came with a request using term ti,

what is the probability that she/he would be satisfied with document Dj ?


Bayes Formula

• Bayesian statistical inference used in both models…

),|()|(

),|()|(

),|()|(),|()|(

)|,()|,(

CABPACP

BACPABP

CABPACPBACPABP

ABCPACBP


Model 1 Bayes

• A is the class of events of using the library• Di is the class of events of Document i being

judged relevant• Ij is the class of queries consisting of the

single term Ij

• P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved

)|(

),|()|(),|(

AIP

DAIPADPIADP

j

ijiji


Model 2 – Robertson & Sparck Jones

Document Relevance

Documentindexing

Given a term t and a query q

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N


Robertson-Spark Jones Weights

• Retrospective formulation --

rRnNrnrR

r

log


Robertson-Sparck Jones Weights

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Predictive formulation


Model 1

• A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q.

Robertson, Maron & Cooper, 1982


Model 2• Documents have many different properties; some documents

have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.

Robertson, Maron & Cooper, 1982


Probabilistic Models: Some Unifying Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance (R) is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

Dx

Qy



• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)• Model 0 -- P(R|y,x)• Probabilities are estimated based on prior

usage or relevance estimation


Probabilistic ModelsQD

x

y

Di

Qj


Logistic Regression

• Based on work by William Cooper, Fred Gey and Daniel Dabney.

• Builds a regression model for relevance prediction based on a set of training data

• Uses less restrictive independence assumptions than Model 2– Linked Dependence


Probabilistic Models: Logistic Regression

• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.

nnkji vcvcvcctdR|qO ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP

m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:


Logistic Regression

100 -90 -80 -70 -60 -50 -40 -30 -20 -10 -0 -

0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce


Probabilistic Models: Logistic Regression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged


Probabilistic Models: Logistic Regression

6

10),|(

iii XccDQRP

Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients.At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown previously



• Strong theoretical basis

• In principle should supply the best predictions of relevance given available information

• Can be implemented similarly to Vector

• Relevance information is required -- or is “guestimated”

• Important indicators of relevance may not be term -- though terms only are usually used

• Optimally requires on-going collection of relevance information

Advantages Disadvantages


Vector and Probabilistic Models

• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how the

ranking is calculated– Vector assumes relevance

– Probabilistic relies on relevance judgments or estimates


Current use of Probabilistic Models

• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w


Okapi BM25

• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was

derived• dl and avdl are the document length and the average

document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(


Logistic Regression and Cheshire II

• The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data.

• Demo (?)