66
9/19/2000 Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

Vector and Probabilistic Ranking

  • Upload
    tadhg

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Vector and Probabilistic Ranking. Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Review. Inverted files The Vector Space Model Term weighting. Inverted indexes. - PowerPoint PPT Presentation

Citation preview

Page 1: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector and Probabilistic Ranking

Ray Larson & Marti Hearst

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Review

• Inverted files

• The Vector Space Model

• Term weighting

Page 3: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Page 4: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

How Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Page 5: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms

• Queries represented the same as documents• Query and Document weights are based on length

and direction of their vector• A vector distance measure between the query and

documents is used to rank retrieved documents

Page 6: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.

Page 7: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector Space Documentsand Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3

D10 0 1 1 5D11 1 0 1 4

Q 1 2 3q1 q2 q3

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

t2

t3

t1

Boolean term combinations

Q is a query – also represented as a vector

Page 8: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

Page 9: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Assigning Weights to Terms

• Binary Weights• Raw term frequency• tf x idf

– Recall the Zipf distribution– Want to weight terms highly if they are

• frequent in relevant documents … BUT• infrequent in the collection as a whole

• Automatically derived thesaurus terms

Page 10: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Binary Weights

• Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1

Page 11: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Raw Term Weights

• The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1

D10 0 3 5D11 4 0 1

Page 12: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Assigning Weights• tf x idf measure:

– term frequency (tf)– inverse document frequency (idf) -- a way to

deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term in each document

Page 13: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

tf x idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 14: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

Page 15: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Page 16: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector Space Visualization

Page 17: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Text ClusteringClustering is

“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

Page 18: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

K-Means Clustering

• 1 Create a pair-wise similarity measure• 2 Find K centers using agglomerative clustering

– take a small sample

– group bottom up until K groups found

• 3 Assign each document to nearest center, forming new clusters

• 4 Repeat 3 as necessary

Page 19: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”

Page 20: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

S/G Example: query on “star”Encyclopedia text

14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated

Page 21: Vector and Probabilistic Ranking
Page 22: Vector and Probabilistic Ranking
Page 23: Vector and Probabilistic Ranking
Page 24: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Another use of clustering

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation:

Page 25: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Clustering Multi-Dimensional Document Space

(image from Wise et al 95)

Page 26: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Clustering Multi-Dimensional Document Space

(image from Wise et al 95)

Page 27: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Concept “Landscapes”

Pharmocology

Anatomy

Legal

Disease

Hospitals

(e.g., Lin, Chen, Wise et al.) Too many concepts, or too coarse Too many concepts, or too coarse Single concept per documentSingle concept per document No titlesNo titles Browsing without searchBrowsing without search

Page 28: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Clustering

• Advantages:– See some main themes

• Disadvantage:– Many ways documents could group together

are hidden

• Thinking point: what is the relationship to classification systems and facets?

Page 29: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Today

• Vector Space Ranking

• Probabilistic Models and Ranking

• (lots of math)

Page 30: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

tf x idf normalization• Normalize the term weights (so longer documents

are not unfairly given more weight)– normalize usually means force all values to fall within a certain

range, usually between 0 and 1, inclusive.

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

Page 31: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector space similarity(use the weights to compare the documents)

terms.) thehting when weigdone tion was(Normaliza

product.inner normalizedor cosine, thecalled also is This

),(

:is documents twoof similarity theNow,

1

t

kjkikji wwDDsim

Page 32: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector Space Similarity Measurecombine tf x idf into a similarity measure

)()(

),(

:comparison similarity in the normalize otherwise

),( :normalized weights termif

absent is terma if 0 ...,,

,...,,

1

2

1

2

1

1

,21

21

t

jd

t

jqj

t

jdqj

i

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD

Page 33: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

To Think About

• How does this ranking algorithm behave?– Make a set of hypothetical documents

consisting of terms and their weights– Create some hypothetical queries– How are the documents ranked, depending on

the weights of their terms and the queries’ terms?

Page 34: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Computing Similarity Scores

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Page 35: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Computing a similarity score

98.0 42.0

64.0

])7.0()2.0[(*])8.0()4.0[(

)7.0*8.0()2.0*4.0(),(

yield? comparison similarity their doesWhat

)7.0,2.0(document Also,

)8.0,4.0(or query vect have Say we

22222

2

DQsim

D

Q

Page 36: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector Space with Term Weights and Cosine Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim

Page 37: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Weighting schemes

• We have seen something of– Binary– Raw term weights– TF*IDF

• There are many other possibilities– IDF alone– Normalized term frequency

Page 38: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Term Weights in SMART

• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.

• Designed for laboratory experiments in IR – Easy to mix and match different weighting

methods– Really terrible user interface– Intended for use by code hackers (and even

they have trouble using it)

Page 39: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Term Weights in SMART

• In SMART weights are decomposed into three factors:

norm

collectfreqw kkd

kd

Page 40: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

SMART Freq Components

1)ln(

)max(2

1

2

1)max(

}1,0{

kd

kd

kd

kd

kd

kd

freq

freqfreq

freq

freq

freq

Binary

maxnorm

augmented

log

Page 41: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Collection Weighting in SMART

k

k

k

k

k

k

Doc

Doc

DocNDocDoc

NDoc

Doc

NDoc

collect

1

log

log

log

2

Inverse

squared

probabilistic

frequency

Page 42: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Term Normalization in SMART

jvector

vectorj

vectorj

vectorj

w

w

w

w

norm

max

4

2

sum

cosine

fourth

max

Page 43: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Problems with Vector Space

• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real

basis– most similarity measures work about the same

regardless of model

• Terms are not really orthogonal dimensions– Terms are not independent of all other terms

Page 44: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic Models

• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

• Rely on accurate estimates of probabilities

Page 45: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probability Ranking Principle

• If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.

Stephen E. Robertson, J. Documentation 1977

Page 46: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Model 1 – Maron and Kuhns

• Concerned with estimating probabilities of relevance at the point of indexing:– If a patron came with a request using term ti,

what is the probability that she/he would be satisfied with document Dj ?

Page 47: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Bayes Formula

• Bayesian statistical inference used in both models…

),|()|(

),|()|(

),|()|(),|()|(

)|,()|,(

CABPACP

BACPABP

CABPACPBACPABP

ABCPACBP

Page 48: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Model 1 Bayes

• A is the class of events of using the library• Di is the class of events of Document i being

judged relevant• Ij is the class of queries consisting of the

single term Ij

• P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved

)|(

),|()|(),|(

AIP

DAIPADPIADP

j

ijiji

Page 49: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Model 2 – Robertson & Sparck Jones

Document Relevance

Documentindexing

Given a term t and a query q

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

Page 50: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Robertson-Spark Jones Weights

• Retrospective formulation --

rRnNrnrR

r

log

Page 51: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Robertson-Sparck Jones Weights

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Predictive formulation

Page 52: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Model 1

• A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q.

Robertson, Maron & Cooper, 1982

Page 53: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Model 2• Documents have many different properties; some documents

have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.

Robertson, Maron & Cooper, 1982

Page 54: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic Models: Some Unifying Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance (R) is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

Dx

Qy

Page 55: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic Models

• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)• Model 0 -- P(R|y,x)• Probabilities are estimated based on prior

usage or relevance estimation

Page 56: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic ModelsQD

x

y

Di

Qj

Page 57: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Logistic Regression

• Based on work by William Cooper, Fred Gey and Daniel Dabney.

• Builds a regression model for relevance prediction based on a set of training data

• Uses less restrictive independence assumptions than Model 2– Linked Dependence

Page 58: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic Models: Logistic Regression

• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.

nnkji vcvcvcctdR|qO ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP

m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:

Page 59: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Logistic Regression

100 -90 -80 -70 -60 -50 -40 -30 -20 -10 -0 -

0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce

Page 60: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic Models: Logistic Regression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

Page 61: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic Models: Logistic Regression

6

10),|(

iii XccDQRP

Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients.At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown previously

Page 62: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Probabilistic Models

• Strong theoretical basis

• In principle should supply the best predictions of relevance given available information

• Can be implemented similarly to Vector

• Relevance information is required -- or is “guestimated”

• Important indicators of relevance may not be term -- though terms only are usually used

• Optimally requires on-going collection of relevance information

Advantages Disadvantages

Page 63: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Vector and Probabilistic Models

• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how the

ranking is calculated– Vector assumes relevance

– Probabilistic relies on relevance judgments or estimates

Page 64: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Current use of Probabilistic Models

• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Page 65: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Okapi BM25

• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was

derived• dl and avdl are the document length and the average

document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

Page 66: Vector and Probabilistic Ranking

9/19/2000 Information Organization and Retrieval

Logistic Regression and Cheshire II

• The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data.

• Demo (?)