47
5 June 2006 Polettini Nicola 1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

Embed Size (px)

Citation preview

Page 1: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 1

Term Weighting in Information

RetrievalPolettini Nicola

Monday, June 5, 2006

Web Information Retrieval

Page 2: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 2

Contents1. Introduction to Vector Space Model

Vocabulary & Terms Documents & Queries Similarity measures

2. Term weighting Binary weights SMART Retrieval System

3. Salton: “Term Precision Model” Paper analysis

4. New weighting schemas Web documents

5. Conclusions6. References

Page 3: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 3

The Vector Space Model

1.Vocabulary.2.Terms.3.Documents & Queries.4.Vector representation.5.Similarity measures.6.Cosine Similarity.

Page 4: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 4

Vocabulary

• Documents are represented as vectors in term space (all terms = vocabulary).

• Queries represented the same as documents.

• Query and Document weights are based on length and direction of their vector.

• A vector distance measure between the query and documents is used to rank retrieved documents.

Page 5: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 5

Terms

• Documents are represented by binary or weighted vectors of terms.

• Terms are usually stems.• Terms can be also n-grams.

• “Computer Science” = bigram• “World Wide Web” = trigram

Page 6: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 6

Documents & Queries Vectors

• Documents and queries are represented as “bags of words” (BOW).

• Represented as vectors:– A vector is like an array of floating

point.– It has direction and magnitude.– Each vector holds a place for every

term in the collection.– Therefore, most vectors are sparse.

Page 7: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 7

Vector representation

• Documents and Queries are represented as vectors.

• Vocabulary = n terms• Position 1 corresponds to term 1, position

2 to term 2, position n to term n.

absent is terma if 0

...,,

,...,,

,21

21

w

wwwQ

wwwD

qnqq

dddi inii

Page 8: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 8

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Similarity Measures

Page 9: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 9

Cosine SimilarityThe similarity of two documents is:

• This is called the cosine similarity.• The normalization is done when weighting the

terms. Otherwise normalization and similarity can be combined.

• Cosine similarity sorts documents according to degrees of similarity.

)()(

),(

1

2

1

2

1

n

jd

n

jqj

n

jdqj

i

ij

ij

ww

wwDQsim

),(1

n

jdqji ijwwDQsim

Page 10: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 10

Example: Computing Cosine Similarity

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

2

1 1D

Q2D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Page 11: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 11

Example: Computing Cosine Similarity (2)

98.0 42.0

64.0

])7.0()2.0[(*])8.0()4.0[(

)7.0*8.0()2.0*4.0(),(

yield? comparison similarity their doesWhat

)7.0,2.0(document Also,

)8.0,4.0(or query vect have Say we

22222

2

DQsim

D

Q

Page 12: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 12

Term Weighting

1.Binary weights2.SMART Retrieval System

Local formulas Global formulas Normalization formulas

3.TFIDF

Page 13: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 13

Binary Weights• Only the presence (1) or absence (0)

of a term is included in the vectordocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1

Page 14: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 14

Binary Weights Formula

0 if 0

0 if 1}{

kd

kd

kd freq

freqfreq

Binary formula gives every word that appears in a document equal relevance.

It can be useful when frequency is not important.

Page 15: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 15

Why use term weighting?

• Binary weights too limiting.(terms are either present or absent).

• Non-binary weights allow to model partial matching .(Partial matching allows retrieval of docs that approximate the query).

• Ranking of retrieved documents - best matching.(Term-weighting improves quality of answer set).

Page 16: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 16

Smart Retrieval System

• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.

• Designed for laboratory experiments in IR –Easy to mix and match different weighting methods.

Paper: Salton, “The Smart Retrieval System – Experiments in Automatic Document Processing”, 1971

Page 17: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 17

Smart Retrieval System (2)

• In SMART weights are decomposed into three factors:

norm

globallocalw kkdkd

Page 18: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 18

Local term-weighting formulas

)1)(ln(}{

)max(2

1}{

2

1)max(

}{

kdkd

kd

kdkd

kd

kd

kd

kd

kd

freqfreq

freqfreq

freq

freq

freqfreq

freq

local

Binary

Frequency

Maxnorm

AugmentedNormalized

AlternateLog

Page 19: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 19

Term frequency• TF (term frequency) - Count of times

term occurs in document. docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1

Page 20: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 20

Term frequency (2)

• The more times a term t occurs in document d the more likely it is that t is relevant to the document.

• Used alone, favors common words, long documents.

• Too much credit to words that appears more frequently.

• Tipically used for query weighting.

Page 21: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 21

Augmented Normalized Term Frequency

)max(}{

kd

kdkd freq

freqKfreqK

• This formula was proposed by Croft.

• Usually K = 0,5.

• K < 0,5 for large documents.

• K = 0,5 for shorter documents.

• Output varies between 0,5 and 1 for terms that appear in the document. It’s a “weak” form of normalization.

Page 22: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 22

Logarithmic Term Frequency

• Logarithms are a way to de-emphasize the effect of frequency.

• Logarithmic formula decreases the effects of large differences in term frequencies.

)1)(ln(}{ kdkd freqfreq

Page 23: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 23

Global term-weighting formulas

k

k

k

k

k

k

Doc

Doc

DocNDocDoc

NDoc

Doc

NDoc

global

1

log

log

log

2

Inverse

Squared

Probabilistic

Frequency

Page 24: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 24

Document Frequency

• DF = document frequency

– Count the frequency considering the whole collection of documents.

– Less frequently a term appears in the whole collection, the more discriminating it is.

Page 25: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 25

Inverse Document Frequency

• Measures rarity of the term in collection.• Inverts the document frequency.• It’s the most used global formula.• Higher if term occurs in less documents:

– Gives full weight to terms that occurr in one document only.

– Gives lowest weight to terms that occurr in all documents.

kDoc

NDoclog

Page 26: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 26

Inverse Document Frequency (2)

• IDF provides high values for rare words and low values for common words.

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

Examples for a collectionof 10000 documents(N = 10000)

Page 27: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 27

Other IDF Schemes

• Squared IDF: used rarely as a variant of IDF.

• Probabilistic IDF: – It assigns weights ranging from - for a

term that appears in every document to log(n-1) for a term that appears in only one document.

– Negative weights for terms appearing in more than half of the documents.

Page 28: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 28

Normalization formulas

jj

n

j j

n

j j

n

j j

w

w

w

w

norm

maxn to1

1

4

1

2

1Sum of weights

Cosine

Fourth

Max

Page 29: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 29

Document Normalization

• Long documents have an unfair advantage:– They use a lot of terms

• So they get more matches than short documents– And they use the same words repeatedly

• So they have much higher term frequencies

• Normalization seeks to remove these effects:– Related somehow to maximum term frequency.– But also sensitive to the number of terms.

• If we don’t normalize short documents may not be recognized as relevant.

Page 30: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 30

Cosine Normalization• It’s the most used and popular.• Normalize the term weights (so longer

documents are not unfairly given more weight).

• If weights are normalized the cosine similarity results:

),( 1

t

kjkikji wwDDsim

Page 31: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 31

Other normalizations• Sum of weights and fourth

normalization are rarely used as cosine normalization variant.

• Max Weight Normalization: It assigns weights between 0 and 1, but it

doesn’t take into account the distribution of terms over documents.

It gives high importance to the most relevant weighted terms within a document (used in CiteSeer).

Page 32: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 32

TFIDF Term-weighting

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 33: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 33

TFIDF Example• It’s the most used term-weighting

4

5

6

3

1

3

1

6

5

3

4

3

7

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

2

1 2 3

2

3

2

4

4

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

tf Wi,j

idf

Page 34: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 34

Normalization example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

4

5

6

3

1

3

1

6

5

3

4

3

7

1

2

1 2 3

2

3

2

4

4

tf

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

Wi,j

idf

1.70 0.97 2.67 0.87Length

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

W'i,j

Page 35: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 35

Retrieval Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

Query: contaminated retrieval

1

query

W'i,j

1

0.29 0.9 0.19 0.57Cosine similarity score

Ranked list:Doc 2Doc 4Doc 1Doc 3

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

W'i,j

Page 36: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 36

Gerard Salton paper: “The term precision

model”1.Weighting Schema proposed.2.Cosine similarity.3.Density formula.4.Discrimination Value formulas.5.Term Precision formulas.6.Conclusions.

Page 37: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 37

Gerard Salton paper: “Weighting Schema

proposed”

1.Use of tf idf formulas.2.Underline the importance of

term weighting.3.Use of cosine similarity.

Page 38: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 38

Gerard Salton paper: “Density formula”

N

jii

N

ijj

jiji DDsNN

DDs1 1

),()1(

1),(

• Density = the average pairwise cosine similarity between distinct document pairs.

• N = total number of documents.

Page 39: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 39

Gerard Salton paper: “Discrimination Value

formulas”

ssDV kk

kikik DVdfw

DV = Discrimination Value.

• It’s the difference between the two average densities where sk is the density for document pairs from which term k has been removed.

• If k is useful DV is positive.

Page 40: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 40

Gerard Salton paper: “Discrimination Value

formulas”• Terms with a high document frequency

increase the total density formula and DV is negative.

• Terms with a low document frequency leave the density unchanged and DV is near zero value.

• Terms with medium document frequency decrease the total density and DV is positive.

Page 41: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 41

Gerard Salton paper: “Term Precision formulas”

sI

s

rR

rw log

• N = total documents.

• R = relevant documents with respect to a query.

• I = (N-R) non relevant documents.

• r = assigned relevant documents.

• s = assigned non relevant documents. (df = r + s)

• w increases in 0<df<R and decreases in R<df<N

• The maximum value of w is reached at df = R.

Page 42: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 42

Gerard Salton paper: “Conclusions”

Precision weights are difficult to compute in practice because required relevance assessments of documents with respect to queries are not normally available in real retrieval situations.

Page 43: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 43

New Weighting Schemas

1.Web problems2.Document Structure3.Hyperlinks4.Different weighting

schemas

Page 44: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 44

New Weighting Schemas (2)

• Weight tokens under particular HTML tags more heavily:– <TITLE> tokens (Google seems to like title matches)

– <H1>,<H2>… tokens– <META> keyword tokens

• Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.

Page 45: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 45

ReferencesGerald Salton and Chris Buckley. Term

weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, Issue 5. 1988.

Gerard Salton and M.J.McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983.

Gerard Salton, A. Wong, and C.S. Yang. A vector space model for Information Retrieval. Journal of the American Society for Information Science, 18(11):613-620, November 1975.

Gerard Salton. The SMART Retrieval System – Experiments in automatic document processing. Prentice Hall, Englewood Cliffs, N. J., 1971.

Page 46: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 46

References (2) Erica Chisholm and Tamara G. Kolda. New Term Weighting

formulas for the Vector Space Method in Information Retrieval.

Computer Science and Mathematics Division. Oak Ridge

National Laboratory, 1999.

W. B. Croft. Experiments with representation in a document

retrieval system. Information Technology: Research and

Development, 2:1-21, 1983.

Ray Larson, Marc Davis. SIMS 202: Information Organization and

Retrieval. UC Berkeley SIMS, Lecture 18: Vector Representation,

2002.

Kishore Papineni. Why Inverse Document Frequency?. IBM T.J.

Watson Research Center Yorktown Heights, New York, Usa,

2001.

Chris Buckley. The importance of proper weighting methods. In

M. Bates, editor, Human Language Technology. Morgan

Kaufman, 1993.

Page 47: 5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 47

Questions?