5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval

5 June 2006 Polettini Nicola 1

Term Weighting in Information

RetrievalPolettini Nicola

Monday, June 5, 2006

Web Information Retrieval


Contents1. Introduction to Vector Space Model

Vocabulary & Terms Documents & Queries Similarity measures

2. Term weighting Binary weights SMART Retrieval System

3. Salton: “Term Precision Model” Paper analysis

4. New weighting schemas Web documents

5. Conclusions6. References


The Vector Space Model

1.Vocabulary.2.Terms.3.Documents & Queries.4.Vector representation.5.Similarity measures.6.Cosine Similarity.


Vocabulary

• Documents are represented as vectors in term space (all terms = vocabulary).

• Queries represented the same as documents.

• Query and Document weights are based on length and direction of their vector.

• A vector distance measure between the query and documents is used to rank retrieved documents.


Terms

• Documents are represented by binary or weighted vectors of terms.

• Terms are usually stems.• Terms can be also n-grams.

• “Computer Science” = bigram• “World Wide Web” = trigram


Documents & Queries Vectors

• Documents and queries are represented as “bags of words” (BOW).

• Represented as vectors:– A vector is like an array of floating

point.– It has direction and magnitude.– Each vector holds a place for every

term in the collection.– Therefore, most vectors are sparse.


Vector representation

• Documents and Queries are represented as vectors.

• Vocabulary = n terms• Position 1 corresponds to term 1, position

2 to term 2, position n to term n.

absent is terma if 0

...,,

,...,,

,21

21

w

wwwQ

wwwD

qnqq

dddi inii


|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Similarity Measures


Cosine SimilarityThe similarity of two documents is:

• This is called the cosine similarity.• The normalization is done when weighting the

terms. Otherwise normalization and similarity can be combined.

• Cosine similarity sorts documents according to degrees of similarity.

)()(

),(

1

2

1

2

1

n

jd

n

jqj

n

jdqj

i

ij

ij

ww

wwDQsim

),(1

n

jdqji ijwwDQsim


Example: Computing Cosine Similarity

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

2

1 1D

Q2D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2


Example: Computing Cosine Similarity (2)

98.0 42.0

64.0

])7.0()2.0[(*])8.0()4.0[(

)7.0*8.0()2.0*4.0(),(

yield? comparison similarity their doesWhat

)7.0,2.0(document Also,

)8.0,4.0(or query vect have Say we

22222

2

DQsim

D

Q


Term Weighting

1.Binary weights2.SMART Retrieval System

Local formulas Global formulas Normalization formulas

3.TFIDF


Binary Weights• Only the presence (1) or absence (0)

of a term is included in the vectordocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1


Binary Weights Formula

0 if 0

0 if 1}{

kd

kd

kd freq

freqfreq

Binary formula gives every word that appears in a document equal relevance.

It can be useful when frequency is not important.


Why use term weighting?

• Binary weights too limiting.(terms are either present or absent).

• Non-binary weights allow to model partial matching .(Partial matching allows retrieval of docs that approximate the query).

• Ranking of retrieved documents - best matching.(Term-weighting improves quality of answer set).


Smart Retrieval System

• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.

• Designed for laboratory experiments in IR –Easy to mix and match different weighting methods.

Paper: Salton, “The Smart Retrieval System – Experiments in Automatic Document Processing”, 1971


Smart Retrieval System (2)

• In SMART weights are decomposed into three factors:

norm

globallocalw kkdkd


Local term-weighting formulas

)1)(ln(}{

)max(2

1}{

2

1)max(

}{

kdkd

kd

kdkd

kd

kd

kd

kd

kd

freqfreq

freqfreq

freq

freq

freqfreq

freq

local

Binary

Frequency

Maxnorm

AugmentedNormalized

AlternateLog


Term frequency• TF (term frequency) - Count of times

term occurs in document. docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1


Term frequency (2)

• The more times a term t occurs in document d the more likely it is that t is relevant to the document.

• Used alone, favors common words, long documents.

• Too much credit to words that appears more frequently.

• Tipically used for query weighting.


Augmented Normalized Term Frequency

)max(}{

kd

kdkd freq

freqKfreqK

• This formula was proposed by Croft.

• Usually K = 0,5.

• K < 0,5 for large documents.

• K = 0,5 for shorter documents.

• Output varies between 0,5 and 1 for terms that appear in the document. It’s a “weak” form of normalization.


Logarithmic Term Frequency

• Logarithms are a way to de-emphasize the effect of frequency.

• Logarithmic formula decreases the effects of large differences in term frequencies.

)1)(ln(}{ kdkd freqfreq


Global term-weighting formulas

k

k

k

k

k

k

Doc

Doc

DocNDocDoc

NDoc

Doc

NDoc

global

1

log

log

log

2

Inverse

Squared

Probabilistic

Frequency


Document Frequency

• DF = document frequency

– Count the frequency considering the whole collection of documents.

– Less frequently a term appears in the whole collection, the more discriminating it is.


Inverse Document Frequency

• Measures rarity of the term in collection.• Inverts the document frequency.• It’s the most used global formula.• Higher if term occurs in less documents:

– Gives full weight to terms that occurr in one document only.

– Gives lowest weight to terms that occurr in all documents.

kDoc

NDoclog


Inverse Document Frequency (2)

• IDF provides high values for rare words and low values for common words.

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

Examples for a collectionof 10000 documents(N = 10000)


Other IDF Schemes

• Squared IDF: used rarely as a variant of IDF.

• Probabilistic IDF: – It assigns weights ranging from - for a

term that appears in every document to log(n-1) for a term that appears in only one document.

– Negative weights for terms appearing in more than half of the documents.


Normalization formulas

jj

n

j j

n

j j

n

j j

w

w

w

w

norm

maxn to1

1

4

1

2

1Sum of weights

Cosine

Fourth

Max


Document Normalization

• Long documents have an unfair advantage:– They use a lot of terms

• So they get more matches than short documents– And they use the same words repeatedly

• So they have much higher term frequencies

• Normalization seeks to remove these effects:– Related somehow to maximum term frequency.– But also sensitive to the number of terms.

• If we don’t normalize short documents may not be recognized as relevant.


Cosine Normalization• It’s the most used and popular.• Normalize the term weights (so longer

documents are not unfairly given more weight).

• If weights are normalized the cosine similarity results:

),( 1

t

kjkikji wwDDsim


Other normalizations• Sum of weights and fourth

normalization are rarely used as cosine normalization variant.

• Max Weight Normalization: It assigns weights between 0 and 1, but it

doesn’t take into account the distribution of terms over documents.

It gives high importance to the most relevant weighted terms within a document (used in CiteSeer).


TFIDF Term-weighting

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik


TFIDF Example• It’s the most used term-weighting

4

5

6

3

1

3

1

6

5

3

4

3

7

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

2

1 2 3

2

3

2

4

4

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

tf Wi,j

idf


Normalization example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

4

5

6

3

1

3

1

6

5

3

4

3

7

1

2

1 2 3

2

3

2

4

4

tf

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

Wi,j

idf

1.70 0.97 2.67 0.87Length

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

W'i,j


Retrieval Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

Query: contaminated retrieval

1

query

W'i,j

1

0.29 0.9 0.19 0.57Cosine similarity score

Ranked list:Doc 2Doc 4Doc 1Doc 3

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

W'i,j


Gerard Salton paper: “The term precision

model”1.Weighting Schema proposed.2.Cosine similarity.3.Density formula.4.Discrimination Value formulas.5.Term Precision formulas.6.Conclusions.


Gerard Salton paper: “Weighting Schema

proposed”

1.Use of tf idf formulas.2.Underline the importance of

term weighting.3.Use of cosine similarity.


Gerard Salton paper: “Density formula”

N

jii

N

ijj

jiji DDsNN

DDs1 1

),()1(

1),(

• Density = the average pairwise cosine similarity between distinct document pairs.

• N = total number of documents.


Gerard Salton paper: “Discrimination Value

formulas”

ssDV kk

kikik DVdfw

DV = Discrimination Value.

• It’s the difference between the two average densities where sk is the density for document pairs from which term k has been removed.

• If k is useful DV is positive.


Gerard Salton paper: “Discrimination Value

formulas”• Terms with a high document frequency

increase the total density formula and DV is negative.

• Terms with a low document frequency leave the density unchanged and DV is near zero value.

• Terms with medium document frequency decrease the total density and DV is positive.


Gerard Salton paper: “Term Precision formulas”

sI

s

rR

rw log

• N = total documents.

• R = relevant documents with respect to a query.

• I = (N-R) non relevant documents.

• r = assigned relevant documents.

• s = assigned non relevant documents. (df = r + s)

• w increases in 0<df<R and decreases in R<df<N

• The maximum value of w is reached at df = R.


Gerard Salton paper: “Conclusions”

Precision weights are difficult to compute in practice because required relevance assessments of documents with respect to queries are not normally available in real retrieval situations.


New Weighting Schemas

1.Web problems2.Document Structure3.Hyperlinks4.Different weighting

schemas


New Weighting Schemas (2)

• Weight tokens under particular HTML tags more heavily:– <TITLE> tokens (Google seems to like title matches)

– <H1>,<H2>… tokens– <META> keyword tokens

• Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.


ReferencesGerald Salton and Chris Buckley. Term

weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, Issue 5. 1988.

Gerard Salton and M.J.McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983.

Gerard Salton, A. Wong, and C.S. Yang. A vector space model for Information Retrieval. Journal of the American Society for Information Science, 18(11):613-620, November 1975.

Gerard Salton. The SMART Retrieval System – Experiments in automatic document processing. Prentice Hall, Englewood Cliffs, N. J., 1971.


References (2) Erica Chisholm and Tamara G. Kolda. New Term Weighting

formulas for the Vector Space Method in Information Retrieval.

Computer Science and Mathematics Division. Oak Ridge

National Laboratory, 1999.

W. B. Croft. Experiments with representation in a document

retrieval system. Information Technology: Research and

Development, 2:1-21, 1983.

Ray Larson, Marc Davis. SIMS 202: Information Organization and

Retrieval. UC Berkeley SIMS, Lecture 18: Vector Representation,

2002.

Kishore Papineni. Why Inverse Document Frequency?. IBM T.J.

Watson Research Center Yorktown Heights, New York, Usa,

2001.

Chris Buckley. The importance of proper weighting methods. In

M. Bates, editor, Human Language Technology. Morgan

Kaufman, 1993.


Questions?