Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat

Computational Linguiestic Course

Instructor: Professor Cercone

Presenter: Morteza zihayat

Information Retrieval and

Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model

22 Information Retrieval and Vector Space Model

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model


Introduction to IR

The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.

(Lyman & Hal 00)

Information Retrieval and Vector Space Model44

Growth of textual information

How can we help manage and exploit all the information?

Literature Email

WWW Desktop

News

IntranetBlog


Information overflow


What is Information Retrieval (IR)?

Narrow-sense: IR= Search Engine Technologies (IR=Google, library info

system) IR= Text matching/classification

Broad-sense: IR = Text Information Management: General problem: how to manage text information? How to find useful information? (retrieval)

Example: Google How to organize information? (text classification)

Example: Automatically assign emails to different folders How to discover knowledge from text? (text mining)

Example: Discover correlation of eventsInformation Retrieval and Vector Space Model77

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model


Formalizing IR Tasks

Vocabulary: V = {w1,w2, …, wT} of a languageQuery: q = q1, q2, …, qm where qi V.∈Document: di= di1, di2, …, dimi where dij V.∈Collection: C = {d1, d2, …, dN}Relevant document set: R(q) C:Generally ⊆

unknown and user-dependentQuery provides a “hint” on which documents

should be in R(q)

IR: find the approximate relevant document set R’(q)

Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model99

Evaluation measures

The quality of many retrieval systems depends on how well they manage to rank relevant documents.

How can we evaluate rankings in IR? IR researchers have developed evaluation measures

specifically designed to evaluate rankings. Most of these measures combine precision and recall in a

way that takes account of the ranking.


Precision & Recall


In other words:

Precision is the percentage of relevant items in the returned set

Recall is the percentage of all relevant documents in the collection that is in the returned set.


Evaluating Retrieval Performance


IR System Architecture

1414

Userquery

judgments

docs

results

QueryRep

DocRep

Ranking

Feedback

INDEXING

SEARCHING

QUERY MODIFICATION

INTERFACE

Information Retrieval and Vector Space Model

Indexing DocumentIndexing Document

1515Information Retrieval and Vector Space Model

Searching

Given a query, score documents efficientlyThe basic question:

Given a query, how do we know if document A is more relevant than B?

If document A uses more query words than document B Word usage in document A is more similar to that in

query ….

We should find a way to compute relevance Query and documents


The Notion of Relevance

1717

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

…

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Today’s lectureInformation Retrieval and Vector Space Model

Relevance = Similarity

Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q) similarity(d,q)

R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))

Key issues How to represent query/document?

Vector Space Model (VSM) How to define the similarity measure ?


Outline



Vector Space Model (VSM)

The vector space model is one of the most widely

used models for ad-hoc retrieval

Used in information filtering, information

retrieval, indexing and relevancy rankings.


VSM

Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x1,…,xN), xi is “importance” of term I

Measure relevance by the distance between the query vector and document vector in the vector space


VS Model: illustration

2222

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7

D8

D5

D11

D2 ? ?

D1

? ?

D3

? ?

Query


Some Issues about VS Model

There is no consistent definition for basic concept

Assigning weights to words has not been

determined Weight in query indicates importance of term


Outline



How to Assign Weights?

Different terms have different importance in a text

A term weighting scheme plays an important role

for the similarity measure. Higher weight = greater impact

We now turn to the question of how to weight

words in the vector space model.


There are three components in a weighting

scheme:

gi: the global weight of the ith term,

tij: is the local weight of the ith term in the jth document,

dj:the normalization factor for the jth document


Outline



TF Weighting

Idea: A term is more important if it occurs more frequently in a document

Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization:

TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)Normalization of TF is very important!


TF Methods


IDF Weighting

Idea: A term is more discriminative if it occurs only in fewer documents

Formula:IDF(t) = 1 + log(n/k) n : total number of docsk : # docs with term t (doc freq)


IDF weighting Methods


TF Normalization

Why? Document length variation “Repeated occurrences” are less informative than the “first

occurrence”Two views of document length

A doc is long because it uses more words A doc is long because it has more contents

Generally penalize long doc, but avoid over-penalizing


TF-IDF Weighting

TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc high tf high weight Rare in collection high idf high weight

Imagine a word count profile, what kind of terms would have high weights?


How to Measure Similarity?

3636

product)dot normalized(

)()(

),( :Cosine

),(C :similarityproduct Dot

absent is terma if 0 ),...,(

),...,(

1

2

1

2

1

1

1

1

N

jij

N

jqj

N

jijqj

i

N

jijqji

qNq

iNii

ww

wwDQsim

wwDQS

wwwQ

wwD


Outline



VS Example: Raw TF & Dot Product

38

doc3

information retrievalsearchengine

information

travelinformation

maptravel

government presidentcongress

doc1

doc2

…

Sim(q,doc1)=4.8*2.4+4.5*4.5

Sim(q,doc2)=2.4*2.4

Sim(q,doc3)=0

query=“information retrieval”

1(4.5)1(2.4)Query

1(4.3)1(3.2)1(2.2)Doc3

1(3.3)2(5.6)1(2.4)Doc2

1(5.4)1(2.1)1(4.5)2(4.8)Doc1

4.33.22.25.42.13.32.84.52.4IDF(fake)

CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo.


ExampleExample

Q: “gold silver truck”• D1: “Shipment of gold delivered in a fire”• D2: “Delivery of silver arrived in a silver truck”• D3: “Shipment of gold arrived in a truck”• Document Frequency of the jth term (dfj )

• Inverse Document Frequency (idf) = log10(n / dfj)

Tf*idf is used as term weight here


Example (Cont’d)Example (Cont’d)Id Term df idf1 a 3 02 arrived 2 0.1763 damaged 1 0.4774 delivery 1 0.4775 fire 1 0.4776 gold 1 0.1767 in 3 08 of 3 09 silver 1 0.47710 shipment 2 0.17611 truck 2 0.176


Example(Cont’d)Example(Cont’d)

Tf*idf is used here

SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031SC(Q, D2 ) = 0.486SC(Q,D3) = 0.062The ranking would be D2,D3,D1.• This SC uses the dot product.


Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model


Advantages of VS Model

Empirically effective! (Top TREC performance)IntuitiveEasy to implementWell-studied/Most evaluatedThe Smart system

Developed at Cornell: 1960-1999 Still widely used

Warning: Many variants of TF-IDF!


Disadvantages of VS Model

Assume term independence

Assume query and document to be the same

Lots of parameter tuning!


Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model


Improving the VSM Model

We can improve the model by: Reducing the number of dimensions

eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis

Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given

by[1]: Normalized Document Frequencies Normalized Query Frequencies


Stop ListStop List

Function words do not bear useful information for IRof, not, to, or, in, about, with, I, be, …

Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document)

The removal of stop words usually improves IR effectiveness

A few “standard” stop lists are commonly used.


StemmingStemming

4848

Reason: ◦ Different word forms may bear similar meaning

(e.g. search, searching): create a “standard” representation for them

Stemming: ◦ Removing some endings of word

dancer dancers

dancedanceddancing

dance


Stemming(Cont’d)Stemming(Cont’d)

Two main methods : Linguistic/dictionary-based stemming

high stemming accuracy high implementation and processing costs and higher

coverage

Porter-style stemming

lower stemming accuracy lower implementation and processing costs and lower

coverage Usually sufficient for IR


Latent Semantic Indexing (LSI) [3]

Reduces the dimensions of the term-document space

Attempts to solve the synonomy and polysemyUses Singular Value Decomposition (SVD)

identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text

Based on the principle that words that are used in the same contexts tend to have similar meanings.


LSI Process

In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the

matrix using the matrix to identify the concepts contained in the

text

LSI statistically analyses the patterns of word usage across the entire document collection


References

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and

Hinrich Schuetze https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model-

updated.ppt

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture_13_ir_and_vsm_.ppt Document Classification based on Wikipedia Content,

http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?

timestamp=1318275702299


https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf

Thanks For Your Attention ….


Documents

Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat