51
Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model

Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat

  • Upload
    loren

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

Information Retrieval and Vector Space Model. Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat. Outline. Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example - PowerPoint PPT Presentation

Citation preview

Page 1: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Computational Linguiestic Course

Instructor: Professor Cercone

Presenter: Morteza zihayat

Information Retrieval and

Vector Space Model

Page 2: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model

22 Information Retrieval and Vector Space Model

Page 3: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model

33 Information Retrieval and Vector Space Model

Page 4: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Introduction to IR

The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.

(Lyman & Hal 00)

Information Retrieval and Vector Space Model44

Page 5: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Growth of textual information

How can we help manage and exploit all the information?

Literature Email

WWW Desktop

News

IntranetBlog

Information Retrieval and Vector Space Model55

Page 6: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Information overflow

Information Retrieval and Vector Space Model66

Page 7: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

What is Information Retrieval (IR)?

Narrow-sense: IR= Search Engine Technologies (IR=Google, library info

system) IR= Text matching/classification

Broad-sense: IR = Text Information Management: General problem: how to manage text information? How to find useful information? (retrieval)

Example: Google How to organize information? (text classification)

Example: Automatically assign emails to different folders How to discover knowledge from text? (text mining)

Example: Discover correlation of eventsInformation Retrieval and Vector Space Model77

Page 8: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

88 Information Retrieval and Vector Space Model

Page 9: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Formalizing IR Tasks

Vocabulary: V = {w1,w2, …, wT} of a languageQuery: q = q1, q2, …, qm where qi V.∈Document: di= di1, di2, …, dimi where dij V.∈Collection: C = {d1, d2, …, dN}Relevant document set: R(q) C:Generally ⊆

unknown and user-dependentQuery provides a “hint” on which documents

should be in R(q)

IR: find the approximate relevant document set R’(q)

Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model99

Page 10: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Evaluation measures

The quality of many retrieval systems depends on how well they manage to rank relevant documents.

How can we evaluate rankings in IR? IR researchers have developed evaluation measures

specifically designed to evaluate rankings. Most of these measures combine precision and recall in a

way that takes account of the ranking.

Information Retrieval and Vector Space Model1010

Page 11: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Precision & Recall

Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model1111

Page 12: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

In other words:

Precision is the percentage of relevant items in the returned set

Recall is the percentage of all relevant documents in the collection that is in the returned set.

Information Retrieval and Vector Space Model1212

Page 13: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Evaluating Retrieval Performance

Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model1313

Page 14: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

IR System Architecture

1414

Userquery

judgments

docs

results

QueryRep

DocRep

Ranking

Feedback

INDEXING

SEARCHING

QUERY MODIFICATION

INTERFACE

Information Retrieval and Vector Space Model

Page 15: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Indexing DocumentIndexing Document

1515Information Retrieval and Vector Space Model

Page 16: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Searching

Given a query, score documents efficientlyThe basic question:

Given a query, how do we know if document A is more relevant than B?

If document A uses more query words than document B Word usage in document A is more similar to that in

query ….

We should find a way to compute relevance Query and documents

1616 Information Retrieval and Vector Space Model

Page 17: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

The Notion of Relevance

1717

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Today’s lectureInformation Retrieval and Vector Space Model

Page 18: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Relevance = Similarity

Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q) similarity(d,q)

R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))

Key issues How to represent query/document?

Vector Space Model (VSM) How to define the similarity measure ?

1818Information Retrieval and Vector Space Model

Page 19: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

1919 Information Retrieval and Vector Space Model

Page 20: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Vector Space Model (VSM)

The vector space model is one of the most widely

used models for ad-hoc retrieval

Used in information filtering, information

retrieval, indexing and relevancy rankings.

2020 Information Retrieval and Vector Space Model

Page 21: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

VSM

Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x1,…,xN), xi is “importance” of term I

Measure relevance by the distance between the query vector and document vector in the vector space

2121 Information Retrieval and Vector Space Model

Page 22: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

VS Model: illustration

2222

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7

D8

D5

D11

D2 ? ?

D1

? ?

D3

? ?

Query

Information Retrieval and Vector Space Model

Page 23: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Some Issues about VS Model

There is no consistent definition for basic concept

Assigning weights to words has not been

determined Weight in query indicates importance of term

2424 Information Retrieval and Vector Space Model

Page 24: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

2525 Information Retrieval and Vector Space Model

Page 25: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

How to Assign Weights?

Different terms have different importance in a text

A term weighting scheme plays an important role

for the similarity measure. Higher weight = greater impact

We now turn to the question of how to weight

words in the vector space model.

2626 Information Retrieval and Vector Space Model

Page 26: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

There are three components in a weighting

scheme:

gi: the global weight of the ith term,

tij: is the local weight of the ith term in the jth document,

dj:the normalization factor for the jth document

2727 Information Retrieval and Vector Space Model

Page 27: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

2929 Information Retrieval and Vector Space Model

Page 28: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

TF Weighting

Idea: A term is more important if it occurs more frequently in a document

Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization:

TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)Normalization of TF is very important!

3030Information Retrieval and Vector Space Model

Page 29: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

TF Methods

3131 Information Retrieval and Vector Space Model

Page 30: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

IDF Weighting

Idea: A term is more discriminative if it occurs only in fewer documents

Formula:IDF(t) = 1 + log(n/k) n : total number of docsk : # docs with term t (doc freq)

3232Information Retrieval and Vector Space Model

Page 31: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

IDF weighting Methods

3333 Information Retrieval and Vector Space Model

Page 32: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

TF Normalization

Why? Document length variation “Repeated occurrences” are less informative than the “first

occurrence”Two views of document length

A doc is long because it uses more words A doc is long because it has more contents

Generally penalize long doc, but avoid over-penalizing

3434Information Retrieval and Vector Space Model

Page 33: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

TF-IDF Weighting

TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc high tf high weight Rare in collection high idf high weight

Imagine a word count profile, what kind of terms would have high weights?

3535Information Retrieval and Vector Space Model

Page 34: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

How to Measure Similarity?

3636

product)dot normalized(

)()(

),( :Cosine

),(C :similarityproduct Dot

absent is terma if 0 ),...,(

),...,(

1

2

1

2

1

1

1

1

N

jij

N

jqj

N

jijqj

i

N

jijqji

qNq

iNii

ww

wwDQsim

wwDQS

wwwQ

wwD

Information Retrieval and Vector Space Model

Page 35: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model

3737 Information Retrieval and Vector Space Model

Page 36: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

VS Example: Raw TF & Dot Product

38

doc3

information retrievalsearchengine

information

travelinformation

maptravel

government presidentcongress

doc1

doc2

Sim(q,doc1)=4.8*2.4+4.5*4.5

Sim(q,doc2)=2.4*2.4

Sim(q,doc3)=0

query=“information retrieval”

1(4.5)1(2.4)Query

1(4.3)1(3.2)1(2.2)Doc3

1(3.3)2(5.6)1(2.4)Doc2

1(5.4)1(2.1)1(4.5)2(4.8)Doc1

4.33.22.25.42.13.32.84.52.4IDF(fake)

CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo.

Information Retrieval and Vector Space Model

Page 37: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

ExampleExample

Q: “gold silver truck”• D1: “Shipment of gold delivered in a fire”• D2: “Delivery of silver arrived in a silver truck”• D3: “Shipment of gold arrived in a truck”• Document Frequency of the jth term (dfj )

• Inverse Document Frequency (idf) = log10(n / dfj)

Tf*idf is used as term weight here

39Information Retrieval and Vector Space Model

Page 38: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Example (Cont’d)Example (Cont’d)Id Term df idf1 a 3 02 arrived 2 0.1763 damaged 1 0.4774 delivery 1 0.4775 fire 1 0.4776 gold 1 0.1767 in 3 08 of 3 09 silver 1 0.47710 shipment 2 0.17611 truck 2 0.176

40Information Retrieval and Vector Space Model

Page 39: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Example(Cont’d)Example(Cont’d)

Tf*idf is used here

SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031SC(Q, D2 ) = 0.486SC(Q,D3) = 0.062The ranking would be D2,D3,D1.• This SC uses the dot product.

41Information Retrieval and Vector Space Model

Page 40: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model

4242 Information Retrieval and Vector Space Model

Page 41: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Advantages of VS Model

Empirically effective! (Top TREC performance)IntuitiveEasy to implementWell-studied/Most evaluatedThe Smart system

Developed at Cornell: 1960-1999 Still widely used

Warning: Many variants of TF-IDF!

4343Information Retrieval and Vector Space Model

Page 42: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Disadvantages of VS Model

Assume term independence

Assume query and document to be the same

Lots of parameter tuning!

4444Information Retrieval and Vector Space Model

Page 43: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Outline

Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model

4545 Information Retrieval and Vector Space Model

Page 44: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Improving the VSM Model

We can improve the model by: Reducing the number of dimensions

eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis

Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given

by[1]: Normalized Document Frequencies Normalized Query Frequencies

Information Retrieval and Vector Space Model4646

Page 45: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Stop ListStop List

Function words do not bear useful information for IRof, not, to, or, in, about, with, I, be, …

Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document)

The removal of stop words usually improves IR effectiveness

A few “standard” stop lists are commonly used.

4747Information Retrieval and Vector Space Model

Page 46: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

StemmingStemming

4848

Reason: ◦ Different word forms may bear similar meaning

(e.g. search, searching): create a “standard” representation for them

Stemming: ◦ Removing some endings of word

dancer dancers

dancedanceddancing

dance

Information Retrieval and Vector Space Model

Page 47: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Stemming(Cont’d)Stemming(Cont’d)

Two main methods : Linguistic/dictionary-based stemming

high stemming accuracy high implementation and processing costs and higher

coverage

Porter-style stemming

lower stemming accuracy lower implementation and processing costs and lower

coverage Usually sufficient for IR

4949Information Retrieval and Vector Space Model

Page 48: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Latent Semantic Indexing (LSI) [3]

Reduces the dimensions of the term-document space

Attempts to solve the synonomy and polysemyUses Singular Value Decomposition (SVD)

identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text

Based on the principle that words that are used in the same contexts tend to have similar meanings.

Information Retrieval and Vector Space Model5050

Page 49: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

LSI Process

In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the

matrix using the matrix to identify the concepts contained in the

text

LSI statistically analyses the patterns of word usage across the entire document collection

Information Retrieval and Vector Space Model5151

Page 50: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

References

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and

Hinrich Schuetze https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model-

updated.ppt

https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture_13_ir_and_vsm_.ppt Document Classification based on Wikipedia Content,

http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?

timestamp=1318275702299

5454 Information Retrieval and Vector Space Model

Page 51: Computational  Linguiestic  Course Instructor : Professor  Cercone Presenter :  Morteza zihayat

Thanks For Your Attention ….

5555 Information Retrieval and Vector Space Model