View
57
Download
0
Category
Tags:
Preview:
DESCRIPTION
Information Retrieval and Vector Space Model. Computational Linguiestic Course Instructor : Professor Cercone Presenter : Morteza zihayat. Outline. Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example - PowerPoint PPT Presentation
Citation preview
Computational Linguiestic Course
Instructor: Professor Cercone
Presenter: Morteza zihayat
Information Retrieval and
Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model
22 Information Retrieval and Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VS Model
33 Information Retrieval and Vector Space Model
Introduction to IR
The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.
(Lyman & Hal 00)
Information Retrieval and Vector Space Model44
Growth of textual information
How can we help manage and exploit all the information?
Literature Email
WWW Desktop
News
IntranetBlog
Information Retrieval and Vector Space Model55
Information overflow
Information Retrieval and Vector Space Model66
What is Information Retrieval (IR)?
Narrow-sense: IR= Search Engine Technologies (IR=Google, library info
system) IR= Text matching/classification
Broad-sense: IR = Text Information Management: General problem: how to manage text information? How to find useful information? (retrieval)
Example: Google How to organize information? (text classification)
Example: Automatically assign emails to different folders How to discover knowledge from text? (text mining)
Example: Discover correlation of eventsInformation Retrieval and Vector Space Model77
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model
88 Information Retrieval and Vector Space Model
Formalizing IR Tasks
Vocabulary: V = {w1,w2, …, wT} of a languageQuery: q = q1, q2, …, qm where qi V.∈Document: di= di1, di2, …, dimi where dij V.∈Collection: C = {d1, d2, …, dN}Relevant document set: R(q) C:Generally ⊆
unknown and user-dependentQuery provides a “hint” on which documents
should be in R(q)
IR: find the approximate relevant document set R’(q)
Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model99
Evaluation measures
The quality of many retrieval systems depends on how well they manage to rank relevant documents.
How can we evaluate rankings in IR? IR researchers have developed evaluation measures
specifically designed to evaluate rankings. Most of these measures combine precision and recall in a
way that takes account of the ranking.
Information Retrieval and Vector Space Model1010
Precision & Recall
Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model1111
In other words:
Precision is the percentage of relevant items in the returned set
Recall is the percentage of all relevant documents in the collection that is in the returned set.
Information Retrieval and Vector Space Model1212
Evaluating Retrieval Performance
Source: This slide is borrowed from [1]Information Retrieval and Vector Space Model1313
IR System Architecture
1414
Userquery
judgments
docs
results
QueryRep
DocRep
Ranking
Feedback
INDEXING
SEARCHING
QUERY MODIFICATION
INTERFACE
Information Retrieval and Vector Space Model
Indexing DocumentIndexing Document
1515Information Retrieval and Vector Space Model
Searching
Given a query, score documents efficientlyThe basic question:
Given a query, how do we know if document A is more relevant than B?
If document A uses more query words than document B Word usage in document A is more similar to that in
query ….
We should find a way to compute relevance Query and documents
1616 Information Retrieval and Vector Space Model
The Notion of Relevance
1717
Relevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel
(Salton et al., 75)
Prob. distr.model
(Wong & Yao, 89)
…
GenerativeModel
RegressionModel
(Fox 83)
Classicalprob. Model(Robertson &
Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach
(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model
(Wong & Yao, 95)
Differentinference system
Inference network model
(Turtle & Croft, 91)
Today’s lectureInformation Retrieval and Vector Space Model
Relevance = Similarity
Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q) similarity(d,q)
R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
Key issues How to represent query/document?
Vector Space Model (VSM) How to define the similarity measure ?
1818Information Retrieval and Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model
1919 Information Retrieval and Vector Space Model
Vector Space Model (VSM)
The vector space model is one of the most widely
used models for ad-hoc retrieval
Used in information filtering, information
retrieval, indexing and relevancy rankings.
2020 Information Retrieval and Vector Space Model
VSM
Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x1,…,xN), xi is “importance” of term I
Measure relevance by the distance between the query vector and document vector in the vector space
2121 Information Retrieval and Vector Space Model
VS Model: illustration
2222
Java
Microsoft
Starbucks
D6
D10
D9
D4
D7
D8
D5
D11
D2 ? ?
D1
? ?
D3
? ?
Query
Information Retrieval and Vector Space Model
Some Issues about VS Model
There is no consistent definition for basic concept
Assigning weights to words has not been
determined Weight in query indicates importance of term
2424 Information Retrieval and Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model
2525 Information Retrieval and Vector Space Model
How to Assign Weights?
Different terms have different importance in a text
A term weighting scheme plays an important role
for the similarity measure. Higher weight = greater impact
We now turn to the question of how to weight
words in the vector space model.
2626 Information Retrieval and Vector Space Model
There are three components in a weighting
scheme:
gi: the global weight of the ith term,
tij: is the local weight of the ith term in the jth document,
dj:the normalization factor for the jth document
2727 Information Retrieval and Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model
2929 Information Retrieval and Vector Space Model
TF Weighting
Idea: A term is more important if it occurs more frequently in a document
Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)Normalization of TF is very important!
3030Information Retrieval and Vector Space Model
TF Methods
3131 Information Retrieval and Vector Space Model
IDF Weighting
Idea: A term is more discriminative if it occurs only in fewer documents
Formula:IDF(t) = 1 + log(n/k) n : total number of docsk : # docs with term t (doc freq)
3232Information Retrieval and Vector Space Model
IDF weighting Methods
3333 Information Retrieval and Vector Space Model
TF Normalization
Why? Document length variation “Repeated occurrences” are less informative than the “first
occurrence”Two views of document length
A doc is long because it uses more words A doc is long because it has more contents
Generally penalize long doc, but avoid over-penalizing
3434Information Retrieval and Vector Space Model
TF-IDF Weighting
TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc high tf high weight Rare in collection high idf high weight
Imagine a word count profile, what kind of terms would have high weights?
3535Information Retrieval and Vector Space Model
How to Measure Similarity?
3636
product)dot normalized(
)()(
),( :Cosine
),(C :similarityproduct Dot
absent is terma if 0 ),...,(
),...,(
1
2
1
2
1
1
1
1
N
jij
N
jqj
N
jijqj
i
N
jijqji
qNq
iNii
ww
wwDQsim
wwDQS
wwwQ
wwD
Information Retrieval and Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages of VS ModelImproving the VSM Model
3737 Information Retrieval and Vector Space Model
VS Example: Raw TF & Dot Product
38
doc3
information retrievalsearchengine
information
travelinformation
maptravel
government presidentcongress
doc1
doc2
…
Sim(q,doc1)=4.8*2.4+4.5*4.5
Sim(q,doc2)=2.4*2.4
Sim(q,doc3)=0
query=“information retrieval”
1(4.5)1(2.4)Query
1(4.3)1(3.2)1(2.2)Doc3
1(3.3)2(5.6)1(2.4)Doc2
1(5.4)1(2.1)1(4.5)2(4.8)Doc1
4.33.22.25.42.13.32.84.52.4IDF(fake)
CongressPresidentGovern.EngineSearchMapTravelRetrievalInfo.
Information Retrieval and Vector Space Model
ExampleExample
Q: “gold silver truck”• D1: “Shipment of gold delivered in a fire”• D2: “Delivery of silver arrived in a silver truck”• D3: “Shipment of gold arrived in a truck”• Document Frequency of the jth term (dfj )
• Inverse Document Frequency (idf) = log10(n / dfj)
Tf*idf is used as term weight here
39Information Retrieval and Vector Space Model
Example (Cont’d)Example (Cont’d)Id Term df idf1 a 3 02 arrived 2 0.1763 damaged 1 0.4774 delivery 1 0.4775 fire 1 0.4776 gold 1 0.1767 in 3 08 of 3 09 silver 1 0.47710 shipment 2 0.17611 truck 2 0.176
40Information Retrieval and Vector Space Model
Example(Cont’d)Example(Cont’d)
Tf*idf is used here
SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031SC(Q, D2 ) = 0.486SC(Q,D3) = 0.062The ranking would be D2,D3,D1.• This SC uses the dot product.
41Information Retrieval and Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model
4242 Information Retrieval and Vector Space Model
Advantages of VS Model
Empirically effective! (Top TREC performance)IntuitiveEasy to implementWell-studied/Most evaluatedThe Smart system
Developed at Cornell: 1960-1999 Still widely used
Warning: Many variants of TF-IDF!
4343Information Retrieval and Vector Space Model
Disadvantages of VS Model
Assume term independence
Assume query and document to be the same
Lots of parameter tuning!
4444Information Retrieval and Vector Space Model
Outline
Introduction to IRIR System ArchitectureVector Space Model (VSM)How to Assign Weights?TF-IDF WeightingExampleAdvantages and Disadvantages of VS ModelImproving the VSM Model
4545 Information Retrieval and Vector Space Model
Improving the VSM Model
We can improve the model by: Reducing the number of dimensions
eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis
Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given
by[1]: Normalized Document Frequencies Normalized Query Frequencies
Information Retrieval and Vector Space Model4646
Stop ListStop List
Function words do not bear useful information for IRof, not, to, or, in, about, with, I, be, …
Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document)
The removal of stop words usually improves IR effectiveness
A few “standard” stop lists are commonly used.
4747Information Retrieval and Vector Space Model
StemmingStemming
4848
Reason: ◦ Different word forms may bear similar meaning
(e.g. search, searching): create a “standard” representation for them
Stemming: ◦ Removing some endings of word
dancer dancers
dancedanceddancing
dance
Information Retrieval and Vector Space Model
Stemming(Cont’d)Stemming(Cont’d)
Two main methods : Linguistic/dictionary-based stemming
high stemming accuracy high implementation and processing costs and higher
coverage
Porter-style stemming
lower stemming accuracy lower implementation and processing costs and lower
coverage Usually sufficient for IR
4949Information Retrieval and Vector Space Model
Latent Semantic Indexing (LSI) [3]
Reduces the dimensions of the term-document space
Attempts to solve the synonomy and polysemyUses Singular Value Decomposition (SVD)
identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text
Based on the principle that words that are used in the same contexts tend to have similar meanings.
Information Retrieval and Vector Space Model5050
LSI Process
In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the
matrix using the matrix to identify the concepts contained in the
text
LSI statistically analyses the patterns of word usage across the entire document collection
Information Retrieval and Vector Space Model5151
References
Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schuetze https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf
https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt
https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_model-
updated.ppt
https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture_13_ir_and_vsm_.ppt Document Classification based on Wikipedia Content,
http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?
timestamp=1318275702299
5454 Information Retrieval and Vector Space Model
Thanks For Your Attention ….
5555 Information Retrieval and Vector Space Model
Recommended