27
Vector Space Model CS 652 Information Extraction and Integration

Vector Space Model

Embed Size (px)

DESCRIPTION

Vector Space Model. CS 652 Information Extraction and Integration. Introduction. Docs. Index Terms. doc. match. Ranking. Information Need. query. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Vector Space Model

Vector Space Model

CS 652 Information Extraction and Integration

Page 2: Vector Space Model

2

IntroductionDocs

Information Need

Index Terms

doc

query

Rankingmatch

Page 3: Vector Space Model

3

IntroductionA ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

A ranking is based on fundamental premises regarding the notion of relevance, such as:

common sets of index terms

sharing of weighted terms

likelihood of relevance

Each set of premises leads to a distinct IR model

Page 4: Vector Space Model

4

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval

Browsing

U s e r

T a s k

Classic Models

Boolean Vector (Space) Probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector (Space) Latent Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

Page 5: Vector Space Model

5

Basic Concepts

Each document is described by a set of representative keywords or index terms

Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document

However, search engines assume that all words are index terms (full text representation)

Page 6: Vector Space Model

6

Basic ConceptsNot all terms are equally useful for representing the document contents

The importance of the index terms is represented by weights associated to them

Let

ki be an index term

dj be a document

wij is a weight associated with (ki,dj ), which quantifies the

importance of ki for describing the contents of dj

Page 7: Vector Space Model

7

The Vector (Space) Model

Define:wij

> 0 whenever ki dj

wiq>= 0 associated with the pair (ki,q)

vec(dj ) = (w1j, w2j

, ..., wtj ), document vector of dj

vec(q) = (w1q, w2q

, ..., wtq ), query vector of q

The unitary vectors vec(di) and vec(qj) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

Queries and documents are represented as weighted vectors

Page 8: Vector Space Model

8

The Vector (Space) Model

Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [t

i=1 wij wiq

] / ti=1 wij

2 ti=1 wiq

2

where is the inner product operator & |q| is the length of q

Since wij 0 and wiq

0, 1 sim(q, dj ) 0

A document is retrieved even if it matches the query terms only partially

i

jdj

q

Page 9: Vector Space Model

9

The Vector (Space) Model

Sim(q, dj ) = [ti=1 wij

wiq ] / |dj | |q|

How to compute the weights wij and wiq

?

A good weight must take into account two effects:

quantification of intra-document contents (similarity)

tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity)

idf factor, the inverse document frequency

wij = tf(i, j) idf(i)

Page 10: Vector Space Model

10

The Vector (Space) ModelLet,

N be the total number of documents in the collectionni be the number of documents which contain ki

freq(i, j), the raw frequency of ki within dj

A normalized tf factor is given byf(i, j) = freq(i, j) / max(freq(l, j)),

where the maximum is computed over all terms which occur within the document dj

The inverse document frequency (idf) factor is idf(i) = log (N / ni )the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term ki.

Page 11: Vector Space Model

11

The Vector (Space) Model

The best term-weighting schemes use weights which are give by

wij = f(i, j) log(N / ni)

the strategy is called a tf-idf weighting scheme

For the query term weights, a suggestion isWiq

= (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni)

The vector model with tf-idf weights is a good ranking strategy with general collections

The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute

Page 12: Vector Space Model

12

The Vector (Space) ModelAdvantages:

term-weighting improves quality of the answer set

partial matching allows retrieval of documents that approximate the query conditions

cosine ranking formula sorts documents according to degree of similarity to the query

A popular IR model because of its simplicity & speed

Disadvantages:

assumes mutually independence of index terms (??);

not clear that this is bad though

Page 13: Vector Space Model

Naïve Bayes Classifier

CS 652 Information Extraction and Integration

Page 14: Vector Space Model

14

Bayes Theorem

The basic starting point for inference problems using probability theory as logic

Page 15: Vector Space Model

15

Bayes Theorem

.008 .992

.98 .02

.03 .97

P(+|cancer)P(cancer)=(.98).008=.0078

P(+|~cancer)P(~cancer)=(.03).992=.0298

Page 16: Vector Space Model

16

Basic Formulas for Probabilities

Page 17: Vector Space Model

17

Naïve Bayes Classifier

Page 18: Vector Space Model

18

Naïve Bayes Classifier

Page 19: Vector Space Model

19

Naïve Bayes Classifier

Page 20: Vector Space Model

20

Naïve Bayes Algorithm

Page 21: Vector Space Model

21

Naïve Bayes Subtleties

Page 22: Vector Space Model

22

Naïve Bayes Subtleties

m-estimate of probability

Page 23: Vector Space Model

23

Learning to Classify Text

Classify text into manually defined groups Estimate probability of class membership Rank by relevance Discover grouping, relationships

– Between texts– Between real-world entities mentioned in text

Page 24: Vector Space Model

24

Learn_Naïve_Bayes_Text(Example, V)

Page 25: Vector Space Model

25

Calculate_Probability_Terms

Page 26: Vector Space Model

26

Classify_Naïve_Bayes_Text(Doc)

Page 27: Vector Space Model

27

How to Improve

More training data Better training data Better text representation

– Usual IR tricks (term weighting, etc.)– Manually construct good predictor features

Hand off hard cases to human being