23
Växjö: 23. Jan -0 4 Evaluation of Vector Space . .. 1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist ([email protected]) Växjö University (Mathematics and Systems Engineering) GSLT (Graduate School of Language Technology) Göteborg University (Department of Linguistics)

Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist ([email protected])

Embed Size (px)

Citation preview

Page 1: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 1

Evaluation of Vector Space Models Obtained by Latent Semantic Indexing

Leif Grönqvist ([email protected])Växjö University (Mathematics and Systems Engineering)

GSLT (Graduate School of Language Technology)Göteborg University (Department of Linguistics)

Page 2: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 2

Outline of the talk

Vector space models in IR (reminder since last seminar) The traditional model Latent semantic indexing (LSI)

Singular value decomposition (SVD)

Evaluation Why How & Data sources

Page 3: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 3

The traditional vector model

One dimension for each index term A document is a vector in a very high

dimensional space The similarity between a document

and a query is:

Gives us a degree of similarity instead of yes/no as for basic keyword search

||||),(

qd

qdqdsim

j

jj

Page 4: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 4

The traditional vector model, cont. Assumption used: all terms are unrelated Could be fixed partially using different

weights for each term Still, we have a lot more dimensions than

we want How should we decide the index terms? Similarity between terms are always 0 Very similar documents may have sim0 if they:

use a different vocabulary don’t use the index terms

Page 5: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 5

Latent semantic indexing (LSI) Similar to factor analysis Number of dimensions can be chosen

as we like We make some kind of projection

from a vector space with all terms to the smaller dimensionality

Each dimension is a mix of terms Impossible to know the meaning of

the dimension

Page 6: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 6

LSI, cont. Distance between vectors is cosine

just as before Meaningful to calculate distance

between all terms and/or documents How can we do the projection? There are some ways:

Singular value decomposition (SVD) Random indexing Neural nets, factor analysis, etc.

Page 7: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 7

Why SVD? I prefer SVD since: Michael W Berry 1992: “… This important

result indicates that Ak is the best k-rank approxi-mation (in a least squaressense) to the matrix A.

Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.

Page 8: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 8

A small example input to SVD

Page 9: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 9

What SVD gives us

X=T0S0D0: X, T0, S0, D0 are matrices

Page 10: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 10

Using the SVD The matrices make it easy to project term

and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra

We can select m easily just by using as many rows/columns of T0, S0, D0 as we want

It is possible to calculate a new (approximated) X – it will still be a t x d matrix

Page 11: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 11

Some applications

Automatic generation of a domain specific thesaurus

Keyword extraction from documents Find sets of similar documents in a

collection Find documents related to a given

document or a set of terms

Page 12: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 12

An example based on 50 000 newspaper articles

stefan edbergedberg 0.918cincinnatis 0.887edbergs 0.883världsfemman 0.883stefans 0.883tennisspelarna 0.863stefan 0.861turneringsseger 0.859queensturneringen 0.858växjöspelaren 0.852grästurnering 0.847

bengt johanssonjohansson 0.852johanssons 0.704bengt 0.678centerledare 0.674miljöcentern 0.667landsbygdscentern 0.667implikationer 0.645ickesocialistisk 0.643centerledaren 0.627regeringsalternativet 0.620vagare 0.616

Page 13: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 13

Evaluation We need evaluation metrics to be able to

improve the model! How can we evaluate millions of vectors?

“similar terms have vectors with low cosine” What is similar?

Seems impossible to evaluate the model objectively…

Possible solution: look at specific applications! They may be much easier to evaluate

Page 14: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 14

Applications using the model Vector models may be evaluated using:

A typical IR test suite of queries, documents, and relevance information

Texts with lists of manually selected keywords (multiword units included)

The Test of English as a Foreign Language (TOEFL), which tests the ability of selecting synonyms from a set of alternatives

Still subjectivity, but the more the vector model improves these applications the better it is!

Let’s look in detail at the first application

Page 15: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 15

An IR testbed

There are such testbeds for English, but Swedish has other problems Very different from English Compounds without spaces “New” letters (åäö) Complex morphology Other stop words …

Page 16: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 16

A new Swedish test collection

A group in Borås is building it Per Ahlgren Johan Eklund Leif Grönqvist

It will contain Documents Topics Relevance judgments

Page 17: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 17

Document collection Newspaper articles from GP and HD 161 000 articles, 40 MTokens Good to have more than one

newspaper: Same content, different author (not

always) 10% of my newspaper article

collection Copyright is a problem

Page 18: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 18

Topics Borrowed from CLEF 52/90, but not the most difficult Examples:

Filmer av bröderna Kaurismäki. Description: Sök efter information om filmer som

regisserats av någon av de båda bröderna Aki och Mika Kaurismäki.

Narrative: Relevanta dokument namnger en eller flera titlar på filmer som regisserats av Aki eller Mika Kaurismäki.

Finlands första EU-kommissionär Description: Vem utsågs att vara den första EU-

kommissionären för Finland i Europeiska unionen? Narrative: Ange namnet på Finlands första EU-

kommissionär. Relevanta dokument kan också nämna sakområdena för den nya kommissionärens uppdrag.

Page 19: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 19

Relevance judgments Only a subset for each topic

Selected by earlier experiments Similar approach to TREC and CLEF

100 documents for 5 strategies: 100 N 500 Important to include relevant and irrelevant

documents A scale of relevance proposed by

Sormonen: Irrelevant (0) Marginally relevant (1) Fairly relevant (2) Highly relevant (3)

Manually annotated

Page 20: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 20

Statistics

Some difficult topics got very few relevant documents

Page 21: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 21

Statistics per relevance category

Page 22: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 22

Evaluation metrics Recall & precision is problematic:

Ranked lists – how much better is position 1 than pos 5 and 10?

How long should the lists be? Relevance scale – how much better is

“highly relevant” than “fairly relevant” What about the unknown documents not

judged? Too many unknown leads to a need of

more manual judgments…

Page 23: Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist (leifg@ling.gu.se)

Växjö: 23. Jan -04 Evaluation of Vector Space ... 23

The End!

Questions?