61
LINGO Sandra Gama Search Results Clustering

LINGO Sandra Gama. Internet endless document collection

Embed Size (px)

Citation preview

LINGO

Sandra Gama

Search Results Clustering

Internet endless document collection

Search Engines

NO question answering

FAST access to Web content

SENSITIVE to query quality

we NEED meaningful RESULTS

CLUSTERING!

GROUPING by Similarity

Semantic structure

Groups

Description

Luxury Car

Feline, panther family

Description QUALITY

How to cluster?

LINGOa new approach

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

STAGE 1/4: PREPROCESSING

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

STAGE 1/4: PREPROCESSING

1. Text segmentation

2. Stemming

3. Ignore stop words

STAGE 2/4: PHRASE EXTRACTION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Goal

1/4 More than N occurrences

2/4 No more than 1 sentence

3/4 Complete phrase

4/4 Stop words

How it works

1 2 3 4 5 6 7 8 9 10 11

a b r a c a d a b r aHow many non-empty suffixes?

abracadabra

bracadabra

racadabra

acadabra

cadabra

adabra

dabra

abra

bra

ra

a

11 suffixes

abracadabra

bracadabra

racadabra

acadabra

cadabra

adabra

dabra

abra

bra

ra

a

Sorted Suffix Index

a 11

abra 8

abracadabra 1

acadabra 4

adabra 6

bra 9

bracadabra 2

cadabra 5

dabra 7

ra 10

racadabra 3

1 2 3 4 5 6 7 8 9 10 11 12

a b r a c a d a b r a $

1

2

3

4

5

6

7

8

9

10

11

Sorted Suffix Indexa 11

abra 8

abracadabra 1

acadabra 4

adabra 6

bra 9

bracadabra 2

cadabra 5

dabra 7

ra 10

racadabra 3

11 8 1 4 6 9 2 5 7 10 3Suffix array:

STAGE 3/4: CLUSTER-LABEL INDUCTION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Singular Value Decomposition

A term x document matrix

U, ∑ , V such that A = U ∑ VTfind matrixes

Sandra Gama

D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization

T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval

P1: Singular valueP2: Information retrieval

D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization

T1: InformationT2: Singular

T3: ValueT4: Computations

T5: Retrieval

D1 D2 D3 D4 D5 D6 D7

0.00 0.00 0.56 0.56 0.00 0.00 1.00

0.49 0.71 0.00 0.00 0.00 0.71 0.00

0.49 0.71 0.00 0.00 0.00 0.71 0.00

0.72 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.83 0.83 0.00 0.00 0.00

Abstract concept matrix (SVD)

0.00 0.75 0.00 -0.66 0.00

0.65 0.00 -0.28 0.00 -0.71

0.65 0.00 -0.28 0.00 0.71

0.39 0.00 0.92 0.00 0.00

0.00 0.66 0.00 0.75 0.00

U =

0.00 0.56 1.00 0.00 0.00 0.00 0.00

0.71 0.00 0.00 1.00 0.00 0.00 0.00

0.71 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00

0.00 0.83 0.00 0.00 0.00 0.00 1.00

= PT1

: Inf

orm

ation

P2: I

nfor

mati

on re

trie

val

P1: S

ingu

lar v

alue

T2: S

ingu

lar

T4: C

ompu

tatio

ns

T3: V

alue

T5: R

etrie

val

T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval

M matrix = UkTP

0.92 0.00 0.00 0.65 0.65 0.39 0.00

0.00 0.97 0.75 0.00 0.00 0.00 0.66

Phrases/single words

Abstractconcepts

T1: I

nfor

mati

on

P2: I

nfor

mati

on

retr

ieva

l

P1: S

ingu

lar v

alue

T2: S

ingu

lar

T4: C

ompu

tatio

ns

T3: V

alue

T5: R

etrie

val

Last step

Prune overlapping label descriptions

ZTZ

STAGE 4/4: CLUSTER-CONTENT ALLOCATION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Similarity

Cluster Score

Evaluation and Results

Test Data

10 categories

4 subjects

Subject # docs Contents

Movies 77 Information about the BladeRunner movie

Movies 92 Information about the Lord of the Rings movie

Health Care 77 Orthopedic equipment and manufactures

Photography 15 Infrared-photography references

Computer Science 27 Articles about data warehouses (integrator DBs)

Computer Science 42 MySQL database

Computer Science 15 Native XML databases

Computer Science 38 PostgreSQL database

Computer Science 39 Java programming language tutorials and guides

Computer Science 37 VI text editor

Identifier Merged Categories

G1 LRings, MySQL

G3 LRings, MySQL, Ortho, Infra

G5 MySQL, XMLDB, Dware, Postgr, JavaTut, Vi

G6 MySQL, XMLDB, Dware, Postgr, Ortho

Identifier Merged Categories

G1 Fan fiction/fan art, image galleries, MySQL, wallpapers, LOTR humour, links

G3 MySQL, news, information on infrared, image galleries, foot orthotics, Lord of the Rings, movie

G5 Java tutorial, Vim page, federated data warehouse, native XML database, Web, Postgresql database

G6 MySQL database, federated data warehouse, foot orthotics, orthopedic products, access Postgresql, Web

Cluster Contamination

Analytical evaluation:

LINGO vs. Suffix Tree Clustering

CONCLUSIONS

Future work

Pointer

Communication!

LINGOThank you.

Search Results Clustering