27
Small World Clustering Algorithms Brant Chee

Small World Clustering Algorithms

Embed Size (px)

DESCRIPTION

Small World Clustering Algorithms. Brant Chee. Experiments. 3 clustering algorithms Complete Link (Cluto) K means (Cluto) Small World. Test Collections. Experimental Setup. Parameters left at package defaults Clustered with n = 50,100,150 and 200. - PowerPoint PPT Presentation

Citation preview

Page 1: Small World Clustering Algorithms

Small World Clustering Algorithms

Brant Chee

Page 2: Small World Clustering Algorithms

Experiments

3 clustering algorithms Complete Link (Cluto) K means (Cluto) Small World

Page 3: Small World Clustering Algorithms

Test CollectionsCollection Search Terms Number of

AbstractsNumber of Terms

C1 plasticity OR acetylcholine

81,746 267,981

C2 microarray OR muscarinic OR plasticity OR ((cholinergic OR noradrenergic) AND receptor)

74,533 285,623

Page 4: Small World Clustering Algorithms

Experimental Setup

Parameters left at package defaults Clustered with n = 50,100,150 and 200. Clusters with less than 4 elements or more

than 50 elements were eliminated and the clustering which resulted in less than 40 clusters was chosen to be evaluated.

Page 5: Small World Clustering Algorithms

Quantitative Results

Collection Algorithm Threshold Running Time (s)

SW N/A 40.54

C-Link 50 214.106

C1

K-Means 200 11.581

SW N/A 47.35

C-Link 100 198.147

C2

K-Means 200 5.538

Page 6: Small World Clustering Algorithms

Quantitative Results II

Collection Algorithm # of Clusters Avg. # of Terms/

Cluster

Avg. # of Documents per Cluster

SW 21 6 15,413

C-Link 22 7 12,466

C1

K-Means 11 39 4,425

SW 40 12 10,258

C-Link 28 6 25,070

C2

K-Means 38 30 11,978

Page 7: Small World Clustering Algorithms

Qualitative Evaluation

2 Criteria: Utility and Coherence 3 point scale: 1 good, 2 poor, 3 bad

Good: >60% of articles Poor: 59-41% Bad: <40%

Evaluate terms in cluster to get context.

Page 8: Small World Clustering Algorithms

Quantitative Results Cont…

Collection SW C-Link K-Means

3 18 22 9

2 1 0 1

Utility

1 2 0 1

3 7 13 7

2 6 5 3

C1

Coherence

1 8 4 1

3 37 28 38

2 2 0 0

Utility

1 1 0 0

3 9 18 38

2 21 9 0

C2

Coherence

1 10 1 0

Page 9: Small World Clustering Algorithms

Sample Session

Page 10: Small World Clustering Algorithms
Page 11: Small World Clustering Algorithms
Page 12: Small World Clustering Algorithms
Page 13: Small World Clustering Algorithms
Page 14: Small World Clustering Algorithms

Other Approaches

Statistical Methods

Page 15: Small World Clustering Algorithms

Other Clustering Approaches

Can we choose other types of clustering algorithms which could provide better quality results or provide better cluster labels? SOM (Self Organizing Map)

Slow for high numbers of dimensions and large numbers of objects.

Carrot2 Slow for large numbers of items. Huge memory consumption.

Page 16: Small World Clustering Algorithms

Random Projection

Can we reduce the dimensionality of vectors (ie 50,0001000) while preserving distances? Speed up similarity calculations

Various methods: Random projection. “Latent semantic indexing”. Multi Dimensional Scaling

Page 17: Small World Clustering Algorithms

A ∈ R× be our n points in D dimensions A x Random matrix ∈ RD×k

R of entries in {−1, 0, 1} with probabilty

O(nDk + n2k)

Very Sparse Random Projections

{1

2 D,1

1

D,

1

2 D}

Page 18: Small World Clustering Algorithms

Reducing Dimensionality

Bank Dataset 11,000 articles from 11 categories in Dmoz. 11,000 articles reduced from 30K terms 1GB heap in 11s. Increase in Purity and decrease in Entropy (measures of

clustering quality).

Matrix Entropy Purity

Original 0.975 0.146

512_1 0.584 0.476

512_2 0.589 0.495

512_3 0.62 0.502

1000_1 0.533 0.532

1000_2 0.544 0.496

1000_3 0.546 0.485

Page 19: Small World Clustering Algorithms

MI on Phrases

More context than single words More meaningful term clusters

Page 20: Small World Clustering Algorithms

Other approaches

Knowledge Intensive Approaches

Page 21: Small World Clustering Algorithms

Hypernym

“Is-a” relationship Shakespeare is an author. Pug is a dog.

Implicitly hierarchical. Basis of many ontology and semantic networks.

Wordnet UMLS

Page 22: Small World Clustering Algorithms

Portion of the UMLS Semantic Network: Biologic Function

Page 23: Small World Clustering Algorithms

Hypernym Relations NP such as {, NP}* {(or | and)} NP

Vegetables such as Beets, Carrots and Peas.

Such NP as {NP,}* {(or|and)} NP …works by such authors as Herrick, Goldsmith and Shakespeare.

NP {, NP}* {,} or|and other NP Bruises, …, broken bones or other injuries

NP {,} including {NP,} * {or|and} NP All common-law countries, including Canada and England …

NP {,} especially {NP,} * {or|and} NP … most European countries, especially France, England and Spain.

Page 24: Small World Clustering Algorithms

Uses of Hypernym Trees

Search Query Expansion Facted metadata

Clustering Parent node defines a cluster

Keyword assignment

Page 25: Small World Clustering Algorithms

Trivial Hypernyms organic compounds d-ribose organic compounds d-arabinose organic compounds l-arabinose organic compounds sucrose substances cortisone substances vitamins a and c substances zinc organs liver organs kidney sugar-containing products honey sugar-containing products jam sugar-containing products glucose sugar-containing products fruit juice concentrates sugar-containing products tomato largely populated countries china largely populated countries russia

Page 26: Small World Clustering Algorithms

Bad Hypernyms suicidal patients appears other agents plasmin other agents plasminogen such common sensations illness phenomena founder effects phenomena migration phenomena gene flow clinical manifestations 80 chemical agents homocystine no other explanation anencephaly conditions azure a-0.5 % nahco3 solution conditions ph 8.1 fewer side-effects vegetative disfunction techniques carpentier techniques 's ring

Page 27: Small World Clustering Algorithms

Good? Hypernyms entirely synthetic steroids norgestrel and quingestanol menstrual disorders metrorrhagia menstrual disorders oligoamenorrhea menstrual disorders amenorrhea mild venous disorders swollen veins mild venous disorders heavy limbs mild venous disorders varicosities obstructive pulmonary lung diseases alveolar proteinosis obstructive pulmonary lung diseases pneumonia obstructive pulmonary lung diseases asthma obstructive pulmonary lung diseases bronchiectasis obstructive pulmonary lung diseases cystic fibrosis choline analogues n,n'-dimethylethanolamine choline analogues n-monomethylethanolamine choline analogues ethanolamine 3alpha-oh-containing steroids androsterone