48
TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico Nacional de México (+ 52) 5729-6000 ext. 56544; [email protected]

TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Embed Size (px)

Citation preview

Page 1: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

TECHNIQIE OF CLUSTERING

( Panorama of methods, Works of Benno Stein group )

Mikhail Alexandrov Centro de Investigación en ComputaciónInstituto Politécnico Nacional de México

(+ 52) 5729-6000 ext. 56544; [email protected]

Page 2: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Nature of data structure

Definition of the best data structure based on graph presentation

Numerical method of clustering

Development of MajorClust method from the density based group of methods

Validity and usability of clustering

F-measure for cluster usability (expert opinion) Density index ρ for cluster validity (formal estimation)

Numerical experiments

Comparison of different methods of clustering and different ways of index selection

Results of Investigation of Benno Stein Group

Page 3: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

TEXTUAL DATA

Subject of Grouping

NON TEXTUAL DATA

Local Terminology

It is not important what is the source of data: textual o non textual.

Data: work in the space of parameters Texts: work in the space of words

Page 4: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

TEXTS (‘indexed’)

Presentation of Textual Data

TEXTS (parametrized)

Terminology local

Indexed texts are parametrized texts in the space of words

Page 5: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Clustering Classificación

Characteristics: Caracteristics:Absence of patterns or descriptions Presence of patterns o descripcionesof classes, so the results are of classes, so the results are

defined by the nature of defined by the user ( N>=1 )

the data themselves ( N >1 )

Sinonyms: Sinonyms:Classification without teacher Classification with teacherUnsupervised classification Supervised classificacion

Number of clusters Specials terms: [ ] is known exactly Categorization (of documents)[x] is known approximately Diagnostics (technics, medicine)[ ] is not known => searching Recognition (technics, science)

Types of Grouping

Page 6: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Objectives of Grouping

1. Organization (structuring) of object set

Process is named data structuring Tecnologies: clustering, categorización

2. Improvement of the process of searching interesting patterns

Process is named navigation

Tecnology: cluster-based search

3. Groupng for other applications:

- Knowledge discovering

Tecnology: clustering

- Summarization (generalization) of documents Tecnologies: clustering, categorización

Nota: do not mix the type of groupng and its objective

Page 7: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Example 1.

Clustering of opinions of the experts invited by the Moscow City Government

Page 8: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Example 2.

Categorization of letters and complaints of Moscow dwellers directed to the Major of the city

Page 9: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Learning

General books about Information Retrieval

1. Baeza-Yates, R., Ribero-Neto, B. Modern Information Retrieval. Addison Wesley, 1999 2. Manning, D.C., Schutze, H. Foundations of statistical natural language processing. MIT Press, 1999.

Journals and Congresses about Information Retrieval

1. Journal “Knowledge and Information Systems”, Springer

2. KDD - Knowledge Discovery from Data Base3. NLDB – Natural Language and Data Bases 4. XX-AI – National Congresses on Artificial Intelligence (MICAI, ….)

Page 10: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

LearningGeneral books and articles about clustering

1. Hartigan J. , Clustering Algorithms, Wiley, 1975 <= ! 2. Kaufman L. and Rousseuw P., Finding Group in Data, Wiley, 1990 3. Gordon D. “How many clusters?”, Proc. of IFCS-96, Springer, serie “Studies in classification, data analysis…..”, 1996

Articles about clustering texts

1. Eissen S., Stein B. “Analysis of clustering algorithms for Web based

search”, Springer, LNAI, N_2569

2. Stein B., Eissen S., Wibrock F. “On cluster validity and the information

need of users”, Proc. 3-rd Conf. Artif. Intel. Appl., IASTED, 2003

Universal packets with algorithms of clustering1. ClustAn (Scotland) www. clustan.com Clustan Graphics-6 (2003)

2. MatLab Descriptions are in Internet

3. Statistica Descriptions are in Internet

Page 11: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Preprocessing <=

Processing

General Scheme of Clustering Process

Principal idea :To transform texts to numerical form

to use matematical tools ( the

problem is text grouping documents

but not their undestanding)

Here:

Rude and good matrixes

are Matrix Obj/Atr

Page 12: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

General Scheme of Clustering Process

Preprocessing

Processing <=

Here:

Instead of matrix Obj/Obj

matrix Atr/Atr can be used

Page 13: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Matrixes to be Considered

Page 14: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Clustering for Categorization

Colour matrix “words-words”

before clustering

Page 15: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Clustering for Categorizatión

Colour matriz “words-words”

after clustering

Page 16: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Importance of Preprocessing

Page 17: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Definitions

Def. 1 “Let us V be the set of objects. Clustering

C = { Ci | Ci є V } of V is division of V on subsets, for

which we have : UiCi = V and Ci ∩Cj = 0 i ≠j“

Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight

function that reflects the distance between objects, so we have

a weighted graph G = { V,E, φ }. In this case C is named as

clustering of G.”

In the framework of the second definition every Ci produced

subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are

named clusters.Graph

Set

Clique

Page 18: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Definitions

Principal noteBoth definitions SAYS NOTHING:

- about quality of clusters

- about numbers of clusters

Reason of difficultiesNowadays there is no any general agreement about any

universal defintion of the term cluster

What means that clustering is good ?1. Closeness between objets inside clusters is

essentially more than the closeness between

clusters themselves

2. Constructed clusters correspond to intuitive

presentations of users ( they are natural clusters)

Page 19: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Classification of methods

1. Hierarchical methods Any neighbors ( N is not given)

2. Methods oriented on examplesK-means( N is given)

3. Methods oriented on densityMajorClust (N is calculated automatically)

Based on the way of grouping

Page 20: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Classification of methods

Based on belonging to cluster

Exclusive methods

Every object belongs only to one cluster. Methods are named hard grouping methods

Non-exclusive methods

Every object can belong to several clusters. Methods are named soft grouping methods.

Based on data presentationMethods oriented on sets in free metric space

Every object is presented as a point in a free space

Methods oriented on graphsEvery object is presented as an element on graph

Page 21: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hard classification

Hard clustering

Hard categorization

Soft classification

Soft clustering

Soft categorization

Example.

The distribution of letters of Moscovites

to the Government is soft categorization

(numbers in the table reflect the relative

weight of each theme)

Fuzzy Grouping

Page 22: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical methods

Aglomerative methods (bottom=>top)

- Relations with neighbors

- Ward’s method

División methods (top => bottom)

- Min cut method

- Methods based on dissimilarity he I if this the his a

Page 23: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical aglomerative methods

Neighbors

1. Nearest neighbor (single-linkage)

2. Farthest neighbor (complete linkage)

3. Averaged neighbor (average linkage) King

- Unweighted Pair-Group Average UPGMA

- Weighted Pair-Group Average WPGMA

- Unweighted Pair-Group Centroid WPGMC <= King method

Density

Ward método

Difference between methodsEvaluation of closeness between clusters

Page 24: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical methods

Neighbors.Every object is cluster

Page 25: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical methods

Neighbors.Every object is cluster

General algorithm

1. Initially every object is one

cluster

2. The series of steps are

performed. On every step

the pair of cluster to be the

closest are searched. They

are merged.

3. At the end we have one

cluster.

Page 26: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical methods

Nearest neighbor method (NN)

Page 27: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical methods

Farthest neighbor method

Page 28: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical methods

Averaged neighbor (Unweighted Pair-Group Centroid WPGMC)

Page 29: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Hierarchical methods

Averaged neighbors

- Unweighted Pair-Group

Centroid WPGMC (a)

- Unweighted Pair-Group

Average UPGMA (b)

- Weighted Pair-Group

Average WPGMA

DifferenceEvaluation the closeness

between clusters

(a)

(b)

Page 30: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Methods oriented on examples

Variants ofK-means methods

K-means, centroid (a)K-means, medoid (b) K-means fuzzy

DifferenceElection of centers

(a)

(b)

Page 31: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Methods oriented on examples

K - means, centroid

Page 32: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Methods oriented on examples

Method K-means General algorithm

1. Initially K centers are selected by any random way

2. Series of stepsare performed. On every step the objects are distributed between centers according the criterion of the nearest center. Then all centers are recalculated.

3. The end of searching is fixed when clusters are not changed.

Page 33: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Natural Structure of Data

u

Graph decomposition

C=(C1, C2,….Cn) is decomposition of graph

Λ(C) is weighted partial connectivity of C

λ(Ci) = λi is edge connectivity of G(Ci)H (C*, E(C*)) is connectivity structure

Page 34: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Density based methods

MajorClust method

Principal idea

Un object belongs to the cluster to which the majority of his neighbors belong

Suboptimal solution

Only part of neighbors are considered on every step.

Page 35: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Density based methods

Comparason of Nearest Neighbor method and MajorClust

NN-method does not change belonging of objects.

MajorClust changes belonging of objects

Page 36: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Λ - criterion and MajorClust

Page 37: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Validity

Principal question:What of the formal validiy measures reflects a user opinion by the best way?

Dunn index

Davies- Bouldin index

etc.

Dunn index (to be max)

Page 38: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Validity and Usability

Conclusion

Expected density measure corresponds to F-measure reflecting expert’s opinion

Page 39: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Classification of Tecnologies

Meta methods (Tecnology)

They construct separated data sets using criteria of optimization and limitations:

- Neither much nor small number of clusteres

- Neither large nor small size of clusters

etc.

Visual methods (Tecnology)

They present visual images to a user in order to select manually the clusters

- Using different methods

- Comparing results

Page 40: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Meta MethodsAlgorithm1. Initially the number of clusters Kini and their

centers Ci is given (aleatoriamente)

2. Method K-medoid (or any other one) is completed

3. If there is N > Nmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2

4. If there is N < Nmin in any cluster, than the closest clusters are joined. Go to p.2

5. If there is D > Dmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2

6. If there is D < Dmin in any cluster, than the closest clusters are joined. Go to p.2

7. When the number of iteration I > Imax, Stop

Here: N is the number of objects in a given cluster

D is the diagonal of a given cluster

Page 41: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Technique of Visual Clustering

Clustering on dendrite Clustering in space of factors

Page 42: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Demographic Map of the World,clusters in space of factors and on dendrite

Page 43: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Searching Leaders in Clusteres

Objectives ( Leader = Representive )

• To reduce time for elaboration of a document corpus• To improve the procedure of navigation in Internet ( new)

etc.

Typical documentsA. Typical document (the averaged one)

It reflects the principal idea of a given cluster

B. The least typical document

It is good to reach the consensus between clusters

C. The most typical document

It reflects the idea of difference between clusters

Page 44: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Variants of Leaders C. Most typical

A. Typical

B. Least typical

Cluster

Typical, the most typical and

the least typical documents

form so called semantic sphere

(boundaries) of cluster

Examples

Typical: the most popular costume in

Chine is jeans.

The most typical: the most typical

costume in China is kimono

The least typical: the least typical

costume in China is European siut

(it is the closest one to the other

clusters)

Page 45: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Clustering Metods of Clustering

|| 21 XXC

Page 46: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Open Problems

1. Clustering with filtering (a)

Given object distribution reflects:real structure (nature) + noise (alien objects)

2. Validation of clusters (b)

What is realy inside the black box?

3. Statistical clustering (c)

How take into account the randomness of data?

4. Visual presentation (d)

Facility for visual clustering

Page 47: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

Certain Observations

The numbers of clustering methods is a little bit more than the numbers of researchers working in this area.

Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data.

To be sure that results are really good and do not depend on the method used it is recommended to test these results using any other methods. And these methods are desirable to be antopodes.

Solomon G, 1977. The most antipodes are: NN-method and K-means

Principal problem consists in choice of measure of closeness being adecuate to a given problem and given data

Frecuently the results are bad because of the bad measure but not the bad method !

Page 48: TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico

CLUSTERING DOCUMENTS ( General information, Works of Benno Stein group )

End