TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group )

Preview:

DESCRIPTION

TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico Nacional de México (+ 52) 5729-6000 ext. 56544; dyner1950 @mail.ru. Results of Investigation of Benno Stein Group. - PowerPoint PPT Presentation

Citation preview

TECHNIQIE OF CLUSTERING

( Panorama of methods, Works of Benno Stein group )

Mikhail Alexandrov Centro de Investigación en ComputaciónInstituto Politécnico Nacional de México

(+ 52) 5729-6000 ext. 56544; dyner1950@mail.ru

Nature of data structure

Definition of the best data structure based on graph presentation

Numerical method of clustering

Development of MajorClust method from the density based group of methods

Validity and usability of clustering

F-measure for cluster usability (expert opinion) Density index ρ for cluster validity (formal estimation)

Numerical experiments

Comparison of different methods of clustering and different ways of index selection

Results of Investigation of Benno Stein Group

TEXTUAL DATA

Subject of Grouping

NON TEXTUAL DATA

Local Terminology

It is not important what is the source of data: textual o non textual.

Data: work in the space of parameters Texts: work in the space of words

TEXTS (‘indexed’)

Presentation of Textual Data

TEXTS (parametrized)

Terminology local

Indexed texts are parametrized texts in the space of words

Clustering Classificación

Characteristics: Caracteristics:Absence of patterns or descriptions Presence of patterns o descripcionesof classes, so the results are of classes, so the results are

defined by the nature of defined by the user ( N>=1 )

the data themselves ( N >1 )

Sinonyms: Sinonyms:Classification without teacher Classification with teacherUnsupervised classification Supervised classificacion

Number of clusters Specials terms: [ ] is known exactly Categorization (of documents)[x] is known approximately Diagnostics (technics, medicine)[ ] is not known => searching Recognition (technics, science)

Types of Grouping

Objectives of Grouping

1. Organization (structuring) of object set

Process is named data structuring Tecnologies: clustering, categorización

2. Improvement of the process of searching interesting patterns

Process is named navigation

Tecnology: cluster-based search

3. Groupng for other applications:

- Knowledge discovering

Tecnology: clustering

- Summarization (generalization) of documents Tecnologies: clustering, categorización

Nota: do not mix the type of groupng and its objective

Example 1.

Clustering of opinions of the experts invited by the Moscow City Government

Example 2.

Categorization of letters and complaints of Moscow dwellers directed to the Major of the city

Learning

General books about Information Retrieval

1. Baeza-Yates, R., Ribero-Neto, B. Modern Information Retrieval. Addison Wesley, 1999 2. Manning, D.C., Schutze, H. Foundations of statistical natural language processing. MIT Press, 1999.

Journals and Congresses about Information Retrieval

1. Journal “Knowledge and Information Systems”, Springer

2. KDD - Knowledge Discovery from Data Base3. NLDB – Natural Language and Data Bases 4. XX-AI – National Congresses on Artificial Intelligence (MICAI, ….)

LearningGeneral books and articles about clustering

1. Hartigan J. , Clustering Algorithms, Wiley, 1975 <= ! 2. Kaufman L. and Rousseuw P., Finding Group in Data, Wiley, 1990 3. Gordon D. “How many clusters?”, Proc. of IFCS-96, Springer, serie “Studies in classification, data analysis…..”, 1996

Articles about clustering texts

1. Eissen S., Stein B. “Analysis of clustering algorithms for Web based

search”, Springer, LNAI, N_2569

2. Stein B., Eissen S., Wibrock F. “On cluster validity and the information

need of users”, Proc. 3-rd Conf. Artif. Intel. Appl., IASTED, 2003

Universal packets with algorithms of clustering1. ClustAn (Scotland) www. clustan.com Clustan Graphics-6 (2003)

2. MatLab Descriptions are in Internet

3. Statistica Descriptions are in Internet

Preprocessing <=

Processing

General Scheme of Clustering Process

Principal idea :To transform texts to numerical form

to use matematical tools ( the

problem is text grouping documents

but not their undestanding)

Here:

Rude and good matrixes

are Matrix Obj/Atr

General Scheme of Clustering Process

Preprocessing

Processing <=

Here:

Instead of matrix Obj/Obj

matrix Atr/Atr can be used

Matrixes to be Considered

Clustering for Categorization

Colour matrix “words-words”

before clustering

Clustering for Categorizatión

Colour matriz “words-words”

after clustering

Importance of Preprocessing

Definitions

Def. 1 “Let us V be the set of objects. Clustering

C = { Ci | Ci є V } of V is division of V on subsets, for

which we have : UiCi = V and Ci ∩Cj = 0 i ≠j“

Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight

function that reflects the distance between objects, so we have

a weighted graph G = { V,E, φ }. In this case C is named as

clustering of G.”

In the framework of the second definition every Ci produced

subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are

named clusters.Graph

Set

Clique

Definitions

Principal noteBoth definitions SAYS NOTHING:

- about quality of clusters

- about numbers of clusters

Reason of difficultiesNowadays there is no any general agreement about any

universal defintion of the term cluster

What means that clustering is good ?1. Closeness between objets inside clusters is

essentially more than the closeness between

clusters themselves

2. Constructed clusters correspond to intuitive

presentations of users ( they are natural clusters)

Classification of methods

1. Hierarchical methods Any neighbors ( N is not given)

2. Methods oriented on examplesK-means( N is given)

3. Methods oriented on densityMajorClust (N is calculated automatically)

Based on the way of grouping

Classification of methods

Based on belonging to cluster

Exclusive methods

Every object belongs only to one cluster. Methods are named hard grouping methods

Non-exclusive methods

Every object can belong to several clusters. Methods are named soft grouping methods.

Based on data presentationMethods oriented on sets in free metric space

Every object is presented as a point in a free space

Methods oriented on graphsEvery object is presented as an element on graph

Hard classification

Hard clustering

Hard categorization

Soft classification

Soft clustering

Soft categorization

Example.

The distribution of letters of Moscovites

to the Government is soft categorization

(numbers in the table reflect the relative

weight of each theme)

Fuzzy Grouping

Hierarchical methods

Aglomerative methods (bottom=>top)

- Relations with neighbors

- Ward’s method

División methods (top => bottom)

- Min cut method

- Methods based on dissimilarity he I if this the his a

Hierarchical aglomerative methods

Neighbors

1. Nearest neighbor (single-linkage)

2. Farthest neighbor (complete linkage)

3. Averaged neighbor (average linkage) King

- Unweighted Pair-Group Average UPGMA

- Weighted Pair-Group Average WPGMA

- Unweighted Pair-Group Centroid WPGMC <= King method

Density

Ward método

Difference between methodsEvaluation of closeness between clusters

Hierarchical methods

Neighbors.Every object is cluster

Hierarchical methods

Neighbors.Every object is cluster

General algorithm

1. Initially every object is one

cluster

2. The series of steps are

performed. On every step

the pair of cluster to be the

closest are searched. They

are merged.

3. At the end we have one

cluster.

Hierarchical methods

Nearest neighbor method (NN)

Hierarchical methods

Farthest neighbor method

Hierarchical methods

Averaged neighbor (Unweighted Pair-Group Centroid WPGMC)

Hierarchical methods

Averaged neighbors

- Unweighted Pair-Group

Centroid WPGMC (a)

- Unweighted Pair-Group

Average UPGMA (b)

- Weighted Pair-Group

Average WPGMA

DifferenceEvaluation the closeness

between clusters

(a)

(b)

Methods oriented on examples

Variants ofK-means methods

K-means, centroid (a)K-means, medoid (b) K-means fuzzy

DifferenceElection of centers

(a)

(b)

Methods oriented on examples

K - means, centroid

Methods oriented on examples

Method K-means General algorithm

1. Initially K centers are selected by any random way

2. Series of stepsare performed. On every step the objects are distributed between centers according the criterion of the nearest center. Then all centers are recalculated.

3. The end of searching is fixed when clusters are not changed.

Natural Structure of Data

u

Graph decomposition

C=(C1, C2,….Cn) is decomposition of graph

Λ(C) is weighted partial connectivity of C

λ(Ci) = λi is edge connectivity of G(Ci)H (C*, E(C*)) is connectivity structure

Density based methods

MajorClust method

Principal idea

Un object belongs to the cluster to which the majority of his neighbors belong

Suboptimal solution

Only part of neighbors are considered on every step.

Density based methods

Comparason of Nearest Neighbor method and MajorClust

NN-method does not change belonging of objects.

MajorClust changes belonging of objects

Λ - criterion and MajorClust

Validity

Principal question:What of the formal validiy measures reflects a user opinion by the best way?

Dunn index

Davies- Bouldin index

etc.

Dunn index (to be max)

Validity and Usability

Conclusion

Expected density measure corresponds to F-measure reflecting expert’s opinion

Classification of Tecnologies

Meta methods (Tecnology)

They construct separated data sets using criteria of optimization and limitations:

- Neither much nor small number of clusteres

- Neither large nor small size of clusters

etc.

Visual methods (Tecnology)

They present visual images to a user in order to select manually the clusters

- Using different methods

- Comparing results

Meta MethodsAlgorithm1. Initially the number of clusters Kini and their

centers Ci is given (aleatoriamente)

2. Method K-medoid (or any other one) is completed

3. If there is N > Nmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2

4. If there is N < Nmin in any cluster, than the closest clusters are joined. Go to p.2

5. If there is D > Dmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2

6. If there is D < Dmin in any cluster, than the closest clusters are joined. Go to p.2

7. When the number of iteration I > Imax, Stop

Here: N is the number of objects in a given cluster

D is the diagonal of a given cluster

Technique of Visual Clustering

Clustering on dendrite Clustering in space of factors

Demographic Map of the World,clusters in space of factors and on dendrite

Searching Leaders in Clusteres

Objectives ( Leader = Representive )

• To reduce time for elaboration of a document corpus• To improve the procedure of navigation in Internet ( new)

etc.

Typical documentsA. Typical document (the averaged one)

It reflects the principal idea of a given cluster

B. The least typical document

It is good to reach the consensus between clusters

C. The most typical document

It reflects the idea of difference between clusters

Variants of Leaders C. Most typical

A. Typical

B. Least typical

Cluster

Typical, the most typical and

the least typical documents

form so called semantic sphere

(boundaries) of cluster

Examples

Typical: the most popular costume in

Chine is jeans.

The most typical: the most typical

costume in China is kimono

The least typical: the least typical

costume in China is European siut

(it is the closest one to the other

clusters)

Clustering Metods of Clustering

|| 21 XXC

Open Problems

1. Clustering with filtering (a)

Given object distribution reflects:real structure (nature) + noise (alien objects)

2. Validation of clusters (b)

What is realy inside the black box?

3. Statistical clustering (c)

How take into account the randomness of data?

4. Visual presentation (d)

Facility for visual clustering

Certain Observations

The numbers of clustering methods is a little bit more than the numbers of researchers working in this area.

Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data.

To be sure that results are really good and do not depend on the method used it is recommended to test these results using any other methods. And these methods are desirable to be antopodes.

Solomon G, 1977. The most antipodes are: NN-method and K-means

Principal problem consists in choice of measure of closeness being adecuate to a given problem and given data

Frecuently the results are bad because of the bad measure but not the bad method !

CLUSTERING DOCUMENTS ( General information, Works of Benno Stein group )

End

Recommended