View
219
Download
1
Category
Tags:
Preview:
Citation preview
TECHNIQIE OF CLUSTERING
( Panorama of methods, Works of Benno Stein group )
Mikhail Alexandrov Centro de Investigación en ComputaciónInstituto Politécnico Nacional de México
(+ 52) 5729-6000 ext. 56544; dyner1950@mail.ru
Nature of data structure
Definition of the best data structure based on graph presentation
Numerical method of clustering
Development of MajorClust method from the density based group of methods
Validity and usability of clustering
F-measure for cluster usability (expert opinion) Density index ρ for cluster validity (formal estimation)
Numerical experiments
Comparison of different methods of clustering and different ways of index selection
Results of Investigation of Benno Stein Group
TEXTUAL DATA
Subject of Grouping
NON TEXTUAL DATA
Local Terminology
It is not important what is the source of data: textual o non textual.
Data: work in the space of parameters Texts: work in the space of words
TEXTS (‘indexed’)
Presentation of Textual Data
TEXTS (parametrized)
Terminology local
Indexed texts are parametrized texts in the space of words
Clustering Classificación
Characteristics: Caracteristics:Absence of patterns or descriptions Presence of patterns o descripcionesof classes, so the results are of classes, so the results are
defined by the nature of defined by the user ( N>=1 )
the data themselves ( N >1 )
Sinonyms: Sinonyms:Classification without teacher Classification with teacherUnsupervised classification Supervised classificacion
Number of clusters Specials terms: [ ] is known exactly Categorization (of documents)[x] is known approximately Diagnostics (technics, medicine)[ ] is not known => searching Recognition (technics, science)
Types of Grouping
Objectives of Grouping
1. Organization (structuring) of object set
Process is named data structuring Tecnologies: clustering, categorización
2. Improvement of the process of searching interesting patterns
Process is named navigation
Tecnology: cluster-based search
3. Groupng for other applications:
- Knowledge discovering
Tecnology: clustering
- Summarization (generalization) of documents Tecnologies: clustering, categorización
Nota: do not mix the type of groupng and its objective
Example 1.
Clustering of opinions of the experts invited by the Moscow City Government
Example 2.
Categorization of letters and complaints of Moscow dwellers directed to the Major of the city
Learning
General books about Information Retrieval
1. Baeza-Yates, R., Ribero-Neto, B. Modern Information Retrieval. Addison Wesley, 1999 2. Manning, D.C., Schutze, H. Foundations of statistical natural language processing. MIT Press, 1999.
Journals and Congresses about Information Retrieval
1. Journal “Knowledge and Information Systems”, Springer
2. KDD - Knowledge Discovery from Data Base3. NLDB – Natural Language and Data Bases 4. XX-AI – National Congresses on Artificial Intelligence (MICAI, ….)
LearningGeneral books and articles about clustering
1. Hartigan J. , Clustering Algorithms, Wiley, 1975 <= ! 2. Kaufman L. and Rousseuw P., Finding Group in Data, Wiley, 1990 3. Gordon D. “How many clusters?”, Proc. of IFCS-96, Springer, serie “Studies in classification, data analysis…..”, 1996
Articles about clustering texts
1. Eissen S., Stein B. “Analysis of clustering algorithms for Web based
search”, Springer, LNAI, N_2569
2. Stein B., Eissen S., Wibrock F. “On cluster validity and the information
need of users”, Proc. 3-rd Conf. Artif. Intel. Appl., IASTED, 2003
Universal packets with algorithms of clustering1. ClustAn (Scotland) www. clustan.com Clustan Graphics-6 (2003)
2. MatLab Descriptions are in Internet
3. Statistica Descriptions are in Internet
Preprocessing <=
Processing
General Scheme of Clustering Process
Principal idea :To transform texts to numerical form
to use matematical tools ( the
problem is text grouping documents
but not their undestanding)
Here:
Rude and good matrixes
are Matrix Obj/Atr
General Scheme of Clustering Process
Preprocessing
Processing <=
Here:
Instead of matrix Obj/Obj
matrix Atr/Atr can be used
Matrixes to be Considered
Clustering for Categorization
Colour matrix “words-words”
before clustering
Clustering for Categorizatión
Colour matriz “words-words”
after clustering
Importance of Preprocessing
Definitions
Def. 1 “Let us V be the set of objects. Clustering
C = { Ci | Ci є V } of V is division of V on subsets, for
which we have : UiCi = V and Ci ∩Cj = 0 i ≠j“
Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight
function that reflects the distance between objects, so we have
a weighted graph G = { V,E, φ }. In this case C is named as
clustering of G.”
In the framework of the second definition every Ci produced
subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are
named clusters.Graph
Set
Clique
Definitions
Principal noteBoth definitions SAYS NOTHING:
- about quality of clusters
- about numbers of clusters
Reason of difficultiesNowadays there is no any general agreement about any
universal defintion of the term cluster
What means that clustering is good ?1. Closeness between objets inside clusters is
essentially more than the closeness between
clusters themselves
2. Constructed clusters correspond to intuitive
presentations of users ( they are natural clusters)
Classification of methods
1. Hierarchical methods Any neighbors ( N is not given)
2. Methods oriented on examplesK-means( N is given)
3. Methods oriented on densityMajorClust (N is calculated automatically)
Based on the way of grouping
Classification of methods
Based on belonging to cluster
Exclusive methods
Every object belongs only to one cluster. Methods are named hard grouping methods
Non-exclusive methods
Every object can belong to several clusters. Methods are named soft grouping methods.
Based on data presentationMethods oriented on sets in free metric space
Every object is presented as a point in a free space
Methods oriented on graphsEvery object is presented as an element on graph
Hard classification
Hard clustering
Hard categorization
Soft classification
Soft clustering
Soft categorization
Example.
The distribution of letters of Moscovites
to the Government is soft categorization
(numbers in the table reflect the relative
weight of each theme)
Fuzzy Grouping
Hierarchical methods
Aglomerative methods (bottom=>top)
- Relations with neighbors
- Ward’s method
División methods (top => bottom)
- Min cut method
- Methods based on dissimilarity he I if this the his a
Hierarchical aglomerative methods
Neighbors
1. Nearest neighbor (single-linkage)
2. Farthest neighbor (complete linkage)
3. Averaged neighbor (average linkage) King
- Unweighted Pair-Group Average UPGMA
- Weighted Pair-Group Average WPGMA
- Unweighted Pair-Group Centroid WPGMC <= King method
Density
Ward método
Difference between methodsEvaluation of closeness between clusters
Hierarchical methods
Neighbors.Every object is cluster
Hierarchical methods
Neighbors.Every object is cluster
General algorithm
1. Initially every object is one
cluster
2. The series of steps are
performed. On every step
the pair of cluster to be the
closest are searched. They
are merged.
3. At the end we have one
cluster.
Hierarchical methods
Nearest neighbor method (NN)
Hierarchical methods
Farthest neighbor method
Hierarchical methods
Averaged neighbor (Unweighted Pair-Group Centroid WPGMC)
Hierarchical methods
Averaged neighbors
- Unweighted Pair-Group
Centroid WPGMC (a)
- Unweighted Pair-Group
Average UPGMA (b)
- Weighted Pair-Group
Average WPGMA
DifferenceEvaluation the closeness
between clusters
(a)
(b)
Methods oriented on examples
Variants ofK-means methods
K-means, centroid (a)K-means, medoid (b) K-means fuzzy
DifferenceElection of centers
(a)
(b)
Methods oriented on examples
K - means, centroid
Methods oriented on examples
Method K-means General algorithm
1. Initially K centers are selected by any random way
2. Series of stepsare performed. On every step the objects are distributed between centers according the criterion of the nearest center. Then all centers are recalculated.
3. The end of searching is fixed when clusters are not changed.
Natural Structure of Data
u
Graph decomposition
C=(C1, C2,….Cn) is decomposition of graph
Λ(C) is weighted partial connectivity of C
λ(Ci) = λi is edge connectivity of G(Ci)H (C*, E(C*)) is connectivity structure
Density based methods
MajorClust method
Principal idea
Un object belongs to the cluster to which the majority of his neighbors belong
Suboptimal solution
Only part of neighbors are considered on every step.
Density based methods
Comparason of Nearest Neighbor method and MajorClust
NN-method does not change belonging of objects.
MajorClust changes belonging of objects
Λ - criterion and MajorClust
Validity
Principal question:What of the formal validiy measures reflects a user opinion by the best way?
Dunn index
Davies- Bouldin index
etc.
Dunn index (to be max)
Validity and Usability
Conclusion
Expected density measure corresponds to F-measure reflecting expert’s opinion
Classification of Tecnologies
Meta methods (Tecnology)
They construct separated data sets using criteria of optimization and limitations:
- Neither much nor small number of clusteres
- Neither large nor small size of clusters
etc.
Visual methods (Tecnology)
They present visual images to a user in order to select manually the clusters
- Using different methods
- Comparing results
Meta MethodsAlgorithm1. Initially the number of clusters Kini and their
centers Ci is given (aleatoriamente)
2. Method K-medoid (or any other one) is completed
3. If there is N > Nmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2
4. If there is N < Nmin in any cluster, than the closest clusters are joined. Go to p.2
5. If there is D > Dmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2
6. If there is D < Dmin in any cluster, than the closest clusters are joined. Go to p.2
7. When the number of iteration I > Imax, Stop
Here: N is the number of objects in a given cluster
D is the diagonal of a given cluster
Technique of Visual Clustering
Clustering on dendrite Clustering in space of factors
Demographic Map of the World,clusters in space of factors and on dendrite
Searching Leaders in Clusteres
Objectives ( Leader = Representive )
• To reduce time for elaboration of a document corpus• To improve the procedure of navigation in Internet ( new)
etc.
Typical documentsA. Typical document (the averaged one)
It reflects the principal idea of a given cluster
B. The least typical document
It is good to reach the consensus between clusters
C. The most typical document
It reflects the idea of difference between clusters
Variants of Leaders C. Most typical
A. Typical
B. Least typical
Cluster
Typical, the most typical and
the least typical documents
form so called semantic sphere
(boundaries) of cluster
Examples
Typical: the most popular costume in
Chine is jeans.
The most typical: the most typical
costume in China is kimono
The least typical: the least typical
costume in China is European siut
(it is the closest one to the other
clusters)
Clustering Metods of Clustering
|| 21 XXC
Open Problems
1. Clustering with filtering (a)
Given object distribution reflects:real structure (nature) + noise (alien objects)
2. Validation of clusters (b)
What is realy inside the black box?
3. Statistical clustering (c)
How take into account the randomness of data?
4. Visual presentation (d)
Facility for visual clustering
Certain Observations
The numbers of clustering methods is a little bit more than the numbers of researchers working in this area.
Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data.
To be sure that results are really good and do not depend on the method used it is recommended to test these results using any other methods. And these methods are desirable to be antopodes.
Solomon G, 1977. The most antipodes are: NN-method and K-means
Principal problem consists in choice of measure of closeness being adecuate to a given problem and given data
Frecuently the results are bad because of the bad measure but not the bad method !
CLUSTERING DOCUMENTS ( General information, Works of Benno Stein group )
End
Recommended