Cluster analysis. Partition Methods Divide data into disjoint clusters Hierarchical Methods Build a...

Preview:

Citation preview

Cluster analysis

• Partition MethodsDivide data into disjoint clusters

• Hierarchical Methods

Build a hierarchy of the observations and deduce the clusters from it.

K-means

Criteria

Same criteria with multivariate data:

Justifying the criteria• Anova: decomposition of the variance.

Univariate:

SST=SSW+SSB

Multivariate:

Minimizing the withing clusters variance is equivalent to maximize the between clusters variance (the difference between clusters).

K-means algorithm

Number of clusters

Consequences of standardization

Ruspini example

Problems of k-means

• Very sensitive to outliers

• Euclidean distances not appropriate for eliptical clusters

• It does not give the number of clusters.

Hierarchical Algoritms

Agglomerative algorithms

Nearest neighbour distance

Farthest neighbour distance

Average distance

Centroid method distance

Ward’s method distance

Dendograms

Example

Problems of hierarchical cluster

• If n is large, slow. Each time n(n-1)/2 comparisons.

• Euclidean distances not always appropriate

• If n is large, dendogram difficult to interpret

Clustering by variables

Distances between quantitative variables

Distances between qualitative variables

Similarity between attributes

Recommended