Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit...

Preview:

Citation preview

Clustering methodsCourse code: 175314

Pasi Fränti

10.3.2014

Speech & Image Processing UnitSchool of Computing

University of Eastern FinlandJoensuu, FINLAND

Part 1: Introduction

Sample data

Sources of RGB vectors

Red-Green plot of the vectors

Sample data

Employment statistics:

Application example 1Color reconstruction

Image with compression artifacts

Image with original colors

Application example 2speaker modeling for voice biometrics

Training data

Feature extractionand clustering

Matti

Mikko

Tomi

Speaker models

Tomi

Matti

Feature extraction

Best match: Matti !

Mikko

?

Speaker modeling

Speech data Result of clustering

Application example 3Image segmentation

Normalized color plots according to red and green components.

Image with 4 color clusters

red

gree

n

Application example 4Quantization

Quantized signal Original signal

Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

Color quantization of imagesColor quantization of images

Color image RGB samples

Clustering

Application example 5Clustering of spatial data

Clustered locations of users

Clustered locations of users

Clustering of photos

Timeline clustering

Clustering GPS trajectoriesMobile users, taxi routes, fleet management

Conclusions from clusters

Cluster 1: Office

Cluster 2: Home

Part I:Clustering problem

Subproblems of clustering

1. Where are the clusters?(Algorithmic problem)

2. How many clusters?(Methodological problem: which criterion?)

3. Selection of attributes (Application related problem)

4. Preprocessing the data(Practical problems: normalization, outliers)

Clustering result as partition

Illustrated by Voronoi diagram

Illustrated by Convex hulls

Cluster prototypesPartition of data

Cluster prototypesPartition of data

Centroids as prototypes

Partition by nearestprototype mapping

Duality of partition and centroids

Cluster missingClusters missing

Too m

any clusters

Incorrect cluster allocation

Incorrect number of clusters

Challenges in clustering

How to solve?

Solve the clustering: Given input data (X) of N data vectors, and

number of clusters (M), find the clusters. Result given as a set of prototypes, or partition.

Solve the number of clusters: Define appropriate cluster validity function f. Repeat the clustering algorithm for several M. Select the best result according to f.

Solve the problem efficiently.

Algorithmic

problem

Mathematical

problem

Computer science problem

Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]

• One possible classification based on cost function.

• MSE is well defined and most popular.

Definitions and data

Set of N data points:X={x1, x2, …, xN}

Set of M cluster prototypes (centroids):

C={c1, c2, …, cM},

P={p1, p2, …, pM},

Partition of the data:

Distance and cost function

K

k

kj

kiji xxxxd

1

2),(

N

ipi i

cxN

PCMSE1

21),(

Euclidean distance of data vectors:

Mean square error:

Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters:

Mj

x

c

jp

jpi

j

i

i ,11

Nicxdp jiMj

i ,1),(minarg 2

1

Dependency of data structures

Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :

Complexity of clustering

• Clustering problem is NP complete [Garey et al., 1982]

• Optimal solution by branch-and-bound in exponential time.

• Practical solutions by heuristic algorithms.

M

j

NjM jj

M

MM

N

1

)1(!

1

• Number of possible clusterings:

Cluster software

Main area

Input area

Output

area

• Main area: working space for data

• Input area: inputs to be processed

• Output area:obtained results

• Menu Process:selection of operation

http://cs.joensuu.fi/sipu/soft/cluster2009.exe

Clustering

imageData setCodebook

Partition

Procedure to simulate k-means

Open data set (file *.ts), move it into Input areaOpen data set (file *.ts), move it into Input area

Process – Random codebookProcess – Random codebook, select number of clusters, select number of clusters

REPEATREPEAT

Move obtained codebook from Output area into Input Move obtained codebook from Output area into Input areaarea

Process – Optimal partitionProcess – Optimal partition, select Error function, select Error function

Move codebook into Main area, partition into Input Move codebook into Main area, partition into Input areaarea

Process – Optimal codebookProcess – Optimal codebook

UNTIL DESIRED CLUSTERINGUNTIL DESIRED CLUSTERING

XLMiner softwarehttp://www.resample.com/xlminer/help/HClst/HClst_ex.htmhttp://www.resample.com/xlminer/help/HClst/HClst_ex.htm

Example of data in XLMiner

Distance matrix & dendrogram

Conclusions

Clustering is a fundamental tools needed in Speech and Image processing.

Failing to do clustering properly may defect the application analysis.

Good clustering tool needed so that researchers can focus on application requirements.

1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006.

2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

3. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999.

4. M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982.

5. F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.

Literature