Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit...

Clustering methodsCourse code: 175314

Pasi Fränti

10.3.2014

Speech & Image Processing UnitSchool of Computing

University of Eastern FinlandJoensuu, FINLAND

Part 1: Introduction

Sample data

Sources of RGB vectors

Red-Green plot of the vectors

Sample data

Employment statistics:

Application example 1Color reconstruction

Image with compression artifacts

Image with original colors

Application example 2speaker modeling for voice biometrics

Training data

Feature extractionand clustering

Speaker models

Feature extraction

Best match: Matti !

Speaker modeling

Speech data Result of clustering

Application example 3Image segmentation

Normalized color plots according to red and green components.

Image with 4 color clusters

Application example 4Quantization

Quantized signal Original signal

Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

Color quantization of imagesColor quantization of images

Color image RGB samples

Clustering

Application example 5Clustering of spatial data

Clustered locations of users

Clustering of photos

Timeline clustering

Clustering GPS trajectoriesMobile users, taxi routes, fleet management

Conclusions from clusters

Cluster 1: Office

Cluster 2: Home

Part I:Clustering problem

Subproblems of clustering

1. Where are the clusters?(Algorithmic problem)

2. How many clusters?(Methodological problem: which criterion?)

3. Selection of attributes (Application related problem)

4. Preprocessing the data(Practical problems: normalization, outliers)

Clustering result as partition

Illustrated by Voronoi diagram

Illustrated by Convex hulls

Cluster prototypesPartition of data

Centroids as prototypes

Partition by nearestprototype mapping

Duality of partition and centroids

Cluster missingClusters missing

any clusters

Incorrect cluster allocation

Incorrect number of clusters

Challenges in clustering

How to solve?

Solve the clustering: Given input data (X) of N data vectors, and

number of clusters (M), find the clusters. Result given as a set of prototypes, or partition.

Solve the number of clusters: Define appropriate cluster validity function f. Repeat the clustering algorithm for several M. Select the best result according to f.

Solve the problem efficiently.

Algorithmic

problem

Mathematical

problem

Computer science problem

Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]

• One possible classification based on cost function.

• MSE is well defined and most popular.

Definitions and data

Set of N data points:X={x1, x2, …, xN}

Set of M cluster prototypes (centroids):

C={c1, c2, …, cM},

P={p1, p2, …, pM},

Partition of the data:

Distance and cost function

kiji xxxxd

PCMSE1

Euclidean distance of data vectors:

Mean square error:

Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters:

Nicxdp jiMj

i ,1),(minarg 2

Dependency of data structures

Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :

Complexity of clustering

• Clustering problem is NP complete [Garey et al., 1982]

• Optimal solution by branch-and-bound in exponential time.

• Practical solutions by heuristic algorithms.

NjM jj

• Number of possible clusterings:

Cluster software

Main area

Input area

Output

• Main area: working space for data

• Input area: inputs to be processed

• Output area:obtained results

• Menu Process:selection of operation

http://cs.joensuu.fi/sipu/soft/cluster2009.exe

Clustering

imageData setCodebook

Partition

Procedure to simulate k-means

Open data set (file *.ts), move it into Input areaOpen data set (file *.ts), move it into Input area

Process – Random codebookProcess – Random codebook, select number of clusters, select number of clusters

REPEATREPEAT

Move obtained codebook from Output area into Input Move obtained codebook from Output area into Input areaarea

Process – Optimal partitionProcess – Optimal partition, select Error function, select Error function

Move codebook into Main area, partition into Input Move codebook into Main area, partition into Input areaarea

Process – Optimal codebookProcess – Optimal codebook

UNTIL DESIRED CLUSTERINGUNTIL DESIRED CLUSTERING

XLMiner softwarehttp://www.resample.com/xlminer/help/HClst/HClst_ex.htmhttp://www.resample.com/xlminer/help/HClst/HClst_ex.htm

Example of data in XLMiner

Distance matrix & dendrogram

Conclusions

Clustering is a fundamental tools needed in Speech and Image processing.

Failing to do clustering properly may defect the application analysis.

Good clustering tool needed so that researchers can focus on application requirements.

1. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006.

2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

3. A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999.

4. M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982.

5. F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.

Literature

Clustering methods Course code: 175314 Pasi Fränti 10.3.2014 Speech & Image Processing Unit...

Documents

Joensuu 2016 RUS

Joensuu 2013 på svenska

Clustering Methods: Part 2d Pasi Fränti 31.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Swap-based

Litteraturhusen Joensuu 2012

Joensuu 2014-2015

Joensuu 2016 DE

Joensuu esite 2010

Joensuu - in English

University of Joensuu P.O.Box 111 FI-80101 Joensuu FINLAND Tel. +358 13 251 111 Fax +358 13 251 2050 University of Joensuu

Joensuu 2012 på svenska

Joensuu debates!

Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND Pasi Fränti and

Variable Metric For Binary Vector Quantization UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Ismo Kärkkäinen and Pasi Fränti

10.3.2014 laajennettu joustava, tuettu polku

a nsä - Tunturisuunnistustunturisuunnistus.fi/2014/wp-content/uploads/sites/... · Friman Kaisa, HyRa x 1 Friman Sami, HyRa x 2 Fränti Arto x 2 Fränti Juha x 3 Fränti Kati x 1

Joensuu 2013 – Resolution Booklet

Joensuu 2016 SE

Joensuu 2012 English

Joensuu 2012 - Йоэнсуу

Z-Päivä Joensuu