35
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern Finland

Genetic algorithms (GA) for clustering

  • Upload
    cassia

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Speech and Image Processing Unit School of Computing University of Eastern Finland. Genetic algorithms (GA) for clustering. Clustering Methods: Part 2e. Pasi Fränti. General structure. Genetic Algorithm: Generate S initial solutions REPEAT Z iterations Select best solutions - PowerPoint PPT Presentation

Citation preview

Page 1: Genetic algorithms (GA) for clustering

Genetic algorithms (GA)for clustering

Pasi Fränti

Clustering Methods: Part 2e

Speech and Image Processing UnitSchool of Computing

University of Eastern Finland

Page 2: Genetic algorithms (GA) for clustering

General structure

Genetic Algorithm:Generate S initial solutionsREPEAT Z iterations

Select best solutionsCreate new solutions by crossoverMutate solutions

END-REPEAT

Page 3: Genetic algorithms (GA) for clustering

Components of GA

• Representation of solution

• Selection method

• Crossover method

• Mutation

Most critical !

Page 4: Genetic algorithms (GA) for clustering

Representation of solution

• Partition (P): – Optimal centroid can be calculated from P.

– Only local changes can be made.

• Codebook (C):– Optimal partition can be calculated from C.

– Calculation of P takes O(NM) slow.

• Combined (C, P):– Both data structures are needed anyway.

– Computationally more efficient.

Page 5: Genetic algorithms (GA) for clustering

Selection method

• To select which solutions will be used in crossover for generating new solutions.

• Main principle: good solutions should be used rather than weak solutions.

• Two main strategies:

– Roulette wheel selection

– Elitist selection.

• Exact implementation not so important.

Page 6: Genetic algorithms (GA) for clustering

Roulette wheel selection

),(1

1),(

PCdistortionPCw

S

jjj

iiii

PCw

PCwPCp

1

),(

),(),(

• Select two candidate solutions for the crossover randomly.

• Probability for a solution to be selected is weighted according to its distortion:

Page 7: Genetic algorithms (GA) for clustering

Elitist selection

Elitist approach using zigzag scanning among the best

solutions

Select next pair(i, j): REPEAT

IF (i+j) MOD 2 = 0 THEN i max(1, i-1); j j+1; ELSE j max(1, j-1); i i+1;

UNTIL ij. RETURN(i, j)

• Main principle: select all possible pairs among the best candidates.

Page 8: Genetic algorithms (GA) for clustering

Crossover methods

Different variants for crossover:• Random crossover• Centroid distance• Pairwise crossover• Largest partitions• PNN

Local fine-tuning:• All methods give new allocation of the centroids.• Local fine-tuning must be made by K-means.• Two iterations of K-means is enough.

Page 9: Genetic algorithms (GA) for clustering

Random crossover

Solution 1 Solution 2

+

Select M/2 centroids randomly from the two parent.

Page 10: Genetic algorithms (GA) for clustering

New Solution:

How to create a new solution?

Picking M/2 randomly chosen cluster centroids

from each of the two parents in turn.

How many solutions are there?

36 possibilities how to create a new solution.

What is the probability to select a good

one?

Not high, some are good but K-Means is needed,

most are bad. See statistics.

Parent solution A Parent solution B

Data point

Centroid

Explanation

M – number of clusters

Parent A Parent B Rating

c2, c4 c1, c4 Optimal

c1, c2 c3, c4 Good (K-Means)

c2, c3 c2, c3 Bad

Some possibilities: M = 4

c1

c4

c3

c2

1 2 4 5 8

c1c4

c3

c2

Rough statistics:

Optimal: 1Good: 7Bad: 28

Page 11: Genetic algorithms (GA) for clustering

Parent solution A Parent solution B

c1

c4

c3

c2

1 2 4 5 8

c1c4

c3

c2

c1

c3

c2

c4

Child solution (optimal) Child solution (good) Child solution (bad)

c1

c3

c2

c4

c1

c2

c4

c3

Page 12: Genetic algorithms (GA) for clustering

Centroid distance crossover [Pan, McInnes, Jack, 1995: Electronics Letters ]

[Scheunders, 1997: Pattern Recognition Letters ]

• For each centroid, calculate its distance to the

center point of the entire data set.

• Sort the centroids according to the distance.

• Divide into two sets: central vectors (M/2

closest) and distant vectors (M/2 furthest).

• Take central vectors from one codebook and

distant vectors from the other.

Page 13: Genetic algorithms (GA) for clustering

Parent solution A Parent solution B

New solution:

Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B

OR

Variant (b) Take distant vectors from parent solution A and central vectors from parent solution B

Data point

Centroid

Explanation

M – number of clusters

Centroid of entire dataset

A: d(c4, Ced) < d(c2, Ced) < d(c1, Ced) < d(c3, Ced) B: d(c1, Ced) < d(c3, Ced) < d(c2, Ced) < d(c4, Ced)

1) Distances d(ci, Ced):

2) Sort centroids according to the distance:

A: c4, c2, c1, c3, B: c1, c3, c2, c4

3) Divide into two sets (M = 4):

A: central vectors: c4, c2, distant vectors: c1, c3 B: central vectors: c1, c3, distant vectors: c2, c4

1 2 4 5 8

1

5

6

c1

c2

c3

c4

Ced

1

1

2 4 5 8

5

6

c1

c2

c3

c4

Ced

c2

c4

c2

c4

c1 c3

c1

c3

Page 14: Genetic algorithms (GA) for clustering

Child - variant (a) Child – variant (b)

New solution:

Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B

OR

Variant (b) Take distant vectors from parent solution A and central vectors from parent solution B

Data point

Centroid

Explanation

M – number of clusters

Centroid of entire dataset

c2

c4

c2

c4

c1 c3

c1

c3

1

1

2 4 5 8

5

6

c1

c2

c3

c4

Ced

1

1

2 4 5 8

5

6

c1c2

c3

c4

Ced

Page 15: Genetic algorithms (GA) for clustering

Pairwise crossover[Fränti et al, 1997: Computer Journal]

Greedy approach:

• For each centroid, find its nearest centroid in the other parent solution that is not yet used.

• Among all pairs, select one of the two randomly.

Small improvement:

• No reason to consider the parents as separate solutions.

• Take union of all centroids.

• Make the pairing independent of parent.

Page 16: Genetic algorithms (GA) for clustering

Initial parent solutions

Pairwise crossover example

MSE=8.79109

MSE=11.92109

Page 17: Genetic algorithms (GA) for clustering

Pairwise crossover example

Pairing between parent solutions

MSE=7.34109

Page 18: Genetic algorithms (GA) for clustering

Pairing without restrictions

MSE=4.76109

Pairwise crossover example

Page 19: Genetic algorithms (GA) for clustering

Largest partitions[Fränti et al, 1997: Computer Journal]

• Select centroids that represent largest clusters.

• Selection by greedy manner.

• (illustration to appear later)

Page 20: Genetic algorithms (GA) for clustering

PNN crossover for GA[Fränti et al, 1997: The Computer Journal]

Initial 2

After PNNUnion

PNN

Combined

Initial 1

Page 21: Genetic algorithms (GA) for clustering

The PNN crossover method (1) [Fränti, 2000: Pattern Recognition Letters]

CrossSolutions(C1, P1, C2, P2) (Cnew, Pnew) Cnew CombineCentroids(C1, C2) Pnew CombinePartitions(P1, P2) Cnew UpdateCentroids(Cnew, Pnew) RemoveEmptyClusters(Cnew, Pnew) PerformPNN(Cnew, Pnew)

CombineCentroids(C1, C2) Cnew Cnew C1 C2

CombinePartitions(Cnew, P1, P2) Pnew FOR i 1 TO N DO

IF x c x ci p i pi i 1 2

2 2

THEN

p pinew

i 1 ELSE

p pinew

i 2 END-FOR

Page 22: Genetic algorithms (GA) for clustering

The PNN crossover method (2)

UpdateCentroids(C1, C2) Cnew FOR j 1 TO |Cnew| DO

c jnew

CalculateCentroid(Pnew, j )

PerformPNN(Cnew, Pnew) FOR i 1 TO |Cnew| DO

qi FindNearestNeighbor(ci) WHILE |Cnew|>M DO

a FindMinimumDistance(Q) b qa MergeClusters(ca, pa, cb, pb) UpdatePointers(Q)

END-WHILE

Page 23: Genetic algorithms (GA) for clustering

Importance of K-means(Random crossover)

160

180

200

220

240

260

0 5 10 15 20 25 30 35 40 45 50generation

dis

tort

ion

without k-means

with k-means

BestWorst

Bridge

Page 24: Genetic algorithms (GA) for clustering

Effect of crossover method(with k-means iterations)

Bridge

160

165

170

175

180

185

190

0 5 10 15 20 25 30 35 40 45 50generation

dis

tort

ion Random

Cent.dist.Pairwise

PNN

Largest partitions

Page 25: Genetic algorithms (GA) for clustering

Effect of crossover method(with k-means iterations)

Binary data (Bridge2)

1.25

1.30

1.35

1.40

1.45

1.50

0 5 10 15 20 25 30 35 40 45 50generation

dis

tort

ion Random

Cent.dist.Pairwise

PNN

Largest partitions

Page 26: Genetic algorithms (GA) for clustering

Mutations

• Purpose is to implement small random changes to the solutions.

• Happens with a small probability.

• Sensible approach: change the location of one centroid by the random swap!

• Role of mutations is to simulate local search.

• If mutations are needed crossover method is not very good.

Page 27: Genetic algorithms (GA) for clustering

Effect of k-means and mutations

160

165

170

175

180

0 10 20 30 40 50

Number of iterations

Dis

tort

ion

Bridge

Mutations + K-means

PNN crossover + K-means

Random crossover + K-means

PNN

Mutations alone better than random crossover!

K-means improves but not vital

Page 28: Genetic algorithms (GA) for clustering

Pseudo code of GAIS [Virmajoki & Fränti, 2006: Pattern Recognition]

GeneticAlgorithm(X) (C, P)

FOR i 1 TO Z DO Ci RandomCodebook(X); Pi OptimalPartition(X, Ci);

SortSolutions(C,P);

REPEAT {C,P} CreateNewSolutions( {C,P} ); SortSolutions(C,P);

UNTIL no improvement;

CreateNewSolutions({C, P}) {Cnew, Pnew }

Cnew-1, Pnew-1 C1, P1; FOR i 2 TO Z DO

(a,b) SelectNextPair; Cnew-i, Pnew-I Cross(Ca, Pa, Cb, Pb); IterateK-Means(Cnew-i, Pnew-i);

Cross(C1, P1, C2, P2) (Cnew, Pnew)

Cnew CombineCentroids(C1, C2); Pnew CombinePartitions(P1, P2); Cnew UpdateCentroids(Cnew, Pnew); RemoveEmptyClusters(Cnew, Pnew); IS(Cnew, Pnew);

CombineCentroids(C1, C2) Cnew

Cnew C1 C2

CombinePartitions(Cnew, P1, P2) Pnew

FOR i 1 TO N DO

IF x c x ci p i pi i 1 2

2 2 THEN

p pinew

i 1

ELSE

p pinew

i 2

END-FOR

UpdateCentroids(C1, C2) Cnew

FOR j 1 TO |Cnew| DO

c jnew CalculateCentroid(Pnew, j );

Page 29: Genetic algorithms (GA) for clustering

PNN vs. IS crossovers

Further improvement of about 1%

160

161

162

163

164

165

166

0 10 20 30 40 50Number of Iterations

MS

E

IS crossover + K-means

IS crossover

PNN crossover

PNN crossover + K-means

Bridge

Page 30: Genetic algorithms (GA) for clustering

Optimized GAIS variants

GAIS short (optimized for speed): - Create new generations only as long as the best solution

keeps improving (T=*).

- Use a small population size (Z=10)

- Apply two iterations of k‑means (G=2).

GAIS long (optimized for quality): - Create a large number of generations (T=100)

- Large population size (Z=100)

- Iterate k‑means relatively long (G=10).

Page 31: Genetic algorithms (GA) for clustering

Comparison of algorithms Image sets Birch data sets Synthetic data sets Time

Bridge House Miss

America B1 B2 B3 S1 S2 S3 S4 Bridge

Random 251.32 12.12 8.34 14.44 35.73 8.20 78.55 72.91 55.42 47.05 <1 K-means (aver.) 179.87 7.81 5.96 5.52 7.99 2.53 20.53 20.91 21.37 16.78 5 K-means (best) 176.95 7.35 5.93 5.13 6.87 2.16 13.23 16.07 18.96 15.71 50 SOM 173.63 7.59 5.92 13.50 10.03 15.18 20.11 13.28 21.10 15.71 376 FCM 178.39 7.79 6.22 5.02 5.29 2.48 8.92 13.28 16.89 15.71 166 Split 170.22 6.18 5.40 4.81 2.29 1.91 8.95 13.33 17.50 16.01 13 Split + k-means 165.77 6.06 5.28 4.64 2.28 1.91 8.92 13.28 16.92 15.77 17 RLS 164.64 5.96 5.28 4.64 2.28 1.86 8.92 13.28 16.89 15.71 1146 Split-n-Merge 163.81 5.98 5.19 4.64 2.28 1.93 8.92 13.28 16.91 15.75 85 SR (average) 162.45 6.02 5.27 4.84 3.39 1.99 9.52 13.68 17.31 15.80 213 SR (best) 161.96 5.98 5.25 4.76 3.12 1.98 8.93 13.28 16.89 15.71 2130 PNN 168.92 6.27 5.36 4.73 2.28 1.96 8.93 13.44 17.70 17.52 272 PNN + k-means 165.04 6.07 5.24 4.64 2.28 1.88 8.92 13.28 16.89 16.87 285 GKM – fast 10 164.12 5.94 5.34 4.64 2.28 1.92 8.92 13.28 16.89 15.71 91721 IS 163.38 6.09 5.19 4.70 2.28 1.89 8.92 13.29 16.96 15.79 717 IS + k-means 162.38 6.02 5.17 4.64 2.28 1.86 8.92 13.28 16.89 15.71 719 GA (k-means) 174.91 6.61 5.54 6.58 5.96 2.45 11.66 15.99 19.22 16.14 654 GA (PNN) 162.37 5.92 5.17 4.98 2.28 1.98 8.92 13.28 16.89 15.71 404 SAGA 161.22 5.86 5.10 4.64 2.28 1.86 8.92 13.28 16.89 15.71 74554 GAIS (short) 161.59 5.92 5.11 4.64 2.28 1.86 8.92 13.28 16.89 15.72 1311 GAIS (long) 160.73 5.89 5.07 4.64 2.28 1.86 8.92 13.28 16.89 15.71 387533

Page 32: Genetic algorithms (GA) for clustering

Variation of the result

0

5

10

15

20

25

160 165 170 175 180 185 190

MSE

Fre

quen

cy

k-meansGAIS

IS PNN

IS + k-means

Page 33: Genetic algorithms (GA) for clustering

Time vs. quality comparisonBridge

160

165

170

175

180

185

190

0 1 10 100 1000 10000 100000Time (s)

MS

E

repeatedK-means

RLS

GAIS

PNN

IS

SAGA

Page 34: Genetic algorithms (GA) for clustering

Conclusions

• Best clustering obtained by GA.

• Crossover method most important.

• Mutations not needed.

Page 35: Genetic algorithms (GA) for clustering

References1. P. Fränti and O. Virmajoki, "Iterative shrinking method for

clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.

2. P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, January 2000.

3. P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547-554, 1997.

4. J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113-129, 2003.

5. J.S. Pan, F.R. McInnes and M.A. Jack, VQ codebook design using genetic algorithms. Electronics Letters, 31, 1418-1419, August 1995.

6. P. Scheunders, A genetic Lloyd-Max quantization algorithm. Pattern Recognition Letters, 17, 547-556, 1996.