Data clustering: Topics of Current Interest

Preview:

DESCRIPTION

Data clustering: Topics of Current Interest. Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University of London UK Supported by: - “Teacher-Student” grants from the Research Fund of NRU HSE Moscow (2011-2013) - PowerPoint PPT Presentation

Citation preview

Data clustering: Topics of Current Interest

Boris Mirkin1,2

1National Research University Higher School of Economics Moscow RF

2Birkbeck University of London UK

Supported by: - “Teacher-Student” grants from the Research Fund of NRU HSE Moscow (2011-2013)- International Lab for Decision Analysis and Choice NRU HSE Moscow

(2008 – pres.)- Laboratory of Algorithms and Technologies for Networks Analysis

NRU HSE Nizhniy Novgorod Russia (2010 – pres.)

Data clustering: Topics of Current Interest1. K-Means clustering and two issues

1. Finding right number of clusters1. Before clustering (anomalous)2. While clustering (divisive no minima of density function)

2. Weighting features (3-step iterations)2. K-Means at similarity clustering (kernel k-means)3. Semi-average similarity clustering4. Consensus clustering5. Spectral clustering, Threshold clustering and

Modularity clustering6. Laplacian pseudo-inverse transformation7. Conclusion

Batch K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

K= 3 hypothetical centroids (@)

* * * * * * * * * * @ @

@** * * *

3

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* * * * * * * * * * @ @

@** * * *

4

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* * * * * * * * * * @ @

@** * * *

5

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters

* * @ * * * @ * * * *

** * * *@

6

K-Means criterion: Summary distance to cluster centroids

Minimize

* * @ * * * @ * * * *

** * * *@

kk Si

i

K

k

M

vkviv

Si

K

k

ydcycSW )c,()(),( k11

2

17

Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’

Shortcomings of K-Means - Initialisation: no advice on K or initial

centroids - No deep minima - No defence of irrelevant features

8

CODA Week 8 by Boris Mirkin 9

Issue: How the number and location of initial centers should be chosen? (Mirkin 1998, Chiang and Mirkin 2010)

Minimize

over S and c.

Data scatter (the sumof squared data entries)== W(S,c)+D(S,c)

Data scatter is constant while partitioning

Equivalent criterion:

Maximize

where Nk is the number of entities in Sk

<ck, ck> - Euclidean squared distance between 0 and ck

CODA Week 8 by Boris Mirkin 10

Issue: How the number and location of initial centers should be chosen? 2

Maximize where Nk=|Sk|

Preprocess data by centering: 0 is grand mean<ck, ck> - Euclidean squared distance between 0 and ck

Look for anomalous & populated clusters!!! Further away from the origin.

CODA Week 8 by Boris Mirkin 11

Issue: How the number and location of initial centers should be chosen? 3

Preprocess data by centering to Reference point,typically grand mean. 0 is grand mean since that. Build just one Anomalous cluster.

CODA Week 8 by Boris Mirkin 12

Issue: How the number and location of initial centers should be chosen? 4

Preprocess data by centering to Reference point,typically grand mean. 0 is grand mean since that. Build Anomalous cluster:1. Initial center c is entity farthest away from 0. 2. Cluster update. if d(yi,c) < d(yi,0), assign yi to S. 3. Centroid update: Within-S mean c' if c' c. Go to 2 with c c'. Otherwise, halt.

CODA Week 8 by Boris Mirkin 13

Issue: How the number and location of initial centers should be chosen? 5

Anomalous Cluster is (almost) K-Means up to: (i) the number of clusters K=2: the “anomalous” one and the “main body” of entities around 0; (ii) center of the “main body” cluster is forcibly always at 0; (iii) a farthest away from 0 entity initializes the anomalous cluster.

CODA Week 8 by Boris Mirkin 14

Issue: How the number and location of initial centers should be chosen? 6

Anomalous Cluster iK-Means is superior of: (Chiang, Mirkin, 2010)

Method Acronym

Calinski and Harabasz index CH

Hartigan rule HK

Gap statistic GS

Jump statistic JS

Silhouette width SW

Consensus distribution area CD

Average distance between partitions DD

Square error iK-Means LS

Absolute error iK-Means LM

Issue: Weighting features according to relevance and Minkowski -distance (Amorim, Mirkin, 2012)

1 1 1

| | ( , )k

K M K

ik v iv kv i kk i I v k i S

s w y с d y с

w: feature weights=scale factors

3-step K-Means:-Given s, c, find w (weights)-Given w, c, find s (clusters)-Given s,w, find c (centroids)-till convergence

15

Issue: Weighting features according to relevance and Minkowski -distance 2

Minkowski’s centers

• Minimize over c

• At >1, d(c) is convex• Gradient method

16

( ) | |k

ivi S

d с y с

Issue: Weighting features according to relevance and Minkowski -distance 3

Minkowski’s metric effects

• The more uniform distribution of the entities over a feature, the smaller its weight

• Uniform distribution w=0• The best Minkowski power is data dependent• The best can be learnt from data in a semi-

supervised manner (with clustering of all objects)• Example: at Fisher’s Iris, iMWK-Means gives 5

errors only (a record) 17

18

K-Means kernelized 1

• K-Means: Given a quantitative data matrix, find centers ck and clusters Sk to

minimize W(S,c)= • Girolami 2002: W(S,c)=where A(i,j)=<xi,xj> - kernel trick applicable <xi,xj> K(xi,xj ) • Mirkin 2012:W(S,c)= Const -

19

K-Means kernelized 2

• K-Means equivalent criterion: find partition{S1,…, SK} to maximize

• G(S1,…, SK)=

where (Sk) – within cluster mean

Mirkin (1976, 1996, 2012): Build partition {S1,…, SK}finding one cluster at a time

1 , 1

1 ( , ) ( ) | || |

k

K K

k kk i j S kk

A i j S SS

20

K-Means kernelized 3

• K-Means equivalent criterion and one cluster S at atime: maximizing

g(S)= (S)|S|where (S) – within cluster mean

AddRemAdd(i) algorithm by adding/removing one entity at a time

21

K-Means kernelized 4

• Semi-average criterion:

max g(S)= (S)|S|where (S) – within cluster mean with AddRemAdd(i)

(1) Spectral: max

(2) Tight: the average similarity of S and j > (S) /2 if jS < (S) /2 if jS

( )T

T

s Asg Ss s

22

Three extensions to entire data set

• Partitional: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I– 3. Remove S from I; if I is not empty, goto 1; else halt.

• Additive: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I– 3. subtract a(S)ssT from A; if No-stop-condition, goto 1;

else halt.• Explorative: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Leave those S(i) that do not much overlap.

23

Consensus partition I: Given partitions R1,R2,…,Rn, find an “average” R

• Partition R={R1, R2, …, RK} incidence matrix Z=(zik):

zik = 1 if iRk; zik = 0, otherwise

• Partition R={R1, R2, …, RK} projector matrix P=(pij): P = Z(ZTZ)-1ZT

• Criterion min (R)= (Mirkin, Muchnik 1981 in Russian, Mirkin 2012)

24

Consensus partition 2: Given partitions R1,R2,…,Rn, find an “average” R

21

1

( ,..., ) || || minn

K Zm

R R Zm P Zm

11 , 1

1( ,..., ) ( , ) ( ) | || |

k

K K

K k kk i j R kk

G R R A i j R RR

This is equivalent to max:

25

Consensus partition 3: Given partitions R1,R2,…,Rn, find an “average” R

21

1

( ,..., ) || || minn

K Zm

R R Zm P Zm

1

1 , 1

1( ,..., ) ( , ) ( ) | || |

k

K K

K k kk i j R kk

G R R A i j R RR

Mirkin, Shestakov (2013): (1) This is superior to a bunch of contemporary

consensus clustering approaches(2) Consensus clustering of results of multiple runs of K-Means is better in cluster recovery than best K-Means

26

Additive clustering IGiven similarity A=(A(i,j)), find clusters •u1=(ui

1), u2=(ui2),…, uK=(ui

K) ui

k either 1 or 0 - crisp clusters0 ui

k 1 - fuzzy clusters•1u1, 2u2,…, KuK - intensityAdditive Model:•A= 1

2ui1uj

1+ …+V2ui

VujV+E; min E2

Shepard, Arabie 1979 (presented 1973); Mirkin 1987 (1976 in Russian)

27

Additive clustering IIGiven similarity A=(A(i,j)), iterative extractionMirkin 1987 (1976 in Russian): double-greedy

• OUTER LOOP: One cluster at a time

min L(A, , u) =

1. Find real (intensity) and 1/0 binary u (membership) to (locally) minimize L(A, ,u).

2. Take cluster S = { i | ui = 1 }.

3. Update A A - uuT (subtraction of in S) 4. Reiterate till a Stop-condition.

28

Additive clustering IIIGiven similarity A=(A(i,j)), iterative extractionMirkin 1987 (1976 in Russian): double-greedy

• OUTER LOOP: One cluster at a time leads to T(A) = 1

2|S1|2 + 2 2|S2|2 +…+ K

2|SK| 2 + L (*)

T(A)=,k 2|Sk|2 - contribution of cluster k

Given Skk = a(Sk)

Contribution k 2|Sk|2 = f(Sk) 2

Additive extension of AddRem is applicableSimilar double-greedy approach to fuzzy clustering: Mirkin, Nascimento 2012.

29

Different criteria I

• Summary Uniform (Mirkin 1976 in Russian)

Within-S sum of similarities A(i,j)- to maximizeRelates to those considered

• Summary Modular (Newman 2004)Within-S sum of similarities A(i,j)-B(i,j) to maximize B(i,j)= A(i,+)A(+,j)/A(+,+)

30

Different criteria II

• Normalized cut (Shi, Malik 2000) to maximize A(S,S)/A(S,+) + A(,)/A(,+)where is complement of S, A(S,S) and A(S,+) summary similarities.Can be reformulated: minimize a Rayleigh quotient, f(S) = z is binary; L(A) is Laplace transformation A(i,j) (i,j)

31

FADDIS: Fuzzy Additive Spectral Clustering• Spectral: B = Pseudo-inverse Laplacian of A– One cluster at a time

• Min ||B – 2uiuj||2 (One cluster to find)• Residual similarity B B – 2uiuj

• Stopping conditions– Equivalent: Rayleigh quotient to maximize

•Max uTBu/uTu [follows from model in contrast to a very popular, yet purely heuristic, approach by Shi and Malik 2000]

• Experimentally demonstrated: Competitive over– ordinary graphs for community detection– conventional (dis)similarity data– affinity data (kernel transformations of feature space data)– in-house synthetic data generators

32

Competitive at:

• Community detection in ordinary graphs• Conventional similarity data• Affinity similarity data• Lapin transformed similarity data D=diag(B*1N) L = I - D-1/2BD-1/2

L+ = pinv(L)• There are examples at which Lapin doesn’t work

33

Example at which Lapin does work,but no square error

Conclusion

• Clustering is yet far from a mathematical theory, however, it gets meaty + Gaussian kernels bringing distributions + Laplacian transformation bringing dynamics

• To make it to a theory, a way to go– Modeling dynamics– Compatibility at Multiple data and metadata– Interpretation

Recommended