Data clustering: Topics of Current Interest

Boris Mirkin1,2

1National Research University Higher School of Economics Moscow RF

2Birkbeck University of London UK

Supported by: - “Teacher-Student” grants from the Research Fund of NRU HSE Moscow (2011-2013)- International Lab for Decision Analysis and Choice NRU HSE Moscow

(2008 – pres.)- Laboratory of Algorithms and Technologies for Networks Analysis

NRU HSE Nizhniy Novgorod Russia (2010 – pres.)

Data clustering: Topics of Current Interest1. K-Means clustering and two issues

1. Finding right number of clusters1. Before clustering (anomalous)2. While clustering (divisive no minima of density function)

2. Weighting features (3-step iterations)2. K-Means at similarity clustering (kernel k-means)3. Semi-average similarity clustering4. Consensus clustering5. Spectral clustering, Threshold clustering and

Modularity clustering6. Laplacian pseudo-inverse transformation7. Conclusion

Batch K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

K= 3 hypothetical centroids (@)

* * * * * * * * * * @ @

@** * * *

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* * * * * * * * * * @ @

@** * * *

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* * * * * * * * * * @ @

@** * * *

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters

* * @ * * * @ * * * *

** * * *@

K-Means criterion: Summary distance to cluster centroids

Minimize

* * @ * * * @ * * * *

** * * *@

ydcycSW )c,()(),( k11

Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’

Shortcomings of K-Means - Initialisation: no advice on K or initial

centroids - No deep minima - No defence of irrelevant features

CODA Week 8 by Boris Mirkin 9

Issue: How the number and location of initial centers should be chosen? (Mirkin 1998, Chiang and Mirkin 2010)

Minimize

over S and c.

Data scatter (the sumof squared data entries)== W(S,c)+D(S,c)

Data scatter is constant while partitioning

Equivalent criterion:

Maximize

where Nk is the number of entities in Sk

<ck, ck> - Euclidean squared distance between 0 and ck

Issue: How the number and location of initial centers should be chosen? 2

Maximize where Nk=|Sk|

Preprocess data by centering: 0 is grand mean<ck, ck> - Euclidean squared distance between 0 and ck

Look for anomalous & populated clusters!!! Further away from the origin.

Preprocess data by centering to Reference point,typically grand mean. 0 is grand mean since that. Build just one Anomalous cluster.

Preprocess data by centering to Reference point,typically grand mean. 0 is grand mean since that. Build Anomalous cluster:1. Initial center c is entity farthest away from 0. 2. Cluster update. if d(yi,c) < d(yi,0), assign yi to S. 3. Centroid update: Within-S mean c' if c' c. Go to 2 with c c'. Otherwise, halt.

Anomalous Cluster is (almost) K-Means up to: (i) the number of clusters K=2: the “anomalous” one and the “main body” of entities around 0; (ii) center of the “main body” cluster is forcibly always at 0; (iii) a farthest away from 0 entity initializes the anomalous cluster.

Anomalous Cluster iK-Means is superior of: (Chiang, Mirkin, 2010)

Method Acronym

Calinski and Harabasz index CH

Hartigan rule HK

Gap statistic GS

Jump statistic JS

Silhouette width SW

Consensus distribution area CD

Average distance between partitions DD

Square error iK-Means LS

Absolute error iK-Means LM

Issue: Weighting features according to relevance and Minkowski -distance (Amorim, Mirkin, 2012)

| | ( , )k

ik v iv kv i kk i I v k i S

s w y с d y с

w: feature weights=scale factors

3-step K-Means:-Given s, c, find w (weights)-Given w, c, find s (clusters)-Given s,w, find c (centroids)-till convergence

Issue: Weighting features according to relevance and Minkowski -distance 2

Minkowski’s centers

• Minimize over c

• At >1, d(c) is convex• Gradient method

( ) | |k

d с y с

Issue: Weighting features according to relevance and Minkowski -distance 3

Minkowski’s metric effects

• The more uniform distribution of the entities over a feature, the smaller its weight

• Uniform distribution w=0• The best Minkowski power is data dependent• The best can be learnt from data in a semi-

supervised manner (with clustering of all objects)• Example: at Fisher’s Iris, iMWK-Means gives 5

errors only (a record) 17

K-Means kernelized 1

• K-Means: Given a quantitative data matrix, find centers ck and clusters Sk to

minimize W(S,c)= • Girolami 2002: W(S,c)=where A(i,j)=<xi,xj> - kernel trick applicable <xi,xj> K(xi,xj ) • Mirkin 2012:W(S,c)= Const -

• K-Means equivalent criterion: find partition{S1,…, SK} to maximize

• G(S1,…, SK)=

where (Sk) – within cluster mean

Mirkin (1976, 1996, 2012): Build partition {S1,…, SK}finding one cluster at a time

1 ( , ) ( ) | || |

k kk i j S kk

A i j S SS

• K-Means equivalent criterion and one cluster S at atime: maximizing

g(S)= (S)|S|where (S) – within cluster mean

AddRemAdd(i) algorithm by adding/removing one entity at a time

• Semi-average criterion:

max g(S)= (S)|S|where (S) – within cluster mean with AddRemAdd(i)

(1) Spectral: max

(2) Tight: the average similarity of S and j > (S) /2 if jS < (S) /2 if jS

s Asg Ss s

Three extensions to entire data set

• Partitional: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I– 3. Remove S from I; if I is not empty, goto 1; else halt.

• Additive: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I– 3. subtract a(S)ssT from A; if No-stop-condition, goto 1;

else halt.• Explorative: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Leave those S(i) that do not much overlap.

Consensus partition I: Given partitions R1,R2,…,Rn, find an “average” R

• Partition R={R1, R2, …, RK} incidence matrix Z=(zik):

zik = 1 if iRk; zik = 0, otherwise

• Partition R={R1, R2, …, RK} projector matrix P=(pij): P = Z(ZTZ)-1ZT

• Criterion min (R)= (Mirkin, Muchnik 1981 in Russian, Mirkin 2012)

Consensus partition 2: Given partitions R1,R2,…,Rn, find an “average” R

( ,..., ) || || minn

R R Zm P Zm

11 , 1

1( ,..., ) ( , ) ( ) | || |

K k kk i j R kk

G R R A i j R RR

This is equivalent to max:

Consensus partition 3: Given partitions R1,R2,…,Rn, find an “average” R

( ,..., ) || || minn

R R Zm P Zm

1( ,..., ) ( , ) ( ) | || |

K k kk i j R kk

G R R A i j R RR

Mirkin, Shestakov (2013): (1) This is superior to a bunch of contemporary

consensus clustering approaches(2) Consensus clustering of results of multiple runs of K-Means is better in cluster recovery than best K-Means

Additive clustering IGiven similarity A=(A(i,j)), find clusters •u1=(ui

1), u2=(ui2),…, uK=(ui

k either 1 or 0 - crisp clusters0 ui

k 1 - fuzzy clusters•1u1, 2u2,…, KuK - intensityAdditive Model:•A= 1

2ui1uj

1+ …+V2ui

VujV+E; min E2

Shepard, Arabie 1979 (presented 1973); Mirkin 1987 (1976 in Russian)

Additive clustering IIGiven similarity A=(A(i,j)), iterative extractionMirkin 1987 (1976 in Russian): double-greedy

• OUTER LOOP: One cluster at a time

min L(A, , u) =

1. Find real (intensity) and 1/0 binary u (membership) to (locally) minimize L(A, ,u).

2. Take cluster S = { i | ui = 1 }.

3. Update A A - uuT (subtraction of in S) 4. Reiterate till a Stop-condition.

Additive clustering IIIGiven similarity A=(A(i,j)), iterative extractionMirkin 1987 (1976 in Russian): double-greedy

• OUTER LOOP: One cluster at a time leads to T(A) = 1

2|S1|2 + 2 2|S2|2 +…+ K

2|SK| 2 + L (*)

T(A)=,k 2|Sk|2 - contribution of cluster k

Given Skk = a(Sk)

Contribution k 2|Sk|2 = f(Sk) 2

Additive extension of AddRem is applicableSimilar double-greedy approach to fuzzy clustering: Mirkin, Nascimento 2012.

Different criteria I

• Summary Uniform (Mirkin 1976 in Russian)

Within-S sum of similarities A(i,j)- to maximizeRelates to those considered

• Summary Modular (Newman 2004)Within-S sum of similarities A(i,j)-B(i,j) to maximize B(i,j)= A(i,+)A(+,j)/A(+,+)

Different criteria II

• Normalized cut (Shi, Malik 2000) to maximize A(S,S)/A(S,+) + A(,)/A(,+)where is complement of S, A(S,S) and A(S,+) summary similarities.Can be reformulated: minimize a Rayleigh quotient, f(S) = z is binary; L(A) is Laplace transformation A(i,j) (i,j)

FADDIS: Fuzzy Additive Spectral Clustering• Spectral: B = Pseudo-inverse Laplacian of A– One cluster at a time

• Min ||B – 2uiuj||2 (One cluster to find)• Residual similarity B B – 2uiuj

• Stopping conditions– Equivalent: Rayleigh quotient to maximize

•Max uTBu/uTu [follows from model in contrast to a very popular, yet purely heuristic, approach by Shi and Malik 2000]

• Experimentally demonstrated: Competitive over– ordinary graphs for community detection– conventional (dis)similarity data– affinity data (kernel transformations of feature space data)– in-house synthetic data generators

Competitive at:

• Community detection in ordinary graphs• Conventional similarity data• Affinity similarity data• Lapin transformed similarity data D=diag(B*1N) L = I - D-1/2BD-1/2

L+ = pinv(L)• There are examples at which Lapin doesn’t work

Example at which Lapin does work,but no square error

Conclusion

• Clustering is yet far from a mathematical theory, however, it gets meaty + Gaussian kernels bringing distributions + Laplacian transformation bringing dynamics

• To make it to a theory, a way to go– Modeling dynamics– Compatibility at Multiple data and metadata– Interpretation

Data clustering: Topics of Current Interest

Documents

State Notes TOPICS OF LEGISLATIVE INTEREST · State Notes TOPICS OF LEGISLATIVE INTEREST ... Access to Acute Inpatient Mental Health Services: ... Year 2000 Computer Crisis and Costs

Many topics that interest me as a publisher are connected

Topics of Interest: Ageing & Cook-off · 26-Mar-03 Topics of Interest - Ageing & Cook-off 2 Two Topics of Interest to the IM Community IM & AGEING ü Consequences for IM ü Ageing

Research on public administration: Topics of interest

Behavior-driven clustering of queries into topics

High Interest Topics Packet - Teaching Resources for ...classroomcaboodle.com/.../2014/10/high-interest-reading-topics.pdf · High Interest Topics Packet Informational Reading Selections

A Survey of Evolutionary Algorithms for Clustering · 2008. 11. 3. · evolutionary approach, and comprises advanced topics, like multi-objective and ensemble-based evolutionary clustering

Review of the rest of topics Exponential Equations and Interest Problems

G22.3023 Special Topics of Industrial Interest Network ... · G22.3023 Special Topics of Industrial Interest Network Design and Implementation ... G22.2263 Network Design and Implementation

Web Search Clustering and Labeling with Hidden Topics

Topics of interest

CHEMISTRY 2000 Topics of Interest #2: Fluorescent Molecules in Medicine

Abstract Topics of Interest - Podiatry 2019Abstract Topics of Interest l Biomechanics & Foot Orthoticsl Preventive Podiatric Medicinel Rearfoot Surgeryl Shoe Modifications and Fittingl

Topics of Interest FAMU

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

HYDERABAD – High Interest Topics session: Mitigation …schd.ws/hosted_files/icann572016/d5/I57 HYD_Sat05Nov2016-High... · HYDERABAD – High Interest Topics session: ... so they

Use Sample Interest Inventories to get ideas about dialogue journal topics

Letters addressing topics of current interest or material

Topics Interest Rate Risk-8 and 9[1]

Topics CS347 - Stanford Universityinfolab.stanford.edu/cs347.2001.spring/handouts/lecture11-4in1.pdf · Topics Link-based clustering Enumerative clustering/trawling Recommendation