22
Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup Yufei Ding, Yue Zhao, Xipeng Shen Madanlal Musuvathi, Todd Mytkowicz North Carolina State University, USA Microsoft Research at Redmond, USA K-Means

Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Yinyang K-Means: A Drop-In Replacement of the Classic K-Means

with Consistent Speedup

Yufei Ding, Yue Zhao, Xipeng Shen

Madanlal Musuvathi, Todd Mytkowicz

North Carolina State University, USA Microsoft Research at Redmond, USA

K-Means

Page 2: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Popular Clustering Algorithm: K-Means

2

One  of  the  most  popular  algorithms  for  clustering.  

Decades  of  usage  in  various  domains    since  proposed  by  Lloyd  in  1957.

Page 3: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Classic Algorithm [by Lloyd]

3

Update  centers  w/  new  centroids

Assign  points  to  clusters  based  on  d(p,  c)  ∀p,c

Convergence

Set  initial  centersGroup  N  points  into  K  clusters

Each assignment step calculates N*K distances.cause  slow  clustering  for  large  problems

Page 4: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

4

Group  N  points  into  K  clusters

Each assignment step calculates N*K distances.cause  slow  clustering  for  large  problems

Are all distance calculations necessary?

If  c  is  not  the  closest  center  to  p,  it’s  safe  to  skip  computing  d(p,  c).

Page 5: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Prior Work

5

• K-D Tree [Kanungo et al., 2002]

• Slowdowns for high dimensions

• Incremental Optimization

• [Elkan, 2003; Drake & Hamerly, 2012; Hamerly, 2010]

• Use triangle inequality to filter out the unnecessary

• Many times of memory space cost

• Slowdowns in some cases (medium dim, large K & N)

• Approximation [Wang et al., 2012]

• Different results

• Unable to inherit the level of trust

Lloyd’s  K-­‐Means  remains  the  dominant  choice  in  practice!

Page 6: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Goal of This Work

6

• Speed: Much faster in all settings (N, K, D)

• Trust: Same results as Lloyd’s K-Means produces

• Simplicity: Easy to implement and deploy

Develop  a  drop-­‐in  replacement  of  Lloyd’s  K-­‐Means.

Page 7: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Overview of Yinyang K-Means

7

• Carefully  maintain  and  leverage  lower  bounds  &  upper  bounds

• A  3-­‐level  filter

Minimize  #  of  distance  calculations:

YangYin

• Space-­‐conscious  elastic  designA  harmony  w/  contrary  forces

On average, 9.36X faster than classic K-Means. No slowdown regardless of N, K, dim.

Page 8: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Triangular Inequality

8

• Fundamental tool for getting bounds

cx

c’

d(x,c’) - d(c’,c) ≤ d(x,c) ≤ d(x,c’) + d(c’,c)

x: data point C’: center in iteration i C:center in iteration i+1

lower  bound:  lb(x,  c) upper  bound:  ub(x,  c)

Page 9: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Assignment Step

9

• Three levels of filters

• Overhead V.S. Benefits

Global  filterGroup  filter

Local  filter

Page 10: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

10

Global  Filtering• One single check to decide whether a point is possible to

change its center assignment.

10

c1

c’1 c2

c’2

c’3

c3

c4

c’4

x

G

G’ = {C’1,C’2, …, C’k} - C’1

G = {C1,C2, …, Ck} - C1

Prior  iter:  

Question: Is c1 the closest center for x in this iteration?

This  iter:  

Page 11: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

11

Global  Filtering

11

c1

c’1 c2

c’2

c’3

c3

c4

c’4

x

G

G’ = {C’1,C’2, …, C’k} - C’1

G = {C1,C2, …, Ck} - C1

Prior  iter:  

This  iter:  

• Compute ub(x,c1) and lb(x, G) ub(x,c1) = ub(x, c’1) + 𝛥c1

lb(x, G) = lb(x,G’) - maxdelta

• 𝛥c1 = d(c,c’) maxdelta = Max (𝛥(c)) ∀c

Principle1: if ub(x,c1) ≤ lb(x, G);

then x does not change assignment.

• One single check to decide whether a point is possible to change its center assignment.

Page 12: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Orig  time:  O(n•k•d)  

12

Global  Filtering

12

Overhead:Space Cost: O(n) to store boundsTime Cost: O(k•d + n) to compute center shifts and bounds

ub(x,c1)  =  ub(x,  c’1)  +  𝛥c1    lb(x,  G)  ≤  lb(x,G’)  -­‐  maxdelta

Benefits: ~60% redundant distance computation can be removed• Limiting factor: maxdelta = Max (𝛥(c)) ∀c

Orig  space:  O(n•d)  

Page 13: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

13

Group  Filtering

13

• Rules to compute ub(x,c1) and lb(x, G) ub(x,c1) = ub(x, c’1) + 𝛥c1

lb(x, Gi) ≤ lb(x,G’i) - maxdelta(Gi)

• 𝛥c1 = d(c1,c’1) maxdelta(Gi) = Max (𝛥(c)) ∀c ∊ Gi

• Group filtering:

Principle1I: if ub(x,c1) ≤ lb(x, Gi);

then no center in Gi can be closest to x.

Grouping centers into m

groups: {G1,G2, …, Gm}

c’3

c3G2

c1

c’1c5

c’5

c4

c’4

x

G1

c2 c2

G3

Page 14: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

14

Group  Filtering

14

• How to do the grouping?

Partial K-Means on initial centers (one time preprocessing)

Overhead: O(k•m•d•iter) (we set iter to 5)

c’3

c3G2

c1

c’1c5

c’5

c4

c’4

x

G1

c2 c2

G3

Choosing  m—Elasticity:  •        m  =  k/10          if  space  allows  •        max  value      otherwise

Page 15: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Orig  time:  O(n•k•d)  

15

Group  Filtering

15

• Efficiency and overhead

Overhead:Space Cost: O(n•m) to maintain bounds across iterationTime Cost: O(k•d + n•m) to compute center shifts and bounds

ub(x,c1)  =  ub(x,  c’1)  +  𝛥(c1)      lb(x,  Gi)  ≤  lb(x,Gi)  -­‐  maxdelta(Gi)

Efficiency: ~ 80% redundant distance computation can be removed

Orig  space:  O(n•d)  

Page 16: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

16

Local  Filtering

16

c’3

c3G2

c1

c’1c5

c’5

c4

c’4

x

G1

c2 c2

G3

Check each remaining center, and skip c if min2(x) ≤ lb(x, Gi) - 𝛥cj

min2(x): the so-far 2nd shortest distance from x.

No  extra  space  cost!

Page 17: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Update Step

17

Assign  points  to  clusters  based  on  d(p,  c)  ∀p,c

Update  centers  w/  new  centroids

Convergence

Set  initial  centers

See  paper.

Page 18: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

18

Evaluation• Compared to three other methods:

• Classic (Lloyd’s) K-Means

• Elkan’s K-Means [2003]

• Drake’s K-Means [2012]

• Input: real-world data sets (with different N, K, Dim)

• 4 from UCI machine learning repository [Bache Lichman, 2013]

• 4 other commonly used image data sets [Wang et al., 2012].

• Implemented in GraphLab (a map-reduce framework)

• Two machines

• 16GB memory, 8-core i7-3770K processor

• 4GB memory, 4-core Core2 CPU

Page 19: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

19

Baseline:  Classic  K-­‐means(16GB,  8-­‐core  Intel  Ivy  Bridge)

Speedu

p  (X)

Page 20: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

20

Speedu

p  (X)

Baseline:  Classic  K-­‐means

Speedu

p  (X)

(16GB,  8-­‐core)

XX

Page 21: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

21

16GBm

Please  read  our  paper  for  details!                                          

Page 22: Yinyang K-Means: A Drop-In Replacement of the Classic K ... › f4b9 › ab75539ffddad031134c596… · Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent

Final  Takeaways• Yinyang K-Means: A Drop-In Replacement

• Consistently much faster

• Minimize distance calculations (via a Yinyang harmony)

• Superior cost-benefit tradeoff

• Elastic design

• Inherited trust

• Same results as classic K-Means gives

Code: http://research.csc.ncsu.edu/nc-caps/yykmeans.tar.bz2

K-Means