Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Clusterpath: an Algorithm for Clusteringusing Convex Fusion Penalties
Toby Dylan Hocking Armand Joulin Francis BachJean-Philippe Vert
INRIA – Sierra team, Laboratoire d’Informatique de l’ÉNSMines ParisTech – CBIO, INSERM U900, Institut Curie
Paris, France
The clustering problem: different appraoches
Clustering: assignlabels to n pointsin p dimensionsX ∈ Rn×p.
Methods:K-meansHierarchicalMixture modelsSpectral (Ng etal. 2001)
Issues:HierarchyConvexityGreedinessStabilityInterpretability
Clusterpath: relaxing a hard fusion penalty
Hard-thresholding of differences is a combinatorial problem:minα∈Rn×p ||α− X ||2F subject to
∑i<j 1αi 6=αj 6 t
Relaxation:∑
i<j ||αi − αj ||qwij 6 tThe Lagrange form is useful for optimization algorithms:
α∗(X , λ, q, w) = argminα∈Rn×p12||α− X ||2F + λ
∑i<j
||αi − αj ||qwij
The clusterpath of X is the path of optimal α∗ obtained byvarying λ, for fixed weights wij ∈ R+ and norm q ∈ 1, 2, ∞.Related work: “fused lasso” Tibshirani and Saunders (2005),“grouping pursuit” Shen and Huang (2010), “sum of norms”Lindsten et al. (2011).
Norm and weights control the clusterpath
norm = 1
X
X
norm = 2
X
X
norm = ∞
X
X
γ=
0γ
=1
Geometric interpretation: constrain area between pointsIdentity weights, t = Ω(X )
`2`2
`2
`1
`1 `1
`1
`1
`1
`∞
`∞
`∞
X1
X2
X3
Decreasing weights after join, t < Ω(X )
w12w13
X1
X2
X3
α1
αC = α2 = α3
Decreasing weights, t = Ω(X )
w12w13
w23
X1
X2
X3
We propose dedicated algorithms for each norm
Norm Properties Algorithm Complexity Problem sizes1 piecewise linear, separable path O(pn log n) large ≈ 105
2 rotation invariant active-set O(n2p) medium ≈ 103∞ piecewise linear Frank-Wolfe unknown* medium ≈ 103
*each iteration of complexity O(n2p).
Outline of the `1 path algorithm
Condition sufficient for optimality:
0 = αi − Xi + λ∑
j 6=iαi 6=αj
wij sign(αi − αj) + λ∑
j 6=iαi=αj
wijβij,
with |βij | 6 1 and βij = −βji (Hoefling 2009).1 For λ = 0 the solution α = X is optimal. We initialize the
clusters Ci = i and coefficients αi = Xi for all i .2 As λ increases, the solutions will follow straight lines.3 Taking the derivative of the optimality condition with respect toλ and summing over all points in a cluster C leads to:
dαC
dλ= vC =
∑j 6∈C
wjC sign(αj − αC) =∑j 6∈C
∑i∈C
wij sign(αj − αC)
4 When 2 clusters C1 and C2 fuse, they form a new clusterC = C1 ∪ C2 with vC = (|C1|v1 + |C2|v2)/(|C1| + |C2|).
5 Stop when all the points merge at the mean X .6 Combine dimensions using λ values.
`1 clusterpath of 10 points in 2d
α2
α1
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
Joins the left cluster on α1 before joining right cluster.
Solution at λ = 0.18 yields 2 clusters.
-0.5 0.0 0.5 1.0 1.5
Location in the regularization path λ
Opt
imal
valu
eof` 1
clus
terp
ath
-0.8-0.6-0.4-0.20.00.20.40.6
-0.5
0.0
0.5
1.0
1.5
0.00 0.05 0.10 0.15 0.20
α1
α2
Free software! http://clusterpath.r-forge.r-project.org/
Dedicated C++ optimization algorithms with R interface.Calculates the exact `1 clusterpath for identity weights.Active-set algorithm for the `1 and `2 clusterpath with general weights.
R interface to Python cvxmod clusterpath solver.Clusterpath visualizations in 2d, 3d, and animations.Coming soon: picking the number of clusters automatically!
Future workNecessary and sufficient conditions for cluster splitting?Automatically learning weights and number of clusters?Applications to solving proximal problems.
Clustering performance and timings
Cluster using the prior knowledge that there are 2 clusters.Quantify partition correspondence using the Normalized RandIndex (Hubert and Arabie, 1985): 1 for perfect correspondence,0 for completely random assignment.Results for 2 non-convex interlocking half-moons in 2d:
Clustering method Rand SD Seconds SDeexp spectral clusterpath 0.99 0.00 8.49 2.64eexp spectral kmeans 0.99 0.00 3.10 0.08`2 clusterpath 0.95 0.12 29.47 2.31e01 Ng et al. kmeans 0.95 0.19 7.37 0.42e01 spectral kmeans 0.91 0.19 3.26 0.21Gaussian mixture 0.42 0.13 0.07 0.00average linkage 0.40 0.13 0.05 0.00kmeans 0.26 0.04 0.01 0.00
Similar performance to spectral clustering, and learns a tree:
The weighted `2 clusterpath applied to the iris data:
Scatter Plot Matrix
Sepal.Length0
1
2 0 1 2
−2
−1
0
−2 −1 0
Sepal.Width
1
2
3 1 2 3
−2
−1
0
−2 −1 0
Petal.Length0.0
0.5
1.0
1.5 0.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
−1.5 −0.50.0
Petal.Width0.0
0.5
1.0
1.5 0.0 0.5 1.0 1.5
−1.5
−1.0
−0.5
0.0
−1.5 −0.50.0
setosa versicolor virginica
Performance for several model sizes
Number of clusters
Nor
mal
ized
Ran
din
dex
(big
germ
eans
bette
ragr
eem
entw
ithkn
own
clus
ters
)
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
2 3 5 7 9 11
data:iris
data:m
oons
method
γ = 0.5
γ = 2
γ = 10
GMM
HC
kmeans