View
220
Download
2
Category
Preview:
Citation preview
EXPLORATORY TOOLS IN MODEL-BASEDCLUSTERING
L.A. Garcıa-Escudero, A. Gordaliza, C. Matran and A. Mayo-IscarDpto. de Estadıstica e I. O. Universidad de Valladolid
(SPAIN)
1.- Introduction
• Two key questions:
¦ How to adequately choose the number of groups in a clusteringproblem?
¦ How to measure the strength of data point cluster assignments?
• It is impossible to answer these two questions without:
¦ Stating clearly which is the probabilistic model assumed.¦ Putting constrains on the allowed clusters scatters.¦ Stating clearly what we understand by noise.
2.- Model Based Clustering:
• Many statistical practitioners view the Cluster Analysis as a collectionof mostly heuristic techniques for partitioning multivariate data.
• This view relies on the fact that most of the cluster techniques are notexplicitly based on a probabilistic model:
“...lead the naive investigator into believing that he or she didnot make any assumption at all, and that the results thereforeare ‘objective’...” (Flury 1997, page 123)
⇒ A properly stated underlying probabilistic model is convenient
• Two model-based clustering approaches:
¦ Mixture approach:
n∏
i=1
[ k∑
j=1
πjφ(xi; θj)]
(assign xi to cluster j whenever πjφ(x; θj) > πlφ(x; θl) for l 6= j).
¦ “Crisp” (0-1) approach:
k∏
j=1
∏
i∈Rj
φ(xi; θj)
(Rj indexes of the xi’s assigned to cluster j).
¦ Mixture approach:
n∏
i=1
[ k∑
j=1
πjφ(xi; θj)]⇒ EM-algorithm
(assign xi to cluster j whenever πjφ(x; θj) > πlφ(x; θl) for l 6= j).
¦ “Crisp” (0-1) approach:
k∏
j=1
∏
i∈Rj
φ(xi; θj) ⇒ CEM-algorithm
(Rj indexes of the xi’s assigned to cluster j).
• Noise in real problems ⇒ Robust Clustering
• Two robust clustering approaches providing “theoretical well-basedclustering criterion in presence of outliers” (Bock 2002):
¦ Mixture modeling: The noise is fitted through mixture components(Fraley and Raftery, Peel and McLachlan,...)
¦ Trimming approach: A fraction α of most outlying data is trimmed.(Gallegos and Ritter, Cuesta-Albertos et al., Garcıa-Escudero et al.,Neykov et al.,...).
• Noise in real problems ⇒ Robust Clustering
• Two robust clustering approaches providing “theoretical well-basedclustering criterion in presence of outliers” (Bock 2002):
¦ Mixture modeling: The noise is fitted through mixture components(Fraley and Raftery, Peel and McLachlan,...)
¦ Trimming approach: A fraction α of most outlying data is trimmed.(Gallegos and Ritter, Cuesta-Albertos et al., Garcıa-Escudero et al.,Neykov et al.,...).
⇒ We will focus on the trimming approach!!
3.- TCLUST methodology
• Spurious-Outlier Model (Gallegos 2001 and Gallegos and Ritter 2005):
[ k∏
j=1
∏
i∈Rj
f(xi; µj, Σj)][ ∏
i/∈R
gψi(xi)
]
¦ f(xi; µ, Σ) is a p-variate normal p.d.f..
¦ R = ∪kj=1Rj contains [n(1− α)] regular data.
¦ gψiare some p.d.f.’s for the non-regular data.
• If no conditions are possed on Σi’s ⇒ Not a well-defined problem.
• Restrictions are needed:
¦ Same spherical covariance matrices (i.e., Σj = σ2 · I) ⇒ Trimmedk-means (Cuesta-Albertos et al. 1997).
¦ Same (not necessarily spherical) covariance matrices (Σj = Σ) ⇒Determinantal criteria (Gallegos and Ritter 2005).
¦ Different covariances but with equal scales (|Σ1| = ... = |Σg|) ⇒Heterogeneous robust clustering (Gallegos 2001, 2003)
• A different constrain:
Mn = maxj=1,...,k
maxl=1,...,p
λl(Σj) and mn = minj=1,...,k
minl=1,...,p
λl(Σj),
where λl(Σj) are the eigenvalues of the Σj.
• Fix a constant c such that
Mn/mn ≤ c (Eigenvalues-ratio restriction).
¦ c controls the strength of the restriction:· c = 1 ⇒ Trimmed k-means.· Large c ⇒ An almost unrestricted solution.
• It extends Hathaway’s restrictions σ2i ≤ c · σ2
j for 1 ≤ i, j ≤ k.
• Weights: We consider group weights πj ∈ [0, 1].
• Trimming + Eigenvalue restrictions + Weights ⇒ TCLUST
(Garcıa-Escudero et al. (2008) Annals of Statistics, 36, 1324-1345)
¦ Existence of both theoretical and sample solutions.¦ Consistency.¦ Feasible algorithm.
3. Guidance in choosing k
• Many procedures for choosing k in “crisp” clustering are based onmonitoring the size of the (log-)“likelihoods”:
k 7→k∑
j=1
∑
i∈Rj
log φ(xi; θj) for k = 1, 2, ... .
• Examples:
¦ Σj = σ2I (k-means) ⇒ Friedman and Rubin 1967, Engelman andHartigan 1969, Calinski and Harabasz 1974,...
¦ Σj = Σ (determinant criterium) ⇒ Marriot 1971,...
• Trimmed versions can be also considered⇒ Garcıa-Escudero et al 2003.
• Drawback: The log-likelihoods strictly increase when increasing k:
0 5 10
−20
24
68
10
k = 1
x1
x2
0 5 10
−20
24
68
10
k = 2
x1
x2
0 5 10
−20
24
68
10
k = 3
x1
x2
• Log-likelihoods: -4765.8 (k = 1) < -3530.1 (k = 2) < -3197.7 (k = 3)
• Figure:
Number of groups
Log−
likeli
hood
1 2 3
−500
0−4
500
−400
0−3
500
• Solutions:
¦ Searching for an “elbow”.¦ Nonlinear transformation by Sugar and James 2003.
• The TCLUST does not suffer from this problem:
0 5 10
−20
24
68
10k = 1
x1
x2
0 5 10
−20
24
68
10
k = 2
x1
x2
0 5 10
−20
24
68
10
k = 3
x1
x2
• Log-likelihoods: -4765.8 (k = 1) < -4203.2 (k = 2) = -4203.2 (k = 3)
• Recall the presence of weights (which can be set to zero) ⇒ π3 = 0 ink = 3 solution!.
• The TCLUST does not suffer from this problem:
0 5 10
−20
24
68
10k = 1
x1
x2
0 5 10
−20
24
68
10
k = 2
x1
x2
0 5 10
−20
24
68
10
k = 3
x1
x2
• Log-likelihoods: -4765.8 (k = 1) < -4203.2 (k = 2) = -4203.2 (k = 3)
• This fact was already noticed by Bryant (1991) when dealing with theso-called Penalized Classification Maximum Likelihood...
• Importance of the “trimming” and the scatter constrain:
−10 0 5 10 15 20
−10−5
05
1015
20
(a) k=3, alpha=0 and large c
−10 0 5 10 15 20−10
−50
510
1520
(b) k=2, alpha=0.1 and moderate c
⇒ “◦” are trimmed points in the figure on the right.
4.- Classification Trimmed Likelihood Curves
• Based on the TCLUST methodology, we monitor:
(α, k) 7→ L(α, k) :=k∑
j=1
nj log πj +k∑
j=1
∑
i∈Rj
log φ(xi; θj),
when k = 1, 2, ... and α ∈ (0, 1).
• Smallest k such that L(α, k) ' L(α, k + 1) (for almost every α) ⇒ k isa good choice for the number of groups.
• L(α, k) increase very fast till α ≤ α0 ⇒ α0 is a good choice for thetrimming level.
• They provide information about the group sizes.
Example 1: Mixture 0.9 ·N(( 0
0 ) , ( 4 11 1 )
)+ 0.1 ·N(
( 5−5 ) , ( 30 00 30 )
)
0.0 0.1 0.2 0.3 0.4 0.5
−500
0−3
000
−100
0
k = 1
alpha
0.0 0.1 0.2 0.3 0.4 0.5
−500
0−3
000
−100
0
k = 2
alpha
0.0 0.1 0.2 0.3 0.4 0.5
−500
0−3
000
−100
0
k = 3
alpha
−5 0 5 10 15 20
−20
−10
05
Unrestric
x
y
k = 2 vs. 1
alpha
0 0.1 0.2 0.3 0.4 0.5
020
060
0
k = 3 vs. 2
alpha
0 0.1 0.2 0.3 0.4 0.5
020
060
0
← L(α, k)
← L(α, k − 1)
Example 2: 0.3 · N((−5
5 ) , ( 4 11 1 )
)+ 0.3 · N
(( 5−5 ) ,
(4 −1−1 1
) )+ 0.3 ·
N(( 5
5 ) , ( 4 00 4 )
)+ 0.1 ·N(
( 00 ) , ( 30 0
0 30 ))
0.0 0.2 0.4 0.6
−600
0−4
000
−200
0
k = 1
alpha
0.0 0.2 0.4 0.6−6
000
−400
0−2
000
k = 2
alpha
0.0 0.2 0.4 0.6
−600
0−4
000
−200
0
k = 3
alpha
0.0 0.2 0.4 0.6
−600
0−4
000
−200
0
k = 4
alpha
−15 −5 5 15
−15
−55
15
Unrestric
x
y
k = 2 vs. 1
alpha
0 0.2 0.5
020
040
060
080
0
k = 3 vs. 2
alpha
0 0.2 0.50
200
400
600
800
k = 4 vs. 3
alpha
0 0.2 0.5
020
040
060
080
0
Example 3: “The topography of multivariate normal mixtures” (Ray andLindsay 2005) ⇒ Mixture with 2 components.
0.0 0.2 0.4 0.6
−250
0−1
500
−500
k = 1
alpha
0.0 0.2 0.4 0.6
−250
0−1
500
−500
k = 2
alpha
0.0 0.2 0.4 0.6
−250
0−1
500
−500
k = 3
alpha
−3 −2 −1 0 1 2
−10
12
34
Unrestric
x
y
k = 2 vs. 1
alpha
0 0.1 0.3 0.5 0.7
010
030
050
0k = 3 vs. 2
alpha
0 0.1 0.3 0.5 0.7
010
030
050
0
Mixture with 2 components with 3 modes:
−2 −1 0 1 2 3
−2−1
01
23
x
y
−2 −1 0 1 2 3
−2−1
01
23
alpha = 0.5
x
y
5.- Strength of cluster assignments
• Confirmatory tool: Were satisfactory the choices made for k and α?• The strength of the cluster assignment of observation xi to group j:
Dj(xi, θ) = πjφ(xi, θj)
• If D(1)(x, θ) ≤ ... ≤ D(k)(x, θ), define some Bayes factors as:
BF(i) = log(D(k−1)(xi; θ)/D(k)(xi; θ)
).
¦ Small BF(i) ⇒ Clear cluster assignment for the observation xi.
• Bayes factors for trimmed points: Given the maximum possiblestrength:
di = maxj=1,...,k
{πjφ(xi, θj)} = D(k)(xi, θ), we have:
¦ d(1) ≤ ... ≤ d([nα]) ≤ ... ≤ d(n) ⇒ [nα] observations to be trimmed.
¦ Bayes factors for trimmed data ⇒ BF(i) = log(D(k)(xi, θ)/d([nα])
).
¦ Small BF(i) ⇒ More clearly observation i should be trimmed.
• Graphical display I: “Silhouette” plot
−10 −5 0 5 10 15 20
−10
−50
510
15
(a) k= 3 and alpha= 0.1
x1
x2
−150 −100 −50 0
020
040
060
080
010
00
(b)
Bayes factor
Grou
ps
• Graphical display II: “Most Doubtful assignments”¦ Label observations i’s with BF(i) ≥ log(3/4):
−10 −5 0 5 10 15 20
−10
−50
510
15
BF > log(3/4)
[PCA, discriminant or Bhattacharyya coordinates (Hennig and Christlieb 2002) if p > 2...]
Recommended