Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
8. Unsupervised Analysis8. 8. Unsupervised AnalysisUnsupervised Analysis
Unsupervised AnalysisPattern Recognition:
2
Unsupervised AnalysisUnsupervised AnalysisUnsupervised Analysis
Previously, all our training samples were labeled: these samples were said to be “supervised”
We now investigate a number of “unsupervised” procedures which use unlabeled samples
Collecting and labeling a large set of sample patterns can be costly
We can train with a large amount of (less expensive) unlabeled data, and only then use supervision to label the groupings found, this is appropriate for large “data mining” applications where the contents of a large database are not known beforehand
Unsupervised AnalysisPattern Recognition:
3
Unsupervised AnalysisUnsupervised AnalysisUnsupervised Analysis
This is also appropriate in many applications when the characteristics of the patterns can change slowly with time
Improved performance can be achieved if classifiers running in a unsupervised mode are used
We can use unsupervised methods to identify features that will then be useful for classification
We gain some insight into the nature (or structure) of the data
Unsupervised AnalysisPattern Recognition:
4
Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability• We shall begin with the assumption that the functional
forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector
• We make the following assumptions:
- The samples come from a known number c of classes
- The prior probabilities P(ωj ) for each class are known (j = 1, …,c)
- P(x | ωj , θj ) (j = 1, …,c) are known
- The values of the c parameter vectors θ1 , θ2 , …, θc are unknown
Unsupervised AnalysisPattern Recognition:
5
• The category labels are unknown
This density function is called a mixture density
Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector θ. Once θ is known, we can decompose the mixture into its components and use a MAP classifier on the derived densities
tc21
c
1jparameters mixing
j
densities component
jj
),...,,( where
)(P.),|x(P)|x(P
θθθθ
ωθωθ
=
=∑= 321
4484476
Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability
Unsupervised AnalysisPattern Recognition:
6
• Definition
- A density p(x | θ) is said to be identifiable ifθ ≠ θ’ implies that there exists an x such that:
p(x | θ) ≠ p(x | θ’)
Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability
Unsupervised AnalysisPattern Recognition:
7
As a simple example, consider the case where x is binary and P(x | θ) is the mixture:
Assume that:P(x = 1 | θ) = 0.6 ⇒ P(x = 0 | θ) = 0.4
by replacing these probabilities values, we obtain: θ1 + θ2 = 1.2
⎪⎪⎩
⎪⎪⎨
⎧
=+
=+=
−+−= −−
0x if )(21-1
1x if )(21
)1(21)1(
21)|x(P
21
21
x12
x2
x11
x1
θθ
θθ
θθθθθ
Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability
Unsupervised AnalysisPattern Recognition:
8
• Thus, we have a case in which the mixture distribution is completely unidentifiable, and therefore unsupervised learning is impossible
• In the discrete distributions, if there are too many components in the mixture, there may be more unknowns than independent equations, and identifiability can become a serious problem!
Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability
Unsupervised AnalysisPattern Recognition:
9
• While it can be shown that mixtures of normal densities are usually identifiable, the parameters in the simple mixture density
• Cannot be uniquely identified if P(ω1 ) = P(ω2 )(we cannot recover a unique θ even from an infinite amount of data!)
• θ = (θ1 , θ2 ) and θ = (θ2 , θ1 ) are two possible vectors that can be interchanged without affecting P(x | θ)
• Identifiability can be a problem, we always assume that the densities we are dealing with are identifiable!
⎥⎦⎤
⎢⎣⎡ −−+⎥⎦
⎤⎢⎣⎡ −−= 2
222
11 )x(
21exp
2)(P)x(
21exp
2)(P)|x(P θ
πωθ
πωθ
Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability
Unsupervised AnalysisPattern Recognition:
10
ML EstimatesML EstimatesML EstimatesSuppose that we have a set of n unlabeled samples drawn independently from the mixture density
(θ is fixed but unknown!)
The gradient of the log-likelihood is:
with
Unsupervised AnalysisPattern Recognition:
11
Since the gradient must vanish at the value of θi that maximizes , therefore,
the ML estimate must satisfy the conditions:
ML EstimatesML EstimatesML Estimates
Unsupervised AnalysisPattern Recognition:
12
By including the prior probabilities as unknown variables, we finally obtain:
and
where
ML EstimatesML EstimatesML Estimates
Unsupervised AnalysisPattern Recognition:
13
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Case 1 = Simplest case
Case μi Σi P(ωi ) c
1 ? x x x
2 ? ? ? x
3 ? ? ? ?
Unsupervised AnalysisPattern Recognition:
14
• Case 1: Unknown mean vectors
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
15
• Case 1: Unknown mean vectors (continued)
ML estimate:
is the fraction of those samples having value xk that come from the ith class, and is the average of the samples coming from the ith class.
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
16
• Unfortunately, (1) does not give explicitly
• However, if we have some way of obtaining good initial estimates for the unknown means, (1) can be seen as an iterative process for improving the estimates
• This is a gradient ascent for maximizing the log-likelihood function
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
17
• Example:Consider the simple two-component one-
dimensional normal mixture
(2 clusters!)
Let’s set μ1 = -2, μ2 = 2 and draw 25 samples sequentially from this mixture.
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
⎥⎦⎤
⎢⎣⎡ μ−−
π+⎥⎦
⎤⎢⎣⎡ μ−−
π=μμ 2
22
121 )x(21exp
232)x(
21exp
231),|x(p
Unsupervised AnalysisPattern Recognition:
18
The log-likelihood function is:
• The maximum value of l occurs at:
(which are not far from the true values: μ1 = -2 and μ2 = +2)
• There is another peak at which has almost the same height as can be seen from the following figure:
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
19
Unsupervised AnalysisPattern Recognition:
20
• This mixture of normal densities is identifiable• When the mixture density is not identifiable, the
ML solution is not unique
• Case 2: All parameters unknown
• No constraints are placed on the covariance matrix
Let p(x | μ, σ2) be the two-component normal mixture:
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
21
Suppose μ = x1 , therefore:
For the rest of the samples:
Finally,
The likelihood is therefore large and the maximum-likelihood solution becomes singular.
⎥⎦
⎤⎢⎣
⎡−
⎭⎬⎫
⎩⎨⎧
⎥⎦⎤
⎢⎣⎡−+≥ ∑
=
⎟⎠⎞⎜
⎝⎛
→∞→
n
2k
2kn
0 term this
21
2n1 x
21exp
)22(1x
21exp1),|x,...,x(p
πσσμ
σ
444 3444 21
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
22
• Adding an assumptionConsider the largest of the finite local maxima of the likelihood function and use the ML estimation.We obtain the following:
∑
∑
∑
∑
∑
=
=
=
=
=
−−=
=
=
n
1kki
n
1k
tikikki
i
n
1kki
n
1kkki
i
n
1kkii
)ˆ,x|(P̂
)ˆx)(ˆx)(ˆ,x|(P̂ˆ
)ˆ,x|(P̂
x)ˆ,x|(P̂ˆ
)ˆ,x|(P̂n1)(P̂
θω
μμθωΣ
θω
θωμ
θωω
Iterative scheme
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
23
Where:
∑=
−−
−−
⎥⎦⎤
⎢⎣⎡ −−−
⎥⎦⎤
⎢⎣⎡ −−−
= c
1jjjk
1j
tjk
2/1
j
iik1
it
ik2/1
i
ki
)(P̂)ˆx(ˆ)ˆx(21expˆ
)(P̂)ˆx(ˆ)ˆx(21exp
)ˆ,x|(P̂ωμΣμΣ
ωμΣμΣθω
Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures
Unsupervised AnalysisPattern Recognition:
24
ClusteringClusteringClustering
What is a Cluster? A cluster is a collection of data points with similar properties.
Unsupervised AnalysisPattern Recognition:
25
ClusteringClusteringClustering
Partitional clustering• C-means algorithm (Hard)• Isodata algorithm• Fuzzy C-means Clustering
Hierarchical clustering• Single-linkage algorithm (minimum method)• Complete-linkage algorithm (maximum method)• Average linkage algorithm• Minimum-variance method (Ward’s method)
Unsupervised AnalysisPattern Recognition:
26
• Goal: find the c mean vectors μ1 , μ2 , …, μc
• Replace the squared Mahalanobis distance in the previous discussion
by the squared Euclidean distance
• Find the mean nearest to xk and approximate
as:
mμ̂
)ˆ,x|(P̂ ki θω⎩⎨⎧ =
≅otherwise 0
mi if 1),x|(P̂ ki θω
C-Means ClusteringCC--Means ClusteringMeans Clustering
Unsupervised AnalysisPattern Recognition:
27
Begin
initialize n, c, μ1 , μ2 , …, μc (randomly selected)
do classify n samples according to nearest μirecompute μi
until no change in μireturn μ1 , μ2 , …, μc
End
• Use the iterative scheme to find • if n is the known number of patterns and c the desired
number of clusters, the C-means algorithm is:
c21 ˆ,...,ˆ,ˆ μμμ
C-Means ClusteringCC--Means ClusteringMeans Clustering
Unsupervised AnalysisPattern Recognition:
28
C-Means ClusteringCC--Means ClusteringMeans Clustering
Unsupervised AnalysisPattern Recognition:
29
C-Means ClusteringCC--Means ClusteringMeans ClusteringC-means algorithm:
1. Begin with C cluster centers2. For each sample, find the cluster center nearest to
it. Put the sample in the cluster represented by the just-found cluster center.
3. If no samples changed clusters, stop.4. Recompute cluster centers of altered clusters and
go back to step 2.Properties:
• The number of cluster C must be given in advance.• The goal is to minimize the square error, but it
could end up in a local minimum.Demo:
http://home.dei.polimi.it/matteucc/Clustering/tutori al_html/AppletKM.html
Unsupervised AnalysisPattern Recognition:
30
Clustering: ISODATAClustering: ISODATAClustering: ISODATA
Similar to C-means with some enhancements:• Clusters with too few elements are discarded.• Clusters are merged if the number of clusters
grows too large or if clusters are too close together.
• A cluster is split if the number of clusters is too few or if the cluster contains very dissimilar samples
Properties:• The number of clusters C is not given exactly in
advance.• The algorithm may requires extensive
computations.
Unsupervised AnalysisPattern Recognition:
31
Fuzzy C-Means ClusteringFuzzy CFuzzy C--Means ClusteringMeans Clustering
Unsupervised AnalysisPattern Recognition:
32
Fuzzy C-Means ClusteringFuzzy CFuzzy C--Means ClusteringMeans Clustering
Unsupervised AnalysisPattern Recognition:
33
Solution Spaces for ClusteringSolution Spaces for ClusteringSolution Spaces for Clustering
Constrained c-Partition Matrices
Unsupervised AnalysisPattern Recognition:
34
Partition of DataPartition of DataPartition of Data
Unsupervised AnalysisPattern Recognition:
35
Defuzzification of Fuzzy PartitionsDefuzzificationDefuzzification of Fuzzy Partitionsof Fuzzy PartitionsConversion via max membership and α-cuts
Unsupervised AnalysisPattern Recognition:
36
Fuzzy c-Means ClusteringFuzzy cFuzzy c--Means ClusteringMeans Clustering
Unsupervised AnalysisPattern Recognition:
37
Fuzzy c-Means (FCM) ClusteringFuzzy cFuzzy c--Means (FCM) ClusteringMeans (FCM) ClusteringLet’s now assume that a sample can belong to all the clusters. Define uij where
The cost function (or objective function) for FCM can then be written as
where is between 0 and 1; is the cluster center of fuzzy group is the Euclidean distance between i-th cluster center and jth data point; and is a weighting exponent.
11, 1, , .
c
iji
u j n=
= ∀ =∑ K
( ) 21
1 1
, , , ,c c n
mc i ij ij
i i j
J U J u d= =
= =∑ ∑∑Kc c
iju [ )1,m∈ ∞
Unsupervised AnalysisPattern Recognition:
38
Fuzzy c-Means (FCM) ClusteringFuzzy cFuzzy c--Means (FCM) ClusteringMeans (FCM) Clustering
The updated mean vector is
where
iju [ )1,m∈ ∞
1
1
,n m
ij jji n m
ijj
u
u=
=
=∑∑
xc
( )2/ 1
1
1 .ij mc ijk
kj
udd
−
=
=⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠
∑
Unsupervised AnalysisPattern Recognition:
39
C-means Algorithm (FCM/HCM)CC--means Algorithm (FCM/HCM)means Algorithm (FCM/HCM)
Unsupervised AnalysisPattern Recognition:
40
Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering
Agglomerative clustering (bottom up)
1. Begin with n clusters; each containing one sample
2. Merge the most similar two clusters into one.
3. Repeat the previous step until done
Divisive clustering (top down)
1
1
7
7
1,7
1,7
1,7
5 6
5,6
5,6
3
3
3
3,5,6
3,5,6
1,3,5,6,7
2
2
2
2
4
4
4
4
2,4
2,4
1,2,3,4,5,6,7
Unsupervised AnalysisPattern Recognition:
41
Hierarchical ClusteringHierarchical ClusteringHierarchical ClusteringThe most natural representation of hierarchical clustering is a corresponding tree, called a dendrogram, which shows how the samples are grouped
Unsupervised AnalysisPattern Recognition:
42
Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering
Demo: http://home.dei.polimi.it/matteucc/Clustering/tu torial_html/AppletH.html
Unsupervised AnalysisPattern Recognition:
43
Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering
Single-linkage algorithm (minimum method)
Complete-linkage algorithm (maximum method)
Average-linkage algorithm (average method)
D (C ,C ) d(a,b)i j a C ,b Ci jmin min=
∈ ∈
D (C ,C ) d(a,b)i j a C ,b Ci jmax max=
∈ ∈
D (C ,C )C C
d(a,b)ave i ji j a C ,b Ci j
=∈ ∈∑1
Unsupervised AnalysisPattern Recognition:
44
Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering
Ward’s method (minimum-variance method)
( )D (C ,C ) xWard i j jj
d
i j ji
m
j
d
= = −= ==∑ ∑∑σ μ2
1 11
2
d: the number of featuresm: the number of samples in Ci and Cj
Unsupervised AnalysisPattern Recognition:
45
Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering
Single-linkage algorithm
Unsupervised AnalysisPattern Recognition:
46
Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering
Complete-linkage algorithm