8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

8. Unsupervised Analysis8. 8. Unsupervised AnalysisUnsupervised Analysis

Unsupervised AnalysisPattern Recognition:

2

Unsupervised AnalysisUnsupervised AnalysisUnsupervised Analysis

Previously, all our training samples were labeled: these samples were said to be “supervised”

We now investigate a number of “unsupervised” procedures which use unlabeled samples

Collecting and labeling a large set of sample patterns can be costly

We can train with a large amount of (less expensive) unlabeled data, and only then use supervision to label the groupings found, this is appropriate for large “data mining” applications where the contents of a large database are not known beforehand


3

Unsupervised AnalysisUnsupervised AnalysisUnsupervised Analysis

This is also appropriate in many applications when the characteristics of the patterns can change slowly with time

Improved performance can be achieved if classifiers running in a unsupervised mode are used

We can use unsupervised methods to identify features that will then be useful for classification

We gain some insight into the nature (or structure) of the data


4

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability• We shall begin with the assumption that the functional

forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector

• We make the following assumptions:

- The samples come from a known number c of classes

- The prior probabilities P(ωj ) for each class are known (j = 1, …,c)

- P(x | ωj , θj ) (j = 1, …,c) are known

- The values of the c parameter vectors θ1 , θ2 , …, θc are unknown


5

• The category labels are unknown

This density function is called a mixture density

Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector θ. Once θ is known, we can decompose the mixture into its components and use a MAP classifier on the derived densities

tc21

c

1jparameters mixing

j

densities component

jj

),...,,( where

)(P.),|x(P)|x(P

θθθθ

ωθωθ

=

=∑= 321

4484476

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability


6

• Definition

- A density p(x | θ) is said to be identifiable ifθ ≠ θ’ implies that there exists an x such that:

p(x | θ) ≠ p(x | θ’)



7

As a simple example, consider the case where x is binary and P(x | θ) is the mixture:

Assume that:P(x = 1 | θ) = 0.6 ⇒ P(x = 0 | θ) = 0.4

by replacing these probabilities values, we obtain: θ1 + θ2 = 1.2

⎪⎪⎩

⎪⎪⎨

⎧

=+

=+=

−+−= −−

0x if )(21-1

1x if )(21

)1(21)1(

21)|x(P

21

21

x12

x2

x11

x1

θθ

θθ

θθθθθ



8

• Thus, we have a case in which the mixture distribution is completely unidentifiable, and therefore unsupervised learning is impossible

• In the discrete distributions, if there are too many components in the mixture, there may be more unknowns than independent equations, and identifiability can become a serious problem!



9

• While it can be shown that mixtures of normal densities are usually identifiable, the parameters in the simple mixture density

• Cannot be uniquely identified if P(ω1 ) = P(ω2 )(we cannot recover a unique θ even from an infinite amount of data!)

• θ = (θ1 , θ2 ) and θ = (θ2 , θ1 ) are two possible vectors that can be interchanged without affecting P(x | θ)

• Identifiability can be a problem, we always assume that the densities we are dealing with are identifiable!

⎥⎦⎤

⎢⎣⎡ −−+⎥⎦

⎤⎢⎣⎡ −−= 2

222

11 )x(

21exp

2)(P)x(

21exp

2)(P)|x(P θ

πωθ

πωθ



10

ML EstimatesML EstimatesML EstimatesSuppose that we have a set of n unlabeled samples drawn independently from the mixture density

(θ is fixed but unknown!)

The gradient of the log-likelihood is:

with


11

Since the gradient must vanish at the value of θi that maximizes , therefore,

the ML estimate must satisfy the conditions:

ML EstimatesML EstimatesML Estimates


12

By including the prior probabilities as unknown variables, we finally obtain:

and

where

ML EstimatesML EstimatesML Estimates


13

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Case 1 = Simplest case

Case μi Σi P(ωi ) c

1 ? x x x

2 ? ? ? x

3 ? ? ? ?


14

• Case 1: Unknown mean vectors



15

• Case 1: Unknown mean vectors (continued)

ML estimate:

is the fraction of those samples having value xk that come from the ith class, and is the average of the samples coming from the ith class.



16

• Unfortunately, (1) does not give explicitly

• However, if we have some way of obtaining good initial estimates for the unknown means, (1) can be seen as an iterative process for improving the estimates

• This is a gradient ascent for maximizing the log-likelihood function



17

• Example:Consider the simple two-component one-

dimensional normal mixture

(2 clusters!)

Let’s set μ1 = -2, μ2 = 2 and draw 25 samples sequentially from this mixture.


⎥⎦⎤

⎢⎣⎡ μ−−

π+⎥⎦

⎤⎢⎣⎡ μ−−

π=μμ 2

22

121 )x(21exp

232)x(

21exp

231),|x(p


18

The log-likelihood function is:

• The maximum value of l occurs at:

(which are not far from the true values: μ1 = -2 and μ2 = +2)

• There is another peak at which has almost the same height as can be seen from the following figure:



19


20

• This mixture of normal densities is identifiable• When the mixture density is not identifiable, the

ML solution is not unique

• Case 2: All parameters unknown

• No constraints are placed on the covariance matrix

Let p(x | μ, σ2) be the two-component normal mixture:



21

Suppose μ = x1 , therefore:

For the rest of the samples:

Finally,

The likelihood is therefore large and the maximum-likelihood solution becomes singular.

⎥⎦

⎤⎢⎣

⎡−

⎭⎬⎫

⎩⎨⎧

⎥⎦⎤

⎢⎣⎡−+≥ ∑

=

⎟⎠⎞⎜

⎝⎛

→∞→

n

2k

2kn

0 term this

21

2n1 x

21exp

)22(1x

21exp1),|x,...,x(p

πσσμ

σ

444 3444 21



22

• Adding an assumptionConsider the largest of the finite local maxima of the likelihood function and use the ML estimation.We obtain the following:

∑

∑

∑

∑

∑

=

=

=

=

=

−−=

=

=

n

1kki

n

1k

tikikki

i

n

1kki

n

1kkki

i

n

1kkii

)ˆ,x|(P̂

)ˆx)(ˆx)(ˆ,x|(P̂ˆ

)ˆ,x|(P̂

x)ˆ,x|(P̂ˆ

)ˆ,x|(P̂n1)(P̂

θω

μμθωΣ

θω

θωμ

θωω

Iterative scheme



23

Where:

∑=

−−

−−

⎥⎦⎤

⎢⎣⎡ −−−

⎥⎦⎤

⎢⎣⎡ −−−

= c

1jjjk

1j

tjk

2/1

j

iik1

it

ik2/1

i

ki

)(P̂)ˆx(ˆ)ˆx(21expˆ

)(P̂)ˆx(ˆ)ˆx(21exp

)ˆ,x|(P̂ωμΣμΣ

ωμΣμΣθω



24

ClusteringClusteringClustering

What is a Cluster? A cluster is a collection of data points with similar properties.


25

ClusteringClusteringClustering

Partitional clustering• C-means algorithm (Hard)• Isodata algorithm• Fuzzy C-means Clustering

Hierarchical clustering• Single-linkage algorithm (minimum method)• Complete-linkage algorithm (maximum method)• Average linkage algorithm• Minimum-variance method (Ward’s method)


26

• Goal: find the c mean vectors μ1 , μ2 , …, μc

• Replace the squared Mahalanobis distance in the previous discussion

by the squared Euclidean distance

• Find the mean nearest to xk and approximate

as:

mμ̂

)ˆ,x|(P̂ ki θω⎩⎨⎧ =

≅otherwise 0

mi if 1),x|(P̂ ki θω

C-Means ClusteringCC--Means ClusteringMeans Clustering


27

Begin

initialize n, c, μ1 , μ2 , …, μc (randomly selected)

do classify n samples according to nearest μirecompute μi

until no change in μireturn μ1 , μ2 , …, μc

End

• Use the iterative scheme to find • if n is the known number of patterns and c the desired

number of clusters, the C-means algorithm is:

c21 ˆ,...,ˆ,ˆ μμμ



28



29

C-Means ClusteringCC--Means ClusteringMeans ClusteringC-means algorithm:

1. Begin with C cluster centers2. For each sample, find the cluster center nearest to

it. Put the sample in the cluster represented by the just-found cluster center.

3. If no samples changed clusters, stop.4. Recompute cluster centers of altered clusters and

go back to step 2.Properties:

• The number of cluster C must be given in advance.• The goal is to minimize the square error, but it

could end up in a local minimum.Demo:

http://home.dei.polimi.it/matteucc/Clustering/tutori al_html/AppletKM.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html


30

Clustering: ISODATAClustering: ISODATAClustering: ISODATA

Similar to C-means with some enhancements:• Clusters with too few elements are discarded.• Clusters are merged if the number of clusters

grows too large or if clusters are too close together.

• A cluster is split if the number of clusters is too few or if the cluster contains very dissimilar samples

Properties:• The number of clusters C is not given exactly in

advance.• The algorithm may requires extensive

computations.


31

Fuzzy C-Means ClusteringFuzzy CFuzzy C--Means ClusteringMeans Clustering


32

Fuzzy C-Means ClusteringFuzzy CFuzzy C--Means ClusteringMeans Clustering


33

Solution Spaces for ClusteringSolution Spaces for ClusteringSolution Spaces for Clustering

Constrained c-Partition Matrices


34

Partition of DataPartition of DataPartition of Data


35

Defuzzification of Fuzzy PartitionsDefuzzificationDefuzzification of Fuzzy Partitionsof Fuzzy PartitionsConversion via max membership and α-cuts


36

Fuzzy c-Means ClusteringFuzzy cFuzzy c--Means ClusteringMeans Clustering


37

Fuzzy c-Means (FCM) ClusteringFuzzy cFuzzy c--Means (FCM) ClusteringMeans (FCM) ClusteringLet’s now assume that a sample can belong to all the clusters. Define uij where

The cost function (or objective function) for FCM can then be written as

where is between 0 and 1; is the cluster center of fuzzy group is the Euclidean distance between i-th cluster center and jth data point; and is a weighting exponent.

11, 1, , .

c

iji

u j n=

= ∀ =∑ K

( ) 21

1 1

, , , ,c c n

mc i ij ij

i i j

J U J u d= =

= =∑ ∑∑Kc c

iju [ )1,m∈ ∞


38

Fuzzy c-Means (FCM) ClusteringFuzzy cFuzzy c--Means (FCM) ClusteringMeans (FCM) Clustering

The updated mean vector is

where

iju [ )1,m∈ ∞

1

1

,n m

ij jji n m

ijj

u

u=

=

=∑∑

xc

( )2/ 1

1

1 .ij mc ijk

kj

udd

−

=

=⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠

∑


39

C-means Algorithm (FCM/HCM)CC--means Algorithm (FCM/HCM)means Algorithm (FCM/HCM)


40

Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering

Agglomerative clustering (bottom up)

1. Begin with n clusters; each containing one sample

2. Merge the most similar two clusters into one.

3. Repeat the previous step until done

Divisive clustering (top down)

1

1

7

7

1,7

1,7

1,7

5 6

5,6

5,6

3

3

3

3,5,6

3,5,6

1,3,5,6,7

2

2

2

2

4

4

4

4

2,4

2,4

1,2,3,4,5,6,7


41

Hierarchical ClusteringHierarchical ClusteringHierarchical ClusteringThe most natural representation of hierarchical clustering is a corresponding tree, called a dendrogram, which shows how the samples are grouped


42


Demo: http://home.dei.polimi.it/matteucc/Clustering/tu torial_html/AppletH.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html


43


Single-linkage algorithm (minimum method)

Complete-linkage algorithm (maximum method)

Average-linkage algorithm (average method)

D (C ,C ) d(a,b)i j a C ,b Ci jmin min=

∈ ∈

D (C ,C ) d(a,b)i j a C ,b Ci jmax max=

∈ ∈

D (C ,C )C C

d(a,b)ave i ji j a C ,b Ci j

=∈ ∈∑1


44


Ward’s method (minimum-variance method)

( )D (C ,C ) xWard i j jj

d

i j ji

m

j

d

= = −= ==∑ ∑∑σ μ2

1 11

2

d: the number of featuresm: the number of samples in Ci and Cj


45


Single-linkage algorithm


46


Complete-linkage algorithm

Documents

8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate