46
8. Unsupervised Analysis 8. 8. Unsupervised Analysis Unsupervised Analysis

8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

8. Unsupervised Analysis8. 8. Unsupervised AnalysisUnsupervised Analysis

Page 2: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

2

Unsupervised AnalysisUnsupervised AnalysisUnsupervised Analysis

Previously, all our training samples were labeled: these samples were said to be “supervised”

We now investigate a number of “unsupervised” procedures which use unlabeled samples

Collecting and labeling a large set of sample patterns can be costly

We can train with a large amount of (less expensive) unlabeled data, and only then use supervision to label the groupings found, this is appropriate for large “data mining” applications where the contents of a large database are not known beforehand

Page 3: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

3

Unsupervised AnalysisUnsupervised AnalysisUnsupervised Analysis

This is also appropriate in many applications when the characteristics of the patterns can change slowly with time

Improved performance can be achieved if classifiers running in a unsupervised mode are used

We can use unsupervised methods to identify features that will then be useful for classification

We gain some insight into the nature (or structure) of the data

Page 4: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

4

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability• We shall begin with the assumption that the functional

forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector

• We make the following assumptions:

- The samples come from a known number c of classes

- The prior probabilities P(ωj ) for each class are known (j = 1, …,c)

- P(x | ωj , θj ) (j = 1, …,c) are known

- The values of the c parameter vectors θ1 , θ2 , …, θc are unknown

Page 5: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

5

• The category labels are unknown

This density function is called a mixture density

Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector θ. Once θ is known, we can decompose the mixture into its components and use a MAP classifier on the derived densities

tc21

c

1jparameters mixing

j

densities component

jj

),...,,( where

)(P.),|x(P)|x(P

θθθθ

ωθωθ

=

=∑= 321

4484476

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability

Page 6: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

6

• Definition

- A density p(x | θ) is said to be identifiable ifθ ≠ θ’ implies that there exists an x such that:

p(x | θ) ≠ p(x | θ’)

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability

Page 7: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

7

As a simple example, consider the case where x is binary and P(x | θ) is the mixture:

Assume that:P(x = 1 | θ) = 0.6 ⇒ P(x = 0 | θ) = 0.4

by replacing these probabilities values, we obtain: θ1 + θ2 = 1.2

⎪⎪⎩

⎪⎪⎨

=+

=+=

−+−= −−

0x if )(21-1

1x if )(21

)1(21)1(

21)|x(P

21

21

x12

x2

x11

x1

θθ

θθ

θθθθθ

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability

Page 8: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

8

• Thus, we have a case in which the mixture distribution is completely unidentifiable, and therefore unsupervised learning is impossible

• In the discrete distributions, if there are too many components in the mixture, there may be more unknowns than independent equations, and identifiability can become a serious problem!

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability

Page 9: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

9

• While it can be shown that mixtures of normal densities are usually identifiable, the parameters in the simple mixture density

• Cannot be uniquely identified if P(ω1 ) = P(ω2 )(we cannot recover a unique θ even from an infinite amount of data!)

• θ = (θ1 , θ2 ) and θ = (θ2 , θ1 ) are two possible vectors that can be interchanged without affecting P(x | θ)

• Identifiability can be a problem, we always assume that the densities we are dealing with are identifiable!

⎥⎦⎤

⎢⎣⎡ −−+⎥⎦

⎤⎢⎣⎡ −−= 2

222

11 )x(

21exp

2)(P)x(

21exp

2)(P)|x(P θ

πωθ

πωθ

Mixture Densities & IdentifiabilityMixture Densities & Mixture Densities & IdentifiabilityIdentifiability

Page 10: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

10

ML EstimatesML EstimatesML EstimatesSuppose that we have a set of n unlabeled samples drawn independently from the mixture density

(θ is fixed but unknown!)

The gradient of the log-likelihood is:

with

Page 11: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

11

Since the gradient must vanish at the value of θi that maximizes , therefore,

the ML estimate must satisfy the conditions:

ML EstimatesML EstimatesML Estimates

Page 12: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

12

By including the prior probabilities as unknown variables, we finally obtain:

and

where

ML EstimatesML EstimatesML Estimates

Page 13: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

13

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Case 1 = Simplest case

Case μi Σi P(ωi ) c

1 ? x x x

2 ? ? ? x

3 ? ? ? ?

Page 14: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

14

• Case 1: Unknown mean vectors

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 15: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

15

• Case 1: Unknown mean vectors (continued)

ML estimate:

is the fraction of those samples having value xk that come from the ith class, and is the average of the samples coming from the ith class.

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 16: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

16

• Unfortunately, (1) does not give explicitly

• However, if we have some way of obtaining good initial estimates for the unknown means, (1) can be seen as an iterative process for improving the estimates

• This is a gradient ascent for maximizing the log-likelihood function

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 17: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

17

• Example:Consider the simple two-component one-

dimensional normal mixture

(2 clusters!)

Let’s set μ1 = -2, μ2 = 2 and draw 25 samples sequentially from this mixture.

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

⎥⎦⎤

⎢⎣⎡ μ−−

π+⎥⎦

⎤⎢⎣⎡ μ−−

π=μμ 2

22

121 )x(21exp

232)x(

21exp

231),|x(p

Page 18: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

18

The log-likelihood function is:

• The maximum value of l occurs at:

(which are not far from the true values: μ1 = -2 and μ2 = +2)

• There is another peak at which has almost the same height as can be seen from the following figure:

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 19: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

19

Page 20: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

20

• This mixture of normal densities is identifiable• When the mixture density is not identifiable, the

ML solution is not unique

• Case 2: All parameters unknown

• No constraints are placed on the covariance matrix

Let p(x | μ, σ2) be the two-component normal mixture:

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 21: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

21

Suppose μ = x1 , therefore:

For the rest of the samples:

Finally,

The likelihood is therefore large and the maximum-likelihood solution becomes singular.

⎥⎦

⎤⎢⎣

⎡−

⎭⎬⎫

⎩⎨⎧

⎥⎦⎤

⎢⎣⎡−+≥ ∑

=

⎟⎠⎞⎜

⎝⎛

→∞→

n

2k

2kn

0 term this

21

2n1 x

21exp

)22(1x

21exp1),|x,...,x(p

πσσμ

σ

444 3444 21

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 22: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

22

• Adding an assumptionConsider the largest of the finite local maxima of the likelihood function and use the ML estimation.We obtain the following:

=

=

=

=

=

−−=

=

=

n

1kki

n

1k

tikikki

i

n

1kki

n

1kkki

i

n

1kkii

)ˆ,x|(P̂

)ˆx)(ˆx)(ˆ,x|(P̂ˆ

)ˆ,x|(P̂

x)ˆ,x|(P̂ˆ

)ˆ,x|(P̂n1)(P̂

θω

μμθωΣ

θω

θωμ

θωω

Iterative scheme

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 23: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

23

Where:

∑=

−−

−−

⎥⎦⎤

⎢⎣⎡ −−−

⎥⎦⎤

⎢⎣⎡ −−−

= c

1jjjk

1j

tjk

2/1

j

iik1

it

ik2/1

i

ki

)(P̂)ˆx(ˆ)ˆx(21expˆ

)(P̂)ˆx(ˆ)ˆx(21exp

)ˆ,x|(P̂ωμΣμΣ

ωμΣμΣθω

Applications to Normal MixturesApplications to Normal MixturesApplications to Normal Mixtures

Page 24: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

24

ClusteringClusteringClustering

What is a Cluster? A cluster is a collection of data points with similar properties.

Page 25: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

25

ClusteringClusteringClustering

Partitional clustering• C-means algorithm (Hard)• Isodata algorithm• Fuzzy C-means Clustering

Hierarchical clustering• Single-linkage algorithm (minimum method)• Complete-linkage algorithm (maximum method)• Average linkage algorithm• Minimum-variance method (Ward’s method)

Page 26: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

26

• Goal: find the c mean vectors μ1 , μ2 , …, μc

• Replace the squared Mahalanobis distance in the previous discussion

by the squared Euclidean distance

• Find the mean nearest to xk and approximate

as:

mμ̂

)ˆ,x|(P̂ ki θω⎩⎨⎧ =

≅otherwise 0

mi if 1),x|(P̂ ki θω

C-Means ClusteringCC--Means ClusteringMeans Clustering

Page 27: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

27

Begin

initialize n, c, μ1 , μ2 , …, μc (randomly selected)

do classify n samples according to nearest μirecompute μi

until no change in μireturn μ1 , μ2 , …, μc

End

• Use the iterative scheme to find • if n is the known number of patterns and c the desired

number of clusters, the C-means algorithm is:

c21 ˆ,...,ˆ,ˆ μμμ

C-Means ClusteringCC--Means ClusteringMeans Clustering

Page 28: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

28

C-Means ClusteringCC--Means ClusteringMeans Clustering

Page 29: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

29

C-Means ClusteringCC--Means ClusteringMeans ClusteringC-means algorithm:

1. Begin with C cluster centers2. For each sample, find the cluster center nearest to

it. Put the sample in the cluster represented by the just-found cluster center.

3. If no samples changed clusters, stop.4. Recompute cluster centers of altered clusters and

go back to step 2.Properties:

• The number of cluster C must be given in advance.• The goal is to minimize the square error, but it

could end up in a local minimum.Demo:

http://home.dei.polimi.it/matteucc/Clustering/tutori al_html/AppletKM.html

Page 30: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

30

Clustering: ISODATAClustering: ISODATAClustering: ISODATA

Similar to C-means with some enhancements:• Clusters with too few elements are discarded.• Clusters are merged if the number of clusters

grows too large or if clusters are too close together.

• A cluster is split if the number of clusters is too few or if the cluster contains very dissimilar samples

Properties:• The number of clusters C is not given exactly in

advance.• The algorithm may requires extensive

computations.

Page 31: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

31

Fuzzy C-Means ClusteringFuzzy CFuzzy C--Means ClusteringMeans Clustering

Page 32: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

32

Fuzzy C-Means ClusteringFuzzy CFuzzy C--Means ClusteringMeans Clustering

Page 33: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

33

Solution Spaces for ClusteringSolution Spaces for ClusteringSolution Spaces for Clustering

Constrained c-Partition Matrices

Page 34: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

34

Partition of DataPartition of DataPartition of Data

Page 35: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

35

Defuzzification of Fuzzy PartitionsDefuzzificationDefuzzification of Fuzzy Partitionsof Fuzzy PartitionsConversion via max membership and α-cuts

Page 36: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

36

Fuzzy c-Means ClusteringFuzzy cFuzzy c--Means ClusteringMeans Clustering

Page 37: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

37

Fuzzy c-Means (FCM) ClusteringFuzzy cFuzzy c--Means (FCM) ClusteringMeans (FCM) ClusteringLet’s now assume that a sample can belong to all the clusters. Define uij where

The cost function (or objective function) for FCM can then be written as

where is between 0 and 1; is the cluster center of fuzzy group is the Euclidean distance between i-th cluster center and jth data point; and is a weighting exponent.

11, 1, , .

c

iji

u j n=

= ∀ =∑ K

( ) 21

1 1

, , , ,c c n

mc i ij ij

i i j

J U J u d= =

= =∑ ∑∑Kc c

iju [ )1,m∈ ∞

Page 38: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

38

Fuzzy c-Means (FCM) ClusteringFuzzy cFuzzy c--Means (FCM) ClusteringMeans (FCM) Clustering

The updated mean vector is

where

iju [ )1,m∈ ∞

1

1

,n m

ij jji n m

ijj

u

u=

=

=∑∑

xc

( )2/ 1

1

1 .ij mc ijk

kj

udd

=

=⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠

Page 39: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

39

C-means Algorithm (FCM/HCM)CC--means Algorithm (FCM/HCM)means Algorithm (FCM/HCM)

Page 40: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

40

Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering

Agglomerative clustering (bottom up)

1. Begin with n clusters; each containing one sample

2. Merge the most similar two clusters into one.

3. Repeat the previous step until done

Divisive clustering (top down)

1

1

7

7

1,7

1,7

1,7

5 6

5,6

5,6

3

3

3

3,5,6

3,5,6

1,3,5,6,7

2

2

2

2

4

4

4

4

2,4

2,4

1,2,3,4,5,6,7

Page 41: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

41

Hierarchical ClusteringHierarchical ClusteringHierarchical ClusteringThe most natural representation of hierarchical clustering is a corresponding tree, called a dendrogram, which shows how the samples are grouped

Page 42: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

42

Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering

Demo: http://home.dei.polimi.it/matteucc/Clustering/tu torial_html/AppletH.html

Page 43: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

43

Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering

Single-linkage algorithm (minimum method)

Complete-linkage algorithm (maximum method)

Average-linkage algorithm (average method)

D (C ,C ) d(a,b)i j a C ,b Ci jmin min=

∈ ∈

D (C ,C ) d(a,b)i j a C ,b Ci jmax max=

∈ ∈

D (C ,C )C C

d(a,b)ave i ji j a C ,b Ci j

=∈ ∈∑1

Page 44: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

44

Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering

Ward’s method (minimum-variance method)

( )D (C ,C ) xWard i j jj

d

i j ji

m

j

d

= = −= ==∑ ∑∑σ μ2

1 11

2

d: the number of featuresm: the number of samples in Ci and Cj

Page 45: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

45

Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering

Single-linkage algorithm

Page 46: 8. 8. Unsupervised AnalysisUnsupervised Analysis ... · This density function is called a mixture density. Our goal will be to use samples drawn from this mixture density to estimate

Unsupervised AnalysisPattern Recognition:

46

Hierarchical ClusteringHierarchical ClusteringHierarchical Clustering

Complete-linkage algorithm