50
Clustering Professor Ameet Talwalkar Slide Credit: Professor Fei Sha 11/20/15 Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 1 / 28

CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Clustering

Professor Ameet Talwalkar

Slide Credit: Professor Fei Sha

11/20/15

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 1 / 28

Page 2: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Outline

1 Administration

2 Brief review of last lecture

3 Clustering

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 2 / 28

Page 3: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Grading

Grades for midterm and project proposal will be available by nextTuesday

HW5 grades available next Tuesday or Thursday

For midterm, we will not be giving physical exams back, but you cansee them during Nikos’ office hours

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 3 / 28

Page 4: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

HW6

Last Homework assignment

I will post by next Tuesday

Due on the last day of class

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 4 / 28

Page 5: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Outline

1 Administration

2 Brief review of last lecture

3 Clustering

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 5 / 28

Page 6: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Neural Networks – Basic Idea

Learning nonlinear basis functions and classifiers

Hidden layers are nonlinear mappings from input features to newrepresentation

Output layers use the new representations for classification andregression

Learning parameters

Backpropogation = Stochastic gradient descent

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 6 / 28

Page 7: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Summary of the course so far

Supervised learning has been our focus

Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)

Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), which in terms depends on a

kernel function k(xm,xn) = φ(xm)Tφ(xn),

Loss function1 Squared loss: least square for regression (minimizing residual sum of

errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines

Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 7 / 28

Page 8: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

cont’d

Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation

Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 8 / 28

Page 9: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Dimensionality Reduction: Transform an initial feature representationinto a more concise representation (Next)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 9 / 28

Page 10: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Dimensionality Reduction: Transform an initial feature representationinto a more concise representation (Next)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 9 / 28

Page 11: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Outline

1 Administration

2 Brief review of last lecture

3 ClusteringK-meansGaussian mixture models

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 10 / 28

Page 12: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

ClusteringSetup Given D = {xn}Nn=1 and K, we want to output

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership, i.e., the cluster IDassigned to xn

Toy Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Identify communities within social networks

Find topics in news stories

Group similiar sequences into gene families

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 11 / 28

Page 13: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

ClusteringSetup Given D = {xn}Nn=1 and K, we want to output

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership, i.e., the cluster IDassigned to xn

Toy Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Identify communities within social networks

Find topics in news stories

Group similiar sequences into gene familiesProfessor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 11 / 28

Page 14: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2

(c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 12 / 28

Page 15: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 12 / 28

Page 16: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2

(e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 12 / 28

Page 17: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2

(f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 12 / 28

Page 18: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 12 / 28

Page 19: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

K-means clustering

Intuition Data points assigned to cluster k should be close to µk, theprototype.

Distortion measure (clustering objective function, cost function)

J =N∑

n=1

K∑k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 13 / 28

Page 20: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

K-means clustering

Intuition Data points assigned to cluster k should be close to µk, theprototype.

Distortion measure (clustering objective function, cost function)

J =

N∑n=1

K∑k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 13 / 28

Page 21: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Algorithm

Minimize distortion measure alternative optimization between {rnk}and {µk}

Step 0 Initialize {µk} to some values

Step 1 Assume the current value of {µk} fixed, minimize J over{rnk}, which leads to the following cluster assignment rule

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Assume the current value of {rnk} fixed, minimize J over{µk}, which leads to the following rule to update the prototypes ofthe clusters

µk =

∑n rnkxn∑n rnk

Step 3 Determine whether to stop or return to Step 1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 14 / 28

Page 22: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Algorithm

Minimize distortion measure alternative optimization between {rnk}and {µk}

Step 0 Initialize {µk} to some values

Step 1 Assume the current value of {µk} fixed, minimize J over{rnk}, which leads to the following cluster assignment rule

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Assume the current value of {rnk} fixed, minimize J over{µk}, which leads to the following rule to update the prototypes ofthe clusters

µk =

∑n rnkxn∑n rnk

Step 3 Determine whether to stop or return to Step 1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 14 / 28

Page 23: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Algorithm

Minimize distortion measure alternative optimization between {rnk}and {µk}

Step 0 Initialize {µk} to some values

Step 1 Assume the current value of {µk} fixed, minimize J over{rnk}, which leads to the following cluster assignment rule

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Assume the current value of {rnk} fixed, minimize J over{µk}, which leads to the following rule to update the prototypes ofthe clusters

µk =

∑n rnkxn∑n rnk

Step 3 Determine whether to stop or return to Step 1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 14 / 28

Page 24: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Remarks

Prototype µk is the mean of data points assigned to the cluster k,hence ‘K-means’

The procedure reduces J in both Step 1 and Step 2 and thus makesimprovements on each iteration

No guarantee we find the global solution; quality of local optimumdepends on initial values at Step 0 (k-means++ is a neatapproximation algorithm)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 15 / 28

Page 25: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Application: vector quantization

Replace data point with associated prototype µk

In other words, compress the data points into i) a codebook of all theprototypes; ii) a list of indices to the codebook for the data points

Lossy compression, especially for small K

Clustering pixels and vector quantizing them. From left to right: Originalimage, quantized with large K, medium K, and a small K. Details aremissing due to the higher compression (smaller K).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 16 / 28

Page 26: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Probabilistic interpretation of clustering?

We can impose a probabilistic interpretation of our intuition that pointsstay close to their cluster centers

How can we model p(x) to reflect this?

(b)

0 0.5 1

0

0.5

1 Data points seem to form 3clusters

We cannot model p(x) withsimple and known distributions

E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 17 / 28

Page 27: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Probabilistic interpretation of clustering?

We can impose a probabilistic interpretation of our intuition that pointsstay close to their cluster centers

How can we model p(x) to reflect this?

(b)

0 0.5 1

0

0.5

1 Data points seem to form 3clusters

We cannot model p(x) withsimple and known distributions

E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 17 / 28

Page 28: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

We can model each region witha distinct distribution

Common to use Gaussians, i.e.,Gaussian mixture models(GMMs) or mixture ofGaussians (MoGs).

We don’t know clusterassignments (label) orparameters of Gaussians ormixture components!

We need to learn them all fromour unlabeled dataD = {xn}Nn=1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 18 / 28

Page 29: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

We can model each region witha distinct distribution

Common to use Gaussians, i.e.,Gaussian mixture models(GMMs) or mixture ofGaussians (MoGs).

We don’t know clusterassignments (label) orparameters of Gaussians ormixture components!

We need to learn them all fromour unlabeled dataD = {xn}Nn=1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 18 / 28

Page 30: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Gaussian mixture models: formal definition

A Gaussian mixture model has the following density function for x

p(x) =

K∑k=1

ωkN(x|µk,Σk)

K: the number of Gaussians — they are called (mixture) components

µk and Σk: mean and covariance matrix of the k-th component

ωk: mixture weights – they represent how much each componentcontributes to the final distribution (priors). It satisfies two properties:

∀ k, ωk > 0, and∑k

ωk = 1

The properties ensure p(x) is a properly normalized probabilitydensity function.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 19 / 28

Page 31: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Gaussian mixture models: formal definition

A Gaussian mixture model has the following density function for x

p(x) =

K∑k=1

ωkN(x|µk,Σk)

K: the number of Gaussians — they are called (mixture) components

µk and Σk: mean and covariance matrix of the k-th component

ωk: mixture weights – they represent how much each componentcontributes to the final distribution (priors). It satisfies two properties:

∀ k, ωk > 0, and∑k

ωk = 1

The properties ensure p(x) is a properly normalized probabilitydensity function.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 19 / 28

Page 32: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Gaussian mixture models: formal definition

A Gaussian mixture model has the following density function for x

p(x) =

K∑k=1

ωkN(x|µk,Σk)

K: the number of Gaussians — they are called (mixture) components

µk and Σk: mean and covariance matrix of the k-th component

ωk: mixture weights – they represent how much each componentcontributes to the final distribution (priors). It satisfies two properties:

∀ k, ωk > 0, and∑k

ωk = 1

The properties ensure p(x) is a properly normalized probabilitydensity function.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 19 / 28

Page 33: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.

Denoteωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑k=1

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 20 / 28

Page 34: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑k=1

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 20 / 28

Page 35: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑k=1

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 20 / 28

Page 36: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

GMMs: example

(a)

0 0.5 1

0

0.5

1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 21 / 28

Page 37: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

GMMs: example

(a)

0 0.5 1

0

0.5

1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 21 / 28

Page 38: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for Gaussian mixture models

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1. To estimate, considerthe simple (and unrealistic) case first.

We have labels z If we assume z is observed for every x, then ourestimation problem is easier to solve. Our training data is augmented:

D′ = {xn, zn}Nn=1

zn denotes the region where xn comes from. D′ is the complete data andD the incomplete data. How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑n

log p(xn, zn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 22 / 28

Page 39: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for Gaussian mixture models

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1. To estimate, considerthe simple (and unrealistic) case first.

We have labels z If we assume z is observed for every x, then ourestimation problem is easier to solve. Our training data is augmented:

D′ = {xn, zn}Nn=1

zn denotes the region where xn comes from. D′ is the complete data andD the incomplete data. How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑n

log p(xn, zn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 22 / 28

Page 40: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable∑n

log p(xn, zn) =∑n

log p(zn)p(xn|zn) =∑k

∑n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn. Let us introduce a binaryvariable γnk ∈ {0, 1} to indicate whether zn = k. We then have

∑n

log p(xn, zn) =∑k

∑n

γnk log p(z = k)p(xn|z = k)

We use a “dummy” variable z to denote all the possible values clusterassignment values for xn

D′ specifies this value in the complete data setting

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 23 / 28

Page 41: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable∑n

log p(xn, zn) =∑n

log p(zn)p(xn|zn) =∑k

∑n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn. Let us introduce a binaryvariable γnk ∈ {0, 1} to indicate whether zn = k. We then have∑

n

log p(xn, zn) =∑k

∑n

γnk log p(z = k)p(xn|z = k)

We use a “dummy” variable z to denote all the possible values clusterassignment values for xn

D′ specifies this value in the complete data setting

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 23 / 28

Page 42: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have∑

n

log p(xn, zn) =∑k

∑n

γnk [logωk + logN(xn|µk,Σk)]

Regrouping, we have

∑n

log p(xn, zn) =∑k

∑n

γnk logωk +∑k

{∑n

γnk logN(xn|µk,Σk)

}

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

What’s the intuition?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 24 / 28

Page 43: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have∑

n

log p(xn, zn) =∑k

∑n

γnk [logωk + logN(xn|µk,Σk)]

Regrouping, we have

∑n

log p(xn, zn) =∑k

∑n

γnk logωk +∑k

{∑n

γnk logN(xn|µk,Σk)

}

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

What’s the intuition?Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 24 / 28

Page 44: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Intuition

Since γnk is binary, the previous solution is nothing but

For ωk: count the number of data points whose zn is k and divide bythe total number of data points (note that

∑k

∑n γnk = N)

For µk: get all the data points whose zn is k, compute their mean

For Σk: get all the data points whose zn is k, compute theircovariance matrix

This intuition is going to help us to develop an algorithm for estimating θwhen we do not know zn (incomplete data).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 25 / 28

Page 45: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑Kk′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 26 / 28

Page 46: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑Kk′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 26 / 28

Page 47: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk should be binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

But remember, we’re ‘cheating’ by using θ to compute γnk!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 27 / 28

Page 48: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk should be binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

But remember, we’re ‘cheating’ by using θ to compute γnk!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 27 / 28

Page 49: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk should be binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

But remember, we’re ‘cheating’ by using θ to compute γnk!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 27 / 28

Page 50: CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

Iterative procedure

We can alternate between estimating γnk and using the estimated γnk tocompute the parameters (same idea as with K-means!)

Step 0: initialize θ with some values (random or otherwise)

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

Questions:

Is this procedure reasonable, i.e., are we optimizing a sensible criteria?

Will this procedure converge?

The answers lie in the EM algorithm — a powerful procedure for modelestimation with unknown data (next lecture).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms November 20, 2015 28 / 28