41
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Machine Learning

Gaussian Mixture Models

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

Page 2: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Discriminative vs Generative Models

• Discriminative: Just learn a decision boundary between your sets.

Support Vector Machines

• Generative: Learn enough about your sets to be able to make new examples that would be set members

Gaussian Mixture Models

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2

Page 3: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

The Generative Model POV

• Assume the data was generated from a process we can model as a probability distribution

• Learn that probability distribution

• Once learned, use the probability distribution to• “Make” new examples • Classify data we haven’t seen before.

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Page 4: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Non-parametric distribution not feasible

• Let’s probabilistically model ML student heights.

• Ruler has 200 marks (100 to 300 cm)

• How many probabilities to learn?

• How many students in the class?

• What if the ruler is continuous?Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

150 160 170 180 190 200Height (cm)

Machine learning students

Page 5: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Learning a Parametric Distribution

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

150 160 170 180 190 200

Height (cm)

Prob

abilit

y de

nsity

p(x |Θ) ≡ prob. of x, given parameters Θ of a model, M

• Pick a parametric model (e.g. Gaussian)• Learn just a few parameter values

Page 6: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Using Generative Models for Classification

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

150 160 170 180 190 200Height (cm)

Machine learning students

210 220 230 240

Gaussians whose means and variances were learned from data

New person. Which class does he belong to?

Answer: the class that calls him most probable.

NBA players

Page 7: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Learning a Gaussian Distribution

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

150 160 170 180 190 200

Height (cm)

Prob

abilit

y de

nsity

p(x |Θ) ≡ prob. of x, given parameters Θ of a model, M

Θ ≡ {µ,σ}

M ≡ 12π( )1/2σ

e−(x−µ )2

2σ 2

The parameters we must learn

The “normal” Gaussian distribution, often denoted N, for “normal”

Page 8: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Goal: Find the best Gaussian

• Hypothesis space is Gaussian distributions.• Find parameters that maximize the

prob. of observing data

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

X ≡ {x1,...xn}

Θ* = p(X |ΘargmaxΘ

)

where each Θ≡ {µ,σ}

Θ*

Page 9: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Some math

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9

Θ* = p(X |ΘargmaxΘ

),where each Θ ≡ {µ,σ}

p(X |Θ) = p(xii=1

n

∏ |Θ)

...if can we assume all xi are i.i.d.

Page 10: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Numbers getting smaller

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

p(X |Θ) = p(xii=1

n

∏ |Θ)

What happens as n grows? Problem?

We get underflow if n is, say, 500

p(X |Θ)∝ log(p(xii=1

n

∑ |Θ)) solves underflow.

Page 11: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Remember what we’re maximizing

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

Θ*≡ p(X |Θ)argmaxΘ

= log(p(xii=1

n

∑ |Θ) )argmaxΘ

fitting the Gaussian into this...

log(p(x |Θ)) = log e−(x−µ )2

2σ 2

2π( )1/2σ

⎜⎜⎜

⎟⎟⎟

Page 12: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Some math gets you…

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

log e−(x−µ )2

2σ 2

2π( )1/2σ

⎜⎜⎜

⎟⎟⎟= log e

−(x−µ )2

2σ 2⎛

⎝⎜

⎠⎟ − log( 2π( )1/2σ )

= −(x − µ)2

2σ 2 − logσ − log 2π( )1/2

Plug back into equation from slide 11

Page 13: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

..which gives us

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

Θ*≡ p(X |Θ)argmaxΘ

= log(p(xii=1

n

∑ |Θ) )argmaxΘ

= −(xi − µ)2

2σ 2 − logσ⎛⎝⎜

⎞⎠⎟i=1

n

∑ argmaxΘ

Page 14: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Maximizing Log-likelihood

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14

• To find best parameters, take the partial derivative with respect to parameters and set to 0.

• The result is a closed-form solution

Θ* = −(xi − µ)2

2σ 2 − logσ⎛⎝⎜

⎞⎠⎟i=1

n

∑ argmaxΘ

{σ ,µ}

µ = 1n

xii=1

n

∑ σ 2 = 1n

(xii=1

n

∑ − µ)2

Page 15: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

What if…

• …the data distribution can’t be well represented by a single Gaussian?

• Can we model more complex distributions using multiple Gaussians?

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15

Page 16: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Gaussian Mixture Model (GMM)

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 201416

150 160 170 180 190 200Height (cm)

Machine learning students &

NBA players

210 220 230 240

Two Gaussian components

Model the distribution as a mix of Gaussians

P(x) = P(z j )P(xj=1

K

∑ | z j )

z j is a Boolean saying whether Gaussian j "made" x x is the observed value

Page 17: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

What are we optimizing?

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17

P(x) = P(z j )P(xj=1

K

∑ | z j )

Notating P(z j ) as weight wj and using the Normal

(a.k.a. Gaussian) distribution N(µ j ,σ j2 ) gives us...

= wjN(x | µ j ,σ j2 )

j=1

K

∑This gives 3 variables per Gaussian to optimize: wj ,µ j ,σ j

such that 1= wjj=1

K

Page 18: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Bad news: No closed form solution.

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18

Θ*≡ p(X |Θ)argmaxΘ

= log(p(xii=1

n

∑ |Θ) )argmaxΘ

= log wj p(xi | N(µ j ,σ j

2 ))j=1

K

∑⎛⎝⎜

⎠⎟i=1

n

∑argmaxΘ

Page 19: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Expectation Maximization (EM)

• Solution: The EM algorithm

• EM updates model parameters iteratively.

• After each iteration, the likelihood the model would generate the observed data increases (or at least it doesn’t decrease).

• EM algorithm always converges to a local optimum.

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19

Page 20: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

EM Algorithm Summary

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 20

• Initialize the parameters

• E step: calculate the likelihood a model with these parameters generated the data

• M step: Update parameters to increase the likelihood from E step

• Repeat E & M steps until convergence to a local optimum.

Page 21: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

EM for GMM - Initialization

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 21

• Choose the number of Gaussian components K K should be much less than the number of data points to avoid overfitting.

• (Randomly) select parameters for each Gaussian j: wj,µ j,σ j

...such that 1= wjj=1

K

Page 22: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

EM for GMM – Expectation step

22

The responsibility γ j ,n of Gaussian j for observation xn is defined as...

γ j ,n ≡ p(z j | xn ) =p(xn | z j )p(z j )

p(xn )

=p(xn | z j )p(z j )

p(zk )p(xn | zk )k=1

K

∑=

wjN(xn | µ j ,σ j2 )

wkN(xn | µk ,σ k2 )

k=1

K

Page 23: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

EM for GMM – Expectation step

23

Define the responsibility Γ j of Gaussian j for all the observed data as...

Γ j ≡ γ j ,nn=1

N

∑You can think of this as the proportion of the data explained by Gaussian j.

Page 24: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

EM for GMM – Maximization step

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 201224

Update our parameters as follows...

new wj =Γ j

Nnew µ j =

γ j ,i xii=1

N

∑Γ j

new σ j2 =

γ j ,i (xi − µ ji=1

N

∑ )2

Γ j

Page 25: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Why does this work?

• We need to prove that, as our model parameters are adjusted, likelihood of the data never goes down (monotonically non-decreasing)

• This is the part where I point you to the textbook

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 25

Page 26: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

What happens if…

• If I initialize each Gaussian distribution to have a mean = to the location of a data point…

• …And I allow sigma to go to 0 for any Gaussian?

• What is one (probably bad) solution for the local optimization algorithm?

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 26

Page 27: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

What if…

• …our data isn’t just scalars, but each data point has multiple dimensions?

• Can we generalize to multiple dimensions?

• We need to define a covariance matrix.

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 27

Page 28: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Covariance Matrix

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 28

Given d-dimensional random variable vector !X = X1,....,Xd[ ]

the covariance matrix denoted Σ (confusing, eh?) is defined as...

Σ ≡

E (X1 −µ1)(X1 −µ1)[ ] E (X1 −µ1)(X2 −µ2 )[ ] ... E (X1 −µ1)(Xd −µd )[ ]E (X2 −µ2 )(X1 −µ1)[ ] E (X2 −µ2 )(X2 −µ2 )[ ] ! E (X2 −µ2 )(Xd −µd )[ ]

! ! " !E (Xd −µd )(X1 −µ1)[ ] E (Xd −µd )(X2 −µ2 )[ ] ! E (Xd −µd )(Xd −µd )[ ]

$

%

&&&&&

'

(

)))))

This is a generalization of one-dimensional variance for a scalar random variable X

σ 2 = var(X) = E X −µ( )2$%

'(

Page 29: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Multivariate Gaussian Mixture

First dimension

Second dimension

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 29

P(!X) = wj p(

!X | N(

j=1

K

∑ !µ,Σ j ))

Given d dimensions and K Gaussians, how many parameters?

The d by d covariance matrix Σ describes the shape and orientation of an elipse.

Page 30: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Example: Initialization

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 30

Page 31: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

After Iteration #1

31

Page 32: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

After Iteration #2

32

Page 33: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

After Iteration #3

33

Page 34: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

After Iteration #4

34

Page 35: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

After Iteration #5

35

Page 36: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

After Iteration #6

36

Page 37: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012

(IllustrationfromAndrewMoore'stutorialslidesonGMM)

After Iteration #20

37

Page 38: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

GMM Remarks• GMM is powerful: any density function can be

arbitrarily-well approximated by a GMM with enough components.

• If the number of components 𝐾is too large, data will be overfitted.– Likelihood increases with 𝐾.– Extreme case: 𝑁 Gaussians for 𝑁 data points, with

variances → 0, then likelihood → ∞.• How to choose 𝐾?

– Use domain knowledge.– Validate through visualization.

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 38

Page 39: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

GMM is a “soft” version of K-means• Similarity– 𝐾 needs to be specified.– Converges to some local optima.– Initialization matters final results.– One would want to try different initializations.

• Differences– GMM Assigns “soft” labels to instances.– GMM Considers variances in addition to

means.

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 39

Page 40: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

GMM for Classification

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 40

(illustration from Leon Bottou’s slides on EM)

• Given training data with multiple classes…1) Model the training data for each class with a GMM2) Classify a new point by estimating the probability

each class generated the point3) Pick the class with the highest probability as the

label.

Page 41: eecs349 gaussian mixture models - GitHub Pages · Leon Bottou’sslides on EM) •Given training data with multiple classes… 1) Model the training data for each class with a GMM

GMM for Regression

Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 41

(illustration from Leon Bottou’s slides on EM)

Given dataset D={<x1, y1 >,..., <xn, yn >}, where yi ∈ℜ

and xi is a vector of d dimensions...Learn a d +1 dimensional GMM.Then, compute f (x) = E[y | x]