36
236608 Visual Recognition Tutorial 1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization Algorithm Jensen’s inequality EM for a mixture model

236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

Embed Size (px)

Citation preview

Page 1: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 1

Tutorial 4

• Maximum likelihood – an example

• Maximum likelihood – another example

• Bayesian estimation

• Expectation Maximization Algorithm

• Jensen’s inequality

• EM for a mixture model

Page 2: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 2

• Bayesian leaning considers (the parameter vector to be

estimated) to be a random variable.

Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning

Bayesian Estimation: General Theory

Page 3: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 3

• Density function for x, given the training data set

(it was defined in the Lect.2)

• From the definition of conditional probability densities

• The first factor is independent of X(n) since it just our assumed form for parameterized density.

• Therefore

Bayesian parametric estimation

D

( ) ( ) ( )( , | ) ( | , ) ( | ).n n np X p X p X x x

( )( | , ) ( | )np X p x x

( ) ( )( | ) ( | ) ( | )n np X p p X d x x

( ) 1{ ,..., }n NX x x

( ) ( )( | ) ( , | )n np X p X d x x

Page 4: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 4

• Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of

If the weighting factor , which is a posterior of peaks very sharply about some value we obtain .

Thus the optimal estimator is the most likely value of given the data and the prior of .

Bayesian parametric estimation

.

( )( | )np X

( )( | ) ( | )np X p x x

Page 5: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 5

Suppose we know the distribution of possible values of that is a prior

Suppose we also have a loss function which measures the penalty for estimating when actual value is

Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk

Note that the loss function is usually continuous.

Bayesian decision making

0 ( ).p

( ) ( )[ | ] ( | ) ( , )n nR X p X d

( , )

.

Page 6: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 6

Let us look at : the optimal estimator is

the most likely value of given the data and the prior of This “most likely value” is given by

Maximum A-Posteriori (MAP) Estimation

( )

( ) 0( )

01

( )0

( ) ( | )arg max ( | ) arg max

( )

( ) ( | )arg max

( | ') ( ') '

nn

n

n

ii

n

p p Xp X

p X

p p x

p X p d

0

( , )1

if

if

Page 7: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 7

since the data is i.i.d.

• We can disregard the normalizing factor when looking for the maximum

Maximum A-Posteriori (MAP) Estimation

( )

1

( | ) ( | )n

ni

i

p X p x

( )( )np X

Page 8: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 8

So, the we are looking for is

MAP - continued

0

1

01

01

01

arg max ( ) ( | )] log is monotonically increasing)

arg max log ( ) ( | )]

arg max log ( ) log ( | )

arg max log ( ) log ( | )

n

ii

n

ii

n

ii

n

ii

p p x

p p x

p p x

p p x

Page 9: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 9

In MAP estimator, the larger n (the size of the data), the less important is in the expression

It can motivate us to omit the prior.

What we get is the maximum likelihood (ML) method.

Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way .

is a log-likelihood of with respect to X(n) .

We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function.

Maximum likelihood

01

log ( ) log ( | )n

ii

p p x

0log ( )p

1

arg max log ( | )n

ii

p x

( )log ( | )np X

Page 10: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 10

Let us find the ML estimator for the parameter of the exponential density function :

so we are actually looking for the maximum of log-likelihood.

Observe:

The maximum is achieved where

We have got the empirical mean (average)

Maximum likelihood – an example

( )

11

1 1arg max ( | ) arg max arg max ln

i ix xn nn

ii

p X e e

1( | ) exp

xp x

2

1ln

1 1ln ln

i

i

x

xd e

x xe

d

2 21 1 1

1 10

n n ni

i ii i i

x nx x

n

Page 11: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 11

Let us find the ML estimator for

Observe:

The maximum is at where

This is the median of the sampled data.

Maximum likelihood – another example

| | | |( )

11

1 1arg max ( | ) arg max arg max ln

2 2i i

n nx xn

ii

p X e e

|1( | )

2xp x e

| |

| |

1ln

1 2ln | | ln 2 ( )

2

i

i

x

xi i

d ee x sign x

d

1

( ) 0n

ii

sign x

Page 12: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 12

We saw Bayesian estimator for 0/1 loss function (MAP).What happens when we assume other loss functions?Example 1: (is unidimensional).The total Bayesian risk here:

We seek its minimum:

Bayesian estimation -revisited

( , ) | |

( ) ( )

( ) ( )

[ | ] ( | ) | |

( | )( ) ( | )( )

n n

n n

R X p X d

p X d p X d

( )( ) ( )[ | ]

( | ) ( | )n

n ndR Xp X d p X d

d

Page 13: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 13

At the which is a solution we have

That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution

Example 2: (squared error).

Total Bayesian risk:

Again, in order to find the minimum, let the derivative be equal 0:

Bayesian estimation -continued

( ) ( )( | ) ( | )n np X d p X d

( , ) | |

( )( | )np X 2( , ) ( )

( ) ( ) 2[ | ] ( | )( )n nR X p X d

Page 14: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 14

• The optimal estimator here is the conditional expectation of given the data X(n) .

Bayesian estimation -continued

( )( )

( ) ( ) ( )

[ | ]( | )2( )

2 ( | ) 2 ( | ) 2 [ | ] 0

nn

n n n

dR Xp X d

d

p X d p X d E X

Page 15: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 15

• EM is iterative technique designed for probabilistic models.

• We have:

two sample spaces: – X which are observed

– Y which are missing

Vector of parameters which gives a distribution of X.

We should find

or

EM Algorithm

arg max ( | )ML P X

arg max ( | )MAP P X

Page 16: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 16

The problem is that to calculate

Is difficult ,but calculation of is relatively easy

We define:

The algorithm makes cyclically two steps:

E: Compute

M:

EM Algorithm

( | , ')( | ') [log ( , | )]P Y XQ E P X Y

( | ')Q

1 arg max ( | ')m Q

( | ) ( , | )P X P X Y dy ( , | )P X Y

Page 17: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 17

• EM is iterative technique designed for probabilistic models.

Maximizing a function with lower-bound approximation vs.

linear approximation

EM Algorithm

Page 18: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 18

• Gradient descend makes linear approximation to the objective function (O.F.), Newton’s method makes quadratic approx. But optimal step is not known.

• EM instead makes a local approx. that is lower bound (l.b.) to the O.F.

• Choosing a new guess to maximize the will always be an improvement, if gradient is not zero.

• Thus two steps: E – compute a l.b., M-maximize the l.b.

The bound used by EM is following from Jensen’s inequality.

EM Algorithm

Page 19: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 19

Function is convex over (a,b) if

Convex Concave

Jensen’s inequality: For convex function

Jensen’s inequality

:f

1 2 1 2 1 2, ( , ), [0,1] ( (1 ) ) ( ) (1 ) ( )x x a b f x x f x f x

[ ( )] ( [ ])E f X f E X

Page 20: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 20

For d.r.v.with two mass points

Let Jensen’s inequality is right for k-1 mass points, then

due to induction assumption

due to convexity

Jensen’s inequality

1 1 2 2 1 1 2 2[ ( )] ( ) ( ) ( ) ( [ ])E f X p f x p f x f p x p x f E X 1 1 2 2[ ] ,E X p x p x [0,1]ip

1

1 1

[ ( )] ( ) ( ) (1 ) ( )(1 )

k ki

i i k k k ii i k

pE f X p f x p f x p f x

p

1

1

( ) (1 )(1 )

ki

k k k ii k

pp f x p f x

p

1

1

( ) (1 )(1 )

ki

k k k ii k

pf p f x p x

p

1

( [ ])k

i ii

f p x f E X

Page 21: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 21

• Provided

EM Algorithm

log( [ ]) [log( )]

log( ( )) log( ( ))

log( ( )) log( ( ) )

log( ( )) log ( )

( )) ( )

j

j

j

j jj j

q

jj j

q

jj j

q

jj j

E g E g

q g j q g j

q g j g j

q g j g j

q g j g j

1

0

( ) 0

jj

j

q

q

g j

Page 22: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 22

• We should make maximization of the function

where X is a matrix of observed data. If f() is simple, we find maximum by equating its gradient to zero

But if f() is a mixture (of simple functions) it is difficult.

This is a situation for the EM.

Given a guess for find lower bound for f() with a

function g(q(y)), parameterized by free variables q(y).

The General EM Algorithm

( ) ( | ) (2)f p X

( ) ( | ) ( , | ) (3)Y

f p p Y X X

Page 23: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 23

• Gradient descend makes linear approximation to the

provided

Define

If we want the lower bound g(,q) to touch f at the current

guess for , we choose to maximize G(q, .

EM Algorithm

( )( , | ) ( , | )

( ) ( ) ( , ( ))( ) ( )

qp p

f q g qq q

y

yy

X y X yy y

y y

( ) 1q yy

( , ) log ( , ) ( ) log ( , | ) ( ) log ( )G q g q q p q q yy X y y y

Page 24: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 24

• Adding the Lagrange multiplier to the constraint on q gives:

• For this choice the bound becomes

• So indeed it touches the objective f() .

EM Algorithm

( , ) (1 ( )) ( ) log ( , | ) ( ) log ( )G q q q p q q y yy y X y y y

1 log ( , | ) log ( ) 0( )

Gp q

q

X y yy

( , | )( ) ( | , )

( , | )

pq p

p

y

X yy y X

X y

( )

( )( )( , | )( , ) ( | ) ( | ) ( | ) (9)

( | , )

qqqp

g q p p pp

y

yyy

y y

X yX X X

y X

Page 25: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 25

• Finding q to get a good bound is the “E” step.

• To get the next guess for we maximize the bound over (this is the “M” step). It is problem-dependent. The relevant term of G is

• It may be difficult and also it isn’t strictly necessary to maximize the bound over . This is sometimes called “generalized EM”.

• It is clear from the figure that the derivative of g at the current guess is identical to the derivative of f .

EM Algorithm

( )( ) log ( , | ) log ( , | )qq p p yyy X y X y

Page 26: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 26

• We have a mixture of two one-dimensional Gaussians (k=2).

• Let mixture coefficients be equal:

• Let variances be

• The problem is to find

• We have sample set

EM for a mixture model

0.5i 1 2 1

1 2,1 ,1

1 1( | ) ( ) ( )

2 2P x N x N x

1 2( , )

( )1,...,

nnx x x

Page 27: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 27

• To use an algorithm of EM define hidden random variables (indicators)

• Thus for every i we have:

• We define every hidden variables:

• The aim is to calculate and to maximize Q.

EM for a mixture model

,

1

0i j

i j

x was chosen fromNz

otherwise

,1 ,2 1i iz z

, 1 1,2

n

i j i jZ z

Page 28: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 28

• For every xi we have:

• From the assumption of iid for the sample set we have:

• We see that an expression is linear in .

EM for a mixture model

22

,1

1( )

2

,1 ,2

1 1( , , | )

2 2

i j i ij

z x

i i iP x z z e

( ),1 ,2

1

22

,1 1

log ( , | ) log ( , , | )

1 1log ( )

22

nn

i i ii

n

i j i ji j

P x Z P x z z

z x

ijz

Page 29: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 29

• STEP E:

• We want to calculate an expected value relative to

EM for a mixture model

( )( | , ')nP Z x

( ) ( )

2( ) 2

,( | , ') ( | , ')1 1

1 1[log ( , | )] log [ ]( )

22n n

nn

i j i jP Z x P Z xi j

E P Z x E z x

( )

2

2 21 2

, , , ,( | , ')

1( )

2

1 1( ) ( )

2 2

[ ] ( 1| , ') 1 ( 0 | , ') 0 ( 1| , ')

1

2( )1

2

n

i j

i i

i j i j i i j i i j iP Z x

x

i jx x

E z P z x P z x P z x

eP x was chosen from N

e e

Page 30: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 30

• STEP M:

• Differentiating and equating to zero we’ll have:

• Thus

EM for a mixture model

1 2

22

( , ) 1 1

1arg max ( | ') arg min [ ]( )

2

n

new ij i ji j

Q E z x

1

[ ]( ) 0n

ij i jij

FE z x

1

1[ ]

n

j iji

E zn

Page 31: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 31

In what follows we use j instead of y because missing variables are discrete in this example.

• Model density is a linear combination of component densities

p(x | j,) :

where M is a number of basis functions (parameter of the model),

P(j) are mixing parameters. They actually are prior probabilities of the data point having been generated from component j of the mixture.

EM mixture of Gaussians

1

( ) ( | ) ( ) (11)M

j

p p j P j

x x

Page 32: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 32

• They satisfy

• The component density function p(x | j) are normalized:

• We shall use Gaussians for p(x | j)

• We should find

EM mixture of Gaussians

1

( ) 1 0 ( ) 1M

j

P j P j

( | ) 1p j d x x

2

2 / 2 2

|| ||1( | ) exp

(2 ) 2j

dj j

p j

xx

( ), andj jP j μ

Page 33: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 33

• STEP E: calculate

when . (See formulas (8) and (10))

• We have:

• We maximize (17) with constrain (12):

EM mixture of Gaussians

( ) ( | , )oldq j p j x

( )( , ) [log ( , | )]new old newq jQ E P j x

| , ) ( )( | , )

)

p j P jp j

p

x

xx

( , | ) ( ) ( | , )P j P j p j x x

2

21 1

|| ||( , ) ( | , ) log ( ) log (17)

2( )

i newN Mjnew old i old new new

j newi j j

Q p j P j d

x μ

x

1

(1 ( )) (18)M

new

j

Q P j

Page 34: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 34

• STEP M: Derivative of (18) with respect to Pnew(j):

• Thus

• Using (12) we shall have

• So from (21) and (20) :

EM mixture of Gaussians

1

( | , )0

( )

i oldN

newi

p j

P j

x

1

( | , ) ( )N

i old new

i

p j P j

x

1 1

( | , )M N

i old

j i

p j

x

Page 35: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 35

• By calculating derivatives from(18) due to and we’ll have:

EM mixture model. General case

1

1 1

( | , )( )

( | , )

Ni old

new iM N

i old

j i

p jP j

p j

x

x

newjμ

newj

1

1

( | , )(23)

( | , )

Ni old i

new ij N

i old

i

p j

p j

x xμ

x

2

2 1

1

( | , ) || ||1

( ) (24)( | , )

Ni old i new

jnew ij N

i old

i

p j

d p j

x x

x

Page 36: 236608 Visual Recognition Tutorial1 Tutorial 4 Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation Expectation Maximization

236608 Visual Recognition Tutorial 36

• Algorithm for calculating p(x) (formula (11)).

For every x

begin initialize

do fixed number of times

Calculate formulas (22),(23),(24)

return formula (11).

end

EM mixture model. General case

2( ), ,j jP j μ