21

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

  • View
    251

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality
Page 2: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

2

Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case Bayesian Parameter Estimation: General

Estimation Problems of Dimensionality Computational Complexity Component Analysis and Discriminants Hidden Markov Models

Chapter 3:Maximum-Likelihood and Bayesian

Parameter Estimation (part 2)

Page 3: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 3

3Bayesian Estimation (Bayesian learning

to pattern classification problems)

In MLE was supposed fixedIn BE is a random variableThe computation of posterior probabilities

P(i | x) lies at the heart of Bayesian classification

Goal: compute P(i | x, D)

Given the sample D, Bayes formula can be written as:

c

1jjj

iii

)|(P).,|x(P

)|(P).,|x(P),x|(P

DD

DDD

3

Page 4: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 4

4

To demonstrate the preceding equation, use:

)().,|(

)().,|(),|(

:Thus

) this!provides sample (Training )|()(

)|,()|(

)|().|()|,(

1

c

jjjj

iiii

ii

jj

iii

PxP

PxPxP

PP

xPxP

PxPxP

D

DD

D

DD

DD,D

3

Page 5: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 5

5

Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate using the a-posteriori density P( | D)

The univariate case: P( | D)

is the only unknown parameter

(0 and 0 are known!)

),N( ~ )P(

),N( ~ ) | P(x200

2

4

Page 6: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 6

6

Reproducing density

Identifying (1) and (2) yields:

nk

1kk )(P).|x(P

(1) d)(P).|(P)(P).|(P

)|(PDD

D

(2) ),(N~)|(P 2nn D

220

2202

n

0220

2

n2200

20

n

n and

.n

ˆ n

n

4

Page 7: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 7

7

4

Page 8: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 8

8

The univariate case P(x | D)P( | D) computedP(x | D) remains to be computed!

It provides:

(Desired class-conditional density P(x | Dj, j))

Therefore: P(x | Dj, j) together with P(j)

And using Bayes formula, we obtain the

Bayesian classification rule:

Gaussian is d)|(P).|x(P)|x(P DD

),(N~)|x(P 2n

2n D

)(P).,|x(PMax,x|(PMax jjjj

jj

DD

4

Page 9: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 9

9

Bayesian Parameter Estimation: General Theory

P(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are:The form of P(x | ) is assumed known, but the

value of is not known exactlyOur knowledge about is assumed to be

contained in a known prior density P()The rest of our knowledge is contained in a

set D of n random variables x1, x2, …, xn that follows P(x)

5

Page 10: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 10

10

The basic problem is:

“Compute the posterior density P( | D)”

then “Derive P(x | D)”

Using Bayes formula, we have:

And by independence assumption:

)|x(P)|(P k

nk

1k

D

,d)(P).|(P)(P).|(P

)|(P

DD

D

5

Page 11: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 11

11

Problems of DimensionalityProblems involving 50 or 100 features

(binary valued)Classification accuracy depends upon the

dimensionality and the amount of training dataCase of two classes multivariate normal with the

same covariance

0)error(Plim

)()(r :where

due21

)error(P

r

211t

212

2

2u

2/r

7

Page 12: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 12

12

If features are independent then:

Most useful features are the ones for which the difference between the means is large relative to the standard deviation

It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model !

2di

1i i

2i1i2

2d

22

21

r

),...,,(diag

7

Page 13: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 13

13

77

7

Page 14: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 14

14

Computational Complexity

Our design methodology is affected by the computational difficulty“big oh” notation

f(x) = O(h(x)) “big oh of h(x)”

If:

(An upper bound on f(x) grows no worse than h(x) for sufficiently large x!)

f(x) = 2+3x+4x2

g(x) = x2

f(x) = O(x2)

)x(hc)x(f ;)x,c( 02

00

7

Page 15: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 15

15

“big oh” is not unique!

f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

“big theta” notationf(x) = (h(x))If:

f(x) = (x2) but f(x) (x3) )x(gc)x(f)x(gc0

xx ;)c,c,x(

21

03

210

7

Page 16: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 16

16

Complexity of the ML Estimation

Gaussian priors in d dimensions classifier with n training samples for each of c classes

For each category, we have to compute the discriminant function

Total = O(d2..n)Total for c classes = O(cd2.n) O(d2.n)

Cost increase when d and n are large!

)n(O)n.2d(O

)1(O)2d.n(O

1t)n.d(O

)(Plnˆln21

2ln2d

)ˆx()ˆx(21

)x(g

7

Page 17: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 17

17

Component Analysis and DiscriminantsCombine features in order to reduce the

dimension of the feature spaceLinear combinations are simple to compute

and tractableProject high dimensional data onto a lower

dimensional spaceTwo classical approaches for finding “optimal”

linear transformationPCA (Principal Component Analysis) “Projection

that best represents the data in a least- square sense”

MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense”

8

Page 18: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 18

18

Hidden Markov Models: Markov ChainsGoal: make a sequence of decisions

Processes that unfold in time, states at time t are influenced by a state at time t-1

Applications: speech recognition, gesture recognition, parts of speech tagging and DNA sequencing,

Any temporal process without memoryT = {(1), (2), (3), …, (T)} sequence of statesWe might have 6 = {1, 4, 2, 2, 1, 4}

The system can revisit a state at different steps and not every state need to be visited

10

Page 19: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 19

19

First-order Markov models

Our productions of any sequence is described by the transition probabilities

P(j(t + 1) | i (t)) = aij

10

Page 20: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 20

20

10

Page 21: 1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality

Pattern Classification, Chapter 3 21

21

= (aij, T)

P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”

Production of the word: “pattern” represented by phonemes

/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)

Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/ and /n/ to a silent state

10