47
1 CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 4

CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

  • Upload
    votram

  • View
    226

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

1

CS434a/541a: Pattern RecognitionProf. Olga Veksler

Lecture 4

Page 2: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

2

Outline

� Normal Random Variable� Properties� Discriminant functions

Page 3: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

3

Why Normal Random Variables?

� Analytically tractable� Works well when observation comes

form a corrupted single prototype (µµµµ)� Is an optimal distribution of data for

many classifiers used in practice

Page 4: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

4

The Univariate Normal Density

Where: µµµµ = mean (or expected value) of x

σσσσ2 = variance

,x

21

exp21

)x(p2

������������

����

������������

������������

����

���� −−−−−−−−====σσσσ

µµµµσσσσππππ

� x is a scalar (has dimension 1)

Page 5: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

5

Page 6: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

6

Several Features

� What if we have several features x1, x2, …, xd� each normally distributed� may have different means� may have different variances� may be dependent or independent of each other

� How do we model their joint distribution?

Page 7: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

7

covariance of x1 and xd

inverse of ΣΣΣΣ

determinant of ΣΣΣΣ

� Multivariate normal density in d dimensions is:

��������

������������

���� −−−−−−−−−−−−==== −−−− )x()x(21

exp)2(

1)x(p 1t

2/12/dµµµµΣΣΣΣµµµµ

ΣΣΣΣππππ

The Multivariate Normal Density

��������

����

����

��������

����

����====

2d1d

d121

σσσσσσσσ

σσσσσσσσΣΣΣΣ

���

µµµµ = [µµµµ1, µµµµ2, …, µµµµd]t

x = [x1, x2, …, xd]t

� Each xi is N(µµµµi,σσσσi2)

� to prove this, integrate out all other features from the joint density

Page 8: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

8

More on ΣΣΣΣ

� plays role similar to the role that σσσσ2 plays in one dimension

� From ΣΣΣΣ we can find out 1. The individual variances of features

x1, x2, …, xd

2. If features xi and xj are � independent σσσσij=0� have positive correlation σσσσij>0� have negative correlation σσσσij<0

��������

����

����

��������

����

����====

2d1d

d121

σσσσσσσσ

σσσσσσσσΣΣΣΣ

���

Page 9: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

9

The Multivariate Normal Density

� If ΣΣΣΣ is diagonal then the features xi,…, xj are independent, and

∏∏∏∏====

��������

������������

���� −−−−−−−−====d

i i

ii

i

xxp

1

2

22)(

exp2

1)(

σσσσµµµµ

ππππσσσσ

������������

����

����

������������

����

����

23

22

21

000000

σσσσσσσσ

σσσσ

Page 10: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

10

scalar s (single number), the closer s to 0 the larger is p(x)normalizing

constant

��������

������������

���� −−−−−−−−−−−−==== −−−− )x()x(21

exp)2(

1)x(p 1t

2/12/dµµµµΣΣΣΣµµµµ

ΣΣΣΣππππ

The Multivariate Normal Density

[[[[ ]]]]������������

����

����

������������

����

����

��������

����

����

��������

����

����

−−−−−−−−−−−−

������������

����

����

������������

����

����

−−−−−−−−−−−−−−−−⋅⋅⋅⋅====

−−−−

33

22

11

1

232313

232212

131221

332211

xxx

xxx21

expc)x(pµµµµµµµµµµµµ

σσσσσσσσσσσσσσσσσσσσσσσσσσσσσσσσσσσσ

µµµµµµµµµµµµ

� Thus P(x) is larger for smaller )x()x( 1t µµµµΣΣΣΣµµµµ −−−−−−−− −−−−

Page 11: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

)x()x( 1t µµµµΣΣΣΣµµµµ −−−−−−−− −−−−

� ΣΣΣΣ is positive semi definite (xtΣΣΣΣ x>=0)� If xtΣΣΣΣ x=0 for nonzero x then det(ΣΣΣΣ)=0. This case is

not interesting, p(x) is not defined1. one feature vector is a constant (has zero

variance) 2. or two components are multiples of each other

� so we will assume ΣΣΣΣ is positive definite (xtΣΣΣΣ x >0)

� If ΣΣΣΣ is positive definite then so is Σ Σ Σ Σ −−−−1111

Page 12: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

)x()x( 1t µµµµΣΣΣΣµµµµ −−−−−−−− −−−−

t

t

21

21

1 ΜΜΜΜΜΜΜΜΦΛΦΛΦΛΦΛΦΛΦΛΦΛΦΛΣΣΣΣ ====������������

����

����������������

����

����====

−−−−−−−−−−−−

� Thus if ΛΛΛΛ−−−−1/2 1/2 1/2 1/2 denotes matrix s.t. ΛΛΛΛ−−−−1/2 1/2 1/2 1/2 ΛΛΛΛ−−−−1/2 1/2 1/2 1/2 = ΛΛΛΛ−−−−1111

� Positive definite matrix of size d by d has d distinct real eigenvalues and its d eigenvectors are orthogonal

� Thus if ΦΦΦΦ is a matrix whose columns are normalized eigenvectors of ΣΣΣΣ, then Φ Φ Φ Φ -1= Φ Φ Φ Φ t

� ΣΦΣΦΣΦΣΦ =ΦΛΦΛΦΛΦΛ where ΛΛΛΛ is a diagonal matrix with corresponding eigenvalues on the diagonal

� Thus ΣΣΣΣ =ΦΛΦΦΛΦΦΛΦΦΛΦ−−−−1111 and ΣΣΣΣ −−−−1111 =ΦΛΦΛΦΛΦΛ−−−−1111 Φ Φ Φ Φ −−−−1111

Page 13: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

)x()x( 1t µµµµΣΣΣΣµµµµ −−−−−−−− −−−−

� Thus

� Thus

where

====−−−−−−−−====−−−−−−−− −−−− )x(MM)x()x()x( tt1t µµµµµµµµµµµµΣΣΣΣµµµµ

(((( )))) (((( )))) 2tttt )x(M)x(M)x(M µµµµµµµµµµµµ −−−−====−−−−−−−−====

21 )()()( µµµµµµµµµµµµ −−−−====−−−−ΣΣΣΣ−−−− −−−− xMxx tt

121

tM −−−−−−−−==== ΦΦΦΦΛΛΛΛ

� Points x which satisfy lie on an

ellipse

constxMt ====−−−−2

)( µµµµ

rotation matrix

scaling matrix

Page 14: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

)x()x( t µµµµµµµµ −−−−−−−−

µ

points x at equal Eucledian

distance from µµµµlie on a circle

µ

points x at equal Mahalanobis distance from

µµµµ lie on an ellipse: Σstretches circles to ellipses

)x()x( 1t µµµµΣΣΣΣµµµµ −−−−−−−− −−−−

usual (Eucledian) distance between x and µµµµ

p(x) decreases

−−−−−−−− −−−− )x()x( 1t µµµµµµµµMahalanobis distance

between x and µµµµ

p(x) decreases slowp(

x) d

ecre

ases

fast

eigenvectors of ΣΣΣΣ

Page 15: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

15

2-d Multivariate Normal Density

� Can you see much in this graph?

� At most you can see that the mean is around [0,0], but can’t really tell if x1 and x2 are correlated

Page 16: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

16

2-d Multivariate Normal Density

� How about this graph?

Page 17: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

17

2-d Multivariate Normal Density

� Level curves graph� p(x) is constant along

each contour� topological map of 3-d

surface

� Now we can see much more � x1 and x1 are independent� σσσσ1

2 and σσσσ22 are equal

µµµµ

Page 18: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

18

2-d Multivariate Normal Density

������������

������������==== 10

01ΣΣΣΣ

[[[[ ]]]]0,0====µµµµ������������

������������==== 10

01ΣΣΣΣ

[[[[ ]]]]1,1====µµµµ

������������

������������==== 40

01ΣΣΣΣ

[[[[ ]]]]0,0====µµµµ

������������

������������==== 10

04ΣΣΣΣ

[[[[ ]]]]0,0====µµµµ

Page 19: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

2-d Multivariate Normal Density

������������

������������==== 15.0

5.01ΣΣΣΣ

[[[[ ]]]]0,0====µµµµ

������������

������������==== 19.0

9.01ΣΣΣΣ������������

������������==== 49.0

9.01ΣΣΣΣ

������������

������������−−−−

−−−−==== 15.05.01ΣΣΣΣ

������������

������������−−−−

−−−−==== 19.09.01ΣΣΣΣ

������������

������������−−−−

−−−−==== 49.09.01ΣΣΣΣ

Page 20: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

20

The Multivariate Normal Density

� If X has density N(µµµµ,ΣΣΣΣ) then AX has density N(ΑΑΑΑtµµµµ,ΑΑΑΑtΣΑΣΑΣΑΣΑ)� Thus X can be transformed into a spherical normal

variable (covariance of spherical density is the identity matrix I) with whitening transform

21−−−−

ΦΛΦΛΦΛΦΛ====wA

������������

������������==== 19.0

9.01ΣΣΣΣ������������

������������==== 10

01ΣΣΣΣ

X AX

Page 21: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Discriminant Functions� Classifier can be viewed as network which

computes m discriminant functions and selects category corresponding to the largest discriminant

features

discriminantfunctions

select classgiving maximim

(((( ))))xg2 (((( ))))xgm(((( ))))xg1

1x 2x 3x dx

� gi(x) can be replaced with any monotonically increasing function, the results will be unchanged

Page 22: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

22

Discriminant Functions

� The minimum error-rate classification is achieved by the discriminant function

gi(x) = P(ci |x)=P(x|ci)P(ci)/P(x)

� Since the observation x is independent of the class, the equivalent discriminant function is

gi(x) = P(x|ci)P(ci)

� For normal density, convinient to take logarithms. Since logarithm is a monotonically increasing function, the equivalent discriminant function is

gi(x) = ln P(x|ci)+ ln P(ci)

Page 23: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Discriminant Functions for the Normal Density

��������

������������

���� −−−−ΣΣΣΣ−−−−−−−−ΣΣΣΣ

==== −−−− )()(21

exp)2(

1)|( 1

2/12/ iit

ii

di xxcxp µµµµµµµµππππ

)(lnln21

2ln2

)()(21

)( 1iiii

tii cP

dxxxg ++++ΣΣΣΣ−−−−−−−−−−−−ΣΣΣΣ−−−−−−−−==== −−−− ππππµµµµµµµµ

� Plug in p(x|ci) and P(ci) get

� Discriminant function gi(x) = ln P(x|ci)+ ln P(ci)

� Suppose we for class ci its class conditional density p(x|ci) is N(µi,Σi)

)(lnln21

)()(21

)( 1iiii

tii cPxxxg ++++ΣΣΣΣ−−−−−−−−ΣΣΣΣ−−−−−−−−==== −−−− µµµµµµµµ

constant for all i

Page 24: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

24

� That is

Case ΣΣΣΣi = σσσσ2I

������������

����

������������

����⋅⋅⋅⋅====

��������

����

����

��������

����

����====

100010001

000000

2

2

2

2

i σσσσσσσσ

σσσσσσσσ

� In this case, features x1, x2. ,…, xd are independent with different means and equal variances σ2

µµµµ

Page 25: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

25

� Discriminant function

Case ΣΣΣΣi = σσσσ2I

)(lnln21

)()(21

)( 1iii

tii cPxxxg ++++ΣΣΣΣ−−−−−−−−−−−−−−−−==== −−−− µµµµµµµµ

� Can simplify discriminant function

(((( )))) )(lnln21

)()(21

)( 22 i

di

tii cPx

Ixxg ++++−−−−−−−−−−−−−−−−==== σσσσµµµµ

σσσσµµµµ

constant for all i

====++++−−−−−−−−−−−−==== )(ln)()(2

1)( 2 ii

tii cPxxxg µµµµµµµµ

σσσσ

)(ln2

1 2

2 ii cPx ++++−−−−−−−−==== µµµµσσσσ

� Det(ΣΣΣΣi)=σσσσ2d and ΣΣΣΣi-1=(1/σσσσ2)I

������������������������

����

����

������������������������

����

����

====

2

2

2

100

01

0

001

σσσσ

σσσσ

σσσσ

Page 26: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Case ΣΣΣΣi = σσσσ2I Geometric Interpretation

(((( )))) 2ii xxg µµµµ−−−−−−−−====

then),c(Pln)c(PlnIf ji ====

(((( )))) )c(Plnx2

1xg i

2i2i ++++−−−−−−−−==== µµµµ

σσσσ

then),c(Pln)c(PlnIf ji ≠≠≠≠

decision region for c2

decision region for c3

decision region for c1

in c3

1µµµµ

2µµµµ

3µµµµ

decision region for c2

decision region for c3

decision region for c1

x

1µµµµ

2µµµµ

3µµµµ

voronoi diagram: points in each cell are closer to the mean in that cell

than to any other mean

Page 27: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

27

Case ΣΣΣΣi = σσσσ2I====++++−−−−−−−−−−−−==== )(ln)()(

21

)( 2 iit

ii cPxxxg µµµµµµµµσσσσ

)c(Pln)xxxx(2

1ii

tii

tti

t2 ++++++++−−−−−−−−−−−−==== µµµµµµµµµµµµµµµµ

σσσσconstant

for all classes

====++++++++−−−−−−−−==== )c(Pln)x2(2

1)x(g ii

ti

ti2i µµµµµµµµµµµµ

σσσσ

0)( itii wxwxg ++++====

))c(Pln2

(x i2i

ti

2

ti ++++−−−−++++

σσσσµµµµµµµµ

σσσσµµµµ

discriminant function is linear

Page 28: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Case ΣΣΣΣi = σσσσ2I

0)( itii wxwxg ++++====

� Thus discriminant function is linear,� Therefore the decision boundaries

gi(x)=gj(x) are linear� lines if x has dimension 2� planes if x has dimension 3� hyper-planes if x has dimension larger than 3

linear in x:

====

====d

iii

ti xwxw

1

constant in x

Page 29: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

29

Case ΣΣΣΣi = σσσσ2I: Example� 3 classes, each 2-dimensional Gaussian with

������������

������������==== 21

1µµµµ������������

������������==== 64

2µµµµ������������

������������−−−−==== 4

23µµµµ ��������

����������������============ 30

03321 ΣΣΣΣΣΣΣΣΣΣΣΣ

� Priors and(((( )))) (((( ))))41

cPcP 21 ======== (((( ))))21

cP 3 ====

������������

����

����++++−−−−++++==== )c(Pln

2x)x(g i2

iti

2

ti

i σσσσµµµµµµµµ

σσσσµµµµ

� Discriminant function is

[[[[ ]]]])38.1

65

(x321

)x(g1 −−−−−−−−++++====

� Plug in parameters for each class[[[[ ]]]]

)38.1652

(x364

)x(g2 −−−−−−−−++++====

[[[[ ]]]])69.0

620

(x3

42)x(g3 −−−−−−−−++++

−−−−====

Page 30: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

30

Case ΣΣΣΣi = σσσσ2I: Example� Need to find out when gi(x) < gj(x) for i,j=1,2,3

� Can be done by solving gi(x) = gj(x) for i,j=1,2,3

� Let’s take g1(x) = g2(x) first

� Simplifying, [[[[ ]]]]647

xx

343

2

1 −−−−====������������

��������

����−−−−−−−−

647

x34

x 21 −−−−====−−−−−−−−

line equation

[[[[ ]]]] [[[[ ]]]])38.1

652

(x364

)38.165

(x321 −−−−−−−−++++====−−−−−−−−++++

Page 31: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

31

Case ΣΣΣΣi = σσσσ2I: Example� Next solve g2(x) = g3(x)

02.6x32

x2 21 ====++++

� Almost finally solve g1(x) = g3(x)

81.1x32

x 21 −−−−====−−−−

� And finally solve g1(x) = g2(x) = g3(x)

82.4xand4.1x 21 ========

Page 32: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

32

Case ΣΣΣΣi = σσσσ2I: Example

� Priors and(((( )))) (((( ))))41

cPcP 21 ========

c1

c2

c3

(((( ))))21

cP 3 ====

g1 (x) = g

2 (x)

g2 (x) = g

3 (x)

g 1(x

) = g 3

(x)

lines connecting means

are perpendicular to decision boundaries

Page 33: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

33

� Covariance matrices are equal but arbitrary

Case ΣΣΣΣi = ΣΣΣΣ

� In this case, features x1, x2. ,…, xd are not necessarily independent

������������

������������==== 15.0

5.01ΣΣΣΣ

Page 34: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

squared Mahalanobis Distance

� Discriminant function

Case ΣΣΣΣi = ΣΣΣΣ

)(lnln21

)()(21

)( 1iii

tii cPxxxg ++++ΣΣΣΣ−−−−−−−−−−−−−−−−==== −−−− µµµµµµµµ

constantfor all classes

� Discriminant function becomes

)c(Pln)x()x(21

)x(g ii1t

ii ++++−−−−−−−−−−−−==== −−−− µµµµµµµµ

� Mahalanobis Distance −−−−−−−−====−−−− −−−−−−−− )yx()yx(yx 1t2

1ΣΣΣΣ

� If ΣΣΣΣ=I, Mahalanobis Distance becomes usual Eucledian distance

)yx()yx(yx t2

I 1 −−−−−−−−====−−−− −−−−

Page 35: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Eucledian vs. Mahalanobis Distances

−−−−−−−−====−−−− −−−−ΣΣΣΣ−−−− )()( 12

1 µµµµµµµµµµµµ xxx t)()(2 µµµµµµµµµµµµ −−−−−−−−====−−−− xxx t

µ

points x at equal Eucledian

distance from µµµµlie on a circle

µ

points x at equal Mahalanobis distance from

µµµµ lie on an ellipse:Σ stretches cirles to ellipses

eigenvectors of Σ

Page 36: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Case ΣΣΣΣi = ΣΣΣΣ Geometric Interpretation

(((( )))) 1ii xxg −−−−−−−−−−−−==== ΣΣΣΣµµµµ

then),c(Pln)c(PlnIf ji ====

(((( )))) )c(Plnx21

xg iii 1 ++++−−−−−−−−==== −−−−ΣΣΣΣµµµµ

then),c(Pln)c(PlnIf ji ≠≠≠≠

1µµµµ

2µµµµ

3µµµµ

decision region for c2

decision region for c3

decision region for c1

1µµµµ

2µµµµ

3µµµµ

decision region for c2

decision region for c3

decision region for c1

points in each cell are closer to the mean in that cell than to any other mean under Mahalanobis distance

Page 37: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Case ΣΣΣΣi = ΣΣΣΣ� Can simplify discriminant function:

====++++−−−−−−−−−−−−==== −−−− )(ln)()(21

)( 1ii

tii cPxxxg µµµµµµµµ

(((( )))) ====++++ΣΣΣΣ++++ΣΣΣΣ−−−−ΣΣΣΣ−−−−ΣΣΣΣ−−−−==== −−−−−−−−−−−−−−−− )(ln21 1111

iitii

tti

t cPxxxx µµµµµµµµµµµµµµµµ

(((( )))) ====++++ΣΣΣΣ++++ΣΣΣΣ−−−−ΣΣΣΣ−−−−==== −−−−−−−−−−−− )(ln221 111

iiti

ti

t cPxxx µµµµµµµµµµµµ

constant for all classes

(((( )))) )(ln221 11

iiti

ti cPx ++++ΣΣΣΣ++++ΣΣΣΣ−−−−−−−−==== −−−−−−−− µµµµµµµµµµµµ

� Thus in this case discriminant is also linear

0iti wxw ++++====����

����

����

���� −−−−++++==== −−−−−−−−i

1tii

1ti 2

1)c(Plnx µµµµΣΣΣΣµµµµΣΣΣΣµµµµ

Page 38: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Case ΣΣΣΣi = ΣΣΣΣ: Example� 3 classes, each 2-dimensional Gaussian with

������������

������������==== 21

1µµµµ������������

������������−−−−==== 5

12µµµµ

������������

������������−−−−==== 4

23µµµµ

������������

������������−−−−

−−−−============ 45.15.11

321 ΣΣΣΣΣΣΣΣΣΣΣΣ

(((( )))) (((( ))))41

cPcP 21 ======== (((( ))))21

cP 3 ====

� Again can be done by solving gi(x) = gj(x) for i,j=1,2,3

Page 39: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

scalarrow vector

Case ΣΣΣΣi = ΣΣΣΣ: Example

��������

����

���� −−−−++++====��������

����

���� −−−−++++ −−−−−−−−−−−−−−−−i

1tii

1tij

1tjj

1tj 2

1)c(Plnx

21

)c(Plnx µµµµΣΣΣΣµµµµΣΣΣΣµµµµµµµµΣΣΣΣµµµµΣΣΣΣµµµµ

(((( )))) ��������

����

���� −−−−++++��������

����

���� −−−−−−−−====−−−− −−−−−−−−−−−−−−−−i

1tiij

1tjj

1ti

1tj 2

1)c(Pln

21

)c(Plnx µµµµΣΣΣΣµµµµµµµµΣΣΣΣµµµµΣΣΣΣµµµµΣΣΣΣµµµµ

(((( ))))��������

����

����

����−−−−++++====−−−− −−−−−−−−−−−−

i1t

ij1t

jj

i1ti

tj 2

121

)c(P)c(P

lnx µµµµΣΣΣΣµµµµµµµµΣΣΣΣµµµµΣΣΣΣµµµµµµµµ

� Let’s solve in general first

(((( )))) (((( ))))xgxg ij ====

� Let’s regroup the terms

� We get the line where (((( )))) (((( ))))xgxg ij ====

Page 40: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

Case ΣΣΣΣi = ΣΣΣΣ: Example

[[[[ ]]]] 0x02 ====−−−−

� Now substitute for i,j=1,2

[[[[ ]]]] 41.2x4.114.3 −−−−====−−−−−−−−

� Now substitute for i,j=2,3

[[[[ ]]]] 41.2x43.114.5 −−−−====−−−−−−−−� Now substitute for i,j=1,3

(((( ))))��������

����

����

����−−−−++++====−−−− −−−−−−−−−−−−

i1t

ij1t

jj

i1ti

tj 2

121

)c(P)c(P

lnx µµµµΣΣΣΣµµµµµµµµΣΣΣΣµµµµΣΣΣΣµµµµµµµµ

0x1 ====

41.2x4.1x14.3 21 ====++++

41.2x43.1x14.5 21 ====++++

Page 41: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

41

Case ΣΣΣΣi = ΣΣΣΣ : Example

� Priors and(((( )))) (((( ))))41

cPcP 21 ======== (((( ))))21

cP 3 ====

lines connecting means

are not in general perpendicular to

decision boundaries

c2

c3

c1

Page 42: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

42

� Covariance matrices for each class are arbitrary

General Case ΣΣΣΣi are arbitrary

� In this case, features x1, x2. ,…, xd are not necessarily independent

������������

������������====ΣΣΣΣ 15.0

5.01i ��������

����������������−−−−

−−−−====ΣΣΣΣ 49.09.01

j

Page 43: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

43

� From previous discussion,

General Case ΣΣΣΣi are arbitrary

� This can’t be simplified, but we can rearrange it:

)(lnln21

)()(21

)( 1iiii

tii cPxxxg ++++ΣΣΣΣ−−−−−−−−ΣΣΣΣ−−−−−−−−==== −−−− µµµµµµµµ

(((( )))) )(lnln21

221

)( 111iiii

tii

tii

ti cPxxxxg ++++ΣΣΣΣ−−−−ΣΣΣΣ++++ΣΣΣΣ−−−−ΣΣΣΣ−−−−==== −−−−−−−−−−−− µµµµµµµµµµµµ

��������

����

���� ++++ΣΣΣΣ−−−−ΣΣΣΣ−−−−++++ΣΣΣΣ++++��������

����

���� ΣΣΣΣ−−−−==== −−−−−−−−−−−− )(lnln21

21

21

)( 111iiii

tii

tii

ti cPxxxxg µµµµµµµµµµµµ

0)( itt

i wxwWxxxg ++++++++====

Page 44: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

44

General Case ΣΣΣΣi are arbitrary

0)( itt

i wxwWxxxg ++++++++====

� Thus the discriminant function is quadratic� Therefore the decision boundaries are quadratic

(ellipses and parabolloids)

quadratic in x since

======== ====

========d

jijiij

d

j

d

ijiij

t xxwxxwWxx1,1 1

linear in x

constant in x

Page 45: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

� 3 classes, each 2-dimensional Gaussian with

������������

������������−−−−==== 3

11µµµµ

������������

������������==== 60

2µµµµ������������

������������−−−−==== 4

23µµµµ

������������

������������−−−−

−−−−==== 25.05.01

1ΣΣΣΣ

(((( )))) (((( ))))41

cPcP 21 ======== (((( ))))21

cP 3 ====

� Again can be done by solving gi(x) = gj(x) for i,j=1,2,3

General Case ΣΣΣΣi are arbitrary: Example

������������

������������−−−−

−−−−==== 7222

2ΣΣΣΣ������������

������������==== 35.1

5.113ΣΣΣΣ

��������

����

���� ++++ΣΣΣΣ−−−−ΣΣΣΣ−−−−++++ΣΣΣΣ++++��������

����

���� ΣΣΣΣ−−−−==== −−−−−−−−−−−− )(lnln21

21

21

)( 111iiii

tii

tii

ti cPxxxxg µµµµµµµµµµµµ

� Need to solve a bunch of quadratic inequalities of 2 variables

� Priors: and

Page 46: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

General Case ΣΣΣΣi are arbitrary: Example

������������

������������−−−−==== 3

11µµµµ

������������

������������==== 60

2µµµµ������������

������������−−−−==== 4

23µµµµ

������������

������������−−−−

−−−−==== 25.05.01

1ΣΣΣΣ������������

������������−−−−

−−−−==== 7222

2ΣΣΣΣ������������

������������==== 35.1

5.113ΣΣΣΣ

(((( )))) (((( ))))41

cPcP 21 ======== (((( ))))21

cP 3 ====

c2

c3 c1

c1

Page 47: CS434a/541a: Pattern Recognition Prof. Olga Vekslerolga/Courses/CS434a_541a/Lecture4.pdf · 3 Why Normal Random Variables? Analytically tractable Works well when observation comes

� The Bayes classifier when classes are normally distributed is in general quadratic� If covariance matrices are equal and proportional to

identity matrix, the Bayes classifier is linear� If, in addition the priors on classes are equal, the Bayes

classifier is the minimum Eucledian distance classifier� If covariance matrices are equal, the Bayes

classifier is linear� If, in addition the priors on classes are equal, the Bayes

classifier is the minimum Mahalanobis distance classifier

� Popular classifiers (Euclidean and Mahalanobis distance) are optimal only if distribution of data is appropriate (normal)

Important Points