64
Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick Aaron Bobick School of Interactive Computing CS 7616 Pattern Recognition Linear, Linear, Linear…

CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Aaron Bobick School of Interactive Computing

CS 7616 Pattern Recognition Linear, Linear, Linear…

Page 2: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Administrivia

• First problem set will be out tonight (Thurs 1/23). Due in more than one week, Sunday Feb 2 (touchdown…), 11:55pm. • General description: for a trio of data sets (one common, one from the

sets we provide, one from those sets or your own), use parametric density estimation for normal densities to find best result. Use both MLE methods and Bayes.

• But next one may be out before this one is due.

Page 3: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Today brought to you by… • Some materials borrowed from Jie Lu, Joy, Lucian @ CMU,

Geoff Hinton (U Toronto), and Reza Shadmehr (Hopkins)

Page 4: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Outline for “today” • We have seen linear discriminants arise in the case of normal

distributions. (When?)

• Now we’ll approach from another way: • Linear regression – really least squares

• “Hat” operator • From regression to classification: Indicator Matrix

• Logistic regression – which is not regression but classification • Reduced rank linear discriminants - Fischer Linear Discriminant Analysis

Page 5: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Jumping ahead… • Last time regression and some discussion of discriminants from

normal distributions.

• This time logistic regression and Fisher LDA

Page 6: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

First regression

• Let 𝑋 = 𝑋1,𝑋2, …𝑋𝑝𝑇

be a random vector. Unfortunately, 𝒙𝑖 is the ith vector. Let 𝑦𝑖 be a real value associated with 𝒙𝑖.

• Let us assume we want want to build a predictor of y based upon a linear model.

• Choose 𝛽such that the residual is smallest:

Page 7: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear regression • Easy to do with vector notation: Let 𝑿 be a matrix (N x (p+1)) where each row is (1, 𝑥𝑖) (why p+1?). Let y be a N long column vector of outputs. Then: • Want to minimize this. How? Differentiate: •

Page 8: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

• Setting derivative to zero:

• Solving:

• Predicting • Could now predict the original y’s:

• The matrix called H for “hat”:

Continuing…

0 0ˆ Ty x β=

Page 9: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Two views of regression

Page 10: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Methods for Classification • What are they?

Methods that give linear decision boundaries between classes Linear decision boundaries {x: β0

+ β1T x = 0}

• How to define decision boundaries? Two classes of methods • Model discriminant functions δk(x) for each class as linear • Model the boundaries between classes as linear

Page 11: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Two Classes of Linear Methods • Model discriminant functions δk(x) for each class as linear;

choose the k for which δk(x) is largest. • Different models/methods:

• Linear regression fit to the class indicator variables • Linear discriminant analysis (LDA) • Logistic regression (LOGREG)

• Model the boundaries between classes as linear (will be

discussed later in class) • Perceptron • Support vector classifier (SVM)

Page 12: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

• Linear model for kth indicator response variable

• Decision boundary is set of points

• Linear discriminant function for class k

xxf Tkkk

∧∧∧

+= ββ 0)(

}0)()(:{)}()(:{ 00 =−+−==∧∧∧∧∧∧

xxxfxfx Tlklklk ββββ

)()( xfx kk

Page 13: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

• Let Y be a vector where the kthelement 𝑌𝑘 is a 1 if the class of the corresponding input is K, zero otherwise. This vector Y is an indicator vector

• For a set of N training points we can stack the Y’s into an NxK matrix such that each row is the Y for a single input. In this case each column is a different indicator function to be learned. A different regression problem.

This image cannot currently be displayed.

Page 14: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

• Best linear fit: for a single column we know how to solve this:

• So for the stacked Y :

0ˆ Ty x β=

Page 15: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables • So given columns of weights B (just columns of 𝛽)

• Compute the discriminant functions as a row vector :

• And choose class k for whichever𝑓𝑘 𝑥 is largest

Page 16: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables • So why is this a good idea? Or is it?

• This is actually a sum of squares approach: define the class

indicator as a target value of 1 or 0. Goal is to fit each class target function as well as possible.

• How well does it work?

• Pretty well when K=2 (number of classes)

• But…

Page 17: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables

•Problem –When K≥3, classes can be masked by others

–Because the rigid nature of the regression model:

Page 18: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Regression Fit to the Class Indicator Variables Quadratic Polynomials

Page 19: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis (Common Convariance Matrix Σ) • Model class-conditional density of X in class k as multivariate

Gaussian

• Class posterior

• Decision boundary is set of points

)()(21

2/12/

1

||)2(1)( k

Tk xx

pk exfµµ

π−∑−− −

∑=

∑ =

=== K

l ll

kk

xfxfxXkG

1)(

)()|Pr(π

π

}0)|Pr()|Pr(log:{)}|Pr()|Pr(:{ =

====

======xXlGxXkGxxXlGxXkGx

}0)()()(21log:{ 11 =−∑+−∑+−= −−

lkT

lkT

lkl

k xx µµµµµµππ

Page 20: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis (Common Σ) con’t • Linear discriminant function for class k

• Classify to the class with the largest value for its δk(x)

• Parameters estimation

• Objective function

• Estimated parameters

NNkk /=∧

π

kkg ik Nxi

/∑ =

)/())((1

KNxx Tkik

K

k kg ii

−−−=∑∧∧

= =

∑ ∑ µµ

kkT

kkT

k xx πµµµδ log21)( 11 +∑−∑= −−

)(maxarg)( xxG kgk δ∈

=

)(Pr)|(Prlogmaxarg),(Prlogmaxarg11 iii

N

iiiN

iyyxyx ββββββ ∑∑ ==

==

Page 21: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

More on being linear…

Page 22: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

The planar decision surface in data-space for the simple linear discriminant function:

00 ≥+ wT xw

Page 23: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Gaussian Linear Discriminant Analysis with Common Convariance Matrix (GDA) • Model class-conditional density of X in class k as multivariate

Gaussian

• Class posterior

• Decision boundary is set of points

)()(21

2/12/

1

||)2(1)( k

Tk xx

pk exfµµ

π−∑−− −

∑=

1

( )Pr( | )( )

k k

Kl l l

f xC k X xf xππ=

= = =∑

Pr( | ){ : Pr( | ) Pr( | )} { : log 0}Pr( | )

C k X xx C k X x C l X x xC l X x= =

= = = = = => == =

1 1Pr( ) 1{ : log ( ) ( ) ( ) 0}Pr 2)(

T Tkk l k l k l

l

Cx xC

µ µ µ µ µ µ− −= − + ∑ − + ∑ − =

Page 24: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Gaussian Linear Discriminant Analysis with Common Convariance Matrix (GDA) • Linear discriminant function for class k

• Classify to the class with the largest value for its δk(x) • Parameters estimation (where 𝑦𝑖is class of 𝒙𝑖)

• Objective function

• MLE Estimated parameters

) /( k kPr N NC = /

iC kk i kx Nµ∧

== ∑

)/())((1

KNxx Tkik

K

k kg ii

−−−=∑∧∧

= =

∑ ∑ µµ

1 11( ) log(Pr(2

))T T

k k k k kx x Cδ µ µ µ− −= ∑ − ∑ +

)(Pr)|(Prlogmaxarg),(Prlogmaxarg11 iii

N

iiiN

iyyxyx ββββββ ∑∑ ==

==

Page 25: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

• To compute the posterior, we modeled the right side of the equation below by assuming that they were Gaussians and computed their parameters (or used a kernel estimate of the density).

• In logistic regression, we want to directly model the posterior as a function of the variable x.

• In practice, when there are k classes to classify, we model:

( ) ( ) ( )( )

( ) ( )

( ) ( )1

ˆ ˆˆ ˆˆ |ˆ

ˆ ˆL

p C P C p C P CP C

pP p

λλ λ

=

= =

x xx

xx

( ) ( )P̂ C g=x x

( ) ( )( )11|P

gP k

=x

xx

Logistic Regression

Page 26: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

In this example we assume that the two distributions for the classes have equal variance. Suppose we want to classify a person as male or female based on height.

Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let y be an indicator variable for being a female. Then the conditional distribution of x (the height becomes):

( ) ( ) ( ) ( )2 22 2

1 1 1 1| 1 exp | 0 exp2 22 2f mp x y x p x y xµ µσ σσ π σ π

= = − − = = − −

( ) ( ) ( )| 0 and | 1 and 1p x y p x y P y q= = = =What we have:

( )| 1p x y =( )| 0p x y =

x

Classification by maximizing the posterior distribution

( )1 |P y x=What we want:

Page 27: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( )( ) ( ) ( ) ( )

( )

( ) ( ) ( )

( ) ( )

( )

( ) ( )( )

2

2

2 22 2

22

2

2

222

1 | 11|

1 | 1 0 | 0

1exp2

1 1exp 1 exp2 2

111 exp

211exp

21

1 11 exp log2

1

1 exp log

f

f m

m

f

m f

P y p x yP y x

P y p x y P y p x y

q x

q x q x

q x

q x

q x xq

µσ

µ µσ σ

µσ

µσ

µ µσ

= == =

= = + = =

− − =

− − + − − −

= − − − +

− −

= −

+ − − − −

=

+ ( ) ( )2 22 2

1 12

m fm f

q xq

µ µµ µ

σ σ

− − − − +

Posterior probability for classification when we have two classes:

q = Pr(𝐶1)

Page 28: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Computing the probability that the subject is female, given that we observed height x.

( )( ) ( )2 2

2 2

11|1 11 exp log

2m f

m f

P y xq x

qµ µ

µ µσ σ

= = − − + − − +

( )

176166

121 0.5

m

f

cmcm

cmp y

µµ

σ

==

=

= =

( )1|P y x=

120 140 160 180 200 220

0.2

0.4

0.6

0.8

1

x

Posterior:

( )| 1p x y = ( )| 0p x y =

a logistic function In the denominator, x appears linearly inside the exponential

So if we assume that the class membership densities p(x/y) are normal with equal variance, then the posterior probability will be a logistic function.

Computing the posterior

Page 29: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( )( ) ( ) ( ) ( )

( )

( ) ( ) ( )

( ) ( )

( )

( ) ( )( )

2

2

2 22 2

22

2

2

222

1 | 11 |

1 | 1 0 | 0

1exp2

1 1exp 1 exp2 2

111 exp

211exp

21

1 11 exp log2

1

1 exp log

f

f m

m

f

m f

P y p x yP y x

P y p x y P y p x y

q x

q x q x

q x

q x

q x xq

µσ

µ µσ σ

µσ

µσ

µ µσ

= == =

= = + = =

− − =

− − + − − −

= − − − +

− −

= −

+ − − − −

=

+ ( ) ( ) ( )( )02 22 2

11 exp1 1

2

iTm f

m f

aq xq

µ µµ µ

σ σ

= − + − − − − +

a x

Posterior probability for classification when we have two classes:

q = Pr(𝐶1)

Page 30: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

-4 -2 0 2 4 6

-2

0

2

4

6 0 0Ta− + =a x

1x

2x

Class 0

( ) ( )( )

( ) ( )( )( )( )( )( )

( )( )

( )( ) ( )( )

( )

( ) ( )

0

0( ) ( )

0 0

( ) ( )( )

( ) ( )

( ) ( )

0( ) ( )0

111 exp

exp10 11 exp 1 exp

11 if log 0

0

1 1log log0 exp

i iiT

iTi i

i iT T

i ii

i i

i iiT

ii i T

P ya

aP y

a a

P yy

P y

P ya

P y a

= =+ −

−= = − =

+ − + −

== >

=

== = − +

= −

xa x

a xx

a x a x

x

x

xa x

x a x

Logistic regression classification • Assumption of equal variance

among density of classes implies a linear decision boundary

Page 31: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ){ }{ }

( ) ( )( )( )

( ) ( ) ( )( )

( ) ( ) ( )( )

( ) ( ) ( )

( ) ( )

( ) ( )

(1) (1) ( ) ( )

( )

( ) ( ) ( )

( ) ( ) ( )

1( ) ( ) ( ) ( )

1(1) ( ) ( ) ( ) ( )

1

( ) ( ) ( ) ( )

1

, , , ,

0,111

1 exp

0 1

1

, , , 1

log 1 log 1

i i

i i

N N

n

i i iiT

i i i

y yi i i i

n y yn i i i

in

i i i i

i

D y y

y

P y q

P y q

p y q q

p y y q q

l D y q y q

=

=

=

= = =+ −

= = −

= −

= −

= + − −

x x

xw x

x

x

w x

w

Assumption of equal variance among the clusters

The goal is to find parameters w that maximize the log-likelihood of training.

0( )( )

11( ) 22

1

ii

i

wx w

wx

= =

x w

Logistic regression: problem statement

Page 32: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( )( )

( )( )

( ) ( )

2

1 ;0 11 exp

1exp 1

1 1log 1 log

1 11 1

1

T

T

T

T

T

q q

q

qq q

d qdq q q q qdq q q

d

= < <+ −

− = −

−= − − = −

= − − = − −

= −

w x

w x

w x

w x

w x

Some useful properties of the logistic function

Page 33: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( )

( )( )( )

( )( )

( ) ( ) ( )

( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

1

( )( ) ( )

( ) ( )1

( )( )( ) ( )

( ) ( )1

( ) ( )( ) ( )

( ) ( )1

( ) ( )

1

log 1 log 1

11

11

1

11

ni i i i

i

iTii in

i i iTi

iinii i

i ii

i inii i

i ii

nii i

i

l D y q y q

dydl y dqd q q dd

yy q qq q

y q q qq q

y q

=

=

=

=

=

= + − −

− = − −

− = − − − − = − −

= −

w

w x

w ww x

x

x

x

( ) ( )( 1) ( ) ( ) ( ) ii i i iy qη+ = + −

∑w w x

( )idl

dq

Online algorithm for logistic regression: gradient ascent (new data or iterate)

Page 34: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( )

( )( )( )( )

( ) ( ) ( )

( )( )

( )

( ) ( )2 ( )

( )1

( ) ( )

1

(1) (1)

(2) (2)

( ) ( )

2

1

1 0 0

0 1 0

0 0 1

iTii i in

T i TiTi

ni i Ti i

i

n n

TT

dd y qdl dqd d dq dd

q q

q q

q qQ

q q

dl X QXd d

=

=

−=

= − −

− −

= −

w xx

w w ww x

x x

w w

( ) ( )

( )

( ) ( )

1

(1) (1) (1)

( ) ( ) ( )

nii i

i

T

n n n T

T

dl y qd

q yX

q y

dl Xd

=

= −

≡ ≡ ≡

= −

∑ xw

xq y

x

y qw

Batch algorithm: Iteratively Re-weighted Least Squares

• First derivative • Second derivative

Page 35: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( )( )( )

( ) ( )

( ) ( ) ( )

2

1( 1) ( )

111 exp

i i iiT

T

TT

t t T T

P y q

dl Xd

dl X QXd d

X QX X−+

= = =+ −

= −

= −

= + −

xw x

y qw

w w

w w y qIRLS

0.2 0.4 0.6 0.8

5

6

7

8

9

10

11

q

( )1

1q q−uncertain

certain certain

Sen

sitiv

ity to

err

or

Iteratively Re-weighted Least Squares

Page 36: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ) ( )( ) ( ) ( ) ( )

( )

( ) ( ) ( )

( ) ( )

( )

212

11

2 21 22 2

1 2

222

22

212

11

1 | 11|

1 | 1 0 | 0

1 1exp22

1 1 1 1exp 1 exp2 22 2

11 11 exp

2211 1exp

221

11 exp log

f m

P y p x yP y x

P y p x y P y p x y

q x x

q x x q x x

q x x

q x x

qq

σσ π

σ σσ π σ π

σσ π

σσ π

= == =

= = + = =

− − =

− − + − − −

=

− − − +

− −

= −

+

( ) ( )

( )

2 212 12 2

2 2 1

20 1 2

1 1log2 2

11 exp

x x x x

w w x w x

σσ σ σ

+ − − + −

=+ + +

Modeling the posterior with unequal variance

Page 37: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( ) ( ){ }{ }

( ) ( )( )( )

( )( )

( )

( ) ( )

(1) (1) ( ) ( )

( )

( ) ( ) ( )

(1)1

2

1 2( )2

122

1( 1) ( )

, , , ,

0,111

1 exp

1

N N

n

i i i

iT

T

TN

t t T T

D y y

y

P y q

xx

Xx xxx

X QX X−+

=

= = =+ −

= ≡

= + −

x x

xw g x

g x

g x

g x

w w y q

-8 -6 -4 -2 0

-1

0

1

2

3

4

By using non-linear bases, we can deal with clusters having unequal variance.

Estimated posterior probability

1x

2x

Logistic regression with basis functions

Page 38: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( )( )

( ) ( )( ) ( )

( ) ( )

( ) ( )

( ) ( ) ( ) ( )

11 1 1

1

1 111 1

1 1 1 111 1 1

1

1| 1 | 1| |

1 1exp2

1 1exp2

1 1exp log2 2

1 1exp log2 2

exp

T

Tk k k

T Tk k

k

T T T Tk k k

k

k

P y P y p yP y k P y k p y k

q

q

qq

qq

a

− −

− − − −

= = ==

= = =

− − ∑ − ∑

= − − ∑ − ∑

= − − ∑ − + − ∑ −

= − ∑ + ∑ + ∑ − ∑

= +

x xx x

x μ x μ

x μ x μ

x μ x μ x μ x μ

μ μ μ μ μ x μ x

( )( )( )

1

1 1

1|log

|

Tk

Tk k

P ya

P y k=

= +=

w x

xw x

x

Logistic function for multiple classes with equal variance

Rather than modeling the posterior directly, let us pick the posterior for one class as our reference and then model the ratio of the posterior for all other classes with respect to that class. Suppose we have k classes:

Page 39: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

( )( )( ) ( ) ( )

( )

( ) ( ) ( )

( ) ( )

( )( )

( ) ( )

( )

1 1 1

11

1

1

1

1

1

1

1

1|log

|

| exp |

| 1

exp | | 1

| 1 exp 1

1|1 exp

exp|

1 exp

Tk k

i

k

ik

ii

k

ii

k

ii

ik

jj

P ya m

P y k

P y i m P y k

P y i

m P y k P y k

P y k m

P y km

mP y i

m

=

=

=

=

=

== + ≡

=

= = =

= =

= + = =

= + =

= =+

= =+

xw x

x

x x

x

x x

x

x

xA “soft-max” function

Logistic function for multiple classes with equal variance: soft-max

Page 40: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Classification of multiple classes with equal variance

160 180 200 220

0.0025

0.005

0.0075

0.01

0.0125

0.015

( ) ( )| 1 1p x y P y= = ( ) ( )| 2 2p x y P y= = ( ) ( ) ( )3

1|

ip x p x y i P y i

=

= = =∑

( ) ( )| 3 3p x y P y= =

160 180 200 220

0.005

0.01

0.015

0.02

0.025

160 180 200 220

0.2

0.4

0.6

0.8

1 Posterior probabilities

160 180 200 220

-15

-10

-5

5

10

15

( )( )

1|log

3 |P yP y

==

xx

( )( )

2 |log

3 |P yP y

==

xx

Page 41: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Page 42: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Fisher’s linear discriminant • A simple linear discriminant function is a projection of the data

down to 1-D. • So choose the projection that gives the best separation of the classes.

What do we mean by “best separation”?

• An obvious direction to choose is the direction of the line joining the class means. • But if the main direction of variance in each class is not orthogonal to this

line, this will not give good separation (see the next figure).

• Fisher’s method chooses the direction that maximizes the ratio of between class variance to within class variance. • This is the direction in which the projected points contain the most

information about class membership (under Gaussian assumptions)

Page 43: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

A picture showing the advantage of Fisher’s linear discriminant.

When projected onto the line joining the class means, the classes are not well separated.

Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.

Page 44: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

(Fisher) Discriminant Analysis • Discriminant analysis seeks directions that are efficient for discrimination. • Consider the problem of projecting data from d dimensions onto a line with

the hope that we can optimize the orientation of the line to minimize error. • Consider a set of N d-dimensional samples 𝑥1, 𝑥2, … 𝑥𝑁 with 𝑛1 the subset

D1 labeled ω1 and 𝑛2 in the subset D2 labeled ω2. • Define a linear combination of the components of x: 𝑦 = w𝑇x which

yields corresponding set of N samples 𝑦1, 𝑦2, …𝑦𝑁 divided into Y1 and Y2. • Our challenge is to find w that “maximizes separation”. • This can be done by considering the ratio of the between-class scatter to

the within-class scatter.

Page 45: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Separation of the Means and Scatter • Define a sample mean for class i:

• The sample mean for the projected points are:

• The sample mean for the projected points is just the projection of the mean (which is expected since this is a linear transformation).

• It follows that the distance between the projected means is:

∑=∈ iDi

i n xxm 1

1 1i i

T t

i iy Y D

i i

m yn n∈ ∈

= = =∑ ∑

xw x w m

1 2 1 2

T Tm m− = − w m w m

Page 46: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Separation of the Means and Scatter • Define a scatter for the projected samples:

• An estimate of the variance of the pooled data is:

and is called the within-class scatter .

2 2( )i

iy

iY

ys m∈∑= −

2 2

1 2

1 ( )s sn

+

Page 47: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Fisher Linear Discriminant and Scatter • The Fisher linear discriminant maximizes the criteria:

• Define a scatter matrix for class i : • The total scatter is: • We can write the scatter for the projected samples as:

• So the sum of the scatters can be written as:

22

21

221

~~

~~)(

ssmm

wJ+

−=

( )( )i

T

i i iD∈

= ∑x

S x - m x - m

21 SSS +=W

( )

( )( )

22

i

i

T T

TT

i iD

t

i i iD

s∈

= −∑

= − − =∑

x

x

w x w m

w x m x m w w S w

wSw Wtss =+ 2

22

1~~

Page 48: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Separation of the Projected Means • The separation of the projected means obeys:

• where the between class scatter, SB, is given by:

• 𝑆𝑊 is the within-class scatter and is proportional to the covariance of the pooled data.

• 𝑆𝐵 , the between-class scatter, is symmetric and positive definite, but because it is the outer product of two vectors, its rank is at most one.

• This implies that for any w, 𝑆𝐵𝑤 is in the direction of m1-m2. • The criterion function, J(w), can be written as:

( )( )( )

2

1 2 1

1 2 1

2

2

2

T

TT

T

B

Tm m− = −

= − −

=

w m w m

w m m m m ww S w

( )( )1 2 1 2

T

B = − −S m m m m

( )wSwwSww

Wt

Bt

J =

Page 49: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis • This ratio is well-known as the generalized Rayleigh quotient and has the

well-known property that the vector, w, that maximizes J(w), must satisfy:

Linear Discriminant Analysis

B Wλ=S w S w

Page 50: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Proof of Fisher • Show that J(w) is max when 𝑆𝐵𝑤 = 𝜆𝑆𝑊𝑤

( )t

B

t

W

fJg

= =w S www S w

B Wλ=S w S w

Page 51: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Linear Discriminant Analysis • This ratio is well-known as the generalized Rayleigh quotient and has

the well-known property that the vector, w, that maximizes J(), must satisfy:

• The solution is: this is Fisher’s linear discriminant.

• This solution maps the d-dimensional problem to a one-dimensional problem (in this case).

• From earlier, when the conditional densities, 𝑝(𝑥|ω𝑖), are multivariate Gaussian with equal covariances, the optimal decision boundary is given by:

where , and 𝑤0 is related to the prior probabilities.

Linear Discriminant Analysis

B Wλ=S w S w

)( 211 mmSw −= -

W

0 0T w+ =w x

( )211 μ-μw −= ∑

Recall: 𝑆𝐵𝑤 is in direction of 𝑚1 −𝑚2

Page 52: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Suppose K>2?

Page 53: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Multiple Discriminant Analysis • For the c-class problem in a d-dimensional space, the natural

generalization involves c-1 discriminant functions. • The within-class scatter is defined as:

• Define a total mean vector, m:

• and a total scatter matrix, 𝑆𝑇, by:

• The total scatter is related to the within-class scatter

• We have c-1 discriminant functions of the form:

( )( )11 i

c

W i ii D

Tc

iiS

= ∈== ∑ ∑=∑

xS x - m x - m

ic

iin

nmm ∑=

=1

1

( )( )∑=x

m-xm-xS tT

BWT SSS += ∑ −−==

c

i

tiiiB n

1))(( mmmmS

[ ] 1 2 1T T

i i , ,...,c -= = =y W x w x

Page 54: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Multiple Discriminant Analysis (Cont.) • The criterion function is:

Where 𝑆𝑊 is as before (pooled covariance) but 𝑆𝐵 is now covariance of K centers, rank K-1

• The solution to maximizing J(W) is once again found via an

eigenvalue decomposition:

• Because SB is the sum of c rank one or less matrices, and because only c-1 of these are independent, SB is of rank c-1 or less. (See Hastie chapter 4)

T

B

T

W

J =W S W

(W)W S W

( )0 0B i W B i W iλ λ− = − =S S S S w

∑ −−==

c

i

tiiiB n

1))(( mmmmS

Page 55: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Spreading out the centers

Page 56: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Multi-Fisher

• When well behaved Multi-class Fisher (FLDA?) can work well.

• Maybe you’ll try it???

Page 57: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Some left over time?

Page 58: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Discriminant functions for N>2 classes • One possibility is to use N two-way discriminant functions.

• Each function discriminates one class from the rest.

• Another possibility is to use N(N-1)/2 two-way discriminant functions • Each function discriminates between two particular classes.

• Both these methods have problems

Page 59: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Problems with multi-class discriminant functions

More than one good answer

Two-way preferences need not be transitive!

Page 60: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

A simple solution

• Use N discriminant functions 𝑓1(𝑥), 𝑓2(𝑥), … , 𝑓𝑘(𝑥), … and pick the max. • This is guaranteed to give

consistent and convex decision regions if 𝑓𝑘(𝑥) is linear.

( ) ( )

( ) ( ) ( ) ( )

( )

(1 ) (1 )

k A j A k B j B

k A B j A B

f f and f f

implies for positive thatf f

α

α α α α

> >

+ − > + −

x x x x

x x x x

Page 61: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

More time?

Page 62: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

A way of thinking about the role of the inverse covariance matrix • If the Gaussian is spherical we

don’t need to worry about the covariance matrix.

• So we could start by transforming the data space to make the Gaussian spherical • This is called “whitening” the

data. • It pre-multiplies by the matrix

square root of the inverse covariance matrix.

• In the transformed space, the weight vector is just the difference between the transformed means.

affTaff

aff

aff

T

forgives

and

asfor

valuesamethegives

xw

xΣx

μΣμΣw

xw

μμΣw

21

21

21

01

011

:

)(

−−

=

−=

−= −

Page 63: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

Two ways to train a set of class-specific generative models

• Generative approach Train each model separately to fit the input vectors of that class. • Different models can be trained

on different cores. • It is easy to add a new class

without retraining all the other classes

• These are significant advantages when the models are harder to train than the simple linear models considered here.

• Discriminative approach Train all of the parameters of both models to maximize the probability of getting the labels right.

Page 64: CS 7616 Pattern Recognitionafb/classes/CS7616... · 1 exp log 2. mf mf. Py x q x q. µµ µµ σσ = = − − + − −+ ( ) 176 166 12 1 0.5. m f. cm cm cm py. µ µ σ = = = =

Linear, Linear, Linear CS7616 Pattern Recognition – A. Bobick

An example where the two types of training behave very differently

decision boundary

What happens to the decision boundary if we add a new red point here?

new Gaussian

For generative fitting, the red mean moves rightwards but the decision boundary moves leftwards! If you really believe its Gaussian data this is sensible.