Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

Machine Learning 10601Recitation 6Sep 30, 2009Oznur Tastan

Outline

• Multivariate Gaussians• Logistic regression

Multivariate Gaussians (or "multinormal distribution“ or “multivariate normal distribution”)

Multivariate case: Vector of observations x,

vector of means and covariance matrix

Univariate case: single mean and variance

Dimension of x Determinant

Multivariate Gaussians

do not depend on xnormalization constants

Multivariate case

Univariate case

depends on x and positive

The mean vector

m

2

1

μ

μ

μ

x

.

.)(Eμ

Covariance of two random variables

Recall for two random variables xi, xj

)()()(

)])([(E

),(Cov2

jiji

jjii

jiij

xExExxE

xx

xx

The covariance matrix

])μx)(μx(E[ Ttranspose operator

mmm

nn11

mm

11

μxμx

μx

μx

E

221

2422

21

141212

...

.

..

..

.

.

.

..

..

)]..()[(

)(

.

.

)(

Var(xm)=Cov(xm, xm)

An example: 2 variate case

The pdf of the multivariate will be: Covariance matrix

Determinant

An example: 2 variate case

Recall in general case independence implies uncorrelation but uncorrelation does not necessarily implies independence.Multivariate Gaussians is a special case where uncorrelation implies independence as well.

Factorized into two independent Gaussians!They are independent!

Diagonal covariance matrix

Diagonal matrix: m matrix where off-diagonal terms are zero

22

12

0

0

ji

xx jjiiij

0)])([(E2

If all the variables are independent from each other,The covariance matrix will be an diagonal one.Reverse is also true:If the covariance matrix is a diagonal one they are independent

Gaussian Intuitions: Size of

= [0 0] = [0 0] = [0 0] = I = 0.6I = 2I

As becomes larger, Gaussian becomes more spread out

Identity matrix

Gaussian Intuitions: Off-diagonal

As the off-diagonal entries increase, more correlation between value of x and value of y

Gaussian Intuitions: off-diagonal and diagonal

Decreasing non-diagonal entries (#1-2)Increasing variance of one dimension in diagonal (#3)

Isocontours

Isocontours example

We have showed

Now let’s try to find for some constant c the isocontour

Isocontours continued

Isocontours continued

Define

Equation of an ellipseCentered on μ1, μ2 and axis lengths 2r1 and 2r2

We had started with diaogonal matrix

In the diagonal covariance matrix case the ellipses will be axis aligned.

Don’t confuse Multivariate Gaussians with Mixtures of Gaussians

Mixture of Gaussians:

Component

Mixing coefficientK=3

Logistic regression

Linear regressionOutcome variable Y is continuous

Logistic regressionOutcome variable Y is binary

Logistic function (Logit function)

zez

1

1)(

zlogi

t(z)

Notice σ(z) is always bounded between [0,1] (a nice property) and as z increase σ(z) approaches 1, as z decreases σ(z) approaches to 0

This term is [0, infinity]

Logistic regressionLearn a function to map X values to Y given data

),(),..,,( 11 NN YXYX

The function we try to learn is P(Y|X)

X can be continuous or discrete

Discrete

Logistic regression

N

iiXwwe

YP1

01

1)X|1(

1

1

0

0( 0 | ) 1 ( 1| )

1

Ni i

Ni i

w w X

w w X

eP Y P Y

e

X X

Classification

If this holds Y=0 is more probableThan Y=1 given X

Classification

N

iiXwwe

YP1

01

1)X|1(

Nii

Nii

Xww

Xww

e

eYP

1

1

0

0

1)X|0(

Take log both sides

Classification rule if this holds Y=0

Logistic regression is a linear classifier

100

N

i iw w X

00 1 N

iiXww

100

N

i iw w X

Y=0

Y=1

0)|0( XYP

0)|1( XYP

Decision boundary

Classification

X1 X1

σ(z

)= σ

(w0+

w1X

1))

Notice σ(z) is 0.5 when X1=2

wo=+2, to check evaluate at X1=0 g(z)~0.1

σ(z) is 0.5 when X1=0 to see

0 1 1

1

1

0

2 ( 1) 0

2

w w X

X

X

0 1 1

1

0

0 ( 1) 0

1 0

w w X

X

X

Classify as Y=0 Classify as Y=0

Estimating the parameters

Given data ),(),..,,( 11 NN YXYX

N

i

ii wXYP1w

),|(maxarg

Objective:

Train the model to get w that maximizes the conditional likelihood

Difference with Naïve Bayes of Logistic Regression

Loss function! Optimize different functions → Obtaindifferent solutions

Naïve Bayes argmax P(X|Y) P(Y)

Logistic Regression argmax P(Y|X)

Naïve Bayes and Logistic Regression

• Have a look at the Tom Mitchell’s book chapterhttp://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf

Linked under Sep 23 Lecture Readings as well.

http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

Some matlab tips for the last question in HW3

• logical function might be useful for dividing into splits. An example of logical in use (please read the Matlab help)

S=X(logical(X(:,1)==1),:)this will also work S=X((X(:1)==1,:))

This will subset the portion of the X matrix where the first column has value 1 and will put in matrix S (like Data>Filter in Excel)

• Matlab has functions for mean, std, sum, inv, log2• Scaling data to zero mean and unit variance:

• shifting the mean by the mean (subtracting the mean from every element of the vector) and scaling such that it has variance=1 ( dividing the every element of the vector by standard deviation)

• To be able to do that in matrices. You will need the repmat function, have a look at that otherwise the size of the matrices would not match..etc

Elementwise multiplicationuse .*

References

• http://www.stanford.edu/class/cs224s/lec/224s.09.lec10.pdf

• http://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf

• Carlos Guestrin lecture notes• Andrew Ng lecture notes

http://www.stanford.edu/class/cs224s/lec/224s.09.lec10.pdf

http://www.stanford.edu/class/cs224s/lec/224s.09.lec10.pdf

Documents

Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan