Upload
shanon-lucas
View
225
Download
0
Embed Size (px)
Citation preview
Machine Learning 10601Recitation 6Sep 30, 2009Oznur Tastan
Outline
• Multivariate Gaussians• Logistic regression
Multivariate Gaussians (or "multinormal distribution“ or “multivariate normal distribution”)
Multivariate case: Vector of observations x,
vector of means and covariance matrix
Univariate case: single mean and variance
Dimension of x Determinant
Multivariate Gaussians
do not depend on xnormalization constants
Multivariate case
Univariate case
depends on x and positive
The mean vector
m
2
1
μ
μ
μ
x
.
.)(Eμ
Covariance of two random variables
Recall for two random variables xi, xj
)()()(
)])([(E
),(Cov2
jiji
jjii
jiij
xExExxE
xx
xx
The covariance matrix
])μx)(μx(E[ Ttranspose operator
mmm
nn11
mm
11
μxμx
μx
μx
E
221
2422
21
141212
...
.
..
..
.
.
.
..
..
)]..()[(
)(
.
.
)(
Var(xm)=Cov(xm, xm)
An example: 2 variate case
The pdf of the multivariate will be: Covariance matrix
Determinant
An example: 2 variate case
Recall in general case independence implies uncorrelation but uncorrelation does not necessarily implies independence.Multivariate Gaussians is a special case where uncorrelation implies independence as well.
Factorized into two independent Gaussians!They are independent!
Diagonal covariance matrix
Diagonal matrix: m matrix where off-diagonal terms are zero
22
12
0
0
ji
xx jjiiij
0)])([(E2
If all the variables are independent from each other,The covariance matrix will be an diagonal one.Reverse is also true:If the covariance matrix is a diagonal one they are independent
Gaussian Intuitions: Size of
= [0 0] = [0 0] = [0 0] = I = 0.6I = 2I
As becomes larger, Gaussian becomes more spread out
Identity matrix
Gaussian Intuitions: Off-diagonal
As the off-diagonal entries increase, more correlation between value of x and value of y
Gaussian Intuitions: off-diagonal and diagonal
Decreasing non-diagonal entries (#1-2)Increasing variance of one dimension in diagonal (#3)
Isocontours
Isocontours example
We have showed
Now let’s try to find for some constant c the isocontour
Isocontours continued
Isocontours continued
Define
Equation of an ellipseCentered on μ1, μ2 and axis lengths 2r1 and 2r2
We had started with diaogonal matrix
In the diagonal covariance matrix case the ellipses will be axis aligned.
Don’t confuse Multivariate Gaussians with Mixtures of Gaussians
Mixture of Gaussians:
Component
Mixing coefficientK=3
Logistic regression
Linear regressionOutcome variable Y is continuous
Logistic regressionOutcome variable Y is binary
Logistic function (Logit function)
zez
1
1)(
zlogi
t(z)
Notice σ(z) is always bounded between [0,1] (a nice property) and as z increase σ(z) approaches 1, as z decreases σ(z) approaches to 0
This term is [0, infinity]
Logistic regressionLearn a function to map X values to Y given data
),(),..,,( 11 NN YXYX
The function we try to learn is P(Y|X)
X can be continuous or discrete
Discrete
Logistic regression
N
iiXwwe
YP1
01
1)X|1(
1
1
0
0( 0 | ) 1 ( 1| )
1
Ni i
Ni i
w w X
w w X
eP Y P Y
e
X X
Classification
If this holds Y=0 is more probableThan Y=1 given X
Classification
N
iiXwwe
YP1
01
1)X|1(
Nii
Nii
Xww
Xww
e
eYP
1
1
0
0
1)X|0(
Take log both sides
Classification rule if this holds Y=0
Logistic regression is a linear classifier
100
N
i iw w X
00 1 N
iiXww
100
N
i iw w X
Y=0
Y=1
0)|0( XYP
0)|1( XYP
Decision boundary
Classification
X1 X1
σ(z
)= σ
(w0+
w1X
1))
Notice σ(z) is 0.5 when X1=2
wo=+2, to check evaluate at X1=0 g(z)~0.1
σ(z) is 0.5 when X1=0 to see
0 1 1
1
1
0
2 ( 1) 0
2
w w X
X
X
0 1 1
1
0
0 ( 1) 0
1 0
w w X
X
X
Classify as Y=0 Classify as Y=0
Estimating the parameters
Given data ),(),..,,( 11 NN YXYX
N
i
ii wXYP1w
),|(maxarg
Objective:
Train the model to get w that maximizes the conditional likelihood
Difference with Naïve Bayes of Logistic Regression
Loss function! Optimize different functions → Obtaindifferent solutions
Naïve Bayes argmax P(X|Y) P(Y)
Logistic Regression argmax P(Y|X)
Naïve Bayes and Logistic Regression
• Have a look at the Tom Mitchell’s book chapterhttp://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf
Linked under Sep 23 Lecture Readings as well.
Some matlab tips for the last question in HW3
• logical function might be useful for dividing into splits. An example of logical in use (please read the Matlab help)
S=X(logical(X(:,1)==1),:)this will also work S=X((X(:1)==1,:))
This will subset the portion of the X matrix where the first column has value 1 and will put in matrix S (like Data>Filter in Excel)
• Matlab has functions for mean, std, sum, inv, log2• Scaling data to zero mean and unit variance:
• shifting the mean by the mean (subtracting the mean from every element of the vector) and scaling such that it has variance=1 ( dividing the every element of the vector by standard deviation)
• To be able to do that in matrices. You will need the repmat function, have a look at that otherwise the size of the matrices would not match..etc
Elementwise multiplicationuse .*
References
• http://www.stanford.edu/class/cs224s/lec/224s.09.lec10.pdf
• http://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf
• Carlos Guestrin lecture notes• Andrew Ng lecture notes