Upload
jared-conrad-mccormick
View
227
Download
4
Embed Size (px)
Citation preview
Linear Methods for Classification
Based on Chapter 4 of Hastie, Tibshirani, and Friedman
David Madigan
Predictive ModelingGoal: learn a mapping: y = f(x;)
Need: 1. A model structure
2. A score function
3. An optimization strategy
Categorical y {c1,…,cm}: classification
Real-valued y: regression
Note: usually assume {c1,…,cm} are mutually exclusive and exhaustive
Probabilistic Classification
Let p(ck) = prob. that a randomly chosen object comes from ck
Objects from ck have: p(x |ck , k) (e.g., MVN)
Then: p(ck | x ) p(x |ck , k) p(ck)
Bayes Error Rate: dxxpxcpp kk
B )())|(max1(*
•Lower bound on the best possible error rate
Bayes error rate about 6%
Classifier Types
Discriminative: model p(ck | x )
- e.g. logistic regression, CART
Generative: model p(x |ck , k)
- e.g. “Bayesian classifiers”, LDA
Regression for Binary Classification
•Can fit a linear regression model to a 0/1 response
•Predicted values are not necessarily between zero and one
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
x
y
zeroOneR.txt
•With p>1, the decision boundary is lineare.g. 0.5 = b0 + b1 x1 + b2 x2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x1
x2
Linear Discriminant AnalysisK classes, X n × p data matrix.
p(ck | x ) p(x |ck , k) p(ck)
Could model each class density as multivariate normal:
)()(2
1
212
1
||)2(
1)|(
kkT
k xx
kpk excp
LDA assumes for all k. Then:k
)()()(2
1
)(
)(log
)|(
)|(log 11
lkT
lkT
lkl
k
l
k xcp
cp
xcp
xcp
This is linear in x.
Linear Discriminant Analysis (cont.)
It follows that the classifier should predict: )(maxarg xkk
)(log2
1)( 11
kkTkk
Tk cpxx
“linear discriminant function”
If we don’t assume the k’s are identicial, get Quadratic DA:
)(log)()(2
1||log
2
1)( 1
kkkT
kkk cpxxx
Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum likelihood:
kki
ik Nx /ˆ
NNcp kk /)(ˆ
)/()')((ˆ1
KNxxK
k kikiki
LDA QDA
LDA (cont.)
•Fisher is optimal if the class are MVN with a common covariance matrix
•Computational complexity O(mp2n)
Logistic Regression
Note that LDA is linear in x:
)()()(2
1
)(
)(log
)|(
)|(log 0
10
10
00
k
Tk
Tk
kk xcp
cp
xcp
xcp
xTkk 0
Linear logistic regression looks the same:
xxcp
xcp Tkk
k 00 )|(
)|(log
But the estimation procedure for the co-efficients is different.LDA maximizes joint likelihood [y,X]; logistic regression maximizes conditional likelihood [y|X]. Usually similar predictions.
Logistic Regression MLE
For the two-class case, the likelihood is:
n
iiiii xpyxpyl
1
));(1log()1();(log)(
xxp
xp T
);(1
);(log ))exp(1log();(log xxxp TT
n
i
TTi xxyl
1
))exp(1log()(
The maximize need to solve (non-linear) score equations:
n
iiii xpyx
d
dl
1
0));(()(
Logistic Regression ModelingSouth African Heart Disease Example (y=MI)
Coef. S.E. Z score
Intercept
-4.130 0.964 -4.285
sbp 0.006 0.006 1.023
Tobacco
0.080 0.026 3.034
ldl 0.185 0.057 3.219
Famhist
0.939 0.225 4.178
Obesity
-0.035 0.029 -1.187
Alcohol
0.001 0.004 0.136
Age 0.043 0.010 4.184
Wald
Regularized Logistic Regression
•Ridge/LASSO logistic regression
•Successful implementation with over 100,000 predictor variables
•Can also regularize discriminant analysis
j
j
N
iii
T wyxwn
w 1
))exp(1log(1
infargˆ
Simple Two-Class Perceptron
Define:
Classify as class 1 if h(x)>0, class 2 otherwise
Score function: # misclassification errors on training data
For training, replace class 2 xj’s by -xj; now
need h(x)>0
pjxwxh jj 1,)(
Initialize weight vector
Repeat one or more times:
For each training data point xi
If point correctly classified, do nothing
Else ixww
Guaranteed to converge to a separating hyperplane (if exists)
“Optimal” Hyperplane
The “optimal” hyperplane separates the two classes and maximizes the distance to the closest point from either class.
Finding this hyperplane is a convex optimization problem.
This notion plays an important role in support vector machines
wx+b=0(0,0) from
|1|
w
b
(0,0) from |1|
w
b
w
2