Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Day 6 - Logistic Regression (continued) Agenda:
Classification and Logistic Regression•Training binary classifiers◦Evaluating classifiers◦Training multiclass classifiers◦
More thoughts on square capital example and whether to approach problem as regression or classification
Parametric Approach: Choose a model for f with unknown parameters. Estimate the parameters.
Parametric
Regression predict a continuous value
Model Fo IR't IRy fort noise
Given city in
Find 8 0
ParametricClassification predict membership in a category
Let link m
o.EEy frxlt noiseGiven It Yi in nFind O
decisionboundary
Approach for Estimating G
select a model for f w parameters 0
minimize the loss between
training labels and predictions on
training data
Binary Classification in 2D with logistic regression
Min LI Yi ForkQ it I predictionloss
or yfunction
Example linear regression hey y y'trsquare loss or l loss
1yo o É
Training Data No Yasin n
X2
class l
class O
X
Modely of Goto x Osx JK O
where
sigmoid
logisticfunction I
YAo confidencethat X belongs
t to class I
t
o o o o
gigwidth ftIEScertainty
Solve main Ei LI Yi JIN O for E
PredictFor new sample X predict
class I it Is zclass O if JC Yz
Question: Consider a binary classification problem. What property of the data would lead logistic regression to learn a good classifier?
Two classes are linearly separable.If your data is not lin. Separable, can you find a representation in which it is? (Classic machine learning perspective). With representation learning, you specifically try to learn a representation in which the classes are linearly separable
isOne choice log loss
Lf Y J logroll it yet
11binary continuo
ylogs l y log l y
k
k
Question: With the following data, does a minimizer to the optimization problem even exist?
Mtn E L Yi Jie o
L ly y y logs 11 y log l y
J 810 0 X 0222
F XIt
Class Iclass o
th
Labels are discrete (0,1)Outputs are continuous (never actually equal to 0 or 1)For any parameter theta, you could double theta and get a more confident classifier
Evaluating Classifiers Activity: Someone invents a test for a rare disease that affects 0.1% of the population. The test has accuracy 99.9%. Are you convinced this is a good test? No. The trivial predictor (always predict no) is just as accurate
t
FalsePositive
TrueNegative
Accuracy TI.tn
Activity: You are building a binary classifier that detects whether a pedestrian is crossing the sidewalk within 30 feet of a self driving car. If the detection is positive, the car puts on the breaks. Would you rather have good precision and great recall or good recall and great precision? Good precision and great recall
Precision Iff what fraction of predicted
positives are real
what fraction of positivesRecall Iffy are correctly predicted
Want high precision high recall
There is a trade off between True Positives and False Positives, and between True Negatives and False Negatives
Training data x 3Yo o b
Modely ofGotQ x Osx JK O
Predictfor new sample X predict
class I it Is zclass 0 if Y V2
could chooseany value
Receiver Operating Characteristic Curves
Each point is a classifier w differentthreshold
Everythingrate
randomrate classifier
declareEverythingnegative
Comparing classifiers and Area-Under-Curve (AUC) Also common to plot precision-recall curves
Iggy VI want Auc l
Fprate
precis
Multiclass Classification
Consider a 3 class classification problem in biasterm
Training Datu X yin
file Rd for all i
y e 1133 youif Y ofclass I
8 if y of class
One hot encoding 9 if y of class 3
Model Y Softmax Z Zz Zz
Z O Xw Q end
log its2 02 x GE Rd
Z 03 x O E Ird
logits are linear in parameters of model
soft mat Zi Zz Z ÉÉez addsup tof
ÉEE
What will decision boundary look like for 3-class classification?
Trainmain Lf y softman Q x 02 x Q x
3
log loss Lf Y Y E Y logicC L
IR I 3
PredictionGwen X compute yet EM
Predict class I w Es argmat I
XzClass Iclass 2class 3
if
i
CS 6140: Machine Learning — Fall 2021— Paul Hand
HW 3Due: Wednesday October 6, 2021 at 2:30 PM Eastern time via Gradescope.
Names: [Put Your Name(s) Here]
You can submit this homework either by yourself or in a group of 2. You may consult any andall resources. Make sure to justify your answers. If you are working alone, you may either writeyour responses in LaTeX or you may write them by hand and take a photograph of them. If youare working in a group of 2, you must type your responses in LaTeX. You are encouraged touse Overleaf. Create a new project and replace the tex code with the tex file of this document,which you can find on the course website. To share the document with your partner, click Share> Turn on link sharing, and send the link to your partner. When you upload your solutions toGradescope, make sure to take each problem with the correct page or image.
Question 1. Linear regression with multivariate responses.
Consider training data {(x(i), y(i))}i=1...n, where x(i) 2 Rd and y
(i) 2 Rk . Consider a modely = Ax, where A 2 Rk⇥d is unknown. Estimate A by solving least squares linear regression
minA
nX
i=1
ky(i) �Ax(i)k2.
(a) Find A in the case of training data
8>><>>:
0BBBB@
10
!,
0BBBBBBB@
110
1CCCCCCCA
1CCCCA,0BBBB@
01
!,
0BBBBBBB@
011
1CCCCCCCA
1CCCCA,0BBBB@
11
!,
0BBBBBBB@
231
1CCCCCCCA
1CCCCA
9>>=>>;. You may use a com-
puter to perform linear algebra. Hint: the problem can be simplified by observing thateach output dimension can be computed separately from the others. If you use this fact,justify it in your response.
Response:
(b) Consider the case of generic training data. Let Y be the k ⇥n matrix such that Yji = y(i)j
. Let
X be the n ⇥ d matrix where Xij = x(i)j
. Provide a formula for the least squares estimateof A. Make sure to check that the matrix dimensions match in any matrix products thatappear in your answer. Use the same hint as in part (a).
Response:
(c) Show that any prediction under this learned model is a linear combination of the responsevalues (y(1), . . . , y(n)). That is, for the A in part (b), show that Ax 2 span(y(1), . . . , y(n)) for anyx. You may assume that X is rank d.
Response:
Question 2. Logistic Regression
1
Consider training data {(xi ,yi )}i=1...n, where xi 2 Rd and yi 2 {0,1}. Consider the logistic datamodel y = �(✓ · x), where x 2 Rd , ✓ 2 Rd , and � is the logistic function �(z) = e
z/(ez +1).
(a) Show that � 0(z) = �(z)(1��(z)).
Response:
(b) Let f (✓) =P
n
i=1�yi log yi � (1� yi ) log(1� yi ), where yi = �(✓ · xi ). Compute rf (✓). Use thefact in part (a) to simplify your answer.
Response:
(c) If M =P
n
i=1 xixt
i, show that ztMz � 0 for any z 2 Rd .
Response:
(d) Using a summation and vector and/or matrix products, write down a formula for theHessian, H , of f with respect to ✓. Show that ztHz � 0 for any z 2 Rd .
Response:
2