15
Day 6 - Logistic Regression (continued) Agenda: Classification and Logistic Regression Training binary classifiers Evaluating classifiers Training multiclass classifiers More thoughts on square capital example and whether to approach problem as regression or classification

Day 6 - Logistic Regression (continued)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Day 6 - Logistic Regression (continued) Agenda:

Classification and Logistic Regression•Training binary classifiers◦Evaluating classifiers◦Training multiclass classifiers◦

More thoughts on square capital example and whether to approach problem as regression or classification

Parametric Approach: Choose a model for f with unknown parameters. Estimate the parameters.

Parametric

Regression predict a continuous value

Model Fo IR't IRy fort noise

Given city in

Find 8 0

ParametricClassification predict membership in a category

Let link m

o.EEy frxlt noiseGiven It Yi in nFind O

decisionboundary

Approach for Estimating G

select a model for f w parameters 0

minimize the loss between

training labels and predictions on

training data

Binary Classification in 2D with logistic regression

Min LI Yi ForkQ it I predictionloss

or yfunction

Example linear regression hey y y'trsquare loss or l loss

1yo o É

Training Data No Yasin n

X2

class l

class O

X

Modely of Goto x Osx JK O

where

sigmoid

logisticfunction I

YAo confidencethat X belongs

t to class I

t

o o o o

gigwidth ftIEScertainty

Solve main Ei LI Yi JIN O for E

PredictFor new sample X predict

class I it Is zclass O if JC Yz

Question: Consider a binary classification problem. What property of the data would lead logistic regression to learn a good classifier?

Two classes are linearly separable.If your data is not lin. Separable, can you find a representation in which it is? (Classic machine learning perspective). With representation learning, you specifically try to learn a representation in which the classes are linearly separable

isOne choice log loss

Lf Y J logroll it yet

11binary continuo

ylogs l y log l y

k

k

Question: With the following data, does a minimizer to the optimization problem even exist?

Mtn E L Yi Jie o

L ly y y logs 11 y log l y

J 810 0 X 0222

F XIt

Class Iclass o

th

Labels are discrete (0,1)Outputs are continuous (never actually equal to 0 or 1)For any parameter theta, you could double theta and get a more confident classifier

Evaluating Classifiers Activity: Someone invents a test for a rare disease that affects 0.1% of the population. The test has accuracy 99.9%. Are you convinced this is a good test? No. The trivial predictor (always predict no) is just as accurate

t

FalsePositive

TrueNegative

Accuracy TI.tn

Activity: You are building a binary classifier that detects whether a pedestrian is crossing the sidewalk within 30 feet of a self driving car. If the detection is positive, the car puts on the breaks. Would you rather have good precision and great recall or good recall and great precision? Good precision and great recall

Precision Iff what fraction of predicted

positives are real

what fraction of positivesRecall Iffy are correctly predicted

Want high precision high recall

There is a trade off between True Positives and False Positives, and between True Negatives and False Negatives

Training data x 3Yo o b

Modely ofGotQ x Osx JK O

Predictfor new sample X predict

class I it Is zclass 0 if Y V2

could chooseany value

Receiver Operating Characteristic Curves

Each point is a classifier w differentthreshold

Everythingrate

randomrate classifier

declareEverythingnegative

Comparing classifiers and Area-Under-Curve (AUC) Also common to plot precision-recall curves

Iggy VI want Auc l

Fprate

precis

Multiclass Classification

Consider a 3 class classification problem in biasterm

Training Datu X yin

file Rd for all i

y e 1133 youif Y ofclass I

8 if y of class

One hot encoding 9 if y of class 3

Model Y Softmax Z Zz Zz

Z O Xw Q end

log its2 02 x GE Rd

Z 03 x O E Ird

logits are linear in parameters of model

soft mat Zi Zz Z ÉÉez addsup tof

ÉEE

What will decision boundary look like for 3-class classification?

Trainmain Lf y softman Q x 02 x Q x

3

log loss Lf Y Y E Y logicC L

IR I 3

PredictionGwen X compute yet EM

Predict class I w Es argmat I

XzClass Iclass 2class 3

if

i

CS 6140: Machine Learning — Fall 2021— Paul Hand

HW 3Due: Wednesday October 6, 2021 at 2:30 PM Eastern time via Gradescope.

Names: [Put Your Name(s) Here]

You can submit this homework either by yourself or in a group of 2. You may consult any andall resources. Make sure to justify your answers. If you are working alone, you may either writeyour responses in LaTeX or you may write them by hand and take a photograph of them. If youare working in a group of 2, you must type your responses in LaTeX. You are encouraged touse Overleaf. Create a new project and replace the tex code with the tex file of this document,which you can find on the course website. To share the document with your partner, click Share> Turn on link sharing, and send the link to your partner. When you upload your solutions toGradescope, make sure to take each problem with the correct page or image.

Question 1. Linear regression with multivariate responses.

Consider training data {(x(i), y(i))}i=1...n, where x(i) 2 Rd and y

(i) 2 Rk . Consider a modely = Ax, where A 2 Rk⇥d is unknown. Estimate A by solving least squares linear regression

minA

nX

i=1

ky(i) �Ax(i)k2.

(a) Find A in the case of training data

8>><>>:

0BBBB@

10

!,

0BBBBBBB@

110

1CCCCCCCA

1CCCCA,0BBBB@

01

!,

0BBBBBBB@

011

1CCCCCCCA

1CCCCA,0BBBB@

11

!,

0BBBBBBB@

231

1CCCCCCCA

1CCCCA

9>>=>>;. You may use a com-

puter to perform linear algebra. Hint: the problem can be simplified by observing thateach output dimension can be computed separately from the others. If you use this fact,justify it in your response.

Response:

(b) Consider the case of generic training data. Let Y be the k ⇥n matrix such that Yji = y(i)j

. Let

X be the n ⇥ d matrix where Xij = x(i)j

. Provide a formula for the least squares estimateof A. Make sure to check that the matrix dimensions match in any matrix products thatappear in your answer. Use the same hint as in part (a).

Response:

(c) Show that any prediction under this learned model is a linear combination of the responsevalues (y(1), . . . , y(n)). That is, for the A in part (b), show that Ax 2 span(y(1), . . . , y(n)) for anyx. You may assume that X is rank d.

Response:

Question 2. Logistic Regression

1

Consider training data {(xi ,yi )}i=1...n, where xi 2 Rd and yi 2 {0,1}. Consider the logistic datamodel y = �(✓ · x), where x 2 Rd , ✓ 2 Rd , and � is the logistic function �(z) = e

z/(ez +1).

(a) Show that � 0(z) = �(z)(1��(z)).

Response:

(b) Let f (✓) =P

n

i=1�yi log yi � (1� yi ) log(1� yi ), where yi = �(✓ · xi ). Compute rf (✓). Use thefact in part (a) to simplify your answer.

Response:

(c) If M =P

n

i=1 xixt

i, show that ztMz � 0 for any z 2 Rd .

Response:

(d) Using a summation and vector and/or matrix products, write down a formula for theHessian, H , of f with respect to ✓. Show that ztHz � 0 for any z 2 Rd .

Response:

2