22
Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez, Senior Researcher, Microsoft

Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

  • Upload
    lamkiet

  • View
    231

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Deep Learning ExplainedModule 2: Logistic Regression

Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft

Roland Fernandez, Senior Researcher, Microsoft

Page 2: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Module Outline

Application:

OCR with MNIST data

Model:

Logistic Regression

Concepts:

Loss, Minibatch

Train-Test-Predict workflow

Page 3: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

MNIST Handwritten Digits (OCR)

• Data set of hand written digits (0-9) with✓60,000 training images

✓10,000 test images

• Each image is: 28 x 28 pixels

Handwritten Digits

1 5 4 35 3 5 35 9 0 6

Corresponding Labels

Page 4: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

784 pixels (x)

Logistic Regression

784 pixels ( Ԧ𝑥)

0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3

Model

(W, 𝑏)

Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏

Ԧ𝑧

0 1 2 3 4 5 6 7 8 9

weights (W)

784

10

bias( Ԧ𝑏)10

Model Parameters

Model that maps input features to

discrete output classesopposed to linear

regression

Page 5: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

784 pixels

S S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇

784 784

Logistic Regression

0

𝑤0

784 pixels ( Ԧ𝑥)

Page 6: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

784 pixels (x)

S S S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇

784 784

Logistic Regression

0 1

S = Sum (weights x pixels) = 𝑤1 ∙ Ԧ𝑥𝑇

784 784

𝑤1

784 pixels ( Ԧ𝑥)

Page 7: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

784 pixels (x)

S S

Weights (W)

S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇

784 784

784

10

Logistic Regression

S0 1 9

784 pixels ( Ԧ𝑥)

S = Sum (weights x pixels ) = 𝑤9 ∙ Ԧ𝑥𝑇

784 784

S = Sum (weights x pixels) = 𝑤1 ∙ Ԧ𝑥𝑇

784 784

Ԧ𝑧 =

𝑤9

Ԧ𝑧 = W Ԧ𝑥𝑇

Page 8: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

784 pixels (x)

S S

weights (W)

= map to (0-1) rangeActivation function

0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3 Sigmoid

784

10

Model

Logistic Regression

Sbias ( Ԧ𝑏)(10) 0 1 9

784 pixels ( Ԧ𝑥)

Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏Ԧ𝑧 =

Page 9: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

784 pixels (x)

S S

weights (W)

= map to (0-1) rangeActivation function

0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3 Sigmoid

784

10

Model

Logistic Regression

SBias (10)

(𝑏)0 1 9

784 pixels ( Ԧ𝑥)

Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏Ԧ𝑧 =

Page 10: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

784 pixels (x)

S S

weights (W)

Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏

= None (pass through)

z0 z1 z2 z3 z4 z5 z6 z7 z8 z9

784

10

Model

Logistic Regression with Softmax

0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01𝑒𝑧i

σ𝑗=09 𝑒𝑧j

softmax

SBias (10)

(𝑏)0 1 9

784 pixels ( Ԧ𝑥)

Ԧ𝑧 =

Predicted Probabilities (p)

Page 11: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

28 pix

28

pix

.

28 x 28 pix

Loss Function

Lossfunctions

se = σ𝑗= 09 𝑦𝑗 − 𝑝𝑗

2Squared error

ce = −σ𝑗=09 𝑦𝑗 𝑙𝑜𝑔 𝑝𝑗

Cross entropy error

1 5 4 35 3 5 35 9 0 6

Label One-hot encoded (Y)

0 0 0 1 0 0 0 0 0 0

Model(w, b)

Predicted Probabilities (p)

0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01

Page 12: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Train(learner)

Reporting

TrainingData

Trainmore?

Data SamplerFeatures (x), Labels (Y)

Model

z(params)

Train Workflow

params

update params

loss

iterationsY

Page 13: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Train Workflow

MNISTTrain

3

7

8

0

Input feature (X: 128 x 784)12

8 s

am

ples

(min

i-ba

tch

)

Loss cross_entropy_with_softmax(z,Y)One-hot encoded

Label

(Y: 128 x 10)

0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 1 0

1 0 0 0 0 0 0 0 0 0

bias(𝑏

)(D

im-

10

)

Model

S

z = times(X , W) + b

= 𝐖 X𝑇 + 𝑏

Trainer(model, (loss, error), learner)

Trainer.train_minibatch({X, Y})

Learnersgd, adagrad etc, are solvers to estimate – W & b

Error(optional)

classification_error(z,Y)

weights (W)

784

10

Page 14: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Learn the weights: Learners / Optimizers / Solvers

For 1 sample:

Loss (𝐿𝑖) = −σ𝑗=09 𝑦𝑗

𝑖𝑙𝑜𝑔 𝑝𝑗 where: 𝑝𝑗 = 𝑓(𝑥 𝑖 ; 𝜃)𝑗

𝜃 ∈ (𝑤, 𝑏)

For all samples (m = 60000 images):

Total loss = σ𝑖=1𝑚 𝐿𝑖 (𝜃; (𝑥

𝑖 , 𝑦 𝑖 ))

Convex function: There is 1 and only 1 minimum

Fig: courtesy http://codingwiththomas.blogspot.com/2012/09/particle-swarm-optimization.html

Page 15: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Gradient Descent

𝜃′ = 𝜃 − 𝜇 𝑔𝑟𝑎𝑑 𝐿; 𝜃

Where: 𝜃 = model parameter𝜇 = learning rate

Computing “Total Loss” (σ𝑖𝑛 𝐿𝑖) for large data set is expensive and often redundant

- refer to http://sebastianruder.com/optimizing-gradient-descent/ for details

Page 16: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Stochastic Gradient Descent (SGD)

SGD:Update the parameters foreach (data, label) pair

Mini-batch SGD:Update the parameters formini-batch setSet of (data, label) pairs

refer to http://sebastianruder.com/optimizing-gradient-descent/ for details on different learners

Page 17: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Other learners

Momentum-SGDNestorovAdagradAdsdeltaAdam

Refer to http://sebastianruder.com/optimizing-gradient-descent/ for

details on different learners

Image by: Alec Radford

Page 18: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Train(learner)

Reporting

TrainingData

Trainmore?

Data SamplerFeatures (x), Labels (Y)

Data SamplerFeatures (x), Labels (Y)

Validate

Reporting

More?

ValidationData

Model

z(params)

Validation Workflow

params

update params

trainedparams

loss

iterations

Modelfinal

Y Y

Page 19: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Data SamplerFeatures (x), Labels (Y)

Test

Reporting

Testmore?

TestData

Modelfinal

Test Workflow

trainedparams

Y

Page 20: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Test Workflow

MNISTTest

3

7

8

0

Input feature (X*: 32 x 784)3

2sa

mpl

es(m

ini-

batc

h)

One-hot encoded

Label

(Y*: 32 x 10)

0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 1 0

1 0 0 0 0 0 0 0 0 0

ModelS

z = times(X* , 𝐖∗ ) + b*

= 𝐖∗ X∗𝑇 + 𝑏∗

Trainer.test_minibatch({X*, Y*})

Returns the classification error as % incorrectlylabeled MNIST image.

MNISTTrain

bias(𝑏

*)

(Dim

-10

)weights (W*)

784

10

Page 21: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Prediction Workflow

Any MNIST

9

Input feature (new X: 1 x 784)Model

(W, b)

Model.eval(new X)

0.02 0.09 0.03 0.03 0.01 0.02 0.02 0.06 0.02 0.70

Predicted Softmax Probabilities (predicted_label)

[ numpy.argmax(predicted_label) for predicted_label in predicted_labels ]

[9]

Page 22: Deep Learning Explained - d37djvu3ytnwxt.cloudfront.net · Deep Learning Explained Module 2: Logistic Regression Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft Roland Fernandez,

Prediction Workflow

Any MNIST

9

5

8

2

Input feature (new X: 25 x 784)Model

(W, b)

Model.eval(new X)

Predicted Softmax Probabilities (predicted_label)

[ numpy.argmax(predicted_label) for predicted_label in predicted_labels ]

[9, 5, 8, …, 2]