Lecture 1 Recap · PowerPoint Presentation Author: Laura LT Created Date: 4/22/2018 5:59:36 PM

Preview:

Citation preview

Lecture 1 Recap

Prof. Leal-Taixé and Prof. Niessner 1

Reinforcement learning

Agents Environmentinteraction

Machine learning

Unsupervised learning Supervised learning

Prof. Leal-Taixé and Prof. Niessner 2

Nearest Neighbor

?Prof. Leal-Taixé and Prof. Niessner 3

Nearest Neighbor

distance

NN classifier = dog

Prof. Leal-Taixé and Prof. Niessner 4

Nearest Neighbor

Courtesy of Stanford course cs231n

Decision boundaries

Prof. Leal-Taixé and Prof. Niessner 5

Linear Regression

Prof. Leal-Taixé and Prof. Niessner 6

Linear regression

• Supervised learning

• Find a linear model that explains a target given the inputs

Prof. Leal-Taixé and Prof. Niessner 7

Training

Data points

Linear regression

Input (e.g. image, measurement)

Labels (e.g. cat/dog)

Learner

Model parameters

Prof. Leal-Taixé and Prof. Niessner 8

Training

Testing

Data points

Learner

Model parameters

Predictor

Estimation

Linear regression

Prof. Leal-Taixé and Prof. Niessner 9

Input data, features

weights, model parameters

d inputs

Linear prediction• A linear model is expressed in the form

Prof. Leal-Taixé and Prof. Niessner 10

1 bias

Linear prediction• A linear model is expressed in the form

Prof. Leal-Taixé and Prof. Niessner 11

Temperature of a building

Outside temperature

Number of people

Sun exposure

Level of humidity

Linear prediction

Prof. Leal-Taixé and Prof. Niessner 12

Input features

Model parameters

Prediction

Linear prediction

Prof. Leal-Taixé and Prof. Niessner 13

Temperature of the building

Out

sid

e te

mp

erat

ure

Hum

idity

Num

ber

peo

ple

Sun

exp

osur

e (%

)

MODEL

Linear prediction

Prof. Leal-Taixé and Prof. Niessner 14

Temperature of the building

Out

sid

e te

mp

erat

ure

Hum

idity

Num

ber

peo

ple

Sun

exp

osur

e (%

)

MODEL

How do we obtain the

model?

Linear prediction

Prof. Leal-Taixé and Prof. Niessner 15

How to obtain the model?

Data points

Model parameters

Labels (ground truth)

Estimation

Loss function

Optimization

Prof. Leal-Taixé and Prof. Niessner 16

How to obtain the model?

• Loss function: measures how good my estimation is (how good my model is) and tells the optimization method how to make it better.

• Optimization: changes the model in order to improve the loss function (i.e. to improve my estimation).

Prof. Leal-Taixé and Prof. Niessner 17

Prediction: Temperature of

the building

Linear regression: loss function

Prof. Leal-Taixé and Prof. Niessner 18

Prediction: Temperature of

the building

Linear regression: loss function

Prof. Leal-Taixé and Prof. Niessner 19

MinimizingObjective function

EnergyCost function

Linear regression: loss function

Prof. Leal-Taixé and Prof. Niessner 20

Optimization: linear least squares• Linear least squares: an approach to fit a

mathematical model to the data

• Convex problem, there exists a closed-form solution that is unique.

Prof. Leal-Taixé and Prof. Niessner 21

Optimization: linear least squares

The estimation comes from the

linear model

training samples

Prof. Leal-Taixé and Prof. Niessner 22

Optimization: linear least squares

Matrix notation

labelstraining samples, each input vector

has size Prof. Leal-Taixé and Prof. Niessner 23

Optimization: linear least squares

Matrix notation

Thursday April 19th: exercise session, more on matrix notation

Prof. Leal-Taixé and Prof. Niessner 24

Optimization: linear least squares

Convex

OptimumProf. Leal-Taixé and Prof. Niessner 25

Optimization

More info on Thursday!

No theta there!

Prof. Leal-Taixé and Prof. Niessner 26

Output: Temperature of

the buildingInputs: Outside temperature,

number of people…

Optimization

Prof. Leal-Taixé and Prof. Niessner 27

Is this the best estimate?• Least squares estimate

Prof. Leal-Taixé and Prof. Niessner 28

Maximum Likelihood

Prof. Leal-Taixé and Prof. Niessner 29

Maximum Likelihood Estimate

True underlying distribution

Controlled by a parameter

Parametric family of distributions

Prof. Leal-Taixé and Prof. Niessner 30

Maximum Likelihood Estimate• A method of estimating the parameters of a statistical

model given observations,

Observations from

Prof. Leal-Taixé and Prof. Niessner 31

Maximum Likelihood Estimate• A method of estimating the parameters of a statistical

model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.

Training samples

Prof. Leal-Taixé and Prof. Niessner 32

Maximum Likelihood Estimate

Logarithmic property

Prof. Leal-Taixé and Prof. Niessner 33

Back to linear regression

Prof. Leal-Taixé and Prof. Niessner 34

Back to linear regression

Assuming

Gaussian or Normal distribution

mean

Prof. Leal-Taixé and Prof. Niessner 35

Back to linear regression

Assuming

Prof. Leal-Taixé and Prof. Niessner 36

Back to linear regression

Assuming

mean

Prof. Leal-Taixé and Prof. Niessner 37

Back to linear regression

Original optimization problem

Prof. Leal-Taixé and Prof. Niessner 38

Back to linear regression

Matrix notation

Canceling log and e

Prof. Leal-Taixé and Prof. Niessner 39

Back to linear regression

How can we find the estimate of theta?

Prof. Leal-Taixé and Prof. Niessner 40

Linear regression• Maximum Likelihood Estimate (MLE) corresponds to

the Least Squares Estimate (given the assumptions)

• Introduced the concepts of loss function and optimization to obtain the best model for regression

Prof. Leal-Taixé and Prof. Niessner 41

Image classification

Prof. Leal-Taixé and Prof. Niessner 42

Regression vs. Classification

• Regression: predict a continuous output value (e.g. temperature of a room)

• Classification: predicts a discrete value – Binary classification: output is either 0 or 1– Multi-class classification: set of N classes

Prof. Leal-Taixé and Prof. Niessner 43

Logistic regression

CAT classifier

Prof. Leal-Taixé and Prof. Niessner 44

Sigmoid for binary predictions• 0

Prof. Leal-Taixé and Prof. Niessner 45

Can be interpreted as a probability

1

Spoiler alert: 1 layer Neural Network• 0

Prof. Leal-Taixé and Prof. Niessner 46

Can be interpreted as a probability

1

Logistic regression• Probability of a binary output

Prof. Leal-Taixé and Prof. Niessner 47

Logistic regression• Probability of a binary output

Prof. Leal-Taixé and Prof. Niessner 48

Model for coins

Bernoulli trial

Logistic regression• Probability of a binary output

Model for coins

Prof. Leal-Taixé and Prof. Niessner 49

Logistic regression: loss function• Probability of a binary output

• Maximum Likelihood Estimate

Prof. Leal-Taixé and Prof. Niessner 50

Logistic regression: loss function

Prof. Leal-Taixé and Prof. Niessner 51

Logistic regression: loss function

Prof. Leal-Taixé and Prof. Niessner 52

Maximize!

Logistic regression: loss function

Prof. Leal-Taixé and Prof. Niessner 53

We want. large, since logarithm is a monotonically increasing function, we also want

large (1 is the largest value my estimate can take!)

Logistic regression: loss function

Prof. Leal-Taixé and Prof. Niessner 54

We want. large, so we want to be small (0 is the smallest value my estimate can take!)

Logistic regression: loss function

• Related to the multi-class loss you will see in this course (also called softmax loss)

Prof. Leal-Taixé and Prof. Niessner 55

Referred to as cross-entropy loss

Logistic regression: optimization• Loss function

• Cost function

Prof. Leal-Taixé and Prof. Niessner 56

Minimization

Logistic regression: optimization• No closed-form solution

• Make use of an iterative method gradient descent

Prof. Leal-Taixé and Prof. Niessner 57

Gradient descent

Prof. Leal-Taixé and Prof. Niessner 58

Following the slope

Optimum

Prof. Leal-Taixé and Prof. Niessner 59

Initialization

Following the slope

Prof. Leal-Taixé and Prof. Niessner 60

Initialization

Optimum

Following the slope

Prof. Leal-Taixé and Prof. Niessner 61

Initialization

Follow the slope of the DERIVATIVE

Optimum

Gradient steps• From derivative to gradient

• Gradient steps in direction of negative gradient

Prof. Leal-Taixé and Prof. Niessner 62

Direction of greatest

increase of the function

Learning rate

Gradient steps• From derivative to gradient

• Gradient steps in direction of negative gradient

Prof. Leal-Taixé and Prof. Niessner 63

Direction of greatest

increase of the function

SMALL Learning rate

Gradient steps• From derivative to gradient

• Gradient steps in direction of negative gradient

Prof. Leal-Taixé and Prof. Niessner 64

Direction of greatest

increase of the function

LARGE Learning rate

Convergence

Prof. Leal-Taixé and Prof. Niessner 65

Initialization

What is the gradient when we reach this point?

Not guaranteed to reach the

optimumOptimum

Numerical gradient

• Approximate• Slow evaluation

Prof. Leal-Taixé and Prof. Niessner 66

Analytical gradient• Exact and fast

Prof. Leal-Taixé and Prof. Niessner 67

Analytical gradient

Recall Linear Regression

A step in that direction

Gradient descent for least squares

Prof. Leal-Taixé and Prof. Niessner 68

Convex, always converges to the same solution

Gradient descent for logistic regression

Prof. Leal-Taixé and Prof. Niessner 69

Explained in the following lectures when the following topics are introduced:• Computational graphs• Backpropagation of the gradients

Regularization

Prof. Leal-Taixé and Prof. Niessner 71

Regularization

Loss

L2 regularization

L1 regularization

Max norm regularization

Dropout

Prof. Leal-Taixé and Prof. Niessner 72

Regularization: example• Input: 3 features

• Two linear classifiers that give the same result:

Ignores 2 features

Takes information from all features

Prof. Leal-Taixé and Prof. Niessner 73

Regularization: example

Loss

L2 regularization

Minimization

Prof. Leal-Taixé and Prof. Niessner 74

Regularization: example

Loss

Minimization

L1 regularization

Prof. Leal-Taixé and Prof. Niessner 75

Regularization: example• Input: 3 features

• Two linear classifiers that give the same result:

Ignores 2 features

Takes information from all features

Prof. Leal-Taixé and Prof. Niessner 76

Regularization: example• Input: 3 features

• Two linear classifiers that give the same result:

L1 regularization enforces sparsity

Takes information from all features

Prof. Leal-Taixé and Prof. Niessner 77

Regularization: example• Input: 3 features

• Two linear classifiers that give the same result:

L1 regularization enforces sparsity

L2 regularization enforces that the weights have similar values

Prof. Leal-Taixé and Prof. Niessner 78

Regularization: effect• Dog classifier takes different inputs

Furry

Has two eyes

Has a tail

Has paws

Has two ears

L1 regularization will focus all the attention to a few key features

Prof. Leal-Taixé and Prof. Niessner 79

Regularization: effect• Dog classifier takes different inputs

Furry

Has two eyes

Has a tail

Has paws

Has two ears

L2 regularization will take all information into account to make decisions

Prof. Leal-Taixé and Prof. Niessner 80

Regularization

Credits: University of Washington

What is the goal of regularization?

What happens to the training error?

Prof. Leal-Taixé and Prof. Niessner 81

Regularization• Any strategy that aims to

Prof. Leal-Taixé and Prof. Niessner 82

Lower validation error

Increasing training error

Beyond linear

This lecture Next lectures

Prof. Leal-Taixé and Prof. Niessner 83

Next lectures• Next Monday: Introduction to Neural Networks

• This Thursday: Math refresh #2, particularly useful for the jump to NN!

Prof. Leal-Taixé and Prof. Niessner 84

Useful references• General Machine Learning book:

– Pattern Recognition and Machine Learning. C. Bishop.

Prof. Leal-Taixé and Prof. Niessner 85

Recommended