40
Roadmap Intro, model, cost Gradient descent Deep Learning I: Gradient Descent Hinrich Sch¨ utze Center for Information and Language Processing, LMU Munich 2017-07-19 Sch¨ utze (LMU Munich): Gradient descent 1 / 40

Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

  • Upload
    others

  • View
    25

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Deep Learning I: Gradient Descent

Hinrich Schutze

Center for Information and Language Processing, LMU Munich

2017-07-19

Schutze (LMU Munich): Gradient descent 1 / 40

Page 2: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Overview

1 Roadmap

2 Intro, model, cost

3 Gradient descent

Schutze (LMU Munich): Gradient descent 2 / 40

Page 3: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Outline

1 Roadmap

2 Intro, model, cost

3 Gradient descent

Schutze (LMU Munich): Gradient descent 3 / 40

Page 4: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

word2vec skipgram

predict, based on input word, a context word

Schutze (LMU Munich): Gradient descent 4 / 40

Page 5: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

word2vec skipgram

predict, based on input word, a context word

Schutze (LMU Munich): Gradient descent 5 / 40

Page 6: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

word2vec parameter estimation:

Historical developmentvs. presentation in this lecture

Mikolov et al. (2013) introduce word2vec, estimatingparameters by gradient descent.(today)

Still the learning algorithm used by default and in most cases

Levy and Goldberg (2014) show near-equivalenceto a particular type of matrix factorization.(yesterday)

Important because it links two important bodies of research:neural networks and distributional semantics

Schutze (LMU Munich): Gradient descent 6 / 40

Page 7: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent (GD)

Gradient descent is a learning algorithm.

Given:

a hypothesis space (or model family)an objective function or cost functiona training set

Gradient descent (GD) finds a set of parameters, i.e.,a member of the hypothesis space (or specified model)that performs well on the objectivefor the training set.

GD is arguably the most important learning algorithm,notably in deep learning.

Schutze (LMU Munich): Gradient descent 7 / 40

Page 8: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Simple learning problem → word2vec

Gradient descent forhousing prices

Gradient descent forword2vec skipgram

Schutze (LMU Munich): Gradient descent 8 / 40

Page 9: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Inverted ClassroomAndrew Ng: “Machine Learning”

http://coursera.org

Schutze (LMU Munich): Gradient descent 9 / 40

Page 10: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Outline

1 Roadmap

2 Intro, model, cost

3 Gradient descent

Schutze (LMU Munich): Gradient descent 10 / 40

Page 11: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Intro, model, cost: Andrew Ng videos

Introduction

Supervised learning

Model and cost function

Model representationCost functionCost function, intuition 1Cost function, intuition 2

Schutze (LMU Munich): Gradient descent 11 / 40

Page 12: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Housing prices in Portland

input variable x output variable y

size (feet2) price ($) in 1000s

2104 4601416 2321534 315852 178

We will use m for the number of training examples.

Schutze (LMU Munich): Gradient descent 12 / 40

Page 13: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Setup to learn a housing price predictor using GD

Next: Setup for word2vec skipgram

Hypothesis:hθ = θ0 + θ1x

Parameters:θ = (θ0, θ1)

Cost function:

J(θ0, θ1) =1

2m

m∑

i=1

(hθ(x(i))− y (i))2

Objective: minimizeθ0,θ1J(θ0, θ1)

Schutze (LMU Munich): Gradient descent 13 / 40

Page 14: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

graph: hypothesis, parameters

Schutze (LMU Munich): Gradient descent 14 / 40

Page 15: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Basic idea: gradient descent finds “close” hypothesis

Choose θ0, θ1so that, for our training examples (x , y),hθ(x) is close to y .

Schutze (LMU Munich): Gradient descent 15 / 40

Page 16: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Cost as a function of θ1 (for θ0 = 0)

Left:housing price y as a function of square footage x

Right:cost J(θ1) as a function of θ1

Schutze (LMU Munich): Gradient descent 16 / 40

Page 17: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

one-variable case

Schutze (LMU Munich): Gradient descent 17 / 40

Page 18: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Cost as a function of θ0, θ1

Left:housing price y as a function of square footage x

Right: contour plotcost J(θ0, θ1) as a function of (θ0, θ1)

Schutze (LMU Munich): Gradient descent 18 / 40

Page 19: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Hypothesis (left)

Point corresponding to its cost in contour (right)

Schutze (LMU Munich): Gradient descent 19 / 40

Page 20: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Surface plot (instead of contour plot)

Schutze (LMU Munich): Gradient descent 20 / 40

Page 21: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Outline

1 Roadmap

2 Intro, model, cost

3 Gradient descent

Schutze (LMU Munich): Gradient descent 21 / 40

Page 22: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: Andrew Ng videos

Gradient descent

Gradient descent intuition

Gradient descent for linear regression

Schutze (LMU Munich): Gradient descent 22 / 40

Page 23: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: Basic idea

Start with some values of parameters,e.g., θ0 = 0.3, θ1 = −2

(These initial values are randomly chosen.)

Keep changing θ0, θ1 to reduce J(θ0, θ1)

Hopefully, we will end up at a minimum of J(θ0, θ1).

“keep changing”: how exactly do we do that?

Schutze (LMU Munich): Gradient descent 23 / 40

Page 24: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: One step (one variable)

Repeat until convergence

θ1 := θ1 − αd

dθ1J(θ1)

Or: Repeat until convergence

θ1 := θ1 − αJ ′(θ1)

α is the learning rate.

Schutze (LMU Munich): Gradient descent 24 / 40

Page 25: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

positive/negative slope

Schutze (LMU Munich): Gradient descent 25 / 40

Page 26: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Exercise: Gradient descent

Suppose you have arrived at a point where the gradient is 0

Draw an example of this situation

Mark the current point at which the gradient is zero and thepoint that gradient descent will move to next

Schutze (LMU Munich): Gradient descent 26 / 40

Page 27: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: One step (> 1 variables)

Repeat until convergence

θ0 := θ0 − α∂

∂θ0J(θ0, . . . , θk)

θ1 := θ1 − α∂

∂θ1J(θ0, . . . , θk)

. . .

θk := θk − α∂

∂θkJ(θ0, . . . , θk)

α is the learning rate.

Schutze (LMU Munich): Gradient descent 27 / 40

Page 28: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Path down to minimum: Two parameters

Schutze (LMU Munich): Gradient descent 28 / 40

Page 29: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Surface plot (instead of contour plot)

Schutze (LMU Munich): Gradient descent 29 / 40

Page 30: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: One step (regression)

Repeat until convergence

θ0 := θ0 − α1

m

m∑

i=1

(hθ(x(i))− y (i))

θ1 := θ1 − α1

m

m∑

i=1

[(hθ(x(i))− y (i)) · x(i)]

α is the learning rate.

Schutze (LMU Munich): Gradient descent 30 / 40

Page 31: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

derivatives

Schutze (LMU Munich): Gradient descent 31 / 40

Page 32: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Learning rate α

If α is too small:gradient descent is very slow.

If α is too large:gradient descent overshoots, doesn’t converge or (!) diverges.

Schutze (LMU Munich): Gradient descent 32 / 40

Page 33: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

divergence

Schutze (LMU Munich): Gradient descent 33 / 40

Page 34: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Different minima: Exercise

Depending on the starting point, we can arrive at different minimahere. Find two starting points that will result in different minima.

Schutze (LMU Munich): Gradient descent 34 / 40

Page 35: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Schutze (LMU Munich): Gradient descent 35 / 40

Page 36: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: batch vs stochastic

So far: batch gradient descent

Recall cost function:J(θ0, θ1) =

12m

∑m

i=1(hθ(x(i))− y (i))2

We sum over the entire training set . . .

. . . and compute the gradient for the entire traning set.

Schutze (LMU Munich): Gradient descent 36 / 40

Page 37: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Stochastic gradient descent

Instead of computing the gradient for the entire traning set,

we compute it for one training example

or for a minibatch of k training examples.

E.g., k ∈ {10, 50, 100, 500, 1000}

We usually add randomization,

e.g., shuffling the training set.

This is called stochastic gradient descent.

Schutze (LMU Munich): Gradient descent 37 / 40

Page 38: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: stochastic vs batch

Advantages of stochastic gradient descent

More “user-friendly” for very large training sets, converges fasterfor most hard problems, often converges to better optimum

Advantages of batch gradient descent

Is easier to justify as doing the right thing (e.g., no bad movebased on outlier training example), converges faster for someproblems, requires a lot fewer parameter updates per epoch

Schutze (LMU Munich): Gradient descent 38 / 40

Page 39: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Exercise

Prepare a short (perhaps three-sentence) summary of howgradient descent works.

Present this summary to your neighbor

You may want to refer to these notions:Hypothesis hθ = θ0 + θ1x

Parameters θ = (θ0, θ1)

Cost function J(θ0, θ1) =12m

∑m

i=1(hθ(x(i))− y (i))2

Derivative of cost functionObjective minimizeθ0,θ1J(θ0, θ1)

Schutze (LMU Munich): Gradient descent 39 / 40

Page 40: Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Roadmap Intro, model, cost Gradient descent

Gradient descent: Summary

n-dimensional hyperplane for n parameters

Height (dim n+ 1) is cost → the surface we move on.

a series of steps, each step in steepest direction down

The derivative gives us the steepest direction down.

The learning rate determines how big the steps are we take.

We will eventually converge (stop) at a minimum (valley).

There is no guarantee we will reach the global minimum.

Schutze (LMU Munich): Gradient descent 40 / 40