Gradient Descent Optimization - Purdue University › DeepLearn › pdf-bouman › DL-week-2.pdfGradient Descent Optimization oDefinition oMathematical calculation of gradient oMatrix

Gradient Descent OptimizationoDefinitionoMathematical calculation of gradientoMatrix interpretation of gradient computation

Minimizing Loss

§ In order to train, we need to minimize loss

– How do we do this?

§ Key ideas:– Use gradient descent– Computing gradient using chain rule, adjoint gradient, back

propagation.

!∗ = argmin* + !

!"# = %& '#

(

'# !"#estimate

Training data

!# known- ) (Loss*#

What is Gradient Descent?§ Gradient descent:

– The simplest (but surprisingly effective) approach– Move directly down hill

§ What is the down hill direction?! = −∇% &

§ Gradient descent algorithm

Repeat until converged {! ← −∇% && ← & + *!+

}Gradient Descent (GD) Algorithm

! 1×/0gradient is a row vector

transpose

Gradient Descent Picture

Gradient DescentGD steps

minimum

gradient

startingpoint

endingpoint

Gradiant Ascent: N = 100, Step Size = 0.060000

x axis

yaxi

s

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

§ The GD update step:

2D case1D case

−1.5−1

−0.50

0.51

1.5

−1.5

−1

−0.5

0

0.5

1

1.50

0.2

0.4

0.6

0.8

1

x axis

yaxis

Function

Updates

! ← −∇% && ← & + (!)

Gradient Step Size

§ How large should ! be?– ! too small ⇒ slow convergence– ! too large ⇒ unstable– Often there is no good choice!


x axis

yaxi

s

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5Gradiant Ascent: N = 100, Step Size = 0.020000

x axis

yaxi

s

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Too small ⇒ slow convergence


x axis

yaxi

s

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Too large ⇒ oscilate Goldilocks? ⇒ good enough

Steepest Descent§ Use line search to compute the best !

Steepest Descent ⇒ Too Expensive

Steepest Ascent: N = 10

x axis

yaxi

s

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Line Searchminimum

!∗!

Repeat until converged {% ← −∇) *!∗ ← argmin1 ) * + !%3* ← * + !∗%3

}Steepest Descent Algorithm

loss

Coordinate Descent

§ Update one parameter at a time– Removes problem of selecting step size– Each update can be very fast, but lots of updates

Coordinate Ascent: N = 10

x axis

yaxi

s

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Slow Convergence of Gradient Descent§ Very sensitive to condition number of problem

– No good choice of step size

§ Newton’s method: Correct for local second derivative– “Sphere” the ellipse– Too much computation; Too difficult to implement

§ Alternative methods– Preconditioning: Easy, but tends to be ad-hoc, not so robust– Momentum: More latter

contour plot

Smallest singular value !"#$ largest singular

value !"%&

Condition Number '()*'(+,= 2 Condition Number '()*'(+,

= 10

largest singular value !"%&

Smallest singular value !"#$

Requires largestep size

Requires smallstep size

Computing the Loss Gradient

§ Use chain rule to compute the loss gradient

∇"#$%& ' = ∇"1*+,-.

/012, − 4" 5, 6

= 1*+,-.

/01∇" 2, − 4" 5, 6

= 2*+,-.

/012, − 4" 5,

8 ∇" 2, − 4" 5,

= −2* +

,-.

9012, − 4" 5,

8 ∇"4" 5,

What does this mean?

Interpretation of Loss Gradient

§ Loss Gradient computation requires:– Sum over training data: Big sum, but straight forward.– Prediction error: Easy to compute.– Gradient of inference function: This is the difficult part.

• Most challenging part to compute.• Enabled by automatic differentiation built into modern domain specific

languages (DSL) such as Pytorch, Tensorflow, and others.• For NN this is known as back propagation.

−∇#$%&' ( = 2+,-./

0123- − 4# 5-

6 ∇#4# 5-

gradient of function

prediction error

sum over training data

Matrix Interpretation§ Since !" = $% &" + ())*), we have that

§ Then the parameter vector is given by

1×-.error vector = /"0 = !" − $% &"

0= error vector

2×1

para

met

er v

ecto

r

= −∇%4567 0 = dimension of parameter vector

Function Gradient

§ Inference function gradient, ∇"# $% , is given by

&'×)function gradient

= ∇#"# $+ = gradient of function, "# $+ -

,./

i ind

exou

tput

dim

ensi

on

j indexparameter dimension

Forward vs Backward Propogation§ Forward gradient

§Backward (adjoint) gradient

! "# ← ∇&'& ( !)*+×-

∇&'& (-×1

!)*+×1

! "#←

/0 ∇&'& ( → 2

1×-!)1×*+/ →*+×-

∇&'& (

→

Loss Gradient Computation§ Equation is

§ Looks like

−∇#$%&' ( = 2+,-./

0123- − 4# 5-

6 ∇#4# 5-

gradient of function

prediction error

2+,-./

712 1×:;<-:;×=

∇#4# 5-

1×=> =

Update Direction for Supervised Training

§ ! is given by

TrainingPairs"#, %# &' %#

(

%# )"#

"# - * (Loss+#

2- .#/0

123 1×67+#67×8

∇'&' %#

1×8! =

Gradient step:( ← ( + =!>

Documents

Gradient Descent Optimization - Purdue University › DeepLearn › pdf-bouman › DL-week-2.pdfGradient Descent Optimization oDefinition oMathematical calculation of gradient oMatrix