Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Gradient Descent OptimizationoDefinitionoMathematical calculation of gradientoMatrix interpretation of gradient computation
Minimizing Loss
§ In order to train, we need to minimize loss
– How do we do this?
§ Key ideas:– Use gradient descent– Computing gradient using chain rule, adjoint gradient, back
propagation.
!∗ = argmin* + !
!"# = %& '#
(
'# !"#estimate
Training data
!# known- ) (Loss*#
What is Gradient Descent?§ Gradient descent:
– The simplest (but surprisingly effective) approach– Move directly down hill
§ What is the down hill direction?! = −∇% &
§ Gradient descent algorithm
Repeat until converged {! ← −∇% && ← & + *!+
}Gradient Descent (GD) Algorithm
! 1×/0gradient is a row vector
transpose
Gradient Descent Picture
Gradient DescentGD steps
minimum
gradient
startingpoint
endingpoint
Gradiant Ascent: N = 100, Step Size = 0.060000
x axis
yaxi
s
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
§ The GD update step:
2D case1D case
−1.5−1
−0.50
0.51
1.5
−1.5
−1
−0.5
0
0.5
1
1.50
0.2
0.4
0.6
0.8
1
x axis
yaxis
Function
Updates
! ← −∇% && ← & + (!)
Gradient Step Size
§ How large should ! be?– ! too small ⇒ slow convergence– ! too large ⇒ unstable– Often there is no good choice!
Gradiant Ascent: N = 100, Step Size = 0.060000
x axis
yaxi
s
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5Gradiant Ascent: N = 100, Step Size = 0.020000
x axis
yaxi
s
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Too small ⇒ slow convergence
Gradiant Ascent: N = 100, Step Size = 0.180000
x axis
yaxi
s
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Too large ⇒ oscilate Goldilocks? ⇒ good enough
Steepest Descent§ Use line search to compute the best !
Steepest Descent ⇒ Too Expensive
Steepest Ascent: N = 10
x axis
yaxi
s
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Line Searchminimum
!∗!
Repeat until converged {% ← −∇) *!∗ ← argmin1 ) * + !%3* ← * + !∗%3
}Steepest Descent Algorithm
loss
Coordinate Descent
§ Update one parameter at a time– Removes problem of selecting step size– Each update can be very fast, but lots of updates
Coordinate Ascent: N = 10
x axis
yaxi
s
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Slow Convergence of Gradient Descent§ Very sensitive to condition number of problem
– No good choice of step size
§ Newton’s method: Correct for local second derivative– “Sphere” the ellipse– Too much computation; Too difficult to implement
§ Alternative methods– Preconditioning: Easy, but tends to be ad-hoc, not so robust– Momentum: More latter
contour plot
Smallest singular value !"#$ largest singular
value !"%&
Condition Number '()*'(+,= 2 Condition Number '()*'(+,
= 10
largest singular value !"%&
Smallest singular value !"#$
Requires largestep size
Requires smallstep size
Computing the Loss Gradient
§ Use chain rule to compute the loss gradient
∇"#$%& ' = ∇"1*+,-.
/012, − 4" 5, 6
= 1*+,-.
/01∇" 2, − 4" 5, 6
= 2*+,-.
/012, − 4" 5,
8 ∇" 2, − 4" 5,
= −2* +
,-.
9012, − 4" 5,
8 ∇"4" 5,
What does this mean?
Interpretation of Loss Gradient
§ Loss Gradient computation requires:– Sum over training data: Big sum, but straight forward.– Prediction error: Easy to compute.– Gradient of inference function: This is the difficult part.
• Most challenging part to compute.• Enabled by automatic differentiation built into modern domain specific
languages (DSL) such as Pytorch, Tensorflow, and others.• For NN this is known as back propagation.
−∇#$%&' ( = 2+,-./
0123- − 4# 5-
6 ∇#4# 5-
gradient of function
prediction error
sum over training data
Matrix Interpretation§ Since !" = $% &" + ())*), we have that
§ Then the parameter vector is given by
1×-.error vector = /"0 = !" − $% &"
0= error vector
2×1
para
met
er v
ecto
r
= −∇%4567 0 = dimension of parameter vector
Function Gradient
§ Inference function gradient, ∇"# $% , is given by
&'×)function gradient
= ∇#"# $+ = gradient of function, "# $+ -
,./
i ind
exou
tput
dim
ensi
on
j indexparameter dimension
Forward vs Backward Propogation§ Forward gradient
§Backward (adjoint) gradient
! "# ← ∇&'& ( !)*+×-
∇&'& (-×1
!)*+×1
! "#←
/0 ∇&'& ( → 2
1×-!)1×*+/ →*+×-
∇&'& (
→
Loss Gradient Computation§ Equation is
§ Looks like
−∇#$%&' ( = 2+,-./
0123- − 4# 5-
6 ∇#4# 5-
gradient of function
prediction error
2+,-./
712 1×:;<-:;×=
∇#4# 5-
1×=> =
Update Direction for Supervised Training
§ ! is given by
TrainingPairs"#, %# &' %#
(
%# )"#
"# - * (Loss+#
2- .#/0
123 1×67+#67×8
∇'&' %#
1×8! =
Gradient step:( ← ( + =!>