Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

[course site]

Verónica Vilaplana [email protected]

Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia

Optimization for neural network training

Day 4 Lecture 1

#DLUPC

Previously in DLAI… •  Mul.layer perceptron •  Training: (stochas.c / mini-‐batch) gradient descent

•  Backpropaga.on

but… What type of op.miza.on problem? Do local minima and saddle points cause problems? Does gradient descent perform well? How to set the learning rate? How to ini.alize weights? How does batch size affect training?

2

Index •  Op6miza6on for a machine learning task; difference between learning and pure op6miza6on

•  Expected and empirical risk •  Surrogate loss func.ons and early stopping •  Batch and mini-‐batch algorithms

•  Challenges •  Local minima •  Saddle points and other flat regions •  Cliffs and exploding gradients

•  Prac6cal algorithms •  Stochas.c Gradient Descent •  Momentum •  Nesterov Momentum •  Learning rate •  Adap.ve learning rates: adaGrad, RMSProp, Adam

•  Approximate second-‐order methods •  Parameter ini6aliza6on •  Batch Normaliza6on

3

Differences between learning and pure op6miza6on

Op6miza6on for NN training •  Goal: Find the parameters that minimize the expected risk (generaliza.on error)

•  x input, predicted output, y target output, E expecta.on •  pdata true (unknown) data distribu.on •  L loss func6on (how wrong predic6ons are)

•  But we only have a training set of samples: we minimize the empirical risk, average loss on a finite dataset D

J (θ ) = Ε( x ,y )∼pdata

L( f (x;θ ), y)

f (x,θ )

J (θ ) = Ε( x ,y )∼ pdata

L( f (x;θ ), y) = 1D

L( f (x( i) ,θ ), y( i) )( x( i ) ,y( i ) )∈D∑

where is the empirical distribu.on, |D| is the number of examples in D 5 pdata

Surrogate loss

•  O]en minimizing the real loss is intractable •  e.g. 0-‐1 loss (0 if correctly classified, 1 if it is not) (intractable even for linear classifiers (Marcoae 1992)

•  Minimize a surrogate loss instead •  e.g. nega.ve log-‐likelihood for the 0-‐1 loss

•  Some.mes the surrogate loss may learn more •  test error 0-‐1 loss keeps decreasing even a]er

training 0-‐1 loss is zero •  further pushing classes apart from each other

6

0-‐1 loss (blue) and surrogate losses (square, hinge, logis.c)

0-‐1 loss (blue) and nega.ve log likelihood (red)

Surrogate loss func6ons

7

Probabilistic classifier

Outputs probability of class 1 g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x) Binary cross-entropy loss:L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x)) Decision function: f(x) = Ig(x)>0.5

Outputs a vector of probabili.es: g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) ) Negative conditional log likelihood loss L(g(x),y) = -log g(x)y

Decision function: f(x) = argmax(g(x))

Non- Hinge loss: probabilistic classifier

Outputs a «score» g(x) for class 1. score for the other class is -g(x)

L(g(x),t) = max(0, 1-tg(x)) where t=2y-1 Decision function: f(x) = Ig(x)>0

Outputs a vector g(x) of real-‐valued scores for the m classes.Mul.class margin loss L(g(x),y) = max(0,1+max(g(x)k)-g(x)y )

k≠y Decision function: f(x) = argmax(g(x))

Binary classifier Multiclass classifier

Early stopping •  Training algorithms usually do not halt at a local minimum •  Early stopping:

•  based on the true underlying loss (ex 0-‐1 loss) measured on a valida6on set •  # training steps = hyperparameter controlling the effec.ve capacity of the model •  simple, effec.ve, must keep a copy of the best parameters •  acts as a regularizer (Bishop 1995,…)

8

Training error decreases steadily Valida.on error begins to increase Return parameters at point with lowest valida6on error

Batch and mini-‐batch algorithms •  In most op.miza.on methods used in ML the objec.ve func.on decomposes as a sum over a

training set

•  Gradient descent:

examples from the training set with corresponding targets

•  Using the complete training set can be very expensive (the gain of using more samples is less than linear – standard error of mean drops propor.onally to sqrt(m)-‐, training set may be redundant: use a subset of the training set

•  How many samples in each update step? •  Determinis.c or batch gradient methods: process all training samples in a large batch •  Stochas.c methods: use a single example at a .me

•  online methods: samples are drawn from a stream of con.nually created samples

•  Mini-‐batch stochas.c methods: use several (not all) samples 9

∇θ J (θ ) = Ε( x ,y )∼ pdata

∇θ L( f (x;θ ), y) = 1m∇θ L( f (x( i);θ ), y( i) )

i∑

{x( i)}i=1...m {y( i)}i=1...m

Batch and mini-‐batch algorithms Mini-‐batch size? •  Larger batches: more accurate es.mate of the gradient but less than linear return •  Very small batches: Mul.core architectures under-‐u.lized •  If samples processed in parallel: memory scales with batch size •  Smaller batches provide noisier gradient es.mates •  Small batches may offer a regularizing effect (add noise)

•  but may require small learning rate •  may increase number of steps for convergence

Minbatches should be selected randomly (shuffle samples) •  unbiased es.mate of gradients

10

Challenges in NN op6miza6on

Local minima

•  Convex op.miza.on •  any local minimum is a global minimum •  there are several opt. algorithms (polynomial-‐.me)

•  Non-‐convex op.miza.on •  objec6ve func6on in deep networks is non-‐convex •  deep models may have several local minima •  but this is not necessarily a major problem!

12

Local minima and saddle points •  Cri.cal points:

•  For high dimensional loss func.ons, local minima are rare compared to saddle points •  Hessian matrix:

real, symmetric eigenvector/eigenvalue decomposi.on

•  Intui.on: eigenvalues of the Hessian matrix

•  local minimum/maximum: all posi.ve / all nega.ve eigenvalues: exponen.ally unlikely as n grows •  saddle points: both posi.ve and nega.ve eigenvalues

13 Dauphin et al. Iden.fying and aaacking the saddle point problem in high-‐dimensional non-‐convex op.miza.on. NIPS 2014

Hij =

∂2 f∂xi ∂x j

f :!n → !

∇x f (x) = 0

Local minima and saddle points

•  It is believed that for many problems including learning deep nets, almost all local minimum have very similar func.on value to the global op.mum

•  Finding a local minimum is good enough

14

Value of local minima found by running SGD for 200 itera.ons on a simplified version of MNIST from different ini.al star.ng points. As number of parameters increases, local minima tend to cluster more .ghtly.

•  For many random func.ons local minima are more likely to have low cost than high cost.

Dauphin et al. Iden.fying and aaacking the saddle point problem in high-‐dimensional non-‐convex op.miza.on. NIPS 2014

Saddle points •  How to escape from saddle points?

•  First order methods •  ini.ally aaracted to saddle points, but unless

exact hit, it will be repelled when close •  hiqng cri.cal point exactly is unlikely (es.mated

gradient is noisy) •  saddle points are very unstable: noise (stochas.c

gradient descent) helps convergence, trajectory escapes quickly

•  Second order moments: •  Netwon’s method can jump to saddle points

(where gradient is 0)

15 S. Credit: K.McGuinness

SGD tends to oscillate between slowly approaching a saddle point and quickly escaping from it

Other difficul6es •  Cliffs and exploding gradients

•  Nets with many layers / recurrent nets can contain very steep regions (cliffs) mul.plica.on of several parameters): gradient descent can move parameters too far, jumping off of the cliff. (solu.ons: gradient clipping)

•  Long term dependencies: •  computa.onal graph becomes very deep: vanishing and exploding gradients

16

Algorithms

Stochas6c Gradient Descent (SGD) •  Most used algorithm for deep learning •  Do not confuse with determinis.c gradient descent: stochas.c uses mini-‐batches

Algorithm •  Require: Learning rate α, ini.al parameter θ •  while stopping criterion not met do

•  sample a minibatch of m examples from the training set with corresponding targets

•  compute gradient es.mate

•  apply update

•  end while 18

{x( i)}i=1...m {y( i)}i=1...m

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

θ ←θ −α g

Momentum

19

•  Designed to accelerate learning, especially for high curvature, small but consistent gradients or noisy gradients

•  Momentum aims to solve: poor condi.oning of Hessian matrix and variance in the stochas.c gradient

Contour lines= a quadra.c loss with poor condi.oning of Hessian Path (red) followed by SGD (le]) and momentum (right)

Momentum •  New variable v (velocity). direc.on and speed at which parameters move:

exponen.ally decaying average of nega.ve gradient

Algorithm •  Require: learning rate α, ini.al parameter θ, momentum parameter λ , ini6al velocity v •  Update rule:

•  compute gradient es.mate

•  compute velocity update •  apply update

•  Typical values λ=.5, .9,.99 (in [0,1}) •  Size of step depends on how large and aligned a sequence of gradients are. •  Read physical analogy in Deep Learning book (Goodfellow et al)

20

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

θ ←θ + v v ←λv −αg

Nesterov accelerated gradient (NAG) •  A variant of momentum, where gradient is evaluated a]er current velocity is applied:

•  Approximate where the parameters will be on the next .me step using current velocity •  Update velocity using gradient where we predict parameters will be

Algorithm •  Require: learning rate α, ini.al parameter θ, momentum parameter λ , ini.al velocity v •  Update:

•  apply interim update

•  compute gradient (at interim point)

•  compute velocity update

•  apply update

•  Interpreta.on: add a correc.on factor to momentum 21

g ← + 1

m∇ !θ L( f (x( i); !θ ), y( i) )

i∑

θ ←θ + v v ←λv −αg

!θ ←θ + λv

Nesterov accelerated gradient (NAG)

22

current loca.on wt

vt

∇L(wt) vt+1

S. Credit: K. McGuinness predicted loca.on based on velocity alone wt + 𝛾v

∇L(wt + 𝛾vt)

vt

vt+1

SGD: learning rate •  Learning rate is a crucial parameter for SGD

•  To large: overshoots local minimum, loss increases •  Too small: makes very slow progress, can get stuck •  Good learning rate: makes steady progress toward local minimum

•  In prac.ce it is necessary to gradually decrease learning rate

•  step decay (e.g. decay by half every few epochs)

•  exponen6al decay

•  1/t decay •  Sufficient condi.ons for convergence: •  Usually: adapt learning rate by monitoring learning curves that plot the objec.ve func.on

as a func.on of .me (more of an art than a science!) 23

α t = ∞

t=1

∞

∑

α t2 = ∞

t=1

∞

∑

α =α0e−kt

α =α0 / (1+ kt)t = itera/on number

Adap6ve learning rates •  Learning rate is one of the hyperparameters that is the most difficult to set; it has a

significant impact on the model performance

•  Cost if o]en sensi.ve to some direc.ons and insensi.ve to others •  Momentum/Nesterov mi.gate this issue but introduce another hyperparameter

•  Solu6on: Use a separate learning rate for each parameter and automa6cally adapt it through the course of learning

•  Algorithms (mini-‐batch based) •  AdaGrad •  RMSProp •  Adam •  RMSProp with Nesterov momentum

24

AdaGrad •  Adapts the learning rate of each parameter based on sizes of previous updates:

•  scales updates to be larger for parameters that are updated less •  scales updates to be smaller for parameters that are updated more

•  The net effect is greater progress in the more gently sloped direc.ons of parameter space •  Desirable theore.cal proper.es but empirically (for deep models) can result in a premature and

excessive decrease in effec6ve learning rate

•  Require: learning rate α, ini.al parameter θ, small constant δ (e.g. 10-‐7) for numerical stability •  Update:

•  compute gradient

•  accumulate squared gradient

•  compute update

•  apply update 25

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

θ ←θ + Δθ Δθ ←− α

δ + r⊙ g

r ← r + g⊙ g sum of all previous squared gradients updates inversely propor.onal to the square root of the sum

Duchi et al. Adap.ve Subgradient Methods for Online Learning and Stochas.c Op.miza.on. JMRL 2011

Root Mean Square Propaga6on (RMSProp) •  Modifies AdaGrad to perform beaer in non-‐convex surfaces, for aggressively decaying

learning rates •  Changes gradient accumula.on by an exponen6ally decaying average of sum of

squares of gradients •  Requires: learning rate α, ini.al parameter θ, decay rate ρ, small constant δ (e.g. 10-‐7)

for numerical stability •  Update:

•  compute gradient

•  accumulate squared gradient

•  compute update

•  apply update

26

θ ←θ + Δθ Δθ ←− α

δ + r⊙ g r ←ρr + (1− ρ)g⊙ g

g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

It can be combined with Nesterov momentum

Geoff Hinton, Unpublished

ADAp6ve Moments (Adam) •  Combina.on of RMSProp and momentum, but:

•  Keep decaying average of both first-‐order moment of gradient (momentum) and second-‐order moment (RMSProp)

•  Includes bias correc.ons (firs and second moments) to account for their ini.aliza.on at origin

Update: •  compute gradient

•  updated biased first moment es6mate

•  update biased second moment

•  correct biases

•  compute update (opera.ons applied elementwise)

•  apply update

27

θ ←θ + Δθ Δθ ←−α s

δ + r

s ←ρ1s+ (1− ρ1)g g ← + 1

m∇θ L( f (x( i);θ ), y( i) )

i∑

r ←ρ2r + (1− ρ2 )g⊙ g

s ← s

1− ρ1 r ← r

1− ρ2

Kingma et al. Adam: a Method for Stochas.c Op.miza.on. ICLR 2015

δ=e-‐8, ρ1=0.9, ρ2=0.999

Example: test func6on

28 Image credit: Alec Radford.

Beale’s func.on

Example: saddle point

29 Image credit: Alec Radford.

Second order op6miza6on

30

First order Second order

Second order op6miza6on •  Second order Taylor expansion

•  Solving for the cri.cal point we obtain the Newton parameter update:

•  Problem: Hessian has O(N2) elements, inver.ng H O(N3) (N parameters ≈millions)

•  Alterna6ves: •  Quasi-‐Newton methods (BGFS Broyden-‐Fletcher-‐Goldfarb-‐Shanno): instead of

inver.ng Hessian, approximate inverse Hessianwith rank 1 updates over .me O(N2) each

•  L-‐BFGS (Limited memory BFGS): does not form/store the full inverse Hessian

31

J (θ ) ≈ J (θ0 )+ (θ −θ0 )T ∇θ J (θ0 )+ 1

2(θ −θ0 )T H (θ −θ0 )

θ* = θ0 − H −1∇θ J (θ0 )

Parameter ini6aliza6on •  Weights

•  Can’t ini.alize weights to 0 (gradients would be 0) •  Can’t ini.alize all weights to the same value (all hidden units in a layer will always

behave the same; need to break symmetry) •  Small random number, e.g., uniform or gaussian distribu6on

•  if weights start too small, the signal shrinks as it passes through each layer un.l it is too .ny to be useful

•  Calibra6ng variances with 1/sqrt(n) (Xavier Ini6aliza6on) •  each neuron: w = randn(n) / sqrt(n) , n inputs

•  He ini6aliza6on (for ReLu ac6va6ons) sqrt(2/n) •  each neuron w = randn(n) * sqrt(2.0 /n) , n inputs

•  Biases •  ini.alize all to 0 (except for output unit for skewed distribu.ons, 0.01 to avoid satura.ng RELU)

•  Alterna6ve: Ini.alize using machine learning; parameters learned by unsupervised model trained on the same inputs / trained on unrelated task 32

N (0,10−2 )

Batch normaliza6on •  As learning progresses, the distribu.on of the layer inputs changes due

to parameter updates ( internal covariate shi])

•  This can result in most inputs being in the non-‐linear regime of the ac.va.on func.on, slowing down learning

•  Bach normaliza.on is a technique to reduce this effect

•  Explicitly force the layer ac.va.ons to have zero mean and unit variance w.r.t running batch es.mates

•  Adds a learnable scale and bias term to allow the network to s.ll use the nonlinearity

33 Ioffe and Szegedy, 2015. “Batch normaliza.on: accelera.ng deep network training by reducing internal covariate shi]”

FC / Conv

Batch norm

ReLu

FC / Conv

Batch norm

ReLu

Batch normaliza6on •  Can be applied to any input or hidden layer •  For a mini-‐batch of N ac.va.ons of the layer

1.  compute empirical mean and variance for each dimension D 2.  normalize

•  Note: normaliza.on can reduce the expressive power of the network (e.g. normalize

intputs of a sigmoid would constrain them to the linear regime •  So let the network learn the iden.ty: scale and shi] •  To recover the iden.ty mapping the newtork can lean

34

x(k ) = x(k ) − E(x(k ) )

var(x(k ) )

N

D

X

β(k ) = E(x(k ) ) γ

(k ) = var(x(k ) )

y(k ) = γ (k ) x(k ) + β (k )

Batch normaliza6on

35

1.  Improves gradient flow through the network

2.  Allows higher learning rates

3.  Reduces the strong dependency on ini.aliza.on

4.  Reduces the need of regulariza.on

Batch normaliza6on

36

At test 6me BN layers func6ons differently: 1.  Mean and std are not

computed on the batch.

2.  Instead, a single fixed empirical mean and std of ac.va.ons computed during training is used

(can be es.mated with running averages)

Summary

37

•  Op.miza.on for NN is different from pure op.miza.on: •  GD with mini-‐batches •  early stopping •  non-‐convex surface, local minima and saddle points

•  Learning rate has a significant impact on model performance •  Several extensions to SGD can improve convergence •  Ada.ve learning-‐rate methods are likely to achieve best results

•  RMSProp, Adam •  Weight ini.aliza.on: He w= randn(n) sqrt(2/n) •  Batch normaliza.on to reduce the internal covariance shi]

Bibliograpy •  Goodfellow, I., Bengio, Y., and A., C. (2016), Deep Learning, MIT Press. •  Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015), The loss surfaces of

mul.layer networks. In AISTATS. •  Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Iden.fying and

aaacking the saddle point problem in high-‐dimensional non-‐convex op.miza.on. In Advances in Neural Informa.on Processing. Systems, pages 2933–2941.

•  Duchi, J., Hazan, E., and Singer, Y. (2011). Adap.ve subgradient methods for online learning and stochas.c op.miza.on. Journal of Machine Learning Research, 12(Jul):2121–2159.

•  Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2015). Qualita.vely characterizing neural network op.miza.on problems. In Interna.onal Conference on Learning Representa.ons.

•  Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures •  Jacobs, R. A. (1988). Increased rates of convergence through learning rate adapta.on. Neural

networks, 1(4):295–307. •  Kingma, D. and Ba, J. (2014)-‐ Adam: A method for stochas.c op.miza.on. arXiv preprint arXiv:

1412.6980. •  Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solu.ons to the nonlinear dynamics of

learning in deep linear neural networks. In Interna.onal Conference on Learning Representa.ons 38

Data & Analytics

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)