Upload
stefan-kuehn
View
392
Download
2
Embed Size (px)
Citation preview
Deep Learning and Optimization Methods
Stefan Kühn
Join me on XING
Data Science Meetup Hamburg - July 27th, 2017
Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26
Contents
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 2 / 26
Deep Learning
Neural Networks - Universal Approximation Theorem1-hidden-layer feed-forward neural net with finite number of parameters canapproximate any continuous function on compact subsets of Rn
Questions:Why do we need deep learning at all?
I theoretic resultI approximation by piecewise constant functions (not what you might
want for classification/regression)Why are deep nets harder to train than shallow nets?
I More parameters to be learned by training?I More hyperparameters to be set before training?I Numerical issues?
disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —
Stefan Kühn (XING) Deep Optimization 27.07.2017 4 / 26
Example: RNNs
Recurrent Neural NetsExtremely powerful for modeling sequential data, e.g. time series butextremely hard to train (somewhat less hard for LSTMs/GRUs)
Main Advantages:Qualitatively: Flexible and rich model classPractically: Gradients easily computed by Backpropagation (BPTT)
Main Problems:Qualitatively: Learning long-term dependenciesPractically: Gradient-based methods struggle when separation betweeninput and target output is large
Stefan Kühn (XING) Deep Optimization 27.07.2017 5 / 26
Example: RNNs
Recurrent Neural NetsHighly volatile relationship between parameters and hidden states
IndicatorsVanishing/exploding gradientsInternal Covariate Shift
RemediesReLU’Careful’ initializationSmall stepsizes(Recurrent) Batch Normalization
Stefan Kühn (XING) Deep Optimization 27.07.2017 6 / 26
Example: RNNs
Recurrent Neural Nets and LSTMSchmidhuber/Hochreiter proposed change of RNN architecture by addingLong Short-Term Memory Units
Vanishing/exploding gradients?fixed linear dynamics, no longer problematic
Any questions open?Gradient-based trainings works better with LSTMsLSTMs can compensate one deficiency of Gradient-based learning butis this the only one?
Most problems are related to specific numerical issues.
Stefan Kühn (XING) Deep Optimization 27.07.2017 7 / 26
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 8 / 26
Trade-offs between Optimization and Learning
Computational complexity becomes the limiting factor when one envisionslarge amounts of training data. [Bouttou, Bousquet]
Underlying IdeaApproximate optimization algorithms might be sufficient for learningpurposes. [Bouttou, Bousquet]
Implications:Small-scale: Trade-off between approximation error and estimationerrorLarge-scale: Computational complexity dominates
Long story short:The best optimization methods might not be the best learningmethods!
Stefan Kühn (XING) Deep Optimization 27.07.2017 9 / 26
Empirical results
Empirical evidence for SGD being a better learner than optimizer.
RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks
Stefan Kühn (XING) Deep Optimization 27.07.2017 10 / 26
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 11 / 26
Gradient Descent
Minimize a given function f :min f (x), x ∈ Rn
Direction of Steepest Descent, the negative gradient:
d = −∇f (x)
Update in step k
xk+1 = xk − α∇f (xk)
Properties:always a descent direction, no test neededlocally optimal, globally convergentworks with inexact line search, e.g. Armijo’ s rule
Stefan Kühn (XING) Deep Optimization 27.07.2017 12 / 26
Stochastic Gradient Descent
Setting
f (x) :=∑i
fi (x),
∇f (x) :=∑i
∇fi (x), i = 1, . . . ,m number of training examples
Choose i and update in step k
xk+1 = xk − α∇fi (xk)
Stefan Kühn (XING) Deep Optimization 27.07.2017 13 / 26
Shortcomings of Gradient Descent
local: only local information usedespecially: no curvature information usedgreedy: prefers high curvature directionsscale invariant: no
James Martens, Deep learning via Hessian-free optimization
Stefan Kühn (XING) Deep Optimization 27.07.2017 14 / 26
Momentum
Update in step k
zk+1 = βzk +∇f (xk)
xk+1 = xk − αzk+1
Properties for a quadratic convex objective:condition number κ of improves by square rootstepsizes can be twice as longorder of convergence
√κ− 1√κ+ 1
instead ofκ− 1κ+ 1
can diverge, if β is not properly chosen/adaptedGabriel Goh, Why momentum really works
Stefan Kühn (XING) Deep Optimization 27.07.2017 15 / 26
Momentum
D E M O
Stefan Kühn (XING) Deep Optimization 27.07.2017 16 / 26
Adam
Properties:combines several clever tricks (from Momentum, RMSprop, AdaGrad)has some similarities to Trust Region methodsempirically proven - best in class (personal opinion)
Kingma, Ba Adam: A method for stochastic optimization
Stefan Kühn (XING) Deep Optimization 27.07.2017 17 / 26
SGD, Momentum and more
D E M O
Stefan Kühn (XING) Deep Optimization 27.07.2017 18 / 26
L-BFGS and Nonlinear CG
Observations so far:The better the method, the more parameters to tune.All better methods try to incorporate curvature information.Why not doing so directly?
L-BFGSQuasi-Newton method, builds an approximation of the (inverse) Hessianand scales gradient accordingly.
Nonlinear CGInformally speaking, Nonlinear CG tries to solve a quadratic approximationof the function.
No surprise: They also work with minibatches.
Stefan Kühn (XING) Deep Optimization 27.07.2017 19 / 26
Empirical results
Empirical evidence for better optimizers being better learners.
MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning
Stefan Kühn (XING) Deep Optimization 27.07.2017 20 / 26
Truncated Newton: Hessian-Free Optimization
Main ideas:Approximate not Hessian H, but matrix-vector product Hd .Use finite differences instead of exact Hessian.Use damping.Use Linear CG method for solving quadratic approximation.Use clever mini-batch stragegy for large data-sets.
Stefan Kühn (XING) Deep Optimization 27.07.2017 21 / 26
Empirical test on pathological problems
Main results:The addition problem is known to be effectively impossible forgradient descent, HF did it.Basic RNN cells are used, no specialized architectures (LSTMs etc.).
(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)
Stefan Kühn (XING) Deep Optimization 27.07.2017 22 / 26
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 23 / 26
Summary
In the long run, the biggest bottleneck will be the sequential parts of analgorithm. That’s why the number of iterations needs to be small. SGD andits successors tend to have much more iterations, and they cannot benefitas much from higher parallelism (GPUs).
But whatever you do/prefer/choose:At least use successors of SGD: Momentum, Adam etc.Look for generic approaches instead of more and more specialized andmanually finetuned solutions.Key aspects:
I InitializationI Adaptive choice of stepsizes/momentum/. . .I Scaling of the gradient
Stefan Kühn (XING) Deep Optimization 27.07.2017 24 / 26
Resources
Overview of Gradient Descent methodsWhy momentum really worksAdam - A Method for Stochastic OptimizationAndrew Ng et al. about L-BFGS and CG outperforming SGDLecture Slides Neural Networks for Machine Learning - Hinton et al.On the importance of initialization and momentum in deep learningData-Science-Blog: Summary article in preparation (Stefan Kühn)The Neural Network Zoo
Stefan Kühn (XING) Deep Optimization 27.07.2017 25 / 26