FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016 · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

FAST, SCALABLE DEEP LEARNING ENSEMBLES

Alan Mosca

WHO AM I?

• Machine Learning Researcher at Birkbeck, UoL

• Previous life as a Quant

• Stream Computing platforms at Sendence

• Co-founder Malet Labs

• Consulting on Deep Learning projects

WHAT YOU’RE GOING TO SEE

• Background on DL, research and scaling

• Background on Ensembles

• DL Ensembles

• “Second-gen” DL Ensembles

DEEP LEARNING

DEEP LEARNING TODAY

• Training is very slow

• Models are resource and data hungry

• Exceptional results (if done right)

SOME THINGS RESEARCHERSFOCUS ON

• New algorithms

• Better (faster) optimization methods

• Techniques and “tricks” to improve and speed up optimization

• Generating artificial training data through transformations and noise

• Studying the behaviour of models with introspection techniques (“why does it work?”)

TOOLS FOR SPEED

• Massive clusters

• GPUs (mostly NVidia)

• Other dedicated hardware (TCUs, Asics, etc)

DEEP LEARNING ON THE GPU

• CUDA is the underlying main tool

• CUDA wrappers: Theano, Keras, TensorFlow, pyLearn2, Torch, Caffe

SCALE

Buy many GPUs, then use them to run a massive network with better results

• Expensive coordination effort

• Very tight coupling between GPUs (high bandwidth)

• Hard to get right and verify

• Doesn’t really scale easily across multiple machines

MULTI-GPU

• Some libraries/systems support running on multi-GPU platforms (caffe, TF, DIGITS)

• Bandwidth between machines is usually limited: use Infiniband!

• nVidia developed GPUDirect to use rDMA over IB directly between GPUs

OR: SCALE WITHMULTIPLE MODELS

• Run many diverse networks as base learners

• Create an aggregate model of all the networks

• Loose coupling: scales easily across machines

• In many cases it’s the only way to go above a certain accuracy threshold

• Requires some thought and design

WHAT’S AN ENSEMBLE?

ENSEMBLES

• Aggregation of ML methods that solve the same problem: same input type, same output type

• Usually studied from the point of view of learning theory

• Treat the base learner as a “black box”

AN ENSEMBLE

C1

C2

C3

C4

AGGD

“dinosaur”

“dinosaur”

“hat”

“banana”

“dinosaur”

ENSEMBLES: CAN REMOVE MEMBERS

C1

C2

C3

C4

AGGD

“dinosaur”

“dinosaur”

“hat”

“banana”

“dinosaur”

EXAMPLE

WEAK CLASSIFIERS CREATE STRONG CLASSIFIERS

• Weak classifier : any classifier that does better than random guess

• Diverse weak classifiers can be combined to reach arbitrarily good classification (strong classifier), given enough training data and enough diversity

EXAMPLE

NOT AN ENSEMBLE

NOT AN ENSEMBLE

obst

people

lidar

signs LSTM

driving

ENSEMBLE METHODS

COMPONENTS OF AN ENSEMBLE METHOD

• Training set generation

• Diversity generation

• Aggregation

TRAINING AN ENSEMBLE

Training set

T1 T2 Tn

C1 C2 Cn

JOIN

TRAINING AN ENSEMBLE

Training set

T1 T2 Tn

C1 C2 Cn

JOIN

BAGGING

• Resample the training set randomly to generate N new training sets (Bootstrapping)

• Train a new classifier for each training set (for DL we use a new random initialization)

• Aggregate by Averaging or Majority Voting

BOOSTING

• Train the first member on the entire training set

• Repeat at each round:

• Identify errors on the training set made by previous classifier

• Change resample weights to emphasize the “harder” examples

• Generate more specialized classifier with the new training set

• Generate a weight for aggregation (importance factor)

• Weighted-averaging aggregator

BOOSTING VS BAGGING

• The weighted resample is more “directed” towards improving accuracy

• More sensitive to class imbalance

• Sequential vs Parallel training

DEEP LEARNING ENSEMBLES

• A nice way to win Kaggle competitions (and Netflix prize)

• The Ensemble part always seems to be an afterthought

• The Deep Learning model is still treated as a black box

• slow x N

EXAMPLE

• Small CNN on CIFAR-10, 50 epochs: ~84%

• Bagging 3x of the same: ~86%

• trained in 3 x time of a single network

SECOND-GENERATIONDEEP LEARNING ENSEMBLES

SECOND-GEN ENSEMBLES

• Stop treating the model as a “black box”

• Use properties of the model to improve the ensemble

• Specific to a DL algorithm, but faster and better results

SECOND-GEN: AN EXAMPLE

• Based on AdaBoost

• Uses properties of CNNs

• Faster training at each round

CNN REFRESHER

• Particular type of ANN that uses the operation of Convolution, along with some weight sharing, to pass a common function (a feature map) over an input with spatial locality.

0 0 0 00 1 1 00 1 1 00 0 0 0

Input

0 1 00 1 10 0 0

Feature map

2 13 2Output

• 0.60 dinosaur • 0.18 cup • 0.12 grass • 0.10 hat

TRANSFER OF LEARNINGIN CNNS

• 0.72 cat • 0.18 telephone • 0.10 car

• 0.60 dinosaur • 0.18 cup • 0.12 grass • 0.10 hat

DEEP INCREMENTAL BOOSTING

1. The first round of boosting runs as a normal network on the full training set

2. At each round, a new dataset is sampled, in the same fashion as AdaBoost


3. The first N layers are copied between rounds to the next round, reducing the amount of time needed to learn




4. The network is extended by a single layer, to give it capacity to learn the “corrections” in the new training set



N

RESULTS

• Same CNN on CIFAR-10: ~84%

• 3x DIB: ~89% (Bagging was 86%)

• trained in1.3 x the time of a single network

• Still have to run 3 networks at test time

DISTILLATION

• Train a cumbersome model (for example an Ensemble)

• Train a smaller model on the output of the large one

• Use the small one as a cheaper approximation

DISTILLATION OF ENSEMBLES

• Train a single network with the same shape as one the ensemble members, to mimic the entire ensemble

• If done with large enough ensemble it can also serve as a regularization method

QUICK EXPERIMENT

• Very short training (~10 mins) for each network

• Overfit MNIST

• Underfit CIFAR-10

• Distill AdaBoost and Bagging (10 members)

RESULTS (SMALL NETS)

MNIST CIFAR10

Single 0.66% 26.77%

AdaBoost 0.63% 24.61%

AdaBoost Distilled 0.52% 24.05%

Bagging 0.59% 22.30%

Bagging Distilled 0.55% 23.65%

GPU MEMORY USAGE

Single Bagging Distilled

MNIST 130M 1.4G 130M

CIFAR-10 450M 4.5G 450M

APPROXIMATE ENSEMBLES: RESIDUAL NETWORKS

Residual Networks can be also thought of as an approximation of an ensemble, if we unroll the entire tree. We can even remove some paths!

APPROXIMATE ENSEMBLES: RESIDUAL NETWORKS

Residual Networks can be also thought of as an approximation of an ensemble, if we unroll the entire tree. We can even remove some paths!

Look familiar?

BOOSTED RESIDUAL NETWORKS

DIB BRN

Build a Residual Network incrementally and keep all the intermediate networks to make an ensemble

BRN: RESULTS

99.32

99.34

99.36

99.38

99.4

99.42

99.44

99.46

99.48

99.5

1 2 3 4 5 6 7 8 9 10

Accu

racy

(%)

Boosting round

DIBBRN

BRN: RESULTS

• ~9.5% on CIFAR-10 with no dataset augmentation in less than 8 hours, combined total of 190 epochs (100 + 9 * 10)

• ~31% on CIFAR-100 in the same time

TAKEAWAYS

• Deep Learning isn’t just about making bigger networks, we can also use many networks

• Ensembles don’t have to treat the members as black boxes

• There are methods to train DL ensembles quickly and to use them on small hardware

• This will be emerging as a methodology as hardware becomes faster and we have more memory

RESOURCES AND TOOLS

SOFTWARE

• Toupee (github.com/nitbix/toupee): library for DL ensembles and toolkit for running DL (and ensemble) experiments

• Keras fork (github.com/nitbix/keras)

• Contributions welcome! (and encouraged, especially documentation)

http://github.com/nitbix/toupee

http://github.com/nitbix/keras

MY RECENT PAPERS

• “Deep Incremental Boosting”A Mosca & G D Magoulas GCAI 2016

• “Boosted Residual Networks”A Mosca & G D Magoulas Submitted to ICLR 2017

• “Regularizing Deep Learning Ensembles by Distillation” A Mosca & G D Magoulas CIMA Workshop at ECAI 2016

• “Fast Training of Convolutional Networks with Weight-wise Adaptive Learning Rates”A Mosca & G D Magoulas Submitted to ESANN 2017

OTHER LEARNING RESOURCES

• Kaggle ensembling guide: http://mlwave.com/kaggle-ensembling-guide/

• Ensemble methods and Deep Learning: https://fenix.tecnico.ulisboa.pt/downloadFile/563568428719666/licao_21.pdf

http://mlwave.com/kaggle-ensembling-guide/

https://fenix.tecnico.ulisboa.pt/downloadFile/563568428719666/licao_21.pdf

Q&Aand please get in touch!

[email protected]/~amosca02

mailto:[email protected]?subject=

http://www.dcs.bbk.ac.uk/~amosca02

Documents

FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016 · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,