54
FAST, SCALABLE DEEP LEARNING ENSEMBLES Alan Mosca

FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

FAST, SCALABLE DEEP LEARNING ENSEMBLES

Alan Mosca

Page 2: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

WHO AM I?

• Machine Learning Researcher at Birkbeck, UoL

• Previous life as a Quant

• Stream Computing platforms at Sendence

• Co-founder Malet Labs

• Consulting on Deep Learning projects

Page 3: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

WHAT YOU’RE GOING TO SEE

• Background on DL, research and scaling

• Background on Ensembles

• DL Ensembles

• “Second-gen” DL Ensembles

Page 4: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DEEP LEARNING

Page 5: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DEEP LEARNING TODAY

• Training is very slow

• Models are resource and data hungry

• Exceptional results (if done right)

Page 6: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

SOME THINGS RESEARCHERSFOCUS ON

• New algorithms

• Better (faster) optimization methods

• Techniques and “tricks” to improve and speed up optimization

• Generating artificial training data through transformations and noise

• Studying the behaviour of models with introspection techniques (“why does it work?”)

Page 7: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

TOOLS FOR SPEED

• Massive clusters

• GPUs (mostly NVidia)

• Other dedicated hardware (TCUs, Asics, etc)

Page 8: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DEEP LEARNING ON THE GPU

• CUDA is the underlying main tool

• CUDA wrappers: Theano, Keras, TensorFlow, pyLearn2, Torch, Caffe

Page 9: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

SCALE

Buy many GPUs, then use them to run a massive network with better results

• Expensive coordination effort

• Very tight coupling between GPUs (high bandwidth)

• Hard to get right and verify

• Doesn’t really scale easily across multiple machines

Page 10: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

MULTI-GPU

• Some libraries/systems support running on multi-GPU platforms (caffe, TF, DIGITS)

• Bandwidth between machines is usually limited: use Infiniband!

• nVidia developed GPUDirect to use rDMA over IB directly between GPUs

Page 11: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

OR: SCALE WITHMULTIPLE MODELS

• Run many diverse networks as base learners

• Create an aggregate model of all the networks

• Loose coupling: scales easily across machines

• In many cases it’s the only way to go above a certain accuracy threshold

• Requires some thought and design

Page 12: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

WHAT’S AN ENSEMBLE?

Page 13: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

ENSEMBLES

• Aggregation of ML methods that solve the same problem: same input type, same output type

• Usually studied from the point of view of learning theory

• Treat the base learner as a “black box”

Page 14: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

AN ENSEMBLE

C1

C2

C3

C4

AGGD

“dinosaur”

“dinosaur”

“hat”

“banana”

“dinosaur”

Page 15: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

ENSEMBLES: CAN REMOVE MEMBERS

C1

C2

C3

C4

AGGD

“dinosaur”

“dinosaur”

“hat”

“banana”

“dinosaur”

Page 16: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

EXAMPLE

Page 17: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

WEAK CLASSIFIERS CREATE STRONG CLASSIFIERS

• Weak classifier : any classifier that does better than random guess

• Diverse weak classifiers can be combined to reach arbitrarily good classification (strong classifier), given enough training data and enough diversity

Page 18: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

EXAMPLE

Page 19: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

NOT AN ENSEMBLE

Page 20: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

NOT AN ENSEMBLE

obst

people

lidar

signs LSTM

driving

Page 21: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

ENSEMBLE METHODS

Page 22: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

COMPONENTS OF AN ENSEMBLE METHOD

• Training set generation

• Diversity generation

• Aggregation

Page 23: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

TRAINING AN ENSEMBLE

Training set

T1 T2 Tn

C1 C2 Cn

JOIN

Page 24: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

TRAINING AN ENSEMBLE

Training set

T1 T2 Tn

C1 C2 Cn

JOIN

Page 25: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

BAGGING

• Resample the training set randomly to generate N new training sets (Bootstrapping)

• Train a new classifier for each training set (for DL we use a new random initialization)

• Aggregate by Averaging or Majority Voting

Page 26: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

BOOSTING

• Train the first member on the entire training set

• Repeat at each round:

• Identify errors on the training set made by previous classifier

• Change resample weights to emphasize the “harder” examples

• Generate more specialized classifier with the new training set

• Generate a weight for aggregation (importance factor)

• Weighted-averaging aggregator

Page 27: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

BOOSTING VS BAGGING

• The weighted resample is more “directed” towards improving accuracy

• More sensitive to class imbalance

• Sequential vs Parallel training

Page 28: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DEEP LEARNING ENSEMBLES

• A nice way to win Kaggle competitions (and Netflix prize)

• The Ensemble part always seems to be an afterthought

• The Deep Learning model is still treated as a black box

• slow x N

Page 29: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

EXAMPLE

• Small CNN on CIFAR-10, 50 epochs: ~84%

• Bagging 3x of the same: ~86%

• trained in 3 x time of a single network

Page 30: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

SECOND-GENERATIONDEEP LEARNING ENSEMBLES

Page 31: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

SECOND-GEN ENSEMBLES

• Stop treating the model as a “black box”

• Use properties of the model to improve the ensemble

• Specific to a DL algorithm, but faster and better results

Page 32: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

SECOND-GEN: AN EXAMPLE

• Based on AdaBoost

• Uses properties of CNNs

• Faster training at each round

Page 33: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

CNN REFRESHER

• Particular type of ANN that uses the operation of Convolution, along with some weight sharing, to pass a common function (a feature map) over an input with spatial locality.

0 0 0 00 1 1 00 1 1 00 0 0 0

Input

0 1 00 1 10 0 0

Feature map

2 13 2Output

• 0.60 dinosaur • 0.18 cup • 0.12 grass • 0.10 hat

Page 34: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

TRANSFER OF LEARNINGIN CNNS

• 0.72 cat • 0.18 telephone • 0.10 car

• 0.60 dinosaur • 0.18 cup • 0.12 grass • 0.10 hat

Page 35: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DEEP INCREMENTAL BOOSTING

1. The first round of boosting runs as a normal network on the full training set

2. At each round, a new dataset is sampled, in the same fashion as AdaBoost

Page 36: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DEEP INCREMENTAL BOOSTING

3. The first N layers are copied between rounds to the next round, reducing the amount of time needed to learn

• 0.62 cat • 0.23 telephone • 0.15 car

• 0.72 cat • 0.18 telephone • 0.10 car

Page 37: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DEEP INCREMENTAL BOOSTING

4. The network is extended by a single layer, to give it capacity to learn the “corrections” in the new training set

• 0.72 cat • 0.18 telephone • 0.10 car

• 0.77 cat • 0.15 telephone • 0.08 car

N

Page 38: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

RESULTS

• Same CNN on CIFAR-10: ~84%

• 3x DIB: ~89% (Bagging was 86%)

• trained in1.3 x the time of a single network

• Still have to run 3 networks at test time

Page 39: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DISTILLATION

• Train a cumbersome model (for example an Ensemble)

• Train a smaller model on the output of the large one

• Use the small one as a cheaper approximation

Page 40: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

DISTILLATION OF ENSEMBLES

• Train a single network with the same shape as one the ensemble members, to mimic the entire ensemble

• If done with large enough ensemble it can also serve as a regularization method

Page 41: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

QUICK EXPERIMENT

• Very short training (~10 mins) for each network

• Overfit MNIST

• Underfit CIFAR-10

• Distill AdaBoost and Bagging (10 members)

Page 42: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

RESULTS (SMALL NETS)

MNIST CIFAR10

Single 0.66% 26.77%

AdaBoost 0.63% 24.61%

AdaBoost Distilled 0.52% 24.05%

Bagging 0.59% 22.30%

Bagging Distilled 0.55% 23.65%

Page 43: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

GPU MEMORY USAGE

Single Bagging Distilled

MNIST 130M 1.4G 130M

CIFAR-10 450M 4.5G 450M

Page 44: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

APPROXIMATE ENSEMBLES: RESIDUAL NETWORKS

Residual Networks can be also thought of as an approximation of an ensemble, if we unroll the entire tree. We can even remove some paths!

Page 45: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

APPROXIMATE ENSEMBLES: RESIDUAL NETWORKS

Residual Networks can be also thought of as an approximation of an ensemble, if we unroll the entire tree. We can even remove some paths!

Look familiar?

Page 46: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

BOOSTED RESIDUAL NETWORKS

DIB BRN

Build a Residual Network incrementally and keep all the intermediate networks to make an ensemble

Page 47: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

BRN: RESULTS

99.32

99.34

99.36

99.38

99.4

99.42

99.44

99.46

99.48

99.5

1 2 3 4 5 6 7 8 9 10

Accu

racy

(%)

Boosting round

DIBBRN

Page 48: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

BRN: RESULTS

• ~9.5% on CIFAR-10 with no dataset augmentation in less than 8 hours, combined total of 190 epochs (100 + 9 * 10)

• ~31% on CIFAR-100 in the same time

Page 49: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

TAKEAWAYS

• Deep Learning isn’t just about making bigger networks, we can also use many networks

• Ensembles don’t have to treat the members as black boxes

• There are methods to train DL ensembles quickly and to use them on small hardware

• This will be emerging as a methodology as hardware becomes faster and we have more memory

Page 50: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

RESOURCES AND TOOLS

Page 51: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

SOFTWARE

• Toupee (github.com/nitbix/toupee): library for DL ensembles and toolkit for running DL (and ensemble) experiments

• Keras fork (github.com/nitbix/keras)

• Contributions welcome! (and encouraged, especially documentation)

Page 52: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

MY RECENT PAPERS

• “Deep Incremental Boosting”A Mosca & G D Magoulas GCAI 2016

• “Boosted Residual Networks”A Mosca & G D Magoulas Submitted to ICLR 2017

• “Regularizing Deep Learning Ensembles by Distillation” A Mosca & G D Magoulas CIMA Workshop at ECAI 2016

• “Fast Training of Convolutional Networks with Weight-wise Adaptive Learning Rates”A Mosca & G D Magoulas Submitted to ESANN 2017

Page 53: FAST, SCALABLE DEEP LEARNING ENSEMBLESfiles.meetup.com/19679882/UdacityTalk-2016-11-15 (1).pdf · 11/15/2016  · TAKEAWAYS • Deep Learning isn’t just about making bigger networks,

OTHER LEARNING RESOURCES

• Kaggle ensembling guide: http://mlwave.com/kaggle-ensembling-guide/

• Ensemble methods and Deep Learning: https://fenix.tecnico.ulisboa.pt/downloadFile/563568428719666/licao_21.pdf