59
Don’t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 Google Brain slsmith@

Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Don’t Decay the Learning Rate,Increase the Batch Size

Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. LeDecember 9th 2017

Google Brainslsmith@

Page 2: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Three related questions:

● What properties control generalization?

● How should we tune SGD hyper-parameters?

● Can we train efficiently with large batches? (> 50,000 examples)

Page 3: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Small batches out-generalize large batches(at constant learning rate)

As observed by:

“On Large Batch Training…”, Keskar et al. (2017)

Page 4: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Which minimum is best?

Page 5: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Bayesian model comparison

Page 6: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Bayesian model comparison

Page 7: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Bayesian model comparison

Probability ratio of two competing models

Page 8: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Bayesian model comparison

Probability ratio of two competing models

Prior probability ratio of the models. Usually 1.

Page 9: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Bayesian model comparison

Probability ratio of two competing models

Prior probability ratio of the models. Usually 1.

The evidence ratio!

Page 10: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The Bayesian evidence (Gaussian approximation)

λi is the ith Hessian eigenvalue

λ is the L2 regularization parameter

Page 11: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The Bayesian evidence (Gaussian approximation)

λi is the ith Hessian eigenvalue

λ is the L2 regularization parameter

Evidence for a minimum

Page 12: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The Bayesian evidence (Gaussian approximation)

λi is the ith Hessian eigenvalue

λ is the L2 regularization parameter

Evidence for a minimum Depth of the

minimum

Page 13: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The Bayesian evidence (Gaussian approximation)

λi is the ith Hessian eigenvalue

λ is the L2 regularization parameter

Evidence for a minimum Depth of the

minimumWidth of the minimum

Page 14: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The Bayesian evidence (Gaussian approximation)

λi is the ith Hessian eigenvalue

λ is the L2 regularization parameter

Width of the minimum

Invariant to changes in model parameterization(sharp minima can’t generalize!)

Page 15: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Which minimum is best?

Page 16: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Which minimum is best?

Generalization is a weighted combination of:

1) Depth2) Width

Page 17: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Which minimum is best?

The SGD should not minimize the cost function

It should maximize the evidence

Page 18: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The SGD gradient update

NoiseTrue gradient

Page 19: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The SGD gradient update

Page 20: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The SGD gradient update

Page 21: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

The SGD gradient update

Batch size

Page 22: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

How to choose the batch size? (at constant learning rate)

Page 23: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

How to choose the batch size? (at constant learning rate)

Too little noise(big batches)

Too much noise(small batches)

Just right!

Page 24: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

How to choose the batch size? (at constant learning rate)

Just right!

Too little noise(big batches)

Too much noise(small batches)

There should be an optimum batch size

Page 25: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

How to choose the batch size? (at constant learning rate)

Page 26: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

How to choose the batch size? (at constant learning rate) As

predicted!

Page 27: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Defining the SGD “noise scale”

Page 28: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation

Page 29: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation

“Noise scale”

Page 30: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation

After a little math:

Page 31: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Defining the SGD “noise scale” SGD integrates an underlying stochastic differential equation

After a little math:

Prediction:

Page 32: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Page 33: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Page 34: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Consequences

1) We can linearly scale batch size and learning rate■ “Accurate, Large Minibatch SGD: Training ImageNet in 1

Hour”, Goyal et al. (2017)

2) We expect training sets to grow over time■ Suggests batch sizes will rise

Page 35: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

What about momentum?

Page 36: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

What about momentum?

Page 37: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

What about momentum?

Page 38: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Page 39: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Decaying learning rate and increasing batch size are equivalent

Page 40: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Decaying learning rate and increasing batch size are equivalent

Page 41: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Decaying learning rate and increasing batch size are equivalent

We can choose any combination of ε and B with the same g.

(so long as ε isn’t too large)

Page 42: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Three equivalent schedules:

Wide ResNet on CIFAR-10

Page 43: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Training curves:

Ghost batch norm,Hoffer et al., 2017

Page 44: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Training curves:

Computational cost constant

Page 45: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Computational cost constantBut parallelizable

Training curves:

Page 46: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Test curves:

MomentumNesterov

momentum

Page 47: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Test curves:

Vanilla SGD Adam

Page 48: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Towards large batch training:

Page 49: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Towards large batch training:

Page 50: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Towards large batch training:

Page 51: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Towards large batch training:

Page 52: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Towards large batch training:

Typical

speed-up10-100X

Page 53: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Why does momentum scaling reduce test accuracy?

Page 54: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

“Accumulation” stores moving average of gradients

Why does momentum scaling reduce test accuracy?

Page 55: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Larger momentum equals longer memory

Why does momentum scaling reduce test accuracy?

Page 56: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Larger momentum equals longer memory

Why does momentum scaling reduce test accuracy?

The gradient changes too slowly as we explore the parameter space

Page 57: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Training ImageNet in under 2500 updates!

Inception-Resnet-V2

Original implementation:~ 400,000 updates

“ImageNet in one hour”Goyal et al., 2017(learning rate scaling)~ 14,000 updates

Page 58: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

Training ImageNet in under 2500 updates!

79% accuracy in under 6000 updates

77% accuracy in under 2500 updates

Batches of 65,536 images

Page 59: Increase the Batch Size - GitHub Pages...Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le Approximate Bayesian Inference”, arXiv:1704.04289

Proprietary + Confidential

● “A Bayesian Perspective on Generalization and Stochastic Gradient Descent”, arXiv:1710.06451 Samuel L Smith and Quoc V. Le

● “Don’t Decay the Learning Rate, Increase the Batch Size”, arXiv:1711.00489 Samuel L Smith*, Pieter-Jan Kindermans* and Quoc V. Le *Equal contribution

● “Stochastic Gradient Descent as Approximate Bayesian Inference”, arXiv:1704.04289 Stephan Mandt, Matthew D. Hoffman and David M. Blei

Thank You!

slsmith@pikinder@

qvl@