Download pdf - Entropy-SGD: biasing gradient descent into wide valleysvision.ucla.edu/~pratikac/pub/chaudhari.choromanska.ea.iclr17... · Entropy-SGD: biasing gradient descent into wide valleys

Entropy-SGD: biasing gradient descent into wide valleysPratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi,!

Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina

Empirical validation

Hessian of small-LeNet at an optimum

�5 0 10 20 30 40Eigenvalues

0

10

103

105

Freq

uenc

y

�0.5 �0.4 �0.3 �0.2 �0.1 0.0Eigenvalues

0

102

103

104

Freq

uenc

y

Short negative tail

Motivation

0 50 100 150 200Epochs ⇥ L

5

10

15

20

%Er

ror

7.71%

7.81%

SGDEntropy-SGD

0 50 100 150 200Epochs ⇥ L

0

0.1

0.2

0.3

0.4

0.5

0.6

Cro

ss-E

ntro

pyLo

ss

0.03530.0336

SGDEntropy-SGD

0 10 20 30 40 50Epochs ⇥ L

1.2

1.25

1.3

1.35

Perp

lexi

ty

(Test :1.226)1.224

(Test :1.217)1.213

AdamEntropy-Adam

0 10 20 30 40 50Epochs ⇥ L

75

85

95

105

115

Perp

lexi

ty

(Test :78.6)81.43(Test :77.656)

80.116

SGDEntropy-SGD

All-CNN on CIFAR-10

PTB and char-RNN

Local entropy amplifies wide minima

Discrete Perceptrons

‣ What is the shape of the energy landscape?

‣ Reinforce SGD with properties of the loss function

‣ Does geometry connect to generalization?

vs. complexity of training

Modify the loss function

�0.5

0.0

0.5

1.0

1.5

xcandidate

bxF

bx f

f (x)

F(x, 103)

F(x, 2 ⇥ 104)

original global minimum

new global minimum

‣ Modified energy landscape is smoother by a factor

Theorem: Bound generalization error using stability

11+ g c

if there exists c > 0, such that

l�—2

f (x)�/2 [�2g�1, c]

eEntropy�SGD

⇣a

T

⌘h1� 1

1+g c

ib

eSGD b -smooth

f (x) is a-Lipschitz,

Hardt et al., ‘15

‣ Simulated annealing failsBraunstein, Zecchina ’05!

Baldassi et al. ‘16

F(x,d) = log

��x

0: # mistakes(x0) = 0,

��x� x

0��= d

��

dense clusters provably!generalize better, absent in!the standard replica analysis

isolated solutions

‣ Local entropy counts #solutions in a neighborhood

slow-down

‣ Smooth using a convolution “local entropy”

‣ Gradient

original lossGaussian kernel“scope”

Expected value of a local Gibbs distribution

‣ Estimate the gradient using MCMC

—F(x,g) = g�1⇣

x�⌦x

0↵⌘

F(x,g) =� log

hGg ⇤ e

� f (x)i

⌦x

0↵= 1

Z(x)

Z

x

0x

0exp

✓� f (x0)� 1

2gkx� x

0k2

◆dx

0

focuses on a!neighborhood

extremely general and scalable

decrease with!training iterations

Baldassi et al., ‘15

Langevin dynamics