17
A picture of the energy landscape of deep neural networks Pratik Chaudhari December 15, 2017 1 U C L A V I S I O N L A B

A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

  • Upload
    hadang

  • View
    222

  • Download
    4

Embed Size (px)

Citation preview

Page 1: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

A picture of the energy landscape of!deep neural networks

Pratik Chaudhari

December 15, 2017

1

UCLA VISION LAB

Page 2: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

2

What is the shape?

How should it inform the optimization?

Dy (x ;w ) = �(wp

�(wp�1 (. . . �(w 1x)) . . .))

w

⇤ = argmin

w

≈(x ,y ) 2 D

kX

i=1

�1{y=i } log

Dy

i

(x ; w )

| {z }, f (w )

(S)GD

Many, many variants!AdaGrad, SVRG, rmsprop, Adam, Eve,!APPA, Catalyst, Natasha, Katyusha…

wk+1 = wk � ⌘ +fb(wk )

“variance reduction”

Page 3: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

SGDLee et al., ‘16mere GD always!finds local minima

Ge et al., ’15!Sun et al., ’15, ‘16can escape strict!saddle points

Saxe et al., ‘14orthogonal!initializations

variance reduction

Schmidt et al. ’13!Defazio, et al. ’14

AdaGradDuchi et al., ’10

What do we know about the energy landscape?

3

multiple,!equivalent!

minima

linearnetworks

Duvenaud et al., ‘14deep GPs

Soundry & Carmon, ‘16piecewise linear

Haeffele & Vidal ’15matrix/tensor factorization

Baldi & Hornik ‘89PCA

our workHardt et al. ’15!Goodfellow & Vinyals ’15

generalizationcan easily get zero!training error, yet!generalize poorly

paradox between a “benign” energy landscape and delicate training algorithms

empirical!results

all local minima are!close to global minimum

Dauphin et al., ‘14

saddle points!slow down SGD

statistical physics

binary!perceptron

Bray & Dean ’07,Fyodorov & Williams ’07!Choromanska et al. ’15,!Chaudhari & Soatto, ‘15Gaussian random!fields, spin glasses

many descending!directions at high!energy

Baldassi et al. ’15

Page 4: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

Motivation from the Hessian

4

�5 0 10 20 30 40Eigenvalues

0

10

103

105

Freq

uenc

y

�0.5 �0.4 �0.3 �0.2 �0.1 0.0Eigenvalues

0

102

103

104

Freq

uenc

y

Short negative tail

Page 5: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

How do we exploit this?

5

Magnify the energy landscape and smooth with a kernel

w ⇤ = argminw

f (w )

= argmax

x

e

�f (w )

⇡ argmin

w� log

⇣G� ⇤ e�f (w )

Gaussian kernel!of variance g

focuses on the!neighborhood ofw

Page 6: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

�0.5

0.0

0.5

1.0

1.5

xcandidate

bxF

bx f

f (x)

F(x, 103)

F(x, 2 ⇥ 104)

A physics interpretation

6

new global minimum

original global minimum

Our new loss is “local entropy”Baldassi et al., ’15, ‘16

f�(w ) = � logZ

w 0exp

�f (w 0) � 1

2�kw �w 0k2

!dw 0

Page 7: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

Minimizing local entropy

7

‣ Solve

‣ Estimate the gradient using MCMC

can be applied to general deep networks

w ⇤ = argminw

f�(w , �)

‣ Gradient of local entropy

+f�(w , �) = ��1�w �

⌦w 0

↵�denotes an expectation over!a local Gibbs distribution

⌦w 0

↵= Z (w , �)�1

Z

w 0w 0 exp

�f (w 0) � 1

2�kw �w 0k2

!dw 0

Page 8: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

Medium-scale CNN

8

‣ All-CNN-BN on CIFAR-10

‣ Do not see much plateauing of training or validation loss

0 50 100 150 200Epochs ⇥ L

5

10

15

20

%Er

ror

7.71%

7.81%

SGDEntropy-SGD

0 50 100 150 200Epochs ⇥ L

0

0.1

0.2

0.3

0.4

0.5

0.6

Cro

ss-E

ntro

pyLo

ss

0.03530.0336

SGDEntropy-SGD

x-axis wall-clock timeµ

Page 9: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

9

A PDE interpretation

‣ Local entropy is the solution of a Hamilton-Jacobi equation

original loss as the initial condition

ut = �1

2`+u `2 +

1

2� u

u(w , 0) = f (w )

‣ Stochastic control interpretation

quadratic penalty for!greedy gradient descent

dw = �↵(s) ds + dB(s), t s T

w (t ) = w .

C(w (·), ↵(·)) = ≈

"f (w (T )) +

1

2

Z T

tk↵(s)k2 ds

#

↵(w , t ) = +u(w , t )

u(w , t ) = min↵(·)C(w (·), ↵(·))

f�(w ) = u(w , �)

Page 10: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

New PDEs

10

‣Use the non-viscous HJ equation

ut = �1

2`+u `2 +

1

2� u

0

‣Hopf-Lax formula gives the solution “inf-convolution”!or Moreau envelope

u(w , t ) = infw 0

(f (w 0) +

1

2�kw �w 0k2

)

‣This has a few magical properties…

‣Simple formula for the gradient (proximal point iteration)

p⇤ = +f (w � t p⇤)+u(w , t ) = p⇤

Page 11: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

Smoothing using PDEs

11

f (x)

uviscous HJ(x, T)

unon-viscous HJ(x, T)

initial density,!SGD gets stuck

final density of non-viscous HJfinal density of viscous HJ

Page 12: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

Distributed training algorithms

12

‣ A continuous-time view of local entropy

fast variabledw = ���1(w �w 0) ds

dw 0 = �1✏

"+f (w 0) +

1

�(w 0 �w )

#ds +

1p✏dB(s)

Zhang et. al., ’15‣ Elastic-SGD

argminw , w 01,...,w 0p

pX

k=1

f (w 0k ) +1

2�kw 0k �w k2

“homogenized” dynamics!as e ! 0

‣ If w’(s) is very fast, w(s) only sees its average

dw = ���1⇣w �

⌦w 0

↵⌘ds

⇢1(w 0;w ) / exp

�f (w 0) � 1

2�kw 0 �w k2

!

+f�(w , �) = ��1�w �

⌦w 0

↵�

Page 13: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

13

Wide-ResNet on CIFAR-10 / 100

13

Page 14: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

WTH is implicit regularization?

14

Why is SGD so special?

‣ Fokker-Planck equation and optimal transportation

⇢⇤ = argmin

Z�(w ) ⇢(w )dw + ��1

Zlog ⇢ d⇢.

Information bottleneck, Bayesian inference, large batch-sizes, sampling techniques, hyper-parameter choices, neural architecture search, ….

Many, many variants

AdaGrad, SVRG,!SAG, rmsprop,!Adam, Eve,!APPA, Catalyst,!Natasha, Katyusha…

‣ Stochastic differential equation

dw = �+f (w ) d t|{z},⌘

+q2��1D (w ) dB(t )

��1 =⌘

2b

Page 15: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

15

�(w )SGD does not minimize , what is ?f (w )

‣ Deep networks induce highly non-isotropic noise

CIFAR-10

�(D ) = 0.27 ± 0.84rank(D ) = 0.34%

div j (x) = 0

‣ Leads to deterministic, Hamiltonian dynamics in SGD

x = j (x)

Most likely trajectories of SGD are closed loops

saddle-point

‣ Deep networks have out-of-equilibrium distribution

⇢⇤(w ) 6/ e�� f (w )

/ e���(w )

Page 16: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

16

Summary

‣Techniques from control and physics are interpretable, also leadto state-of-the-art algorithms

‣Control has powerful tools to make inroads into understandingand improving deep networks

PDEs, stochastic control, stability of limit cycles, Fokker-Planck equations, continuous-time analysis…

input-output stability, reinforcement learning & optimal control

‣Deep learning is powerful, and quite “easy” to get intoeven the fundamentals are unknown and debated upon

Page 17: A picture of the energy landscape of deep neural · PDF fileStefano Soatto!! Guillaume Carlier, ... Yann LeCun, Anna Choromanska, Ameet Talwalkar!! Carlo Baldassi, Christian Borgs,

Thank You!

17

Joint work with

www.pratikac.info

Stefano Soatto!!

Guillaume Carlier, Adam Oberman, Stanley Osher!!

Yann LeCun, Anna Choromanska, Ameet Talwalkar!!

Carlo Baldassi, Christian Borgs, Jennifer Chayes, Riccardo Zecchina