Synaptic [classical and quantum] fluctuations as a recipe for …cnls.lanl.gov/external/piml/Carlo Balsassi.pdf · 2018-04-23 · Synaptic [classical and quantum] fluctuations as

Synaptic [classical and quantum] fluctuationsas a recipe for robust and efficient

neural network training

R. Zecchina

F. GeraceA. IngrossoC. LucibelloL. SagliettiE. Tartaglione

C. BorgsJ. Chayes

H.J. Kappen

P. ChaudhariS. Soatto

Carlo BaldassiBocconi University, Milano, Italy

Outline (1/4)

?inputs

output

Simple (but surprisingly rich) neuronal models → Stat. Phys. Analysis

More complex neural networks, realistic data → numerical experiments

(replica method)

Outline (2/4)

Learning as an optimization problem

Highly non-convex landscape, many sharp local minima, isolated global minima

Should be hopeless, yet efficient heuristics exist

energy

Outline (3/4)

Large deviation analysis → hidden dense region (high local entropy)

energy

Hidden (eq. analysis ignores it)Dense → robust → generalizes wellAccessible to simple algorithms

Outline (4/4)

High-density states can be targeted by explicitly designed algorithms

But some processes are attracted to them too. In particular:1) Quantum annealing2) Simple gradient on stochastic synapses

energy

Motivation, in brief

● Deep learning models: impressive results (super-human in some cases), very versatile

● Would benefit from: improved theoretical understanding, reduced resources usage, dedicated hardware

● Biological neurons: low-precision, quite noisy

Simplified neuronal models

● Simplest model neuron: perceptron (one layer, discretized time — no dynamics, no Dale's principle...)

?inputs

output


● Simplest model neuron: perceptron (building block of DNNs)

?inputs

output



● Learning as an optimization problem: minimize misclassifications

?inputs

output desired output



● Learning as an optimization problem: minimize misclassifications

● Random i.i.d. patterns, large N → stat. phys. tools → phase transitions (e.g. SAT/UNSAT)

?inputs

output desired output

Equilibrium statistical physics ofneural networks

● Errors ~ energy → Gibbs measure → Stat. Phys. Tools

● SAT/UNSAT phase transition (there is a “critical capacity” αc)

Neural net factor graph(interactions ~ patterns)

Variables(synaptic weights)

Patterns(inputs+des. outputs)

Emin

=0 Emin

>0

Binary (±1) synapses, equilibrium results

● Good capacity: αc = 0.83

● But nasty optimization landscape: typical solutions are isolated, and immersed in exponentially many local minima

● Yet, simple efficient heuristics exists up to αmax ~ 0.75. They find non-isolated solutions, i.e. atypical

● Need to move away from equilibrium analysis

W. Krauth, M. Mézard, J. Phys. France, 1989H. Huang, Y. Kabashima, Phys. Rev. E, 2014

A. Braunstein, R. Zecchina, PRL, 2006C. Baldassi, A. Braunstein, et al, PNAS, 2007

C. Baldassi, J. Stat. Phys., 2009C. Baldassi, A. Braunstein, J. Stat. Mech., 2015

space of configurations

energy landscape

Large deviation analysis: local entropy● Idea: skew the statistical analysis towards non-isolated regions

● Given a configuration, we want to weigh it by how many solutions surround it — local entropy

C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PRL, 2015

Gibbs distribution with a different

energy

Large deviation analysis: Results Summary

● Ultra-dense regions exist at least up to some α ~ 0.77 (good agreement with heuristic solvers)

● Better generalization properties (quasi-bayesian) [note: dense regions are not planted, they are structural]

● Same phenomenology also in with arbitrary number of synaptic states

– capacity saturates fast with number of states → high-precision is not needed (with the right algorithms)

● Same phenomenology also in deeper networks with more realistic datasets

C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PRL, 2015C. Baldassi, F. Gerace, C. Lucibello, L. Saglietti, R. Zecchina, Phys. Rev. E, 2016

energy

dense regionIsolated optima(much more numerous than depicted here)

energylocal entropy

large d

energylocal entropy

intermediate d

energylocal entropy

small d

energylocal entropy

large to small d = "focusing" or "scoping" process

Exploring loc.entropy by interacting replicas● Assume that the parameter y is integer and transform the partition

function:

C. Baldassi, C. Borgs, J. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PNAS, 2016

Local free entropy

reference(“center”) y replicas

interaction

Large deviation measure

Exploring loc.entropy by interacting replicas● Assume that the parameter y is integer and transform the partition

function: simple recipe to extend existing algorithms


reference

y replicas

interaction

original factor graph

We call this the Robust Ensemble

(RE)

Replicated variants of existing algorithms

● Replicated Simulated Annealing → doesn't get stuck → exponential speed-up w.r.t. non-replicated version

● Replicated Belief Propagation → efficient solver + semi-analytical tool for studying dense states + link with reinforcement heuristic

● Replicated Stochastic Gradient Descent (SGD) → dramatic improvement in capacity and speed w.r.t. non-replicated SGD

● (All of these were tested on 2-layer networks: same scenario as the 1-layer perceptron)


Deep neural networks

● Replicated SGD very deeply related to EASGD (Zhang et al. 2015) (11-layers, convolutional architecture, real-life images [ImageNet])

● Entropy-SGD (2016): use Langevin dynamics to estimate local entropy, with excellent results (convolutional networks and recurrent networks, realistic inputs)

● Parallel Local Entropy (Parle) (2017): state-of-the-art results on an array of tasks, but ~3x faster, distributed learning

● Theoretical and numerical evidence from other research groups: minima are not all the same, wide minima (dense states) generalize better

S. Zhang, A. Choromanska, Y. LeCun, NIPS 2015P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, R. Zecchina, ICLR 2016

P. Chaudhari, C. Baldassi, R. Zecchina, S. Soatto, A. Talwalkar, A. Oberman (arXiv 1707.00424), SysML 2018

Quantum annealing

● Quantum annealing strategy: use quantum fluctuations (rather than thermal fluctuations) to overcome energetic barriers

– Classical energy function + quantum perturbation, slowly send the perturbation to zero

● So far: unclear if "true" QA really helps, compared to standard annealing, in any relevant concrete scenario

classicalpart

transverse field(send Γ to 0)

Suzuki-Trotter transformation

● Partition function transformation → "effective" replicated classical Hamiltonian (with infinite replicas, y→∞)

replicatedclassical

partinteraction

Γ→0 ⟺ γ→∞

● Can be simulated with MCMC (finite y) → Quantum Simulated Annealing (QSA)

Quantum annealing vs Robust ensemble

● Effective Hamiltonian after Suzuki-Trotter transformation: very similar to the robust ensemble description...

originalfactor graph

QA

RE

C. Baldassi, R. Zecchina, PNAS 2018 (in press, arXiv:1706.08470)

QSA on binary neural networks study● Analytical calculations + numerical experiments + comparison with

true QA in small instances

● Ends up in the dense states (exponential speed-up w.r.t. thermal annealing – a physical device would work in ~O(1)...)

● QA lowers kinetic energy by delocalizing → favors dense regions


(DWave-like)

Geometric structure is essential


Stochastic synapses (overview)

● Consider a network with binary stochastic synapses, each controlled by a single parameter (a magnetization)

● At each presentation of a pattern, the actual values of the synapses are extracted at random

● Maximum likelihood approach (gradient descent, simulated annealing...)

● Natural way to enforce robustness: the network must try to get in the middle of a dense region where almost all solutions are good.

● …and indeed it does → ends up in high local-entropy states

[This is no accident, the intuition is corroborated by a formal similarity with the RE setting]

C. Baldassi, F. Gerace, H.J. Kappen, C. Lucibello, L. Saglietti, E. Tartaglione, R. Zecchina, arXiv:1710.09825, 2017

Stochastic synapses


Standard:

Bayes:

Our approach:

This is the log-likelihood of a stochastic network where for each pattern (x,y) we sample the synapses W.

More complicated then Standard, less then Bayes. We get some of the goodies of the Bayesian approach (encoding structural priors, implicit robustness) while in the end we solve the Standard problem.

It can be understood as a relaxation of the standard problem. Near the end of the training we have Now we can use SGD to train binary networks!

Stochastic binary perceptron


For large N the log-likelihood becomes

Smoothed-out landscape: we can do Gradient Descent on [or any other simple algorithm, actually]

Eventually, we can get to a polarized configuration which is in the middle of a dense region (verified analytically and numerically).

Stochastic binary multi-layer networks

C. Baldassi, C.Borgs, J. Cheyes, F. Gerace, C. Lucibello, L. Saglietti, E. Tartaglione, R. Zecchina, in preparation, 2018

● The procedure can be applied to more complex architectures (e.g. 3 or more hidden layers).

● However, in order to be efficient, we must introduce an approximation (uncorrelated neurons) which can be quite crude.

– In cases where we can keep the correlations under control, the results are indeed very good (work in progress…)

● Or we could estimate the loglikelihood by sampling (also WIP...)

Conclusions, future directions

● A wide family of stochastic processes is attracted to out-of-equilibrium states with peculiar (and useful) geometric properties in the space of configurations. Can we extend and generalize these results?

● Learning with low precision synapses can be made extremely simple, and the performances are very good: can we improve existing networks/design new neural hardware?

● Accessible states in other problems – by exploring the robust ensemble we can find dense regions as if they were typical

● Applications in inference?

● We want to address unsupervised learning (automatic feature extraction) as well [we have preliminary results, on hold due to time constraints...]

Thanks

R. ZecchinaC. Baldassi

F. GeraceA. IngrossoC. LucibelloL. SagliettiE. Tartaglione

C. BorgsJ. Chayes

H.J. Kappen

P. ChaudhariS. Soatto

Documents

Synaptic [classical and quantum] fluctuations as a recipe for …cnls.lanl.gov/external/piml/Carlo Balsassi.pdf · 2018-04-23 · Synaptic [classical and quantum] fluctuations as