Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Synaptic [classical and quantum] fluctuationsas a recipe for robust and efficient
neural network training
R. Zecchina
F. GeraceA. IngrossoC. LucibelloL. SagliettiE. Tartaglione
C. BorgsJ. Chayes
H.J. Kappen
P. ChaudhariS. Soatto
Carlo BaldassiBocconi University, Milano, Italy
Outline (1/4)
?inputs
output
Simple (but surprisingly rich) neuronal models → Stat. Phys. Analysis
More complex neural networks, realistic data → numerical experiments
(replica method)
Outline (2/4)
Learning as an optimization problem
Highly non-convex landscape, many sharp local minima, isolated global minima
Should be hopeless, yet efficient heuristics exist
energy
Outline (3/4)
Large deviation analysis → hidden dense region (high local entropy)
energy
Hidden (eq. analysis ignores it)Dense → robust → generalizes wellAccessible to simple algorithms
Outline (4/4)
High-density states can be targeted by explicitly designed algorithms
But some processes are attracted to them too. In particular:1) Quantum annealing2) Simple gradient on stochastic synapses
energy
Motivation, in brief
● Deep learning models: impressive results (super-human in some cases), very versatile
● Would benefit from: improved theoretical understanding, reduced resources usage, dedicated hardware
● Biological neurons: low-precision, quite noisy
Simplified neuronal models
● Simplest model neuron: perceptron (one layer, discretized time — no dynamics, no Dale's principle...)
?inputs
output
Simplified neuronal models
● Simplest model neuron: perceptron (building block of DNNs)
?inputs
output
Simplified neuronal models
● Simplest model neuron: perceptron (building block of DNNs)
● Learning as an optimization problem: minimize misclassifications
?inputs
output desired output
Simplified neuronal models
● Simplest model neuron: perceptron (building block of DNNs)
● Learning as an optimization problem: minimize misclassifications
● Random i.i.d. patterns, large N → stat. phys. tools → phase transitions (e.g. SAT/UNSAT)
?inputs
output desired output
Equilibrium statistical physics ofneural networks
● Errors ~ energy → Gibbs measure → Stat. Phys. Tools
● SAT/UNSAT phase transition (there is a “critical capacity” αc)
Neural net factor graph(interactions ~ patterns)
Variables(synaptic weights)
Patterns(inputs+des. outputs)
Emin
=0 Emin
>0
Binary (±1) synapses, equilibrium results
● Good capacity: αc = 0.83
● But nasty optimization landscape: typical solutions are isolated, and immersed in exponentially many local minima
● Yet, simple efficient heuristics exists up to αmax ~ 0.75. They find non-isolated solutions, i.e. atypical
● Need to move away from equilibrium analysis
W. Krauth, M. Mézard, J. Phys. France, 1989H. Huang, Y. Kabashima, Phys. Rev. E, 2014
A. Braunstein, R. Zecchina, PRL, 2006C. Baldassi, A. Braunstein, et al, PNAS, 2007
C. Baldassi, J. Stat. Phys., 2009C. Baldassi, A. Braunstein, J. Stat. Mech., 2015
space of configurations
energy landscape
Large deviation analysis: local entropy● Idea: skew the statistical analysis towards non-isolated regions
● Given a configuration, we want to weigh it by how many solutions surround it — local entropy
C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PRL, 2015
Gibbs distribution with a different
energy
Large deviation analysis: Results Summary
● Ultra-dense regions exist at least up to some α ~ 0.77 (good agreement with heuristic solvers)
● Better generalization properties (quasi-bayesian) [note: dense regions are not planted, they are structural]
● Same phenomenology also in with arbitrary number of synaptic states
– capacity saturates fast with number of states → high-precision is not needed (with the right algorithms)
● Same phenomenology also in deeper networks with more realistic datasets
C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PRL, 2015C. Baldassi, F. Gerace, C. Lucibello, L. Saglietti, R. Zecchina, Phys. Rev. E, 2016
energy
dense regionIsolated optima(much more numerous than depicted here)
energylocal entropy
large d
energylocal entropy
intermediate d
energylocal entropy
small d
energylocal entropy
large to small d = "focusing" or "scoping" process
Exploring loc.entropy by interacting replicas● Assume that the parameter y is integer and transform the partition
function:
C. Baldassi, C. Borgs, J. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PNAS, 2016
Local free entropy
reference(“center”) y replicas
interaction
Large deviation measure
Exploring loc.entropy by interacting replicas● Assume that the parameter y is integer and transform the partition
function: simple recipe to extend existing algorithms
C. Baldassi, C. Borgs, J. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PNAS, 2016
reference
y replicas
interaction
original factor graph
We call this the Robust Ensemble
(RE)
Replicated variants of existing algorithms
● Replicated Simulated Annealing → doesn't get stuck → exponential speed-up w.r.t. non-replicated version
● Replicated Belief Propagation → efficient solver + semi-analytical tool for studying dense states + link with reinforcement heuristic
● Replicated Stochastic Gradient Descent (SGD) → dramatic improvement in capacity and speed w.r.t. non-replicated SGD
● (All of these were tested on 2-layer networks: same scenario as the 1-layer perceptron)
C. Baldassi, C. Borgs, J. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, R. Zecchina, PNAS, 2016
Deep neural networks
● Replicated SGD very deeply related to EASGD (Zhang et al. 2015) (11-layers, convolutional architecture, real-life images [ImageNet])
● Entropy-SGD (2016): use Langevin dynamics to estimate local entropy, with excellent results (convolutional networks and recurrent networks, realistic inputs)
● Parallel Local Entropy (Parle) (2017): state-of-the-art results on an array of tasks, but ~3x faster, distributed learning
● Theoretical and numerical evidence from other research groups: minima are not all the same, wide minima (dense states) generalize better
S. Zhang, A. Choromanska, Y. LeCun, NIPS 2015P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, R. Zecchina, ICLR 2016
P. Chaudhari, C. Baldassi, R. Zecchina, S. Soatto, A. Talwalkar, A. Oberman (arXiv 1707.00424), SysML 2018
Quantum annealing
● Quantum annealing strategy: use quantum fluctuations (rather than thermal fluctuations) to overcome energetic barriers
– Classical energy function + quantum perturbation, slowly send the perturbation to zero
● So far: unclear if "true" QA really helps, compared to standard annealing, in any relevant concrete scenario
classicalpart
transverse field(send Γ to 0)
Suzuki-Trotter transformation
● Partition function transformation → "effective" replicated classical Hamiltonian (with infinite replicas, y→∞)
replicatedclassical
partinteraction
Γ→0 ⟺ γ→∞
● Can be simulated with MCMC (finite y) → Quantum Simulated Annealing (QSA)
Quantum annealing vs Robust ensemble
● Effective Hamiltonian after Suzuki-Trotter transformation: very similar to the robust ensemble description...
originalfactor graph
QA
RE
C. Baldassi, R. Zecchina, PNAS 2018 (in press, arXiv:1706.08470)
QSA on binary neural networks study● Analytical calculations + numerical experiments + comparison with
true QA in small instances
● Ends up in the dense states (exponential speed-up w.r.t. thermal annealing – a physical device would work in ~O(1)...)
● QA lowers kinetic energy by delocalizing → favors dense regions
C. Baldassi, R. Zecchina, PNAS 2018 (in press, arXiv:1706.08470)
(DWave-like)
Geometric structure is essential
C. Baldassi, R. Zecchina, PNAS 2018 (in press, arXiv:1706.08470)
Stochastic synapses (overview)
● Consider a network with binary stochastic synapses, each controlled by a single parameter (a magnetization)
● At each presentation of a pattern, the actual values of the synapses are extracted at random
● Maximum likelihood approach (gradient descent, simulated annealing...)
● Natural way to enforce robustness: the network must try to get in the middle of a dense region where almost all solutions are good.
● …and indeed it does → ends up in high local-entropy states
[This is no accident, the intuition is corroborated by a formal similarity with the RE setting]
C. Baldassi, F. Gerace, H.J. Kappen, C. Lucibello, L. Saglietti, E. Tartaglione, R. Zecchina, arXiv:1710.09825, 2017
Stochastic synapses
C. Baldassi, F. Gerace, H.J. Kappen, C. Lucibello, L. Saglietti, E. Tartaglione, R. Zecchina, arXiv:1710.09825, 2017
Standard:
Bayes:
Our approach:
This is the log-likelihood of a stochastic network where for each pattern (x,y) we sample the synapses W.
More complicated then Standard, less then Bayes. We get some of the goodies of the Bayesian approach (encoding structural priors, implicit robustness) while in the end we solve the Standard problem.
It can be understood as a relaxation of the standard problem. Near the end of the training we have Now we can use SGD to train binary networks!
Stochastic binary perceptron
C. Baldassi, F. Gerace, H.J. Kappen, C. Lucibello, L. Saglietti, E. Tartaglione, R. Zecchina, arXiv:1710.09825, 2017
For large N the log-likelihood becomes
Smoothed-out landscape: we can do Gradient Descent on [or any other simple algorithm, actually]
Eventually, we can get to a polarized configuration which is in the middle of a dense region (verified analytically and numerically).
Stochastic binary multi-layer networks
C. Baldassi, C.Borgs, J. Cheyes, F. Gerace, C. Lucibello, L. Saglietti, E. Tartaglione, R. Zecchina, in preparation, 2018
● The procedure can be applied to more complex architectures (e.g. 3 or more hidden layers).
● However, in order to be efficient, we must introduce an approximation (uncorrelated neurons) which can be quite crude.
– In cases where we can keep the correlations under control, the results are indeed very good (work in progress…)
● Or we could estimate the loglikelihood by sampling (also WIP...)
Conclusions, future directions
● A wide family of stochastic processes is attracted to out-of-equilibrium states with peculiar (and useful) geometric properties in the space of configurations. Can we extend and generalize these results?
● Learning with low precision synapses can be made extremely simple, and the performances are very good: can we improve existing networks/design new neural hardware?
● Accessible states in other problems – by exploring the robust ensemble we can find dense regions as if they were typical
● Applications in inference?
● We want to address unsupervised learning (automatic feature extraction) as well [we have preliminary results, on hold due to time constraints...]
Thanks
R. ZecchinaC. Baldassi
F. GeraceA. IngrossoC. LucibelloL. SagliettiE. Tartaglione
C. BorgsJ. Chayes
H.J. Kappen
P. ChaudhariS. Soatto