Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
AVLSI
Architectures for Neuromorphic Computing
Rajit ManoharComputer Systems LabYale Universityhttp://csl.yale.edu/http://avlsi.csl.yale.edu/
AVLSI
The context: microelectronics scaling
• It’s been a great ride…❖ … but sequential programs don’t
speed up each year like they used to in the “good old days.”
• Computation demand is growing!❖ Massive amounts of data being
collected by cheap, ubiquitous sensors.
❖ ~ 1.5B smartphones (with cameras) shipped in 2017.*
❖ ~ 0.75B monthly active users on Instagram in 2017.*
❖ Modern machine learning depends on massive amounts of data.
Data collected by: M. Horowitz, F. Labonte, O. Shacham, K. Olokutun and others; extrapolations by C. Moore
*Kleiner-Perkins 2018 Internet Trends
AVLSI
Parallelism to the rescue?
1967
• Some algorithms just aren’t parallel ❖ “Unfortunately, for most interesting algorithms, […] no architecture is scalable
[…]” -- Agarwal et al. (CACM 1991)
• But maybe we’re going about this the wrong way… ❖ Physical systems, by their very nature, are massively parallel. ❖ Can we build computing systems inspired by physical ones?
AVLSI
Neuromorphic computing*
• Philosophical motivation❖ Understand thought, consciousness
• Biological motivation❖ Understand the brain through engineering
• Computational motivation❖ Real-time vision, speech, pattern recognition, …
*term coined by Carver Mead
“Neuro” = neural “-morphic” = “having the shape, form, or structure”
AVLSI
Neuromorphics 101
• Basic computation❖ Weighted input spikes are accumulated on a
capacitor❖ The neuron is implemented as a “threshold
detector”❖ On an output spike, the state of the neuron is
reset (with a refractory period)
• ~1,000 to 10,000 synapses per neuron
• Classical approach ❖ Mixed-signal design: analog neurons and
synapse circuits, digital asynchronous communication
axon
weight
+ -threshold
reset
dendrite
synapse
neuron
leak
AVLSI
General purpose neuromorphic systems
• Core components❖ Set of neurons + synapses from the
network being modeled mapped to hardware
❖ Synapses can be made “superposable”❖ Routing network handles spike
communication between hardware elements
• Time-multiplexing❖ Common hardware for computation❖ Per-neuron/per-synapse state
synapse hardware
neuron hardware
synapse hardware
neuron hardware
synapse hardware
neuron hardware
routing
AVLSI
SyNAPSE project: 2008 to 2015❖ Analog neurons, digital spikes and
routing ❖ Special materials for synapses (oxides,
phase change memory, magnetic tunneling junctions, nanotraps)
❖ Embrace uncertainty in manufacturing technology, no compensation for variability
❖ Errors in communication/neuron okay because the system will compensate
❖ Analog neurons, digital spikes and routing
❖ Special materials for synapses (oxides, phase change memory, magnetic tunneling junctions, nanotraps)
❖ Embrace uncertainty in manufacturing technology, no compensation for variability
❖ Errors in communication/neuron okay because the system will compensate
2008 2014
GoldenGate TrueNorth
4
0 100 200 300 400 500 6000
50
100
150
200
250
Time (ms)
Neu
ron
In
dex
Simulation
Chip
Fig. 5. The spiking dynamics of the chip exactly match a software simulationwhen configured with the same parameters and recurrent connectivity. Spikesare plotted as dots for measured chip data and circles for simulation.
distribution of pixel correlations in the digits (60,000 images).After learning these weights, we trained 10 linear classifierson the outputs of the hidden units using supervised learning.Finally, we test how well the network classifies digits on out-of-sample data (10,000 images), and achieved 94% accuracy.
To map the RBM onto our neurosynaptic core, we makethe following choices: First, we represent the 256 hidden unitswith our integrate-and-fire neurons. Next, we represent eachvisible unit using two axons, one for positive (excitatory) con-nections and the other for negative (inhibitory) connections,accounting for 968 of 1024 axons. Then, we cast the 484×256real-valued weight matrix into two 484×256 binary matrices,one representing the positive connections (taking the highest15% of the positive weights), and the other representing thenegative connections (taking the lowest 15% of the weights).Finally, the synaptic values and thresholds of each neuron areadjusted to normalize the sum total input in the real-valuedcase with the sum total input of the binary case.
Following the example from [6], we are able to imple-ment the RBM using spiking neurons by imposing a globalinhibitory rhythm that clocks network dynamics. In the firstphase of the rhythm (no inhibition), hidden units accumulatesynaptic inputs driven by the pixels, and spike when theydetect a relevant feature; these spikes correspond to binaryactivity of a conventional (non-spiking) RBM in a singleupdate. In the second phase of the rhythm, the strong inhibitionresets all membrane potentials to 0. By sending the outputsof the hidden units to the same linear classifier as before(implemented off-chip) we achieve 89% accuracy for out-of-sample data (see Fig. 6 for one trial).
Our simple mapping from real-valued to binary weightsshows that the performance of the RBM does not decreasesignificantly, and suggests that more sophisticated algorithms,such as deep Boltzmann machines, will also perform well inhardware despite binary weights.
V. DISCUSSION
A long standing goal in the neuromorphic community is tocreate a compact, modular block that combines neurons, largesynaptic fanout, and addressable inputs. Our breakthroughneurosynaptic core, with digital neurons, crossbar synapses,and address-events for communication, is the first of its kind
22 x 22 Visible Units
968 Axons x 256 Neurons
10 Label Neurons
Axons
Synapses
Input Predicted: 3
+
-
Fig. 6. (left) Pixels represent visible units, which drive spike activity onexcitatory (+) and inhibitory (-) axons. (middle) 16 × 16 grid of neuronsspike in response to the digit stimulus. Spikes are indicated as black squares,and encode the digit as a set of features. (right) An off-chip linear classifiertrained on the features, and the resulting activation. Here, the classifier predictsthat 3 is the most likely digit, whereas 6 is the least likely.
to achieve this long standing goal in working silicon. Thekey new component of our design is the embedded crossbararray, which allows us to implement synaptic fanout withoutresorting to off-chip memory that can create an I–O bottleneck.By bypassing this critical bottleneck, it is now possible to builda large on-chip network of neurosynaptic cores, creating anultra-low power neural fabric that can support a wide array ofreal-time applications that are one-to-one with software.
Looking forward, to build a human-scale system with 1014
synapses (distributed across many chips), our next focus isto tackle the formidable but tractable challenges of density,passive power, and active power for inter-core communication.
ACKNOWLEDGMENT
This research was sponsored by DARPA under contract No.HR0011-09-C-0002. The views and conclusions contained herein arethose of the authors and should not be interpreted as representing theofficial policies, either expressly or implied, of DARPA or the U.S.Government. Authors thank William Risk and Sue Gilson for projectmanagement, Scott Fairbanks, Scott Hall, Kavita Prasad, Ben Hill,Ben Johnson, Rob Karmazin, Carlos Otero, and Jon Tse for physicaldesign, Steven Esser and Greg Corrado for their work on digitalneurons, and Andrew Cassidy and Rodrigo Alvarez for developingsoftware for data collection.
REFERENCES
[1] R. Ananthanarayanan, S. K. Esser, H. D. Simon, and D. S. Modha,“The cat is out of the bag: cortical simulations with 109 neurons,1013 synapses,” in Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis. New York, NY, USA:ACM, 2009, pp. 63:1–63:12.
[2] C. Mead, Analog VLSI and neural systems. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.
[3] R. Manohar, “A case for asynchronous computer architecture,” in Pro-ceedings of the ISCA Workshop on Complexity-Effective Design, 2000.
[4] N. Imam and R. Manohar, “Address-event communication using token-ring mutual exclusion,” in Proceedings of the IEEE International Sympo-sium on Asynchronous Circuits and Systems, 2011.
[5] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,July 2006. [Online]. Available: http://dx.doi.org/10.1126/science.1127647
[6] P. Merolla, T. Ursell, and J. Arthur, “The thermodynamic temperature ofa rhythmic spiking network,” CoRR, vol. abs/1009.5473, 2010.
long-term collaboration with
IBM Research
AVLSI
Current state-of-the-art
• IBM/Cornell “TrueNorth” chip❖ ~25 pJ/synaptic
operation❖ 65mW for 1M neurons,
256M synapses
• 28nm technology
• QDI + bundled data asynchronous digital logic
• Intel “Loihi” chip❖ ~24 pJ/synaptic
operation❖ Integrated on-chip
learning support❖ Microprocessors for
management
• 14nm technology
• QDI + bundled data asynchronous digital logic
• Stanford/Yale “Braindrop”❖ ~0.4 pJ/effective
synaptic operation❖ Support for “NEF”
programming model
• 28nm FDSOI
• QDI digital logic, synchronous I/O, and analog circuits for neurons and synapses
2014 2018 2019
AVLSI
Challenges: energy-efficiency
• Biological neural systems❖ ~ 20 fJ/synaptic operation
• TrueNorth/Loihi❖ ~ 20 pJ/synaptic operation
• How do we close the gap?❖ Many, many proposals (new devices, materials, etc…) for better synapses and
neurons❖ Reality‣ ~30-50% power is in spike communication/storage— Amdahl strikes again!‣ Best case: reduce to 7-10 pJ, even after overcoming all the technical
obstacles!❖ Many proposals with significantly lower energy reported‣ … but not for a system, just for small devices/components
AVLSI
Challenges: design and simulation
• ACT open-source language❖ Integrated representation of‣ properties‣ functional model ‣ behavioral description‣ gates‣ transistors‣ geometry
❖ Type system‣ relates different levels of abstraction‣ inheritance for extensible modules
• Integrated commercial event-basedand custom asynchronous digital simulator
Design
Expanded design
Technology mapping
New cell generation
Characterizer
Placement
Asynchronous static timing engine
Routing
Floorplan
Layout finishing
.lib
.act
.spice
.lef
.rect.act
.act .def
.def
.gds
Layout editor
.lef
.gds
.v
.v
.def
.spef .act
translation to proprietary commands
.def
cell layout
Prof. Keshav Pingali
part of the DARPA ERI initiative
AVLSI
0.00
0.55
1.10
1.65
2.20
-20.00
%
-10.00
%0.0
0%10
.00%
20.00
%30
.00%
40.00
%
Difference in delay
Perfo
rman
ce ra
tioComparison of two commercial circuit simulators
AVLSI
Challenges: programmability and algorithms
• How do we best utilize this computation model?❖ … in a general-purpose framework?
• What’s the right “programming language”?• Current solutions
❖ Use learning/training and artificial neural networks❖ Use hand-crafted solutions❖ Time-averaged spike rate is used to represent a value
𝛜 (bits)
Number of “spike slots”𝛅=0.05 𝛅=0.10 𝛅=0.25
1 28 20 8
2 176 126 56
3 848 592 288
4 3670 2582 1248
5 15211 10731 5227
maxv2[0,1]
{Prv̂[|v � v̂| > ✏]} �
|v � v̂| ✏
sender receiver
AVLSI
Summary
• Neuromorphic systems❖ Biologically inspired, naturally parallel approach❖ Various attempts to create programmable platforms
• Biological systems are an existence proof❖ … we need to better understand how they compute
• Challenges❖ What are efficient ways to compute in this framework?❖ How do we reduce the cost of communication and storage?❖ Is there a different abstraction, beyond simply emulating Biology?
AVLSI
Acknowledgments
• Major project collaborators❖ TrueNorth: Dharmendra Modha, John Arthur, Paul Merolla, …❖ Braindrop: Kwabena Boahen, Alex Neckar, Sam Folk, Ben Benjamin, …
• Group members❖ Filipp Akopyan, Nabil Imam, Saber Moradi, Carlos Otero, Jon Tse
• Many, many members of the neuromorphic community❖ Andreas Andreou, Gert Cauwenberghs, Tobi Delbruck, Shih-Chi Liu, …
• Sponsors❖ DARPA, ONR, AFRL‣ Todd Hylton