Fast, Energy-Efficient, Robust, and Reproducible Mixed ...strukov/papers/2017/iedm2017.pdf · layer neuromorphic network based on embedded nonvolatile ... Analog VLSI and Neural Systems,

Fast, Energy-Efficient, Robust, and Reproducible Mixed-Signal Neuromorphic Classifier Based on

Embedded NOR Flash Memory Technology

X. Guo1*, F. Merrikh Bayat1*, M. Bavandpour1, M. Klachko1, M. R. Mahmoodi1, M. Prezioso1, K. K. Likharev2#, and D. B. Strukov1&

1UC Santa Barbara, Santa Barbara, CA 93106-9560, U.S.A., 2Stony Brook University, Stony Brook, NY 11794-3800, U.S.A. *these authors contributed equally to this work, #[email protected], &[email protected]

Abstract---We have designed, fabricated, and tested a prototype mixed-signal, 28×28-binary-input, 10-ouput, 3-layer neuromorphic network based on embedded nonvolatile floating-gate cell arrays redesigned from a commercial 180-nm NOR flash memory. Each array performs a very fast and energy-efficient analog vector-by-matrix multiplication, which is the bottleneck for signal propagation in neuromorphic networks. All functional components of the prototype circuit, including 2 synaptic arrays with 101,780 floating-gate synaptic cells, 74 analog neurons, and the peripheral circuitry for weight adjustment and I/O operations, have a total area below 1 mm2. Its testing on the MNIST benchmark set has shown a classification fidelity of 94.65%, close to the 96.2% obtained in simulation. The classification of one pattern takes <1 μs time and ~20 nJ energy – both numbers >103× better than those of the 28-nm IBM TrueNorth digital chip for the same task at a similar fidelity. Estimates show that this performance may be further improved using a better neuron design and a more advanced memory technology, leading to a >102× advantage in speed and a >104× advantage in energy efficiency over the state-of-the-art purely digital circuits for classification of large, complex patterns. Experimental results for the chip-to-chip statistics, long-term drift, and temperature sensitivity show no evident showstoppers on the way toward practical deep neuromorphic networks with unprecedented performance.

I. INTRODUCTION

The idea of using nonvolatile floating-gate memory cells in analog neuromorphic networks has been around for almost 30 years [1]. Fig. 1a shows the concept of fast and energy-efficient analog vector-by-matrix multiplication (VMM) - the bottleneck for signal propagation in neuromorphic networks. Up until recently, implementations of such circuits were based on “synaptic transistors” [2-4], which can be fabricated using the standard CMOS technology. While sophisticated, energy-efficient systems have been demonstrated [3, 4] using this approach, synaptic transistors have relatively large areas (~103 F2, where F is the minimum feature size [4]), leading to larger time delays and energy consumption.

Fortunately, by now the nonvolatile floating-gate memory cells have been highly optimized and scaled down all the way to F ~ 20 nm, and may be embedded into CMOS integrated circuits [5]. These cells are quite suitable to serve as

adjustable synapses in neuromorphic networks, provided that the memory arrays are redesigned to allow for individual, precise adjustment of the memory state of each device. Recently, such modification was performed for the 180-nm ESF1 [6, 7] (Fig. 2) and the 55-nm ESF3 [8] embedded commercial NOR flash memory technology of SST Inc. [7], with good prospects for its scaling down to at least F = 28 nm. Though such modification nearly triples the cell area, it is still at least 10× smaller, in terms of F2, than that of synaptic transistors [4]. In these terms, the modified cell area is also comparable to that of 1T1R-cells of emerging nonvolatile (e.g., memristive) technologies [9]. Moreover, due to the internal gain of their cells, floating-gate arrays for the analog VMM (Fig. 1a) have an additional advantage over memristive arrays (Fig. 1b), in terms of the necessary gain and energy consumption of the peripheral circuitry.

The main result reported in this paper is an experimental demonstration of a reproducible, stable, and robust neuromorphic network, based on redesigned floating-gate memory arrays, which can perform high-fidelity classification of images of the standard MNIST benchmark, with record-breaking speed and energy efficiency.

II. NETWORK DESIGN

Our design uses the energy-saving gate coupling [1, 4] of the peripheral and array cells, which works well in the subthreshold mode, with a nearly-exponential dependence of the drain current IDS of the memory cell on the gate voltage VGS (Fig. 3). The sub-1% current fluctuations across a ~105× dynamic range of the subthreshold operation (Figs. 3, 4a,c) and low variations of the program and erase voltages (Fig. 5) enable precise VMM operation of such gate-coupled arrays (Fig. 4b). (Other features of the modified ESF1/3 cell arrays, including their long-term analog retention and fast weight tuning with a ~0.3% accuracy, were reported in Refs. [6-8].)

The implemented neuromorphic network is a 3-layer (one-hidden-layer) MLP perceptron with 784 inputs, representing 28×28 black-and-white pixels of the input patterns, 64 hidden layer neurons with the rectify-tanh activation function, and 10 output neurons (Fig. 6). The differential synaptic coupling between the neuron layers is provided by two crossbar arrays with the total of 2×[(28×28+1)×64 + (64+1)×10] = 101,780 floating-gate memory cells.

The proper synaptic weights were calculated in an external computer using the standard error backpropagation algorithm with a specific cost function, maximizing the voltage difference between the correct and the second-largest network output. Such training, and a differential gate-coupled design, enable very low sensitivity to weight tuning errors, time drift, and temperature changes, which was confirmed by both simulations (Fig. 7) and testing (Figs. 10-13). The weights were “imported” into the circuit, i.e. used to tune the memory state of each array cell to the proper analog value, by utilizing an on-chip peripheral circuitry. In the first experiments reported here, only ~30% of the cells were tuned (Fig. 8) and the weight import accuracy for a single cell tuning was limited to 5%, to decrease the import time.

The input pattern bits are shifted serially into a 785-bit register before each classification; to start it, the bits are read out in parallel into the network. The mixed-signal VMM in the first crossbar array is implemented by applying input voltages (either 4.2 V or 0 V) directly to the gates of the array cell transistors (Fig. 6b). The resulting output currents are passed, through differential amplifiers, to the activation function circuit. The fully analog VMM in the second array is implemented using the gate-coupled approach [1, 4] (Fig. 6c). To minimize the error due to the subthreshold slope’s dependence on the memory state (Fig. 2), we have used higher gate voltages (1.1 V to 2.7 V), limited only by technology restrictions. The 10 voltage outputs of the network have been measured externally.

III. EXPERIMENTAL RESULTS

We tested three chips with the same classifier network. As Fig. 9 shows, the average tuning accuracy for the tested chips was 4.4%, 5.6%, and 3.6%. Some of the already tuned cells were disturbed beyond the target accuracy during the subsequent weight import, because of the half-select disturb effect and noise. At this preliminary testing stage, these cells were not re-tuned, because even for such rather crude weight import, the experimentally tested classification fidelity (94.7%, 94.1%, and 94.2%, respectively, for the three tested chips) on the MNIST benchmark test set is already remarkably close to the simulated value (96.2%) for the same network (Fig. 10), and not too far from the maximum fidelity (97.7%) of a perceptron of this size. (We have also obtained preliminary high-fidelity results for the CIFAR-10 set, using our chip, in the TDM mode, in the first layer of the convolutional network described in Ref. [10] – see Fig. 15).

As Fig. 11 shows, the cell conductances slightly decreased (by 13% on the average) over a 7-month storage period. Thought this drift had a noticeable impact on the output voltages of the circuit (Fig. 12), the classification fidelity remained unchanged at 94.7%. Very encouragingly, the classification performance does not suffer at a temperature elevation to 130°C, even without additional efforts to make the circuit temperature-insensitive (Fig. 13).

However, the most important experimental result is that the already mentioned high fidelity of the MNIST set classification has been achieved at an ultralow (sub-20-nJ) energy consumption per average classified pattern (Fig. 14a), and the average classification time below 1 μs (Fig. 14b).

IV. DISCUSSION AND SUMMARY

The achieved speed and energy efficiency are much better than those demonstrated, for the same task, with any digital network we are aware of. In particular, both numbers are >103× better than those obtained with the 28-nm IBM TrueNorth digital chip [11]. There are still many ready reserves in our design. For example, current mirrors in neuron circuits may strongly decrease their (currently dominant) contribution to latency and energy. Our estimates (Fig. 16) show that using such mirrors, and a more advanced 55-nm memory technology ESF3 of the same company [7, 10] may enable an at least ~100× advantage in the operation speed, and an enormous, >104× advantage in the energy efficiency, over the state-of-the-art purely digital (GPU and custom) circuits [12], at the classification of large, complex patterns using deep-learning networks – see, e.g., Ref. [13].

ACKNOWLEDGMENT

This work was supported by DARPA’s UPSIDE program (HR0011-13-C-0051UPSIDE) via BAE Systems, and a Google Faculty award.

REFERENCES [1] C. Mead, Analog VLSI and Neural Systems, Addison Wesley, 1989. [2] C. Diorio, P. Hasler, A. Minch, and C. A. Mead, “A single-transistor

silicon synapse”, IEEE Trans. Elec. Dev., vol. 43, pp. 1972-1980, 1996. [3] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog VLSI

trainable pattern classifier”, IEEE JSSC, vol. 42, pp. 1169-1179, 2007. [4] J. Hasler and H. Marr, “Fin ing a roadmap to achieve large neuro-

morphic hardware systems”, Front. Neurosci., vol. 7, art. 118, 2013. [5] “Superflash Technology Overview”, SST, Inc,; available online at

www.sst.com/technology/sst-superflash-technology . [6] F. Merrikh Bayat et al., “Redesigning commercial floating-gate

memory for analog computing applications”, in: Proc. ISCAS'15, Lisbon, Portugal, May 2015, pp. 1921-1924.

[7] F. Merrikh Bayat et al., “Model-based high-precision tuning of NOR flash memory cells for analog computing applications”, in: Proc. DRC’16, Newark, DE, June 2016.

[8] X. Guo et. al., “Temperature-insensitive analog vector-by-matrix multiplier based on 55 nm NOR flash memory cells”, in: Proc. CICC’17, Austin, TX, Apr.-May 2017.

[9] D. B. Strukov and H. Kohlstedt, “Resistive switching phenomena in thin films: Materials, devices, and applications”, MRS Bulletin, vol. 37, pp. 108-114, 2012.

[10] E. H. Lee, and S. S. Wong, “A 2.5 GHz 7.7 TOPs/W switched-capacitor matrix multiplier with co-designed local memory in 40 nm”, in: Proc. ISSCC’16, San Francisco, CA, Feb. 2016, pp. 418-420.

[11] S. K. Esser et al., “Backpropagation for energy-efficient neuromorphic computing”, in: Proc. NIPS’15, Montreal, Canada, Dec. 2015, pp. 1117-1125.

[12] Y.-H. Chen, T. Kishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks”, in: Proc. ISSCC’16, 2016, pp. 262-263.

[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks”, in: Proc. NIPS’12, Lake Tahoe, NV, Dec. 2012, pp. 1097-1105.

gate

drainsource

Fig. 2. SST’s 180-nm ESF1NOR flash memory: (a)Cross-section of the two-cell“supercell” (schematically);(b) schematics of 4-supercellfragment of the redesignedarray [6, 7], with the gatelines routed vertically to thesource lines, to enable single-cell erase; (c) TEM cross-section image of one memorycell; (d) top-view SEM imageof four memory supercells.Fig. 1. Analog VMM circuits with (a) floating-gate and (b)

memristive cells. Note the difference in opamp gain requirements.

Fig. 7. The simulated classification fidelity of theimplemented network as a function of (a) the weightimport precision, (b) the weight drift, and (c)temperature, all for the set of weights used in theexperiment. The weight variations were modeled byadding, to its optimized value, a normally distributedfluctuations with: on panel (a), zero mean and the shownstandard deviation; on panel (b), the shown mean and the5% standard deviation. The temperature dependenceshown on panel (c) was simulated using the experimentalresults reported in Ref. [8].

100

101

102

10 40 70

Mis

clas

sific

atio

n ra

te (

%)

Normalized STD of weight import (%)

(a)

101

102

10 40 70

Mis

clas

sific

atio

n ra

te (

%)

Weight’s mean drift (%)

100

(b)

Temperature (°C)

Cla

ssifi

catio

n ra

te (

%)

20 60 100 14094.5

95.0

95.5

96.0

96.5(c)

100 nm

(a)

sourcedrain drain

floating gate

gate gate

(c) 1 µm(d)

(b)

NG

1V

2V

NV

...

1G

2G

=

Σ =N

iiiVGI

1

... ... ...

0≈ΣU 1>>g

NW

...

1W

=

Σ =N

iii IWI

1

... ... ...

2W

1V

2V

NV

1I

2I

NI

1~g

(a) (b)

Fig. 5. Histogram of erase and program voltage thresholds for 100ESF1 memory cells.

Fig. 3. Drain current of an ESF1 cell, at VDS = 1 V, as a function ofthe gate voltage, for several memory states. Inset: the log slope β,measured at IDS = 10 nA, as a function of the memory state (shownas the corresponding gate voltage).

1.E-13

1.E-12

1.E-11

1.E-10

1.E-09

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

0 1 2 3 4 5

fully erased

Gate voltage, VGS (V)

Cur

rent

. I D

S(A

)

0 1 2 3 4 5

10-4

10-6

10-8

10-10

10-12

setup leakage S

lope

, β

VGS (V)0 40

0.40.2

VT = 26 mV

subthreshold conduction

fully programmed

0 2 4 6 80

5

10

Cou

nt

0 2 4 6 80

5

10

# of

dev

ices Erasure statistics for

100 devicesProgramming statistics for 100 devices

(a) (b)Cou

nt

programming statistics for 100 devices

erasure statistics for 100 devices

Source voltage threshold (V) Gate voltage threshold (V)

(b)(a)

Cou

nt

max75% percentileaverage25% percentilemin

max75% percentileaverage25% percentilemin

30 runs 30 runs

Fig.6. Implementation of a multilayer perceptron network: (a) Graph representation of the 3-layer network;(b, c) circuit diagrams of 2×2 slices of the (b) 1st and (c) 2nd layers. The hidden-layer neuron consists of adifferential summing amplifier and an activation function (rectified-tanh) circuit, while the output layerneurons do not implement an activation function. (d) The breakdown of the network’s the active componentareas, and (e) a micrograph of the fabricated multilayer perceptron chip.

… … …

1

2

784

bias bias

1

2

64

1

2

10output

inputs

act. func.

2.7 V2.7 V

-

+

I -

hidden-layer neuron

voltage input data

VDD

VSS

I +differential

synaptic weight

-+I +

2.7 V2.7 V

1.1 V

classifier output

current input from hidden-layer neurons

VSS

VDDI -

peripheral floating gate transistors

1st layer VMM

High voltage pass transistors

2nd layer VMM

Memory for pattern

storage

Hidden neurons

5 mm

5 mm

LV Pads

HV Pads

Output neurons

(a)

voltage shifter act. function (both layers)decoders

weights2nd xbar

weights 1st xbarshift registers

neurons1 st layer

neurons2 nd layer

0.5%

25.2%

1.5%28.9%2%

8.9%

8.6% 24.4%total: 0.78 mm2

(b)

(c)

(d)

-3.94

-3.96

-3.98

-4.00

Time (s)70 80 90 100

。time seriesFFT filtered

60

Sou

rce

curr

ent,

ISD

(A)

(c)Fig. 4. (a) The relative r.m.s. variation and thefull (peak-to-valley) swing of the current formemory cells within a 10×10 array, tuned todifferent states, measured for the gate-coupledconfiguration (inset) during a 27+ hourinterval. (b) Transfer function of a gate-coupled VMM cell (inset), tuned to fourdifferent states. (c) Random telegraph noise ofa memory cell at a current of 4 nA.

( δI O

UT) rm

s/(I

OU

T) a

ve(%

)

( δI O

UT) p

eak

to v

alle

y/(I

OU

T) a

ve(%

)

STDpeak to valley

10-1010-12 10-8 10-60.2

0.6

1.0

7

5

3

Output current, IOUT (A)

(a)300 nA

IOUT

1VA

P

A

Iin

P

Iout

0.00E+00 1.00E-07 2.00E-07 3.00E-07

Output4(W4=0.25)

Output3(W3=0.33)

Output2(W2=0.5)

Output1(W1=1)-0.3

-0.2

-0.1

0.0

0.0 0.1 0.2 0.3

Out

put

curr

ent,

I OU

T(µ

A)

Input current, IIN (µA)

(b)

IOUT=wIINA

PIIN

(e)

pattern “0”

pattern “1”

pattern “2”

pattern “3”

pattern “4”

pattern “5”

pattern “6”

pattern “7”

pattern “8”

pattern “9”

2

2

2

22

2

2

22

2

2

22

2

2

22

2

2

2

0

0

0

20

0

0

20

0

0

20

0

0

20

0

0

20 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

Cou

nt

Neuron #0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

Neuron #0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

(b)

Neuron #0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

chip #2(a) chip #1 (c) chip #3

Fig. 9. Synaptic weight import (i.e. cell tuning) statistics: The measured cell currents, at the input voltage ofVG=2.5V, VDS=1V, vs. the target currents (computed at the external network training). The dashed red linescorrespond to perfect tuning.

Fig. 16. Speed and energy consumption of signal propagation through the convolutional(dominating) part of a large deep network [13], with ~0.65×106 neurons, at its variousimplementations. The estimates for floating-gate networks take into account the 55×55 =3,025-step time-division multiplexing (natural for this particular network), and theexperimental values of the subthreshold current slope of the cells (see the inset in Fig. 3).

Fig.14. Physical performance: (a) Histogram of the experimentallymeasured total currents flowing into the circuit, characterizing the staticpower consumption of the both memory cell arrays, for all patterns ofthe MNIST test set. The inset lists the pattern-independent static currentof the neurons. (b) The typical signal dynamics after an abrupt turn-on ofthe voltage shifter’s power supply, measured concurrently at the networkinput, at the output of a sample hidden-layer neuron, and at all network’soutputs. (The actual input voltage is 10× larger.) The oscillatorybehavior of the outputs is a result of a suboptimal phase stability designof the operational amplifiers. Until it has been improved, and the inputcircuit sped up, only a sub-1-µs average time delay of the network maybe claimed, though it is probably closer to 0.5 µs.

Fig. 10. Classificationfidelity statistics:Histograms of the largestoutput voltages of eachclass, measured for all10,000 MNIST testpatterns, for three differentchips. The correct outputs(red bars) alwaysdominate. Note thelogarithmic vertical scale.

Fig. 13. Measured classificationfidelity as a function of theambient temperature.

Fig. 11. The freshly tuned weights(cell currents) vs. those measured7 months later, without re-tuning.

Temperature (°C)

Cla

ssifi

catio

n r

ate

(%

)

0 50 100 15094.5

95.0

95.5

Fig. 12. Histograms of the relative changes of the output voltages for all 10,000 MNISTtest set patterns, shown as a normalized difference between the originally measuredvalues and those measured 7 months later, for chip #1.

chip #3

(c)

STD = 4.1%

chip #1

(a)

1E-8 1E-7 1E-6Target current (A)

STD = 5.8%

1E-6

1E-7

1E-8

Tu

ned

cu

rren

t (A

)

1E-6

1E-7

1E-8

Tu

ned

cu

rren

t (A

)

(b)

STD = 6.7%

chip #2



1st layer2nd layer1st layer bias2nd layer bias

0 0.3 0.6 0.9 1.2Cell current, IDS (μA)

32,139non-zero

weights

100

101

102

103

Co

unt

Input Hidden #1 Output "0" Output "1" Output "2" Output "3" Output "4" Output "5" Output "6" Output "7" Output "8" Output "9"

Δt ≈ 0.45 μs

Vo

ltage

(V

)

(b)-0.2

0.0

0.2

0.4

0.6

-0.4

Time (µs)0 4 8 12 161 2 3 4 5

VDS = 1.05 V<IDS> = 2.9 mA

VDD = 2.7 VIDD = 5.65 mA

memory arrays: neurons:

(a)1 2 3 4 5

Current (mA)

0

300

600

900

1200

Co

unt

1E-6

1E-7

1E-81E-8 1E-7 1E-6

Original weight (A)

We

ight

aft

er 7

mo

nths

(A

)

mean drift = 13%

0

2000

40000

2000

4000

0

2000

40000

2000

4000

Relative output voltage change (%)

Cou

nt

Co

unt

-6 0 6-6 0 6-6 0 6-6 0 6-6 0 6

-6 0 6-6 0 6-6 0 6-6 0 6-6 0 6

class ‘0’

class ‘1’

class ‘3’

class ‘2’

class ‘4’

class ‘5’

class ‘6’

class ‘8’

class‘7’

class ’9’

Fig. 8. Histogram of the synaptic weights(measured as the cell currents at VD = 2.7 / 2.7 V,VS = 1.65 / 1.1 V, and VG = 4.2 / 2.7 V in the 1st

/ 2nd array) used in the experiment.

Fig. 15. Experimentally measuredpre-activation inputs for 10,000CIFAR-10 test images, for aparticular feature map in the firstlayer of the network described inRef. [10], corresponding to the84% top-3% classification fidelity(vs. the 84.8% simulated for the6-bit weight precision, and the85.7% for the full precision).

chip #1

chip #1 chip #1

chip #1 chip #1

Energy / pattern ≤ 2Δt (VDD<IDS> +VDD<IDS>) ≈ 20 nJ

AlexNet [13] single pattern classification:

Digital circuits [12] Mixed-signal floating-gate circuits (estimates)

GPU 28 nm ASIC 65 nm ESF1 180 nm ESF3 55 nm time (s) 1.5×10-2 2.9×10-2 ~1×10-4 ~6×10-5

energy (J) 1.5×10-1 0.8×10-2 ~3×10-7 ~2×10-7

chip #4

Me

asur

ed

valu

e (V

)

0.00 0.30 0.60 0.90

0.90

0.00

0.30

0.60

Simulated value (V)

<error> = 1/160,000×∑|(Vmeas-Vsim)/Vsim| = 0.07

2000200202

2000200202

2000200202

2000200202

2000200202

Co

unt

20002002022000200202200020020220002002022000200202

Cou

nt

Documents

Fast, Energy-Efficient, Robust, and Reproducible Mixed ...strukov/papers/2017/iedm2017.pdf · layer neuromorphic network based on embedded nonvolatile ... Analog VLSI and Neural Systems,