6
Energy-Eicient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad Reza Mahmoodi, and Dmitri B. Strukov UC Santa Barbara, Department of Electrical and Computer Engineering, Santa Barbara, CA, 93106, U.S.A {mbavandpour,mrmahmoodi,strukov}@ece.ucsb.edu ABSTRACT We propose an extremely energy-ecient mixed-signal approach for performing vector-by-matrix multiplication in a time domain. In such implementation, multi-bit values of the input and output vector elements are represented with time-encoded digital signals, while multi-bit matrix weights are realized with current sources, e.g. transistors biased in subthreshold regime. With our approach, mul- tipliers can be chained together to implement large-scale circuits completely in a time domain. Multiplier operation does not rely on energy-taxing static currents, which are typical for peripheral and input/output conversion circuits of the conventional mixed-signal implementations. As a case study, we have designed a multilayer perceptron, based on two layers of 10 × 10 four-quadrant vector- by-matrix multipliers, in 55-nm process with embedded NOR ash memory technology, which allows for compact implementation of adjustable current sources. Our analysis, based on memory cell mea- surements, shows that at high computing speed the drain-induced barrier lowering is a major factor limiting multiplier precision to 6 bit. Post-layout estimates for a conservative 6-bit digital in- put/output N × N multiplier designed in 55 nm process, including I/O circuitry for converting between digital and time domain rep- resentations, show 7 fJ/Op for N > 200, which can be further lowered well below 1 fJ/Op for more optimal and aggressive design. 1 INTRODUCTION e need for computing power is steadily increasing across all computing domains, and has been rapidly accelerating recently in part due to the emergence of machine learning, bioinformatics, and internet-of-things applications. With conventional device tech- nology scaling slowing down and many traditional approaches for improving computing power reaching their limits, the increasing focus now is on heterogeneous computing with application specic circuits and specialized processors [1, 2]. Vector-by-matrix multiplication (VMM) is one of the most com- mon operations in many computing applications and, therefore, the development of its ecient hardware is of the utmost impor- tance. (We will use VMM acronym to refer to both vector-by-matrix multiplication and vector-by-matrix multiplier.) Indeed, VMM is a core computation in virtually any neuromorphic network [3] and many signal processing algorithms [4, 5]. For example, VMM is by far the most critical operation of deep-learning convolutional classiers [6, 7]. eoretical studies showed that sub-8-bit precision is typically sucient for the inference computation [8], which is why the most recent high-performance graphics processors support 8-bit xed-point arithmetics [9]. e low precision is also adequate for lossy compression algorithms, e.g. those based on discrete co- sine transform, which heavily rely on VMM operations [10]. In another recent study, dense low-precision VMM accelerators were proposed to dramatically improve performance and energy e- ciency of linear sparse system solvers, which may take months of processing time when using conventional modern supercomputers [11]. In such solvers, the solution is rst approximated with dense low-precision VMM accelerators, and then iteratively improved by using traditional digital processors. e most promising implementations of low to medium preci- sion VMMs are arguably based on analog and mixed-signal circuits [12]. In a current-mode implementation, the multi-bit inputs are encoded as analog voltages/currents, or digital voltage pulses [13], which are applied to one set of (e.g., row) electrodes of the array with adjustable conductance cross-point devices, such as mem- ristors [7, 13] or oating-gate memories [5, 14, 15], while VMM outputs are represented by the currents owing into the column electrodes. e main drawback of this approach is energy-hungry and area-demanding peripheral circuits, which, e.g. rely on large static currents to provide accurate virtual ground for the memristor implementation. In principle, the switched-capacitor approach does not have this deciency since the computation is performed by only moving charges between capacitors [16]. Unfortunately, the benets of such potentially extremely energy-ecient computation are largely negated at the interface, since reading the output, or cascading multiple VMMs, require energy-hungry ADC and/or analog buers. Perhaps even more serious drawback, signicantly impacting the density (and as a result other performance metrics), is a lack of adjustable capacitors. Instead, each cross-point element is typically implemented with the set of xed binary-weighted capacitors cou- pled to a number of switches (transistors), which are controlled by the digitally-stored weight. In this paper we propose to perform vector-by-matrix multiplica- tion in a time-domain, combining congurability and high density of the current-mode implementation with energy-eciency of the switch-capacitor VMM, however, avoiding costly I/O conversion of the laer. Our approach draws inspiration from prior work on time-domain computing [1721], but dierent in several important aspects. e main dierence with respect to Refs. [1921] is that our approach allows for precise four-quadrant VMM using analog input and weights. Unlike the work presented in Ref. [17], there is no weight-dependent scaling factor in the time-encoded outputs, making it possible to chain multipliers together to implement func- tional large-scale circuits completely in a time domain. Inputs were encoded in the duration of the digital pulses in recent work [13]. However, that approach shares many similar problems of current- mode VMMs, most importantly of costly I/O conversion circuits. Moreover, due to resistive synapses considered in Ref. [13], the output line voltage must be pinned for accurate integration which arXiv:1711.10673v1 [cs.AR] 29 Nov 2017

Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for ... · Energy-E•icient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for ... · Energy-E•icient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad

Energy-E�icient Time-Domain Vector-by-Matrix Multiplier forNeurocomputing and Beyond

Mohammad Bavandpour, Mohammad Reza Mahmoodi, and Dmitri B. StrukovUC Santa Barbara, Department of Electrical and Computer Engineering, Santa Barbara, CA, 93106, U.S.A

{mbavandpour,mrmahmoodi,strukov}@ece.ucsb.edu

ABSTRACTWe propose an extremely energy-e�cient mixed-signal approachfor performing vector-by-matrix multiplication in a time domain.In such implementation, multi-bit values of the input and outputvector elements are represented with time-encoded digital signals,while multi-bit matrix weights are realized with current sources, e.g.transistors biased in subthreshold regime. With our approach, mul-tipliers can be chained together to implement large-scale circuitscompletely in a time domain. Multiplier operation does not rely onenergy-taxing static currents, which are typical for peripheral andinput/output conversion circuits of the conventional mixed-signalimplementations. As a case study, we have designed a multilayerperceptron, based on two layers of 10 × 10 four-quadrant vector-by-matrix multipliers, in 55-nm process with embedded NOR �ashmemory technology, which allows for compact implementation ofadjustable current sources. Our analysis, based onmemory cell mea-surements, shows that at high computing speed the drain-inducedbarrier lowering is a major factor limiting multiplier precision to∼ 6 bit. Post-layout estimates for a conservative 6-bit digital in-put/output N × N multiplier designed in 55 nm process, includingI/O circuitry for converting between digital and time domain rep-resentations, show ∼ 7 fJ/Op for N > 200, which can be furtherlowered well below 1 fJ/Op for more optimal and aggressive design.

1 INTRODUCTION�e need for computing power is steadily increasing across allcomputing domains, and has been rapidly accelerating recentlyin part due to the emergence of machine learning, bioinformatics,and internet-of-things applications. With conventional device tech-nology scaling slowing down and many traditional approaches forimproving computing power reaching their limits, the increasingfocus now is on heterogeneous computing with application speci�ccircuits and specialized processors [1, 2].

Vector-by-matrix multiplication (VMM) is one of the most com-mon operations in many computing applications and, therefore,the development of its e�cient hardware is of the utmost impor-tance. (We will use VMM acronym to refer to both vector-by-matrixmultiplication and vector-by-matrix multiplier.) Indeed, VMM is acore computation in virtually any neuromorphic network [3] andmany signal processing algorithms [4, 5]. For example, VMM isby far the most critical operation of deep-learning convolutionalclassi�ers [6, 7]. �eoretical studies showed that sub-8-bit precisionis typically su�cient for the inference computation [8], which iswhy the most recent high-performance graphics processors support8-bit �xed-point arithmetics [9]. �e low precision is also adequatefor lossy compression algorithms, e.g. those based on discrete co-sine transform, which heavily rely on VMM operations [10]. In

another recent study, dense low-precision VMM accelerators wereproposed to dramatically improve performance and energy e�-ciency of linear sparse system solvers, which may take months ofprocessing time when using conventional modern supercomputers[11]. In such solvers, the solution is �rst approximated with denselow-precision VMM accelerators, and then iteratively improved byusing traditional digital processors.

�e most promising implementations of low to medium preci-sion VMMs are arguably based on analog and mixed-signal circuits[12]. In a current-mode implementation, the multi-bit inputs areencoded as analog voltages/currents, or digital voltage pulses [13],which are applied to one set of (e.g., row) electrodes of the arraywith adjustable conductance cross-point devices, such as mem-ristors [7, 13] or �oating-gate memories [5, 14, 15], while VMMoutputs are represented by the currents �owing into the columnelectrodes. �e main drawback of this approach is energy-hungryand area-demanding peripheral circuits, which, e.g. rely on largestatic currents to provide accurate virtual ground for the memristorimplementation.

In principle, the switched-capacitor approach does not have thisde�ciency since the computation is performed by only movingcharges between capacitors [16]. Unfortunately, the bene�ts ofsuch potentially extremely energy-e�cient computation are largelynegated at the interface, since reading the output, or cascadingmultiple VMMs, require energy-hungry ADC and/or analog bu�ers.Perhaps even more serious drawback, signi�cantly impacting thedensity (and as a result other performance metrics), is a lack ofadjustable capacitors. Instead, each cross-point element is typicallyimplemented with the set of �xed binary-weighted capacitors cou-pled to a number of switches (transistors), which are controlled bythe digitally-stored weight.

In this paper we propose to perform vector-by-matrix multiplica-tion in a time-domain, combining con�gurability and high densityof the current-mode implementation with energy-e�ciency of theswitch-capacitor VMM, however, avoiding costly I/O conversionof the la�er. Our approach draws inspiration from prior work ontime-domain computing [17–21], but di�erent in several importantaspects. �e main di�erence with respect to Refs. [19–21] is thatour approach allows for precise four-quadrant VMM using analoginput and weights. Unlike the work presented in Ref. [17], there isno weight-dependent scaling factor in the time-encoded outputs,making it possible to chain multipliers together to implement func-tional large-scale circuits completely in a time domain. Inputs wereencoded in the duration of the digital pulses in recent work [13].However, that approach shares many similar problems of current-mode VMMs, most importantly of costly I/O conversion circuits.Moreover, due to resistive synapses considered in Ref. [13], theoutput line voltage must be pinned for accurate integration which

arX

iv:1

711.

1067

3v1

[cs

.AR

] 2

9 N

ov 2

017

Page 2: Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for ... · Energy-E•icient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad

tT

C

... ... ...

NI

1I

2I

time-encodeddigital output

1V

2V

NV

1g

time-encodeddigital input

...

)(ON ttHVV

VC

V1

V2

V∑

VC

voltage

time

VTH

0

0

t1 t2

CI /1 CII /)( 21

)(ON ii ttHVV

(a) (b)

VON

VOFF = 0

0

T 2T

N

i

iitIII

CV

0

C 1

(c)

T 2T

voltage

time

V

V

t

t

iV

iV

iI

iI

iI

iI

V

V

)( Vf

(d)

phase I phase II

Figure 1: �emain idea of time-domain vector-by-matrix multiplier: (a) Circuit and (b) timing diagrams explaining the operation for single-quadrant multiplier, assuming VOFF = 0 so that the inputs and outputs are conveniently described with Heaviside function. Note that panela does not show bias input I0. (c) Four-quadrant multiplier circuit diagram, showing for clarity only one matrix weight, implemented withfour current sources I++i , I−−i , I−+i , and I+−i . (d) �e timing diagram showing example of the positive output of the four-quadrant multiplier.�e bottom diagram for f (VΣ) corresponds to the case-study circuit implementation (Fig. 2).

results in large static power consumption. Finally, in this paper, weare presenting preliminary, post-layout performance results for asimple though representative circuit. Our estimates are based onthe layout of the circuit in 55 nm process with embedded NOR �ashmemory �oating gate technology, which is very promising for theproposed time-domain computing.

2 TIME-DOMAIN VECTOR-BY-MATRIXMULTIPLIER

2.1 Single-�adrant Dot-Product OperationLet us �rst focus on implementing N -element time-domain dot-product (weighted-sum) operation of the form

y =1

Nwmax

N∑i=1

wixi , (1)

with non-negative inputs xi and output y, and with weightswi in arange of [0,wmax]. Note that for convenience of chaining multipleoperations, the output range is similar to that of the input due tonormalization.

�e i-th input and the dot-product output are time-encoded,respectively, with digital voltages Vi and VΣ, such that:

Vi =

{VOFF 0 ≤ t < ti ≤ TVON ti ≤ t < 2T (2)

VΣ =

{VOFF T ≤ t < T + tΣ ≤ 2TVON T + tΣ ≤ t < 3T (3)

Here, value ofT−ti ∝ xi represents amulti-bit (analog) input, whichis de�ned within a time window [0,T ], while the value T − tΣ ∝y represents a multi-bit (analog) output observed within a timewindow [T , 2T ]. With such de�nitions, the maximum values forthe inputs and the output are always equal to T , their minimumvalues are 0, while Vi ≡ VON for T ≤ t < 2T and VΣ ≡ VON for2T ≤ t < 3T .

�e time-encoded voltage inputs are applied to the array’s rowelectrodes, which connect to the control input of the adjustablecon�gurable current sources, i.e. gate terminals of transistors (Fig.

1a). VOFF input is assumed to turn o� the current source (transistor).On the other hand, application ofVON voltage initiates the constantcurrent 0 ≤ Ii ≤ Imax, speci�c to the programmed value of the i-thcurrent source, to �ow into the column electrode. �is current willcharge the capacitorC which is comprised by the capacitance of thecolumn (drain) line and and intentionally added external capacitor.When the capacitor voltage VC reaches threshold VTH, the outputof the digital bu�er will switch from VOFF to VON at time T + tΣ,which is e�ectively the time-encoded output of the dot-productoperation.

Assuming for simplicity VC(t = 0) = VOFF = 0 and negligibledependence of the currents injected by current sources on VC (e.g.,achieved by biasing cross-point transistor in sub-threshold regime)the charging dynamics of the capacitor is given by the dot productof currents IiH (t − ti ), where H is Heaviside function, with theircorresponding time intervals t − ti , i.e

CVC(t) =N∑i=0

IiH (t − ti )(t − ti ), 0 ≤ t ≤ 2T . (4)

Before deriving the expression for the output T − tΣ, let us notetwo important conditions imposed by our assumptions for theminimum and maximum values of the output. First, the additional(weight-dependent) bias current I0 is added to the sum in Eq. 4,with its turn-on time always t0 = 0, to make output T − tΣ equalto 0 for the smallest possible value of dot product, representedby t1,2, ...,N = T and I1,2, ...,N = 0. Secondly, the largest possibleoutputT −tΣ = T , which corresponds to the largest values of inputsT − t1,2, ...,N = T and current sources I1,2, ...N = Imax, is ensuredby

Imax =CVTHNT. (5)

UsingV = VTH in Eq. 4 and noting thatH is always 1 for tΣ ≥ 0, therelation between time-encoded output and the inputs is describedsimilarly to Eq. 1 if we assume that currents Ii are

Ii = ImaxCVTHwi

2CVTHwmax − ImaxT∑Ni=1wi

, i = 1, 2, ...N , (6)

2

Page 3: Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for ... · Energy-E•icient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad

and the bias current is

I0 =12

(CVTHT−

N∑i=1

Ii)≡ 1

2

(NImax −

N∑i=1

Ii). (7)

2.2 Four-�adrant Multiplier�e extension of the proposed time-domain dot-product compu-tation to time-domain VMM is achieved by utilizing the array ofcurrent source elements and performing multiple weighted-sum op-erations in parallel (Fig. 1a). �e implementation of four-quadrantmultiplier, in which input, outputs, and weights can be of bothpolarities, is shown in Figure 1c. In such di�erential style imple-mentation, dedicated wires are utilized for time-encoded positiveand negative inputs / outputs, while each weight is representedby four current sources. �e magnitude of the computed output isencoded by the delay, just as it was discussed for single-quadrantdot-product operation, while the output’s sign is explicitly de�nedby the corresponding wire of a pair. To multiply input by the posi-tive weight, I++i and I−−i are set to the same value according to Eq.7, with the other current source pair set to I+−i = I−+i = 0, while itis the opposite for the multiplication by the negative weights. Notethat with such implementation, in the most general case, the input/ output values are represented by time-encoded voltage pulses onboth positive and negative wires of the pair, with, e.g., t+Σ − t

−Σ < 0

corresponding to the positive output (Fig. 1d), and negative (or zero,when t+Σ − t

−Σ = 0) output otherwise.

�e output for the proposed time-domain VMM is always com-puted within �xed 2T -long window. Its maximum and minimumvalues are independent of a particular set of utilized weights andalways correspond to T and 2T , which is di�erent from prior pro-posals [17]. �is property of our approach is convenient for im-plementing large-scale circuits completely in a time-domain. Forexample, VMM outputs can be supplied directly to another VMMblock or some other time-domain circuitry, e.g. implementing RaceLogic [22].

3 CASE STUDY: PERCEPTRON NETWORK3.1 Design Methodology�e design methodology for implementing larger circuits is demon-strated on the example of speci�c circuit comprised of two VMMsand a nonlinear (“rectify-linear”) function block, which is repre-sentative of multi-layer perceptron network (Fig. 2a, b). Figure 2cshows gate-level implementation of VMM and rectify-linear cir-cuits, which is suitable for pipelined operation. Speci�cally, thethresholding (digital bu�er) is implemented with S-R latch. �erectify-linear functionality is realized with just one AND gate whichtakes input from two latches that are serving a di�erential pair. �eAND gate will generate a voltage pulse with t−Σ − t

+Σ duration for

positive outputs from the �rst VMM (see, e.g., Fig. 1d) or have zerovoltage for negative ones.

It is important to note that in the considered implementation ofrectify-linear function, the inputs to the second VMM are encodednot in the rising edge time of the voltage pulse, i.e. T − ti , butrather in its duration. Speci�cally, in this case the input is encodedby ti -long voltage pulse in the phase I (i.e. [0,T ] time interval),and always T -long voltage pulse during the phase II ([T , 2T ] time

VMM#1

fVMM

#2

(a)

(d)

)( Vf

V

(b)

(c)

T 2T

VMM

#1 V0,1,…N

time

` τf

2T+τreset

phase II

reset

setVMM

#2

voltage

0

`

V0,1,…N

reset

set

phase IIphase I

phase I

`

input vector

c

c

S-R latch

. . .V0

+V0

-V1

+V1

- VN+VN

- VRESET

RLU( - )

SET

RESET

V

V

V

V

C

C

f

control gate (CG)source (S)

erase gate (EG)

drain (D)

select gate (SG)

Figure 2: Two-layer perceptron network: (a) Block diagram ofthe considered circuit with (b) speci�c rectify-linear function. (c)Implementation details of VMM and rectify-linear circuits. Panel(c) shows only peripheral circuity required for four-quadrant dot-product operation, which involves two rows of adjustable currentsources (�oating gate transistors) and two S-R latches (highlightedwith yellow background). (d) Timing diagram for pipelined oper-ation (shown schematically). Here τf is a combined propagationdelay of S-R latch and rectify-linear circuit. �e dashed red arrowsshow the �ow of information between two VMMs. Note that theinputs to VMMs are always high during the phase II.

interval). Such pulse-duration encoding is more general case ofa scheme presented in Section 2.1. Indeed, in our approach, eachproduct term in dot-product computation is contributed by thetotal charge injected to an output capacitor by one current source,which in turn is proportional to the total time that current sourceis on. In the scheme discussed in Section 2.1, voltage pulse alwaysends at 2T and thus encoding in the rising edge time of a pulse isequivalent to the encoding in the pulse duration.

Additional pass gates, one per each output line, are controlled byRESET signals (Fig. 2c) and are used to pre-charge output capacitorbefore starting new computation. Controlled by SET signal, theoutput OR gate is used to decouple computations in two adjacentVMMs. Speci�cally, the OR gate and SET signal allow to generatephase II’s T -long pulses applied to the second VMM and, at thesame time, pre-charge and start new phase I computation in the�rst VMM. Using appropriate periodic synchronous SET and RESETsignals, pipelined operation with period 2T + τreset is established,where τreset is a time needed to pre-charge output capacitor (Fig.2d).

�ough the slope of activation function is equal to one in theconsidered implementation, it can be easily controlled by appropri-ate scaling of VMM weights (either in one or both VMMs). Also,because of strictly positive inputs, in principle, only two-quadrant

3

Page 4: Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for ... · Energy-E•icient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad

A B

C

D

E

360 µm

120 µm

Figure 3: Layout of the implemented 10 × 10 × 10 network. Here,label A denotes 10 × 20 supercell array, B/C shows column/row pro-gram and erase circuitry, D is one “neuron” block, which includesconservative 0.4 pF MOSCAP output capacitor, S-R latch based on(W = 120 nm, L = 900 nm) transistors, pass gates, and rectify-linear /pipelining circuit, and E is outputmultiplexer. All clock and controlsignals are generated externally.

multiplier is needed for the second layer, which is easily imple-mented by removing all input wires carrying negative values in thefour-quadrant design (Fig. 1c).

3.2 Embedded NOR Flash MemoryImplementation

We have designed two-layer perceptron network, based on two10 × 10 four-quadrant multipliers, in 55 nm CMOS process withmodi�ed embedded ESF3 NOR �ash memory technology [23]. Insuch technology, erase gate lines in the memory cell matrix werererouted (Fig. 2c) to enable precise individual tuning of the �oat-ing gate (FG) cells’ conductances. �e details on the redesignedstructure, static and dynamic I -V characteristics, analog retention,and noise of the �oating gate transistors, as well as results of highprecision tuning experiments can be found in Ref. [15].

�e network, which features 10 inputs and 10 hidden-layer /output neurons, is implemented with two identical 10 × 20 arraysof supercells, CMOS circuits for the pipelined VMM operation andrectify-linear transfer function as described in previous subsection,as well as CMOS circuitry for programming and erasure of theFG cells (Fig. 3). During operation, all FG transistors are biasedin subthreshold regime. �e input voltages are applied to controlgate lines, while the output currents are supplied by the drainlines. Because FG transistors are N-type, the output lines are notdischarged to the ground with RESET signal, but rather chargedto VRESET = VTH + ∆VD, i.e. ∆VD above the threshold voltage VTHof S-R latch. In this case, VMM operation is performed by sinkingcurrents via adjustable current sources based on FG memory cells.

4 DESIGN TRADEOFFS AND PERFORMANCEESTIMATES

Obviously, understanding the true potentials of the proposed time-domain computing would require choosing optimal operating con-ditions and careful tuning of the CMOS circuit parameters. Fur-thermore, the optimal solution will di�er depending on the speci�coptimization goals and input constrains such as VMM size andprecision of operation. Here, we discuss important tradeo�s andfactors at play in the optimization process, focusing speci�cally on

1.00E-08

1.00E-07

1.00E-06

1.00E-05

0 0.5 1 1.5 2 2.5

Dra

in C

urr

ent

(A)

SG Voltage (V)

Drain Dependency Error for 0.2 V Voltage

Swing on The Drain

50%

40%

30%

20%

10%

0.0%

2%-4%

4%-6%

6%-8% 8%-10%

(a) Error

0

2

4

6

8

10

0.5 0.6 0.7 0.8 0.9

Err

or

(%)

Drain Voltage (V)

ID=3uAID=2uAID=1uAID=500nAID=100nAID=50nAID=10nAID=5nAID=1nA

(b)

Figure 4: Relative output error shown as (a) function of maximumdrain current Imax and select gate voltage, assuming drain voltagevariation ∆VD = 0.2 V, and (b) function of drain voltage at VSG = 0.8V. �e error in both panels was calculated by tuning cells to di�er-ent memory states and experimentally measuring changes in sub-threshold drain current. In all cases, VCG =1.2 V and VS = VEG = 0V.

the computing precision. We then provide preliminary estimatesfor area, performance, and energy e�ciency.

4.1 Precision�ere are number of factors which may limit computing precision.�e weight precision is a�ected by the tuning accuracy, dri� ofanalog memory state, and drain current �uctuations due to intrinsiccell’s noise. �ese issues can be further aggravated by variationsin ambient temperature. Earlier, it has been shown that, at leastin small-scale current-mode VMM circuits based on similar 55-nm�ash technology, even without any optimization, all these factorscombined may allow up to 8 bit e�ective precision for majorityof the weights [15]. We expect that just like for current-modeVMM circuits, the temperature sensitivity will be improved due todi�erential design and utilization of higher drain currents, whichis generally desired for optimal design.

Our analysis shows that for the considered memory technology,the main challenge for VMM precision is non-negligible depen-dence of FG transistor subthreshold currents on the drain volt-age (Fig. 4), due to the drain-induced barrier lowering (DIBL). Tocope with this problem, we minimized the relative output errorError = |Imax(VRESET)− Imax(VRESET−∆VD)|/Imax(VRESET). In gen-eral, Error depends on the pre-charged drain voltage VRESET, thevoltage swing on the drain line ∆VD = VRESET −VTH, control andselect gate voltages, and the maximum drain current utilized forweight encoding. In our initial study, we assumed thatVRESET = 0.7

4

Page 5: Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for ... · Energy-E•icient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad

V, which cannot be too small, due to ID ∝ 1 − exp(−VD/VT) in sub-threshold regime (with VS = 0 V), but otherwise has weak impacton Error. Furthermore, for simplicity, we assumed that ∆VD = 0.2V, which cannot be too low because of static and short-circuit leak-ages in CMOS gates (see below), and that VCG = 1.2 V, which is astandard CMOS logic voltage in 55 nm process. We found that thedrain current is especially sensitive to select gate voltages with thedistinct optimum at VSG ∼ 0.8 V (Fig. 4a). �is is apparently dueto shorter e�ective channel length for higher VSG, and hence moresevere DIBL, and voltage divider e�ect at lower VSG. Furthermore,the drain dependency is the smallest at higher drain currents Imax ∼1 µA (Fig. 4a), though naturally bounded by the upper limit of thesubthreshold conduction (Fig. 4b). At such optimal conditions,Error could be less than 2% (Fig. 4a), which is ensuring at least 5bits of computing precision.

Similarly to all FG-based analog computing, majority of theprocess variations, a typical concern for any analog or mixed-signalcircuits, can be e�ciently compensated by adjusting currents of FGdevices, provided that such variations can be properly characterized.�is, e.g., includes VTH mismatches (which can be up to 20 mVrms for the implemented S-R latch according to our Monte Carlosimulations). �e only input dependent error, which cannot beeasily compensated, is due to the variations in the slope of thetransfer characteristics of S-R latch gates. However, this does notseem to be a serious issue for our design, given that the thresholdfor the drain voltage is always crossed at the same voltage slew rate.Also, note that variations in sub-threshold slope are not importantfor our circuit because of digital input voltages.

VMM precision can be also impacted by the factors similar tothose of switch-capacitor approach, including leakages via OFF-state FG devices, channel charge injection from pass transistorat the RESET phase, and capacitive coupling of the drain lines.Fortunately, the dominating coupling between D and CG lines isinput-independent and can be again compensated by adjusting theweights.

4.2 Latency, Energy, and AreaFor a given VMM size N , the latency (i.e. 2T ) is decreased byusing higher Imax and/or reducing C . �e obvious limitation toboth are degradation in precision, as discussed in previous section,and also intrinsic parasitics of the array, most importantly drainline capacitance Cdrain. Our post-layout analysis shows that forthe implemented circuit and optimal Imax, the latency per singlebit of computing precision is 2T0 ≤ 1 ns, and roughly 2T02p forhigher precision p, e.g., ∼ 100 ns for 6-bit VMM. �ese estimatesare somewhat pessimistic because of conservative choice of C ≈200Cdrain = 0.04 pF per input, which ensures ¡0.5% votlage dropon the drain line due to capacitive coupling. (Note that in theimplemented VMM circuit the intrinsic delay does not scale witharray size because of the utilized bootstrapping technique.)

�e energy per operation is contributed by the dynamic compo-nent of charging/discharging of control gate and drain lines as wellas external output capacitors, and the static component, includ-ing vdd-to-ground and short-circuit leakages in the digital logic.Naturally, CMOS leakage currents are suppressed exponentially by

0

20

40

60

80

100

0

5

10

15

20

10 100 200 300 400 500 600 700 800 900 1000

Are

a B

rea

kd

ow

n (

%)

To

tal A

rea

per

Op

era

tio

n

(µm

2/O

p)

VMM Size, N

Neuron Area without Capacitor (%)

I/O Area (%)

Total Area per Operation

0

20

40

60

80

100

0

20

40

60

80

10 100 200 300 400 500 600 700 800 900 1000

En

erg

y B

rea

kd

ow

n (

%)

To

tal

En

erg

y p

er O

per

ati

on

(fJ

/Op

)

VMM Size, N

Neuron Static Energy (%)

I/O Energy (%)

Total Energy per Operation

~ 3.7 µm2/Op

~ 7 fJ/Op

(a)

(b)

Figure 5: (a) Energy and (b) area per operation and their break-downs for a 6-bit digital-input digital-output time-domain VMM asa function of its size for the conservative case with C ≈ 200Cdrain.In particular, the bar chart shows contribution of I/O circuitry forconverting between digital and time-domain representation (redcolumn), neuron circuitry excluding capactor (cyan column), andmemory cell array and external capacitor (remaining white). Era-sure/programming circuitry was not included in the area estimatesbecause it can be shared among multiple VMMs. N is incrementedby 50, except for the �rst bar, which shows data for 10 × 10 VMM.

increasing drain voltage swing, and can be further reduced by lower-ing CMOS transistor currents (i.e. increasing length to width ratio),while still keeping propagation delay τf negligible as compared toT .�e increase in the drain swing, however, have negative impact onVMM precision (Fig. 4b) and dynamic energy. Determining optimalvalue of ∆VD and by how much CMOS transistor currents can bereduced without negatively impacting precision is important futureresearch. �e estimates, based on the implemented layout (withrather suboptimal value of C), ∆VD = 0.2 V, and N = 10, showthat the total energy is about 5.44 pJ for 10 × 10 VMM, or equiva-lently 38.6 TOps/J, with the static energy contributes roughly 65%of the total budget. �e energy-e�ciency signi�cantly improvesfor larger VMMs, e.g. reaching ∼ 120 TOps/J for N = 100, dueto reduced contribution of static energy component. It becomeseven larger, potentially reaching 150 TOps/J for N = 1000, at whichpoint it is completely dominated by dynamic energy related tocharging/discharging external capacitor.

�e area breakdown by the circuit components was accuratelyevaluated from the layout (Fig. 3). Clearly, because of rather smallimplemented VMMs, the peripheral circuitry dominates, with oneneuron block occupying ∼ 1.5 larger area than the whole 10 × 20supercell array. However, with larger and more practical array sizes(e.g. N > 200), the area is completely dominated by the memoryarray and external capacitors, which occupy ∼ 25% and ∼ 75%,respectively, of the total area for the conservative design.

5

Page 6: Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for ... · Energy-E•icient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond Mohammad Bavandpour, Mohammad

In some cases, e.g. convolutional layers in deep neural networks,the same matrix of weights (kernels) is utilized repeatedly to per-form large number of multiplications. To increase density, VMMoperations are performed using time-division-multiplexing schemewhich necessitates storing temporal results and, for our approach,performing conversion between digital and time-domain repre-sentations. Fortunately, the conversion circuitry for the proposedVMM can be very e�cient due to digital nature of time-encodedinput/output signals. We have designed such circuitry in whichthe input conversion is performed with a shared counter and asimple comparator-latch to create time-modulated pulse, while thepulse-encoded outputs are converted to digital signals with a helpof shared counter and a multi-bit register. Figure 5 summarizesenergy and area for a time-domain multiplier based on the conser-vative design, in particular showing that the overhead of the I/Oconversion circuitry drops quickly and becomes negligible as VMMsize increases.

With a more advanced design, which will require more detailedsimulations, the external capacitor can be signi�cantly scaled downor eliminated completely. (For example, the capacitive couplingcan be e�ciently suppressed by adding dummy input lines andusing di�erential input signaling.) In this case, latency and energywill be limited by intrinsic parasitics of the memory cell array, and,e.g., can be below 2 ns and 1 fJ per operation, respectively, for 6-bit1000×1000 VMM. Moreover, the energy and latency are expected toimprove with scaling of CMOS technology (decrease proportionallyto the feature size).

�e proposed approach compares very favourably with previ-ously reported work, such as current-based 180 nm FG/CMOS VMMwith measured 5,670 GOps/J [14], current-based 180 nmCMOS 3-bitVMM with estimated 6,390 GOps/J [12], switch-cap 40 nm CMOS3-bit VMM with measured 7,700 GOps/J [16], memristive 22-nm4-bit VMM with estimated 60,000 GOps/J [7], and ReRAM-based 14nm 8-bit VMMwith estimated 181.8 TOps/J [13]. Note that the mostimpressive reported energy e�ciency numbers do not account forcostly I/O conversion, speci�c to these designs.

5 SUMMARYWe have proposed novel time-domain approach for performingvector-by-matrix computation and then presented the designmethod-ology for implementing larger-scale circuit, based on the proposedmultiplier, completely in time domain. As a case study, we havedesigned a simple multilayer perceptron network, which involvestwo layers of 10 × 10 four-quadrant vector-by-matrix multipliers,in 55-nm process with embedded NOR �ash memory technology.In our performance study, we have focused on the detailed charac-terization of the most important factor limiting the precision, andthen discussed key trade-o�s and key performance metrics. �epost-layout estimates for the conservative design which also in-cludes the I/O circuitry to convert between digital and time-domainrepresentation, show up to > 150 TOps/J energy e�ciency at > 5bit computing precision. A much higher energy e�ciency, exceed-ing POps/J energy e�ciency threshold, can be potentially achievedby using more aggressive CMOS technology and advanced design,though this opportunity requires more investigation.

REFERENCES[1] Norman P. Jouppi and Cli� Young et al. In-datacenter performance analysis of a

tensor processing unit. arXiv preprint arXiv:1704.04760, 2017.[2] Ikuo Magaki and Moein Khazraee et al. Asic clouds: Specializing the datacenter.

Proc. ISCA’16, pages 178–190, June 2016.[3] Yann LeCun, Yoshua Bengio, and Geo�rey Hinton. Deep learning. Nature,

521:436–444, 2015.[4] Manuel de la Guia Solaz and Richard Conway. Razor based programmable

truncated multiply and accumulate, energy-reduction for e�cient digital signalprocessing. IEEE TVLSI, 23:189–193, 2014.

[5] Tyson S.Hall and Christopher M. Twigg et al. Large-scale �eld-programmableanalog arrays for analog signal processing. IEEE TCAS I, 52(11):2298–2307, 2005.

[6] Bert Moons and Roel Uy�erhoeven et al. Envision: A 0.26-to-10 tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural net-work processor in 28 nm fdsoi. Proc. ISSCC’17, pages 246–257, Feb. 2017.

[7] Miao Hu and John Paul Strachan et al. Dot-product engine for neuromorphic com-puting: Programming 1t1m crossbar to accelerate matrix-vector multiplication.Proc. DAC’16, pages 1–6, June 2016.

[8] Itay Hubara and Ma�hieu Courbariaux et al. �antized neural networks: Train-ing neural networks with low precision weights and activations. Arxiv preprintArXiv:1609.07061, Sep. 2016.

[9] NVIDIA. Nvidia p40. h�p://images.nvidia.com/content/pdf/tesla/184427-Tesla-P40-Datasheet-NV-Final-Le�er-Web.pdf.

[10] A. M. Raid and W. M. Khedr et al. Jpeg image compression using discrete cosinetransform - a survey. Arxiv preprint ArXiv:1405.6147, May 2014.

[11] Isaac Richter and Kamil Pas et al. Memristive accelerator for extreme scale linearsolvers. Proc. GOMACTech’15, Mar. 2015.

[12] Jonathan Binas and Daniel Neil et al. Precise deep neural network computationon imprecise low-power analog hardware. Arxive preprint ArXiv:1606.07786, June2016.

[13] Ma�hew J. Marinella and Sapan Agarwal et al. Multiscale co-design analysis ofenergy, latency, area, and accuracy of a reram analog neural training accelerator.arXiv preprint arXiv:1707.09952, 2017.

[14] Farnood Merrikh-Bayat and Xinjie Guo et al. Sub-1-us, sub-20-nj pa�ern clas-si�cation in a mixed-signal circuit based on embedded 180-nm �oating-gatememory cell arrays. Arxive preprint ArXiv:1610.02091, Oct. 2016.

[15] Xinjie Guo and Farnood M. Bayat et al. Temperature-insensitive analog vector-by-matrix multiplier based on 55 nm nor �ash memory cells. Proc. CICC’17, Apr.2017.

[16] Edward H. Lee and S. Simon Wong. A 2.5 ghz 7.7 tops/w switched-capacitormatrix multiplier with co-designed local memory in 40 nm. Proc. ISSCC’16, pages418–419, Jan. 2016.

[17] Vishnu Ravinuthula and Vaibhav Garg et al. Time-mode circuits for analogcomputation. Int. J. Cir. �eory App., 37:631–659, 2009.

[18] Wolfgang Maass and Christopher M. Bishop. Pulsed Neural Networks. MIT Press,1st. edition, 2001.

[19] Takashi Tohara and Haichao Liang et al. Silicon nanodisk array with a �n �eld-e�ect transistor for time-domain weighted sum calculation toward massivelyparallel spiking neural networks. Applied Phy. Express, 9, 2016. art. 034201.

[20] Takashi Morie and Haichao Liang et al. Spike-based time-domain weighted-sumcalculation using nanodevices for low power operation. Proc. IEEE Nano’16,pages 390–392, Aug. 2016.

[21] Daisuke Miyashita and Shouhei Kousai et al. A neuromorphic chip optimizedfor deep learning and cmos technology with time-domain analog and digitalmixed-signal processing. IEEE JSCC, page in press, 2017.

[22] Advait Madhavan, Timothy Sherwood, and Dmitri Strukov. A 4-mm2 180-nm-cmos 15-giga-cell-updates-per-second dna sequence alignment engine based onasynchronous race conditions. Proc. CICC’17, Apr. 2017.

[23] SST. Sst’s esf-3 technology. h�p://www.sst.com/technology/super�ash-technology.aspx.

6