11
1 AbstractGeneral purpose computing systems are used for a large variety of applications. Extensive supports for flexibility in these systems limit their energy efficiencies. Given that big data applications are among the main emerging workloads for computing systems, specialized architectures for big data processing are needed to enable low power and high throughput execution. Several big data applications are particularly focused on classification and clustering tasks. In this paper we propose a multicore heterogeneous architecture for big data processing. This system has the capability to process key machine learning algorithms such as deep neural network, autoencoder, and k- means clustering. Memristor crossbars are utilized to provide low power high throughput execution of neural networks. The system has both training and recognition (evaluation of new input) capabilities. The proposed system could be used for classification, unsupervised clustering, dimensionality reduction, feature extraction, and anomaly detection applications. The system level area and power benefits of the specialized architecture is compared with the NVIDIA Telsa K20 GPGPU. Our experimental evaluations show that the proposed architecture can provide four to six orders of magnitude more energy efficiency over GPGPUs for big data processing. KeywordsLow power architecture; memristor crossbars; autoencoder; unsupervised training; big data. I. INTRODUCTION ith the increasingly large volumes of data being generated in almost all fields including commerce, science, medicine, security, and social networks, one of the key application drivers of future computing systems is the processing of these data to decipher patterns within it to help make better decisions. This process is typically described as big data analytics and involves machine learning to analyze and categorize the data. Two broad categories of machine learning algorithms are supervised and unsupervised learning. In supervised learning, we know exactly what classes of data exist and a sample of data for each class is available (this is described as labeled data). Supervised machine learning algorithms are trained on this labeled datasets, and once trained, the algorithms can then sort out new unlabeled data into the existing set of classes. A more challenging, and perhaps more common, form of big data analytics involves unlabeled data. In this case the machine learning algorithms need to examine large volumes of data to uncover hidden patterns, unknown correlations, and other useful information. Some of the most commonly used unsupervised processing includes, autoencoder, restricted Boltzmann machines, self organizing maps, and k-means clustering. Big data processing using machine learning algorithms have traditionally been run on clusters of GPUs. One of the major problems with these systems is their high power consumption and the need for higher throughput. Due to the limits of Dennard scaling and Moore’s Law, alternative architectures are gaining importance. Several studies have examined low power, high throughput architectures for machine learning applications. Examples of this include DaDianNao [1] and PuDianNao [2], both of which accelerate machine learning algorithms through digital circuits. In PuDianNao neuron synaptic weights are stored in the off-chip memory, and hence requires back and forth data movement during training, between off-chip memory and the processing chip. In DaDianNao neuron synaptic weights are stored in eDRAM and are later brought into Neural Functional Units for execution. Input data are initially stored in the central eDRAM. This system utilized dynamic wormhole routing for data transfers. This paper presents a memristor crossbar based multicore architecture to implement unsupervised training of streaming data. It is a heterogeneous architecture consisting of memristor neural cores, a digital unit for clustering, a RISC core for system configuration and initialization, and an on- chip routing network for data communication. The key benefits of this approach are high throughput at very low power and area consumption. In this paper we utilize memristor crossbars for neural network based system design and examine supervised and unsupervised learning of the system. The proposed system could be used for classification, unsupervised clustering, dimensionality reduction, feature extraction and anomaly detection applications. Memristors [3,4] have received significant attention as a potential building block for neuromorphic systems [5,6]. In these systems memristors are used in a crossbar structure to efficiently evaluate multiply-add operations in parallel in the analog domain (these are the dominant operations in neural A Reconfigurable Low Power High Throughput Streaming Architecture for Big Data Processing Raqibul Hasan, Tarek Taha, and Zahangir Alom Email: {hasanm1, tarek.taha, alomm1}@udayton.edu Department of Electrical and Computer Engineering University of Dayton, Ohio, USA W

A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

1

Abstract—General purpose computing systems are used for a

large variety of applications. Extensive supports for flexibility in

these systems limit their energy efficiencies. Given that big data

applications are among the main emerging workloads for

computing systems, specialized architectures for big data

processing are needed to enable low power and high throughput

execution. Several big data applications are particularly focused

on classification and clustering tasks. In this paper we propose a

multicore heterogeneous architecture for big data processing.

This system has the capability to process key machine learning

algorithms such as deep neural network, autoencoder, and k-

means clustering. Memristor crossbars are utilized to provide

low power high throughput execution of neural networks. The

system has both training and recognition (evaluation of new

input) capabilities. The proposed system could be used for

classification, unsupervised clustering, dimensionality

reduction, feature extraction, and anomaly detection

applications. The system level area and power benefits of the

specialized architecture is compared with the NVIDIA Telsa

K20 GPGPU. Our experimental evaluations show that the

proposed architecture can provide four to six orders of

magnitude more energy efficiency over GPGPUs for big data

processing.

Keywords–Low power architecture; memristor crossbars;

autoencoder; unsupervised training; big data.

I. INTRODUCTION

ith the increasingly large volumes of data being

generated in almost all fields including commerce,

science, medicine, security, and social networks, one of the

key application drivers of future computing systems is the

processing of these data to decipher patterns within it to help

make better decisions. This process is typically described as

big data analytics and involves machine learning to analyze

and categorize the data.

Two broad categories of machine learning algorithms are

supervised and unsupervised learning. In supervised learning,

we know exactly what classes of data exist and a sample of

data for each class is available (this is described as labeled

data). Supervised machine learning algorithms are trained on

this labeled datasets, and once trained, the algorithms can then

sort out new unlabeled data into the existing set of classes.

A more challenging, and perhaps more common, form of

big data analytics involves unlabeled data. In this case the

machine learning algorithms need to examine large volumes

of data to uncover hidden patterns, unknown correlations, and

other useful information. Some of the most commonly used

unsupervised processing includes, autoencoder, restricted

Boltzmann machines, self organizing maps, and k-means

clustering.

Big data processing using machine learning algorithms

have traditionally been run on clusters of GPUs. One of the

major problems with these systems is their high power

consumption and the need for higher throughput. Due to the

limits of Dennard scaling and Moore’s Law, alternative

architectures are gaining importance. Several studies have

examined low power, high throughput architectures for

machine learning applications. Examples of this include

DaDianNao [1] and PuDianNao [2], both of which accelerate

machine learning algorithms through digital circuits. In

PuDianNao neuron synaptic weights are stored in the off-chip

memory, and hence requires back and forth data movement

during training, between off-chip memory and the processing

chip. In DaDianNao neuron synaptic weights are stored in

eDRAM and are later brought into Neural Functional Units

for execution. Input data are initially stored in the central

eDRAM. This system utilized dynamic wormhole routing for

data transfers.

This paper presents a memristor crossbar based multicore

architecture to implement unsupervised training of streaming

data. It is a heterogeneous architecture consisting of

memristor neural cores, a digital unit for clustering, a RISC

core for system configuration and initialization, and an on-

chip routing network for data communication. The key

benefits of this approach are high throughput at very low

power and area consumption. In this paper we utilize

memristor crossbars for neural network based system design

and examine supervised and unsupervised learning of the

system. The proposed system could be used for classification,

unsupervised clustering, dimensionality reduction, feature

extraction and anomaly detection applications.

Memristors [3,4] have received significant attention as a

potential building block for neuromorphic systems [5,6]. In

these systems memristors are used in a crossbar structure to

efficiently evaluate multiply-add operations in parallel in the

analog domain (these are the dominant operations in neural

A Reconfigurable Low Power High Throughput Streaming

Architecture for Big Data Processing

Raqibul Hasan, Tarek Taha, and Zahangir Alom

Email: hasanm1, tarek.taha, [email protected]

Department of Electrical and Computer Engineering

University of Dayton, Ohio, USA

W

Page 2: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

2

networks). This enables highly dense neuromorphic systems

with great computational efficiency [7]. We are using

memristor crossbars which provide high synaptic weight

density and parallel analog processing consuming very low

energy. Memristor based synapses are 146 times denser than

traditional digital SRAM synapse (assuming 8 bits precision

in both cases). In this system processing happens at physical

location of the data. Data transfer energy and functional unit

energy consumptions are saved significantly.

The most recent results for memristor based neuromorphic

systems can be seen in [12-15]. These works do not examine

multicore design and system level evaluations are not studied.

[12,13] examined only linearly separable problems. [16]

examined memristor based neural networks utilizing four

memristors per synapse where the proposed systems utilize

two memristors per synapse. They propose deployment of the

memristor based system as neural accelerator with RISC

system while proposed systems are standalone embedded

processing architecture. Furthermore, the proposed system

has on-chip unsupervised learning capability.

Detailed circuit level simulations (in SPICE) of memristor

crossbars were carried out to verify the neural operations.

Both the programming and the recognition phase of the

neural networks were examined. These simulations utilized

an accurate memristor device model, and take low level

circuit details such as wire resistance, capacitance, and

driver circuits into consideration. We have examined the

power, area, and performance of the proposed multicore

system and compared them with a GPU based system. Our

results indicate that the memristor based architecture can

provide up to six orders of magnitude more energy efficiency

over GPU for the selected benchmarks.

The rest of the paper is organized as follows: section II

demonstrates overall heterogeneous multicore architecture.

Section III describes memristor device, neuron circuit design,

autoencoder and memristor based autoencoder design.

Section IV describes designs of the heterogeneous cores.

Sections V and VI describe experimental setup and results

respectively. Section VII describes related works in the area

and finally in section VIII we summarize our work.

II. SYSTEM OVERVIEW

In this paper we propose an energy efficient multicore

heterogeneous architecture to accelerate both supervised and

unsupervised machine learning applications. The architecture

has implementation of the backpropagation training

algorithm, autoencoder, deep neural network, and k-means

clustering algorithm. These are some of the fundamental

algorithms in deep learning and are used for big data

processing.

A common approach for unsupervised learning is

clustering which uses dimensionality reduction and feature

extraction as preprocessing. Dimensionality reduction allows

the data of large dimension to be represented using a smaller

set of features, and thus allow a simpler clustering algorithm

to process an unlabeled dataset. In this paper, we implement

the commonly used autoencoder algorithm for dimensionality

reduction and the k-means algorithm for clustering. The

autoencoder is based on multilayered neural networks and

thus is implemented using memristor based neural processing

cores. As the input data can be of high dimensionality (such

as a high resolution image), multiple neural cores are needed

to implement the autoencoder.

We utilize the back-propagation algorithm for training the

autoencoder. Due to the use of the autoencoder, a reduced

dimension k-means implementation is sufficient, and hence

we utilize a single digital core to implement k-means

clustering. Deep neural network applications for supervised

training could also be implemented in the proposed system.

For deep network, autoencoder is used for the unsupervised

layer wise pre training and supervised fine tuning is

performed on the pre trained weights utilizing the back-

propagation algorithm. The system has both training and

inference capabilities. It can be used for classification,

unsupervised clustering, dimensionality reduction, feature

extraction, and anomaly detection applications.

Fig. 1. Heterogeneous architecture including memristor based multicore

system as one of the processing components. Proposed multicore system

with several neural cores (NC), one clustering core connected through a 2-D

mesh routing network. (R: router).

Fig. 1 shows the overall system architecture which includes

memristor crossbar based neural cores (NC) and a digital unit

for clustering. Memristor crossbar neural cores are utilized to

provide low power, high throughput execution of neural

networks. The cores in the system are connected through an

on-chip routing network (R). The neural cores receive their

inputs from main memory holding the training data through

the on-chip routers and a buffer between the routing and main

memory as shown in Fig. 1.

Since the proposed architecture is geared towards machine

learning applications (including big data processing), the

training data will be used multiple times during the training

phase. As a result, access to a high bandwidth memory system

is needed, and thus we propose the use of 3D stacked DRAM.

Using through Silicon vias reduces memory access time and

energy, thereby reducing overall system power. To allow

efficient access to the main memory utilizing a DMA

controller that is initialized by a RISC processing core. A

single issue pipelined RISC core is used to reduce power

consumption.

R R R

R R R

R R R

Bu

ffer

Bu

ffer

NC NC NC

NC NC NC

NC NC

Clustering core

. . .

. . .

RISC

Main memory

DMATSV

Page 3: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

3

An on-chip routing network is needed to transfer neuron

outputs among cores in a multicore system. In feed-forward

neural networks, the outputs of a neuron layer are sent to the

following layer after every iteration (as opposed to a spiking

network, where outputs are sent only if a neuron fires). This

means that the communication between neurons is

deterministic and hence a static routing network can be used

for the core-to-core communications. In this study, we

assumed a static routing network as this would be lower

power consuming than a dynamic routing network. The

network is statically time multiplexed between cores for

exchanging multiple neuron outputs.

SRAM based static routing is utilized to facilitate re-

programmability in the switches. Fig. 2 shows the routing

switch design. Note carefully that the switch allows outputs

from a core to be routed back into the core to implement

recurrent networks or multi-layer networks where all the

neuron layers are within the same core.

Fig. 2. SRAM based static routing switch. Each blue circle in the left part of

the figure represents the 8x8 SRAM based switch shown in the middle

(assuming a 8-bit network bus).

III. AUTOENCODER AND MEMRISTOR NEURON CIRCUIT

A. Memristor Devices

The memristor device was first theorized in 1971 by Dr.

Leon Chua [3]. Memristors are essentially nanoscale variable

resistors, where the resistance is modified if a voltage greater

than a write threshold voltage (Vth) is applied [17]. Hence

their resistance can be read by applying a voltage below Vth

across them without changing the device state. Several

research groups have demonstrated memristive behavior

using different materials. One such device, composed of

layers of HfOx and AlOx [18], has a high on state resistance

(RON≈10kΩ), a very high resistance ratio (ROFF/RON≈1000)

and a write threshold voltage of about 1.3 V. Physical

memristors can be laid out in a high density grid known as a

crossbar. The schematic and layout of a memristor crossbar

can each be seen in Fig. 3. A memristor in the crossbar

structure occupies 4F2 area (where F is the device feature

size). This is 36 times smaller than a SRAM memory cell. A

memristor crossbar can perform many multiply-add

operations in parallel and the conductance of multiple

memristors in a crossbar can be updated in parallel. Multiply-

add operations are the dominant operations in neural networks

and training of neural networks require update of synaptic

weights iteratively. As a consequence, memristors have a

great potential as a synaptic element in a neural network based

system design.

(a) (b)

Fig. 3. (a) Memristor crossbar schematic and (b) memristor crossbar

layout.

B. Neuron circuit

Fig. 4 shows the block diagram of a neuron. The neuron

performs two types of operations, (1) a dot product of the

inputs x1,…,xn and the weights w1,…,wn, and (2) and the

evaluation of an activation function. The dot product

operation can be seen in Eq. (1) where i=1,…,n. The

activation function of the neuron is shown in in Eq. (2). In a

multi-layer neural network, a non-linear activation function is

desired (e.g. f(x) = tan-1(x)).

Fig. 4. Neuron block diagram.

n

i ijij WxDP1

(1)

jj DPfy (2)

In this paper we have utilized memristors as neuron

synaptic weights. The circuit in Fig. 5 shows the memristor

based neuron circuit design utilized in this paper. In this

circuit, each data input is connected to two virtually grounded

op-amps (operational amplifiers) through a pair of

memristors. For a given row, if the conductance of a

memristor connected to the first column (σA+) is higher than

the conductance of the memristor connected to the second

column (σA-), then the pair of memristors represents a positive

synaptic weight. In the inverse situation, when σA+

< σA-, the

memristor pair represents a negative synaptic weight. As the

op-amps in Fig. 5 virtually ground the two neuron columns,

the currents through the first and second columns can be

calculated as (𝐴𝜎𝐴+ + ⋯ + 𝛽𝜎𝛽+) and (𝐴𝜎𝐴− + ⋯ + 𝛽𝜎𝛽−)

respectively.

BAGND

Vdd

Vp

Input portOutput port

Routing bus

Programming transistor

A

B

8 bit8 bit

. . .

. . .

8 bit wide

8 bit wide

SRAM

A B

8 bit wide

Memristor device

Area of 4 F2 (F: feature size)

outputinputs

w1

w2

wn

.

.

.

x1

x2

xn

Page 4: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

4

Fig. 5. Circuit diagram for a single memristor based neuron.

Assume that

𝐷𝑃𝑗 = 4𝑅𝑓[𝜎𝐴+ + ⋯ + 𝛽𝜎𝛽+ − 𝜎𝐴− + ⋯ + 𝛽𝜎𝛽−]

= 4𝑅𝑓[𝐴(𝜎𝐴+ − 𝜎𝐴−) + ⋯ + 𝛽(𝜎𝛽+ − 𝜎𝛽−)]

The output, yj of the op-amp connected directly with the

second column represents the neuron output. When the power

rails of the op-amps, VDD and VSS are set to 0.5V and -0.5V

respectively, the neuron circuit implements the activation

function h(x) as in Eq. (3) where

𝑥 = 4𝑅𝑓[𝐴(𝜎𝐴+ − 𝜎𝐴−) + ⋯ + 𝛽(𝜎𝛽+ − 𝜎𝛽−)]. This implies,

the neuron output can be expressed as h(DPj). Fig. 6 shows

that h(x) closely approximates the activation function, 𝑓(𝑥) =1

1+𝑒−𝑥 − 0.5. The values of VDD and VSS are chosen such that

no memristor gets a voltage greater than Vth across it during

evaluation.

ℎ(𝑥) = 𝑥

4 𝑖𝑓 |𝑥| < 2

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3)

Fig. 6. Plot of functions f(x) and h(x).

C. Autoencoder

Several big data applications are particularly focused on

classification and clustering tasks. The robustness of such

systems depends on how well they can extract important

features from the raw data. For big data processing we are

interested for a generic feature extraction mechanism which

will work for a variety of applications. The autoencoder [19]

is a popular method for dimensionality reduction and feature

extraction approach which automatically learns features from

unlabeled data. This means, it is an unsupervised approach

and no supervision is needed.

The architecture of the autoencoder is similar to a multi-

layer neural network as shown in Fig. 7. An autoencoder tries

to learn the function hW,b(x) ≈ x. That is, it tries to learn an

approximation to the identity function, such that output x’ is

similar to input x. By placing constraints on the network, such

as by limiting the number of hidden units, we can discover

useful patterns within the data.

Fig. 7. Two layer network having four inputs, three hidden neurons and four

output neurons.

As a concrete example, suppose the inputs x are the pixel

intensity values from a 10×10 image (100 pixels) so n = 100.

An autoencoder determining compressed representation of

these data has 100 inputs, 50 neurons in the hidden layer and

100 output neurons. The network is trained such that output

layer neurons generate same values as the applied inputs.

Since there are only 50 hidden units, the network is forced to

learn a compressed representation of the input. This paper

utilizes memristor devices to develop novel extreme low

power implementations of on-chip training circuitry for

unsupervised training of the autoencoder.

D. Memristor Based Neural Network Implementation

The structures of a multi-layer neural network and an

autoencoder are similar. Both can be viewed as a feed forward

neural network. Training of the two systems are different. The

autoencoder is trained layer by layer. The training of each

layer is similar to a two layer neural network training where a

temporarily added second layer tries to learn the inputs

applied to first layer (detail in previous subsection). Fig. 7

shows a simple two layer feed forward neural network with

four inputs, four outputs, and three hidden layer neurons. Fig.

8 shows a memristor crossbar based circuit that can be used

to evaluate the neural network in Fig. 7. There are two mem-

ristor crossbars in this circuit, each representing a layer of

neurons. Each crossbar utilizes the neuron circuit shown in

Fig. 5.

In Fig. 8, the first layer of the neurons is implemented using

an 5×6 crossbar. The second layer of two neurons is

implemented using a 4×8 memristor crossbar, where 3 of the

inputs are coming from the 3 neuron outputs of the first

crossbar. One additional input is used for bias. Applying the

inputs to a crossbar, an entire layer of neurons can be

processed in parallel within one cycle.

AB

β

yj

+ -

+ -

AA

Memristor

C

Synapse

RRf

R

-5 0 5-0.5

0

0.5

x

f(x)

h(x)

Hidden layer Output

layer

Input layer

Page 5: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

5

Fig. 8. Schematic of the neural network shown in Fig. 7 for forward

pass.

E. The Training Algorithm

In order to provide proper functionality, a multi-layer

neural network needs to be trained using a training algorithm.

Back-propagation (BP) [20] and the variants of the BP

algorithm are widely used for training such networks. The

stochastic BP algorithm is used to train the memristor based

multi-layer neural network and is described below:

1) Initialize the memristors with high random resistances.

2) For each input pattern x:

i) Apply the input pattern x to the crossbar circuit

and evaluate the DPj values and outputs (yj) of all

neurons (hidden neurons and output neurons).

ii) For each output layer neuron j, calculate the error,

δj, between the neuron output (yj) and the target

output (tj).

𝛿𝑗 = (𝑡𝑗 − 𝑦𝑗) (4)

iii) Back propagate the error for each hidden layer

neuron j.

𝛿𝑗 = ∑ 𝛿𝑘𝑤𝑘,𝑗𝑘 (5)

where neuron k is connected to the previous layer

neuron j.

iv) Determine the amount, Δw, that each neuron’s

synapses should be changed (2η is the learning

rate and f is the activation function):

Δ𝑤𝑗 = 2𝜂 × 𝛿𝑗 × 𝑓′(𝐷𝑃𝑗) × 𝑥 (6)

If the error in the output layer has not converged to a

sufficiently small value, goto step 2.

F. Circuit Implementation of the Training Algorithm

The circuit implementation of the training algorithm for the

neural network in Fig. 8 can be broken down into following

major steps:

1. Apply inputs to layer 1 and record the layer 2 neuron

output errors.

2. Back-propagate layer 2 errors through the second

layer weights and record the layer 1 errors.

3. Update the synaptic weights.

The circuit implementations of these steps are detailed below:

Step 1: A set of inputs is applied to the layer 1 neurons, and

the layer 2 neuron outputs are measured. This process is

shown in Fig. 8. Errors are discretized into 8 bit

representations (one sign bit and 7 bits for magnitude).

Step 2: The layer 2 errors (δL2,1 and δL2,2) are applied to the

layer 2 weights after conversion from digital to analog form

as shown in Fig. 9 to generate the layer 1 errors (δL1,1 to δL1,6).

The memristor crossbar in Fig. 9 is the same as the layer 2

crossbar in Fig. 8. Assume that the synaptic weight associated

with input i, neuron j (second layer neuron) is wij=σij+ - σij

- for

i=1,2,3 and j=1,2,..,4. In the backward phase we want to

evaluate

δL1,i = Σjwij δL2,j for i=1,2,3 and j=1,2,..,4.

= Σj(σij+ - σij

-)δL2,j

= Σjσij+δL2,j - Σjσij

-δL2,j (7)

The circuit in Fig. 9 is essentially evaluating the same

operations as Eq. (7), applying both δL2,j and -δL2,j to the

crossbar columns for j=1,2,3. The back propagated errors are

discretized using ADCs and are stored in buffers. To reduce

the training circuit overhead, we can multiplex the back

propagated error generation circuit as shown in Fig. 10. In this

circuit, enabling the appropriate pass transistors, back

propagated errors are sequentially generated and stored in the

buffer. Access to the pass transistors will be controlled by a

shift register.

Fig. 9. Schematic of the neural

network shown in Fig. 7 for back

propagating errors to layer 1.

Fig. 10. Implementing back

propagation phase multiplexing

error generation circuit.

Step 3: The weight update procedures for both layers are

similar. They take neuron inputs, errors, and intermediate

neuron outputs (DPj) to generate a set of training pulses. The

training unit essentially implements Eq. (6). To update a

synaptic weight by an amount Δw, the conductance of the

memristor connected to the first column of the corresponding

neuron will be updated by amount Δw/2 (where Δw/2=𝜂 ×𝛿𝑗 × 𝑓′(𝐷𝑃𝑗) × 𝑥𝑖) and the memristor connected to the second

column of the corresponding neuron by amount -Δw/2. We

will describe the weight update procedure for the first

columns of the neurons (odd crossbar columns in Fig. 8). The

weight update procedure for the second columns of the

neurons (even columns) is similar to the procedure for the first

columns, except that we need to multiply the neuron errors

(𝛿𝑗) by -1.

In Eq. (6) we need to evaluate the derivative of the

activation function for the dot product of the neuron inputs

and weights (DPj). The DPj value of neuron j is essentially the

scaled difference of the currents through the two columns

implementing the neuron (see section III(B)). Applying the

inputs to the neuron again, the difference of column currents

are stored in the buffer after converting into digital form. The

yj

R

Rf

R+ -

+ -

C

β

A

B

inp

uts

Layer 2 crossbar

D

β

outputs

Layer 1 crossbar

-+

-+

. . .

Error inputs from layer 2

-

DAC

-+

. . .

Error inputs from layer 2

-

DAC

Page 6: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

6

derivative of the activation function (f’) is evaluated using a

lookup table. A digital multiplier is utilized to multiply the

neuron error (δj) and f’(DPj). The number δj×f’(DPj) is

converted into analog form and is applied to the training pulse

generation unit (see Fig. 11(b)). For training, pulses of

variable amplitude and variable duration are produced. The

amplitude of the training signal is modulated by the neuron

input (xi) and is applied to the row wire connecting the desired

memristor (see Fig. 11(a)). The duration of the training pulse

is modulated by η×δj×f’(DPj) and is applied to the column

wire connecting the desired memristor (see Fig. 11(b)). The

combined effect of the two voltages applied across the

memristor will update the conductance by an amount

proportional to 𝜂 × 𝛿𝑗 × 𝑓′(𝐷𝑃𝑗) × 𝑥𝑖.

Fig. 11. Training pulse generation module. Here value of Vb is 1.2 V and

V∆1 is a triangular wave whose duration determines the learning rate.

IV. HETEROGENEOUS CORES

The proposed system has a RISC core, a digital core for

clustering and memristor neural cores. These cores are

connected through an on-chip routing network. The

memristor neural cores could be used to implement

autoencoders and feed forward neural networks of different

configurations.

A. Memristor Neural Core

Fig. 12 shows the memristor based single neural core

architecture. It consists of a memristor crossbar of size

400×200, input and output buffers, a training unit, and a

control unit. Our objective was to take as big crossbar as

possible, because this enables more computations to be done

in parallel. Doing experiment on different crossbar sizes, we

observed that 400×200 crossbar has very little impact of sneak

paths for the memristor device considered (high resistance

values). The control unit will manage the input and output

buffers and will interact with the specific routing switch

associated with the core. The control unit will be implemented

as a finite state machine and thus will be of low overhead.

Processing in the core is analog and entire core process in one

cycle for an input. The neural cores communicate between

each other in digital form as it is expensive to exchange and

latch analog signals. Neuron outputs are discretized using a

three bit ADC converter for each neuron and are stored in the

output buffer for routing. Inputs come to the neural core

through the routing network in digital form and are stored in

the input buffer. Inputs are applied to the memristor crossbar,

converting them into analog form.

Fig. 12. Single memristor neural core architecture.

B. Digital Clustering Core

We have designed a digital core for performing k-means

clustering on the feature representation output of the

autoencoder. The core could be configured to generate up to

32 clusters and maximum input dimension could be 32 after

dimensionality reduction using the autoencoder. This system

performs clustering based on Manhattan distance

calculations.

Assume that our data dimension is n and the number of

clusters is m. For an element of the data sample, the

corresponding Manhattan distances for the current m cluster

centers are evaluated in parallel and are accumulated in the

dist. registers (Fg. 13). If a subtractor (in the first row of

subtractors) output is negative, it is subtracted from the

corresponding dist. register otherwise it is added. After n

iterations (one for each element of the input sample), the

Manhattan distance between the input sample and the current

cluster centers will be in dist. registers. In the next m cycles,

the minimum distance value and the corresponding cluster

center index will be evaluated using the circuit shown in the

figure at right.

The system has another set of registers for each cluster

center (Fig. 13) to store the accumulated values of the data

samples belonging to that cluster. One counter is taken for

each cluster to keep track of how many data samples belong

to that cluster. The data sample is accumulated in the center

accumulator register corresponding to the minimum distance

cluster. To minimize hardware costs and processing time, this

operation is overlapped with the distance calculation step for

the next data pattern in Fig. 13. When all the training data

samples are assigned to the clusters based on the minimum

distances, new cluster center will be evaluated dividing the

center accumulator registers by the corresponding sample

counter values. After this, new cluster centers will be updated

and operations for the next epoch will be performed based on

the new centers.

+

-

Va

ip4=V∆1

Vwd

DPj

-

+

R

R

Rip1=-xi

f’

ip3=δj

× DAC

ip2=-(Vth-Vb)

.

.

.

. . .

Vwai

Memristor

VDD=Vb

VSS=-Vb

(a)

(b)

Inp

ut

Bu

ffer

Output buffer

DAC

DAC

DAC

DAC

DAC

ADC ADC ADC ADC

Control Unit

Routingswitch

Training unit

Page 7: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

7

Fig. 13. Digital clustering core design implementing k-means clustering algorithm.

V. EXPERIMENTAL SETUP

A. Applications and Datasets

We have examined the MNIST [21] and ISOLET [22]

datasets for classification and clustering tasks. The

classification applications utilized deep networks where the

autoencoder was utilized for unsupervised layer-wise pre-

training of the networks. For clustering applications,

dimensionality reduction and feature extraction tasks are

performed using the autoencoder. After that the k-means

clustering algorithm is applied on the data generated by the

autoencoder. An autoencoder based anomaly detection was

examined on the KDD dataset [23]. Table I shows the neural

network configurations for different datasets and applications.

Table I: Neural network configurations. Dataset Configuration

Anomaly

detection

KDD 41→15→41

Classification MNIST 784→300→200→100→10

ISOLET 617→2000→1000→500→250→

26

Dimensionali

ty reduction

MNIST 784→300→200→100→20

ISOLET 617→2000→1000→500→250→

20

B. Mapping Neural Networks to Cores

The neural hardware are not able to time multiplex neurons

as their synaptic weights are stored directly within the neural

circuits. Hence a neural network’s structure may need to be

modified to fit into a neural core. In cases where the networks

were significantly smaller than the neural core memory,

multiple neural layers were mapped to a core. In this case, the

layers executed in a pipelined manner, where the outputs of

layer 1 were fed back into layer 2 on the same core through

the core’s routing switch.

When a software network layer was too large to fit into a

core (either because it needed too many inputs or it had too

many neurons), the network layer was split amongst multiple

cores. Splitting a layer across multiple cores due to a large

number of output neurons is trivial. When there were too

many inputs per neuron for a core, each neuron was split into

multiple smaller neurons as shown in Fig. 14. When splitting

a neuron, the network needs to be trained based on the new

network topology. As the network topology is determined

prior to training (based on the neural hardware architecture),

the split neuron weights are trained correctly.

Fig. 14. Splitting a neuron into multiple smaller neurons.

C. Area Power Calculations

The area, power, and timing of the SRAM array, used in

the clustering core, were calculated using CACTI [24] with

the low operating power transistor option utilized. We

assumed a 45 nm process in our area power calculations.

Components of a typical cache that would not be needed in

the neural core (such as the tag array and tag comparator)

were not included in the calculations. The clustering core also

requires adders, registers, multiplexers, and a control unit.

The power of these basic components were determined

through SPICE simulations. A frequency of 200 MHz was

assumed for the digital system to keep power consumption

low.

For the memristor cores, detailed SPICE simulations were

used for power and timing calculations of the analog circuits

(drivers, crossbar, and activation function circuits). These

simulations considered the wire resistance and capacitance

within the crossbar as well. The results show that the crossbar

required 20 ns to be evaluated. As the memristor crossbars

evaluate all neurons in one step, the majority of time in these

systems is spent in transferring neuron outputs between cores

through the routing network. We assumed that routing would

run at 200 MHz clock resulting in 4 cycles needed for crossbar

processing. The routing link power was calculated using

Orion [25] (assuming 8 bits per link). Off-chip I/O energy was

also considered as described in section II. Data transfer

energy via TSV was assumed to be 0.05 pJ/bit [26].

VI. RESULTS

A. Supervised Training Result

We have performed detailed simulations of the memristor

crossbar based neural network systems. MATLAB (R2014a)

and SPICE (LTspice IV) were used to develop a simulation

framework for applying the training algorithm to a multi-layer

neural network circuit. SPICE was mainly used for detailed

analog simulations of the memristor crossbar array and

MATLAB was used to simulate the rest of the system. The

circuits were trained by applying input patterns one by one

until the errors were below the desired levels.

Simulation of the memristor device used an accurate model

of the device published in [27]. The memristor device

simulated for this circuit was published in [18] and the

m cluster centers

dist.:

SRAM array

Input

+

+

Reg.

+

+

Reg .

n

mux

+

Reg.

. . .

-

-

hold/load

Reg.:

mux

Routing switch

index

mux mux

. . .

. . .

Div. Div.

Reg. Reg.

mux

+

data . . .

Center accumulator:

i1

i2

i3

i4

i5

i6

i7

i8

i1

i2

i3

i4

i5

i6

i7

i8

Page 8: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

8

switching characteristics for the model are displayed in Fig.

15. This device was chosen for its high minimum resistance

value and large resistance ratio. According to the data

presented in [18] this device has a minimum resistance of 10

kΩ, a resistance ratio of 103, and the full resistance range of

the device can be switched in 20 μs by applying 2.5 V across

the device.

Fig. 15. Simulation results displaying the input voltage and current waveforms for the memristor model [27] that was based on the device in [18].

The following parameter values were used in the model to obtain this result:

Vp=1.3V, Vn=1.3V, Ap=5800, An=5800, xp=0.9995, xn=0.9995, αp=3, αn=3,

a1=0.002, a2=0.002, b=0.05, x0=0.001.

SPICE simulations of large memristor crossbars are very

time consuming. Therefore, we examined the on-chip

supervised training approach with the Iris dataset [28] which

utilizes crossbars of manageable sizes. A two layer network

with four inputs, ten hidden neurons and one output neuron

was utilized. Fig. 16 shows the Matlab-SPICE training

simulation on the Iris dataset. It shows that the neural network

was able to learn the desired classifiers.

Fig. 16. Learning curve for the classification based on Iris dataset.

B. Unsupervised Training Result

Unsupervised training of the autoencoder was performed

on the Iris dataset. The network configuration utilized was

4→2→4. After training, the two hidden neuron outputs give

the feature representation of the corresponding input data in a

reduced dimension of two. Fig. 17 shows the distribution of

the data of three different classes (setosa, versicolor,

virginica) in the feature space. We can observe that data

belonging to the same class appears closely in the feature

space and could potentially be linearly separated.

Fig. 17. Distribution of the data of different classes in the feature space.

C. Anomaly Detection

We have examined autoencoder based anomaly detection on

the KDD dataset. The network configuration for this

application was 39->15->39. As the SPICE simulation of this

network size takes a very long time, we did this experiment in

MATLAB. During training the autoencoder was trained using

only normal data packets. It is expected that during

evaluation, for normal data packets the differences between

input and reconstruction will be smaller compared to the

differences for the attack packets. The network was trained

only with 5292 normal packet data (no anomalous packets

were used during training).

Figures 18 and 19 show the distribution of the distances

between inputs and the corresponding reconstructions for the

normal packets and the attack packets respectively. From Fig.

20 it is seen that the system can detect about 96.6% of the

anomalous packets with a 4% false detection rate. Similar

approaches could be used for other big data applications

where the objective is to detect anomalous patterns from large

volumes of continuously streaming data in real time at low

power.

Fig. 18. Distance between original

data and reconstructed data for

normal packets.

Fig. 19. Distance between

original data and reconstructed

data for attack packets.

Fig. 20. Anomaly detection rate for the test dataset for different decision

parameters.

D. Impact of System Constraints on Application Accuracy

The hardware neural network training circuit differs from

the software implementations in the following aspects:

- Limited precision of the discretized neuron errors

and DPj values.

- Limited precision of the discretized neuron output

values.

- Each neuron can have a maximum of 400 synapses.

Fig. 21 compares the accuracies obtained from Matlab

implementations of the applications considering the hardware

system constraints and implementations without considering

those constraints. It is seen that enforcing the system

constraints the applications still give competitive

performances.

0 10 20 30 40 50

-2

-1

0

1

2

time (us)

Voltage (

V)

On State

Off State

0 10 20 30 40 50

-200

-100

0

100

200

time (us)

Curr

ent

( A

)

On State

Off State

0 20 400

0.2

0.4

0.6

0.8

Epoch

MS

E

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4-1

-0.8

-0.6

-0.4

-0.2

0

setosa

versicolor

virginica

0 500 1000 1500 2000 2500 30000

0.2

0.4

0.6

0.8

1

1.2

Normal packet

Dis

tance b

etn

input

& r

econstr

uction

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

1.2

Anomalous packet

Dis

tance b

etn

input

& r

econstr

uction

0 20 40 60 80 1000

20

40

60

80

100

False detection (%)

Dete

ction r

ate

(%

)

Page 9: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

9

Fig. 21. Impact of memristor system constraints (3 bits neuron output and 8

bits neuron error precision) on application accuracy.

E. Single Core Area and Power

The area of the digital clustering core is 0.039 mm2 and its

power consumption is 1.36 mW. The training of 1000 samples

for one epoch in this core takes 0.32 μs time.

The memristor neural core configuration is 400×100, i.e. it

can take a maximum of 400 inputs and can process a

maximum of 100 neurons. The area of a single memristor

neural core is 0.0163 mm2. Table II shows a single memristor

core power and timing at different steps of execution. The

RISC core is considered to be used only for configuring the

cores, routers, and DMA engine. As a result, we assume that

the RISC core is turned off afterwards during the actual

training or evaluation phases.

Table II: Memristor core timing and power for different

execution steps. Time (us) Power (mW)

Forward pass (recognition) 0.27 0.794

Backward pass 0.80 0.706

Weight update 1.00 6.513

Control unit 0.0004

F. System Level Evaluations

Total system area: The whole multicore system includes

144 memristor neural cores, one digital clustering core, one

RISC core, one DMA controller, 4 kB of input buffer and 1

kB of output buffer. The RISC core area was evaluated using

McPat [36] and came to 0.52 mm2. The total system area was

2.94 mm2.

Energy efficiency: We have compared the throughput

energy benefits of the proposed architecture over an Nvidia

Tesla K20 GPU. The system consumes 225 W power. The

area of the GPU is 561 mm2 using a 28 nm process. Figures

22 and 23 show the throughput and energy efficiencies

respectively of the proposed system over GPU for different

applications during training. For training the proposed

architecture provides up to 30x speedup and four to six orders

of magnitude more energy efficiency over GPU.

Fig. 22. Application speedup over

GPU for training. Fig.23. Energy efficiency of the proposed system over GPU for

training.

Figures 24 and 25 show the throughput and energy

efficiencies of the proposed system over GPU for different

applications respectively during evaluation of new inputs. For

recognition, the proposed architecture provides up to 50×

speedup and five to six orders of magnitude more energy

efficiency over the GPU.

Fig. 24. Application speedup

over GPU for recognition.

Fig. 25. Energy efficiency of the

proposed system over GPU for evaluation of a new input.

Table III shows the number of cores used, the time and the

energy for a single training data item in one iteration. Table

IV shows the evaluation time and energy for a single test data.

In both training and recognition phases the computation

energy dominated the total system energy consumption.

Table III: For training number of cores used, the time and

the energy for a single input in the proposed architecture. Memristor # of

core Time (us)

Compute energy (J)

IO energy (J)

Total energy (J)

Mnist_class 57 7.29 4.18E-07 8.48E-09 4.26E-07

Mnist_AE 57 17.99 8.37E-07 8.57E-09 8.45E-07

Mnist_kmeans 1 0.42 9.67E-10 4.47E-12 9.71E-10

Isolate_AE 132 24.41 1.97E-06 2.68E-08 1.99E-06

Isolate_kmeans 1 0.42 9.67E-10 4.47E-12 9.71E-10

Isolet_class 132 8.86 9.67E-07 2.67E-08 9.94E-07

KDD_anomaly 1 4.15 7.33E-09 4.51E-09 1.18E-08

Table IV: Recognition time and energy for one input in the

proposed architecture. Memristor Time (us) Compute

energy (J)

IO energy

(J)

Total

energy (J)

Mnist_class 0.77 1.42E-08 8.43E-09 2.26E-08

Mnist_AE 0.77 1.42E-08 8.43E-09 2.26E-08

Mnist_kmeans 0.32 8.89E-10 3.69E-12 8.93E-10

Isolate_AE 0.77 3.28E-08 2.67E-08 5.94E-08

Isolate_kmeans 0.32 8.89E-10 3.69E-12 8.93E-10

Isolet_class 0.77 3.28E-08 2.67E-08 5.94E-08

KDD_anomaly 0.77 2.48E-10 4.48E-09 4.73E-09

0102030405060708090

100

Acc

ura

cy (

%)

s/w

constrained

0

5

10

15

20

25

30

Spee

du

p o

ver

GP

U

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Ener

gy e

ffic

ien

cy o

ver

GP

U

0

10

20

30

40

50

Spee

du

p o

ver

GP

U

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Ener

gy e

ffic

ien

cy o

ver

GP

U

Page 10: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

10

VII. RELATED WORK

A. Digital Systems

High performance computing platforms can simulate a

large number of neurons, but are very expensive from a power

consumption standpoint. Specialized architectures [29-32]

can significantly reduce the power consumption for neural

network applications and yet provide high performance

IBM’s TrueNorth chip consists of 5.4 billion transistors

[33]. It has 4,096 neurosynaptic cores interconnected via an

intra-chip network that integrates one million programmable

spiking neurons and 256 million configurable synapses. The

basic building block is a core, a self-contained neural network

with 256 input lines (axons), and 256 outputs (neurons)

connected via 256×256 directed, programmable synaptic

connections. With a 400×240-pixel video input at 30 frames-

per-second, the chip consumes 63mW. Key differences

between this system and our system are that TrueNorth uses

spiking neurons, has asynchronous communications, and

allows the SRAM arrays to enter ultra low leakage modes. We

examine the impact of using leakage-less SRAM arrays in our

study.

DaDianNao and PuDianNao [1,2] are accelerator for deep

neural networks (DNN) and convolutional neural networks

(CNN). These systems are based on a fully digital design. In

PuDianNao [2] neuron synaptic weights are stored in off-chip

memory which requires data movement during training back

and forth between the off-chip memory and the processing

chip. In DaDianNao [1], neuron synaptic weights are stored

in eDRAM and are later brought into a neural functional unit

for execution. Input data are initially stored in a central

eDRAM. This system utilized dynamic wormhole routing for

data transfers. It was able to achieve 150× energy reduction

over GPUs. ShiDianNao is an accelerator only for CNNs. As

CNNs utilize shared weights, a small on-chip SRAM cache

was able to store the whole network’s weights.

St. Amant et al. [34] presented a compilation approach

where a user would annotate a code segment for conversion

to a neural network. This study examined transformation of

several kernels including FFT, JPEG, K-means, and the Sobel

edge detector into equivalent neural networks. In this work

they utilized analog neural cores to accelerate a RISC

processor for general purpose codes. Our approach varies

from this in that we are examining standalone neural cores for

embedded applications, and using memristors for the

processing.

Belhadj et al. [35] proposed multicore architectures for

spiking neurons and evaluated them for embedded signal

processing applications. They stored synaptic weights in

digital form and used digital to analog converters to process

the neurons. There were significant differences in the neuron

circuit designs and neural algorithm compared to our system.

B. Memristor Based Systems

The most recent results for memristor based neuromorphic

systems can be seen in [12-15]. Alibart [12] and Preioso [13]

examined only linearly separable problems. Boxun [15]

demonstrates in-situ training of memristor crossbar based

neural networks for nonlinear classifier designs. They

proposed having two copies of the same synaptic weights, one

for the forward pass and another transposed version for the

backward pass. However it is practically not easy to have an

exact copy of a memristor crossbar because of memristor

device stocasticity. Soudry et al. [14] proposed

implementation of gradient descent based training on

memristor crossbar neural networks. They utilized two

transistors and one memristor per synapse. Their synaptic

weight precision is less than that of the two memristors per

synapse neuron design utilized in this paper.

Liu [16] examined memristor based neural networks

utilizing four memristors per synapse while the proposed

systems utilized two memristors per synapse. They proposed

the deployment of the system as a neural accelerator with a

RISC processor. The proposed system executes data directly

coming from a 3D stacked sensor chip.

VIII. CONCLUSION

In this paper we have proposed a multicore heterogeneous

architecture for big data processing. This system has the

capability to process key machine learning algorithms such as

deep neural network, autoencoder, and k-means clustering.

The system has both training and recognition (evaluation of

new input) capabilities. We utilized memristor crossbars to

implement energy efficient neural cores. Our experimental

evaluations show that the proposed architecture can provide

four to six orders of magnitude more energy efficiency over

GPUs for big data processing.

REFERENCES

[1] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang,

Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam.

2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium

on Microarchitecture (MICRO-47). IEEE Computer Society,

Washington, DC, USA, 609-622. [2] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou,

Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015.

PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the Twentieth International Conference on Architectural

Support for Programming Languages and Operating Systems (ASPLOS

'15). ACM, New York, NY, USA, 369-381. [3] L. O. Chua, "Memristor—The Missing Circuit Element," IEEE

Transactions on Circuit Theory, vol. 18, no. 5, pp. 507–519 (1971).

[4] D. Chabi, W. Zhao, D. Querlioz, J.-O. Klein, “Robust Neural Logic Block (NLB) Based on Memristor Crossbar Array” IEEE/ACM

International Symposium on Nanoscale Architectures, pp.137-143,

2011. [5] C. Zamarreño-Ramos, L. A. Camuñas-Mesa, J. A. Pérez-Carrasco, T.

Masquelier, T. Serrano-Gotarredona, and B. Linares-Barranco, “On

spike-timing-dependent-plasticity, memristive devices, and building a self-learning visual cortex,” Frontiers in Neuroscience, Neuromorphic

Engineering, vol. 5, Article 26, pp. 1-22, Mar. 2011. [6] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The

missing Memristor found,” Nature, vol. 453, pp. 80–83 (2008).

[7] T. M. Taha, R. Hasan, C. Yakopcic, and M. R. McLean, "Exploring the Design Space of Specialized Multicore Neural Processors," IEEE

International Joint Conference on Neural Networks (IJCNN), 2013.

[8] M. Hu, H. Li, Y. Chen, Q. Wu, G. S. Rose, R. W. Linderman, "Memristor Crossbar-Based Neuromorphic Computing System: A Case

Study." IEEE Trans. Neural Netw. Learning Syst. 25(10): 1864-1878

(2014)

Page 11: A Reconfigurable Low Power High Throughput Streaming ... · computing systems, specialized architectures for big data processing are needed to enable low power and high throughput

11

[9] A. M. Sheri, A. Rafique, W. Pedrycz, M. Jeon, "Contrastive divergence

for memristor-based restricted Boltzmann machine" Engineering Applications of Artificial Intelligence 37(2015) 336-342

[10] D. Soudry, D. D. Castro, A. Gal, A. Kolodny, and S. Kvatinsky,

"Memristor-Based Multilayer Neural Networks With Online Gradient Descent Training,” IEEE Trans. on Neural Networks and Learning

Systems, accepted.

[11] Liu, Xiaoxiao, et al. "A Heterogeneous Computing System with Memristor-Based Neuromorphic Accelerators." HPEC, 2014.

[12] F. Alibart, E. Zamanidoost, and D.B. Strukov, "Pattern classification by

memristive crossbar circuits with ex-situ and in-situ training", Nature Communications, 2013.

[13] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K.

Likharev, D. B. Strukov, “Training and operation of an integrated neuromorphic network based on metal-oxide memristors,” Nature,

521(7550), 61-64, 2015.

[14] D. Soudry, D. D. Castro, A. Gal, A. Kolodny, and S. Kvatinsky, "Memristor-Based Multilayer Neural Networks With Online Gradient

Descent Training,” IEEE Trans. on Neural Networks and Learning

Systems, issue 99, 2015. [15] Boxun Li; Yuzhi Wang; Yu Wang; Chen, Y.; Huazhong Yang,

"Training itself: Mixed-signal training acceleration for memristor-

based neural network," Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific , vol., no., 20-23 Jan. 2014.

[16] Xiaoxiao Liu; Mengjie Mao; Beiye Liu; Hai Li; Yiran Chen; Boxun Li;

Yu Wang; Hao Jiang; Barnell, M.; Qing Wu; Jianhua Yang, "RENO: A high-efficient reconfigurable neuromorphic computing accelerator

design," in Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE , vol., no., pp.1-6, 8-12 June 2015

[17] X. Dong, C. Xu, S. Member, Y. Xie, and N. P. Jouppi, “NVSim: A

Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” IEEE Trans. on Computer Aided Design of

Integrated Circuits and Systems, vol. 31, no. 7, pp. 994-1007, July,

2012. [18] S. Yu, Y. Wu, and H.-S. P. Wong, "Investigating the switching

dynamics and multilevel capability of bipolar metal oxide resistive

switching memory," Applied Physics Letters 98, 103514 (2011). [19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol,

“Stacked denoising autoencoders: Learning useful representations in a

deep network with a local denoising criterion,” The Journal of Machine

Learning Research, vo. 11, pp. 3371-3408, 2010.

[20] Russell, S. & Norvig, P. (2002). Artificial Intelligence: A Modern

Approach (2nd Edition). Prentice Hall, ISBN-13: 978-01379039555. [21] http://yann.lecun.com/exdb/mnist/

[22] https://archive.ics.uci.edu/ml/datasets/ISOLET

[23] https://kdd.ics.uci.edu/ [24] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing

NUCA Organizations and Wiring Alternatives for Large Caches with

CACTI 6.0” In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40),

Washington, DC, USA, 3-14, 2007.

[25] A. B. Kahng, B. Li, L. S. Peh, and K. Samadi, “ORION 2.0: A fast and accurate NoC power and area model for early-stage design space

exploration,” Design, Automation & Test in Europe Conference &

Exhibition, pp.423-428, 20-24 April 2009. [26] Gorner, J.; Hoppner, S.; Walter, D.; Haas, M.; Plettemeier, D.;

Schuffny, R., "An energy efficient multi-bit TSV transmitter using

capacitive coupling," Electronics, Circuits and Systems (ICECS), 2014 21st IEEE International Conference on , vol., no., pp.650,653, 7-10

Dec. 2014.

[27] C. Yakopcic, T. M. Taha, G. Subramanyam, and R. E. Pino, "Memristor SPICE Model and Crossbar Simulation Based on Devices with

Nanosecond Switching Time," IEEE International Joint Conference on

Neural Networks (IJCNN), August 2013. [28] https://archive.ics.uci.edu/ml/datasets/Iris

[29] S. B. Furber, S. Temple and A. D. Brown, “High-Performance

Computing for Systems of Spiking Neurons,” Proceedings of AISB'06 workshop on GC5: Architecture of Brain and Mind, vol.2, pp 29-36,

Bristol, April, 2006.

[30] J. Schemmel, J. Fieres, K. Meier, “Wafer-Scale Integration of Analog Neural Networks”, IEEE International Joint Conference on Neural

Networks (IJCNN), 2008.

[31] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, D. S. Modha, “A digital neurosynaptic core using embedded crossbar memory with

45pJ per spike in 45nm,” IEEE Custom Integrated Circuits Conference

(CICC), vol., no., pp.1-4, 19-21 Sept. 2011. [32] J. V. Arthur, P. A. Merolla, F. Akopyan, R. Alvarez, A. Cassidy, S.

Chandra, S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar,

D. S. Modha, “Building block of a programmable neuromorphic substrate: A digital neurosynaptic core,” The International Joint

Conference on Neural Networks (IJCNN), pp.1-8, June 2012.

[33] P. A Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, Jun Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura,

B. Brezzo, Ivan Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M.

D. Flickne, W. P. Risk, R. Manohar, D. S. Modha, "A million spiking-neuron integrated circuit with a scalable communication network and

interface." Science 345, no. 6197, pp 668-673, 2014.

[34] R. S. Amant, A. Yazdanbakhsh, J. Park, B. Thwaites, H. Esmaeilzadeh, A. Hassibi, L. Ceze, and D. Burger, “General-purpose code acceleration

with limited-precision analog computation,”. In Proceeding of the 41st

annual international symposium on Computer architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, pp. 505-516, 2014.

[35] B. Belhadj, A. J. L. Zheng, R. Héliot, and O. Temam. “Continuous real-

world inputs can open up alternative accelerator designs,” SIGARCH Comput. Archit. News 41, 3 (June 2013).

[36] Sheng Li; Ahn, Jung Ho; Strong, R.D.; Brockman, J.B.; Tullsen, D.M.;

Jouppi, N.P., "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in

Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM

International Symposium on , vol., no., pp.469-480, 12-16 Dec. 2009