Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
SnaPEA: Predictive Early Activation for ReducingComputation in Deep Convolutional Neural Networks
Vahideh Akhlaghi∗ Amir Yazdanbakhsh∗† Kambiz Samadi‡ Rajesh K. Gupta Hadi EsmaeilzadehAlternative Computing Technologies (ACT) Lab
†Georgia Institute of Technology ‡Qualcomm Technologies, Inc. University of California, San Diego
[email protected] [email protected] [email protected] [email protected] [email protected]
Abstract— Deep Convolutional Neural Networks (CNNs)perform billions of operations for classifying a single input. Toreduce these computations, this paper offers a solution that leveragesa combination of runtime information and the algorithmic structureof CNNs. Specifically, in numerous modern CNNs, the outputs ofcompute-heavy convolution operations are fed to activation unitsthat output zero if their input is negative. By exploiting this uniquealgorithmic property, we propose a predictive early activationtechnique, dubbed SnaPEA. This technique cuts the computationof convolution operations short if it determines that the output willbe negative. SnaPEA can operate in two distinct modes, exact andpredictive. In the exact mode, with no loss in classification accuracy,SnaPEA statically re-orders the weights based on their signs andperiodically performs a single-bit sign check on the partial sum.Once the partial sum drops below zero, the rest of computations cansimply be ignored, since the output value will be zero in any case.In the predictive mode, which trades the classification accuracyfor larger savings, SnaPEA speculatively cuts the computationshort even earlier than the exact mode. To control the accuracy, wedevelop a multi-variable optimization algorithm that thresholds thedegree of speculation. As such, the proposed algorithm exposes aknob to gracefully navigate the trade-offs between the classificationaccuracy and computation reduction. Compared to a state-of-the-artCNN accelerator, SnaPEA in the exact mode, yields, on average,28% speedup and 16% energy reduction in various modern CNNswithout affecting their classification accuracy. With 3% loss inclassification accuracy, on average, 67.8% of the convolutionallayers can operate in the predictive mode. The average speedup andenergy saving of these layers are 2.02× and 1.89×, respectively. Thebenefits grow to a maximum of 3.59× speedup and 3.14× energyreduction. Compared to static pruning approaches, which arecomplimentary to the dynamic approach of SnaPEA, our proposedtechnique offers up to 63% speedup and 49% energy reductionacross the convolution layers with no loss in classification accuracy.
Keywords-Deep Neural Networks; DNN; Convolutional NeuralNetworks; CNN; Accelerators; Computation Reduction; PredictiveEarly Activation; Approximate Computing
I. INTRODUCTION
Deep Convolutional Neural Networks (CNNs) are among the
most widely used family of machine learning methods that have
had a transformative effect on a wide range of applications. CNNs
require ample amounts of computation even for a single input
query. For instance, assigning a label to a relatively small RGB
image (224×224×3) from the ImageNet database [1] requires
billions of multiply-and-accumulate operations [2]–[4]. This paper
aims to reduce these copious amount of computation by exploiting
both their runtime information and algorithmic structure. In
convolutional layers of many modern CNNs, each convolution
operation is commonly followed by an activation function called a
Rectifying Linear Unit (ReLU) that returns zero for negative inputs
and yields the input itself for the positive ones. We observe that a
∗Vahideh Akhlaghi and Amir Yazdanbakhsh contributed equally.
AlexNet
GoogLeNet
SqueezeNet
VGGNet
Average0%
20%
40%
60%
80%
100%
Neg
ativ
eIn
puts
toth
eA
ctiv
atio
nLa
yers
Figure 1: Fraction of activation input values that are negative.
large fraction of ReLU outputs are zero, indicating a large number
of negative convolution outputs. Figure 1 illustrates this trend
among several modern CNNs where ReLU nullifies 42%-68% of
inputs. In addition, comparing the outputs of intermediate convolu-
tional layers for different input images shows the zero values vary
spatially across the images. Figure 2 illustrates this insight across
two images passing through GoogLeNet [5]. The highlighted
differences in the output of the intermediate convolutional layer
attest to the varying spatial distribution of zeros. Harnessing these
insights, we devise SnaPEA, a holistic software-hardware solution,
that cuts a large fraction of the computations short by identifying
the zero intermediate values earlier during the runtime.
SnaPEA operates in two distinct modes, namely exact and
predictive. In the exact mode, in which the classification accu-
racy remains unchanged, SnaPEA detects the zero values by
static re-ordering of weights along with a low-overhead sign-bit
monitoring of partial sums. A negative partial sum triggers early
termination of convolution operations. SnaPEA, in the predictive
mode, trades off the classification accuracy for larger computation
savings by predicting the zero values. Predictive mode results
in earlier termination of the convolution operations compared
to the exact mode, further reducing the amount of computation.
Notwithstanding the higher benefits of predictive mode, an undisci-
plined prediction of zero values leads to significant loss compared
to the nominal CNN classification accuracy. To minimize this
loss while maximizing the reduction in computation, we pro-
pose a co-designed hardware-software solution that (1) statically
pre-arranges the weights, (2) determines a threshold for triggering
predictive early activation, and (3) uses a low-overhead runtime
monitoring mechanism to apply the early activation. As such,
SnaPEA makes the following contributions:
1) SnaPEA leverages the algorithmic structure of CNNs toto reduce their computation. This work provides an insight
that the amount of computation in CNNs can be significantly
reduced by using a combination of runtime information along
SnaPEA: Snappy Predictive Early Activation
662
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture
2575-713X/18/$31.00 ©2018 IEEEDOI 10.1109/ISCA.2018.00061
Co
nvR
eLU Convolution
SoftmaxPooling
Concatenation
NormalizationReLU
Co
nCCCCCC
vvvvR
e LU
IntermediateFeature Maps
ReLUActivation
ConvolutionOperation
Co
nvR
eLU
Co
nvR
eLU C
onv
ReL
UC
onv
ReL
U
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
Co
nvR
eLU
ReL
U
Figure 2: GoogLeNet [5], in which the intermediate feature maps for two input images are magnified. The ellipses on the intermediate feature maps highlight thevarying spatial distribution of non-zero values for distinct input images.
with the algorithmic structure of CNNs, which feeds many
negative inputs to the activation function.
2) SnaPEA is a runtime technique that cuts the CNNcomputations short. Exploiting the aforementioned insight,
this paper devises an exact runtime approach that relies on
a single-bit sign-check to cut the computation short without
losing any accuracy. In addition, SnaPEA comes with a
predictive mode that speculates on the outcome of sign-check
and terminates the computation even earlier, trading off
accuracy for less computation.
3) SnaPEA provides hardware-software solution to controlthe accuracy trade-offs. We develop a multi-variable
optimization algorithm that systematically thresholds the
degree of speculation based on the sensitivity of the CNN
output to each layer. The threshold becomes a knob for
controlling the accuracy-computation tradeoff.
To evaluate the effectiveness of the proposed technique, we
evaluate it on a number of modern CNNs. In the exact mode,
which has no effect on the classification accuracy, SnaPEA,
on average, delivers 28% (maximum of 74%) speedup and
16% (maximum of 51%) energy reduction over EYERISS [2], a
state-of-the-art CNN accelerators. With 3% loss in classification
accuracy, on average, 67.8% of the convolutional layers can
operate in the predictive mode. The average speedup and energy
saving of the layers in the predictive mode over EYERISS are
2.02× and 1.89×, respectively. GoogLeNet sees the maximum
benefit of 3.59× speedup and 3.14× energy reduction. Finally,
we evaluate the benefits of SnaPEA along with static pruning
techniques using the already pruned SqueezeNet CNN [6]. In
the exact mode, SqueezeNet achieves 30% speedup and 15%
energy reductions with no loss of accuracy, demonstrating the
complimentary nature of SnaPEA’s dynamic approach to the
static pruning techniques. Overall, these benefits suggests that
coalescing runtime information with algorithmic insights can lead
to new avenues for reducing the heavy computations of CNNs.
II. SNAPEA HARDWARE-SOFTWARE SOLUTION
SnaPEA provides a hardware-software solution to reduce the
computation in a given CNN. The software part of SnaPEA, illus-
trated in Figure 3, is comprised of two distinct passes: one for the
exact mode, and the other for the predictive mode. In the latter pass,
the solution finds the thresholds for speculation while considering
the acceptable loss in accuracy. In both cases, the task is to reorder
weights of the convolution kernels, depending on the operating
mode. To utilize these transformations, the SnaPEA comes with
an accelerator design that can efficiently execute the CNN with
reordered convolution weights with support for early termination
of convolution. This section overviews the hardware and software
components of SnaPEA.
A. SnaPEA Software Workflow
Figure 3 depicts the software workflow of SnaPEA which takes
a CNN model, an acceptable accuracy loss, and an optimization
dataset as its inputs. The CNN goes through the multiple passes of
this workflow. The first pass, called Convolution Layer Extraction,
elicits the convolution kernels of the CNN. Then, the weights
of each kernel are re-ordered through the remaining passes,
depending on the operating mode, exact or predictive.
Software workflow in the exact mode. To develop this flow,
we leverage the observation that in the CNNs with ReLU
activation layers, the inputs to the convolution layers are positive.
Consequently, in these layers, the convolution output remains
positive by performing Multiply-Accumulate (MAC) operations
with the positive subset of the weights. Only performing the
remaining MAC operations with the negative subset of the weights
can turn the convolution output negative. Given this insight, in
the exact mode, Sign-Based Weight Reordering pass reorders the
weights of convolution kernels based on their sign such that the
positive subset are followed by the negative subset. The reordering
enables SnaPEA to first perform MAC with the positive subset
and then cut the computation and apply activation function earlier
663
���������
��������
���������
� ��
���
��
���
�����
����
�
����
���
�� �
����
��������� ��������� �����������
��������� ��������� ����������
��������� ��������� ����������
�
��
��!�
���"�
� �
#� $
���
!���
�"��
�
��������� ��������� �������
��������� ��������� ������
��������� ��������� ������
�
����
%&��
�'�(
���)
��*�
�'�
����
�(��
�)��
*� �
'����
�
��������� ��������� ��+��, �-
�*� �'���'�(���)��� �����������
�*� �'���'�(���)��� ����������
�*� �'���'�(���)��� ����������
�
�����%*� �'���'�(���)��� �����������
�����%*� �'���'�(���)��� ����������
�����%*� �'���'�(���)��� ����������
�
���
.������$��� ���������������� ��.�������
!����"�� ��/������
� ��
���
���
+���
���+
��,
�-
����
��0 '
���
�'���
���0
'�
Figure 3: Software workflow for SnaPEA.
in the case of observing a negative partial output during the
computation with negative weights.
Software workflow in the Predictive mode. To reduce the com-
putations further, SnaPEA in the predictive mode, speculates on
the sign of the convolution outputs before starting to go through
the negative weights. A thresholding mechanisms controls the
aggressiveness of the speculation. The intuition is that if the partial
output of a convolution after a certain number of MAC operations
is less than a threshold, the final convolution output will likely be
negative. In this mode, since SnaPEA may misspeculate a positive
convolution output as negative, the final classification accuracy
may decline. Therefore, to utilize this intuition effectively, the
software part of SnaPEA needs to deliberately determine: (1) a
threshold value and (2) its associated number of MAC operations,
such that the loss in the classification accuracy remains below the
acceptable level while the computation reduction is maximized.
These two speculation parameters need to be determined for as
many layers as possible to maximize the benefits. To determine
a proper set of parameters, SnaPEA formulates the problem as
a multi-variable constrained optimization problem, and provides
a greedy algorithm to solve it (See Section IV for more details).
The algorithm is run by the software part on the OptimizationDataset through the following three passes. This triad of passes is
to mange the complexity of accounting for the combined effects
of the layers without an exponential explosion of the search space.
First, the software statically runs a characterization pass, named
Kernel Profiling, that measures the sensitivity of the accuracy to
the imprecision introduced in each kernel in isolation. According
to this sensitivity, the Kernel Profiling pass determines a set of
speculation parameters for each kernel. Then, the next pass (LocalOptimization) consolidates the kernel parameters of each layer and
identifies a set of speculation parameters for the layer. This pass
also considers the effects of speculation in each layer in isolation.
Finally, the Global Optimization pass iteratively adjusts the specula-
tion parameters of all layers such that the cross-layer effect yields
an acceptable accuracy with the maximal computation reduction.
The optimization algorithm runs once offline and does not impose
additional runtime overhead during the execution of CNNs.
Based on the obtained speculation parameters for the entire
network, the weights of each kernel are reordered by the Weight
Reordering pass. This pass reorders the kernel weights by placing
the ones determined by the speculation parameters ahead of the
others. Then, the remaining weights are reordered based on the
same procedure used for the Sign-Based Weight Reordering pass,
which puts the negative weights after the positive ones. Finally,
these reordered weights determine the execution of the CNN on
the SnaPEA hardware.
B. SnaPEA Hardware Architecture
The SnaPEA architecture comprises number of identical
Processing Engines (PEs), each of which is designed to compute a
convolution using the reordered weights. To support computation
with the reorderings, each PE is equipped with an index buffer
that hold the indices of weights in the original kernel. The PE
uses this index buffer to fetch the corresponding input value
for each weight. This design is necessary because SnaPEA can
reorder the weights but cannot tamper with the order of the inputs
or activation. Section V expounds this design. The following
provides an overview of the execution flow of a single convolution
window in the exact and predictive modes.
Convolution execution flow in the exact mode. The PE first
performs the operations of the positive weights. For the negative
weights, the PE probes the sign of each partial sum value before
proceeding to the next MAC operation. As soon as the partial sum
becomes negative, the PE terminates the convolution early and
triggers the early activation. Once the early activation is triggered,
the PE is free to perform the computations of another convolution
window. The sign-bit check merely requires a single AND gate,
a low overhead addition to the PE.
Convolution execution flow in the predictive mode. In the
predictive mode, each PE speculates the sign of the convolution
output by comparing the partial sums with a threshold value
after performing a pre-determined number of MAC operations.
As mentioned, both the threshold and the number of operation
are determined in the SnaPEA software workflow. If the partial
result is less that the threshold, PE can speculatively terminate
the convolution and compute the activation early. That is, the PE
outputs a zero for the current convolution window. To support
this speculative execution, each PE is equipped with a unit called
Predictive Activation Unit (PAU) (See Section V).
664
�������
����
-5 +1 -1
+1 +2 +6
CONV -9 0ReLU-9
(a)
+1 -5 -1
+2 +1 +6
�������
����CONV -3 SnaPEA 0ReLU
-3
(b)
�������
����
+1 -5 -1
+2 +1 +6
CONV +2 SnaPEA 0ReLU-1
(c)
Figure 4: A 1×3 convolution in (a) unaltered (b) exact, and (c) predictive modes.In the latter two, the weights and their corresponding inputs are reordered.The white boxes highlight the operations that are cut.
III. COMPUTATION REDUCTION IN SNAPEA
Figure 4 demonstrates how SnaPEA reduces the computation by
an example of 1×3 convolution. Figure 4a performs the unaltered
convolution in which all of the MAC operations are performed
and yields “-9” as the output. Figure 4b illustrates convolution in
the exact mode. In this mode, SnaPEA reorders the weights based
on their sign, and starts the computation with the positive weights.
The computation is terminated after performing only two MAC
operations as the results is already negative, “-3”. The simple sign
check stops the computation. Although the partial sum after two
MAC operations (“-3”) has not reached the final convolution output
(“-9”), it will be converted to zero by the following ReLU operation.
As such, the results is the same as the unaltered convolution.
Therefore, the exact SnaPEA does not change the final output
after ReLU and does not lead to accuracy degradation.
Figure 4c illustrates how predictive mode cuts the operations ear-
lier than the exact mode. As shown, after performing the MAC op-
erations on only one weight, SnaPEA predicts that the convolution
value will eventually be negative. Even though the corresponding
partial sum value is positive (“+2”), SnaPEA speculatively triggers
the ReLU function early with a negative value (e.g., “-1”) and puts
out zero. This speculation reduces the computation from two in the
exact mode to one. In real-world CNNs, convolution is most often
3D and requires a relatively large number of MAC operations as
depicted in Figure 5a. Using these methods, SnaPEA can forgo
a significant number of the MAC operations as illustrated in 5b.
IV. SNAPEA SOFTWARE OPTIMIZATION
Significant computation reduction provided by the predictive
mode comes at a price of experiencing loss in the classification
accuracy due to misspeculating positive outputs as negative ones.
To avoid unacceptable loss while maximizing the computation
reduction, the predictive pass in the software part of SnaPEA,
aims to systematically control the degree of speculation by
properly determining the speculation parameters. To determine
the parameters, the predictive pass formulates the problem as a
constrained optimization problem, and designs a greedy algorithm
to solve it. In this section, we first elaborate on the speculation
ReLUConv
n∑i=0
xi ×wi
(a)
ReLUConv
n∑i=0
xi ×wi
(b)
Figure 5: (a) The unaltered 3D convolution where all the MAC operations(bubbles) are carried out. (b) The same convolution with SnaPEA, where a sig-nificant number of operations are eliminated, delineated by the white bubbles.
parameters, and then explain the problem formulation and the
algorithm to determine the parameters.
A. Speculation Parameters
As mentioned in Section II-A, speculation on the sign of a
convolution output is performed by comparing the partial result
of a set of MAC operations with a threshold value. Therefore,
the threshold value and its associated set of operations are the
parameters that control the degree of speculation. The threshold is
merely a value that is required to be determined by the software for
the controlled speculation. However, to determine a proper set of
operations, the software requires to select the proper weights. One
approach to select the weights would be to sort the weights in de-
scending order of their absolute values, and select those with larger
magnitude as a set of operations for performing the speculation.
In this approach, although the contributions of both positive and
negative weights are taken into account, the classification accuracy
drastically declines. The reason is that selecting the weights with
the larger magnitude ignores the contributions of input values
which are, to a large degree, random and data dependent.
To mitigate the mentioned issue, SnaPEA sorts the weights
in ascending order, partitions them into a number of smaller
groups, and selects the weight with the largest magnitude from
each group. This approach enables even the smallest weights to
appear in the set of operations for the speculation; consequently,
the smaller weights that may couple with large input values have
an opportunity to contribute to the speculation. In this approach,
to select a proper set of operations, the software only requires to
determine the number of groups. This means that the number of
groups can be exploited as an indicator of a set of operations in the
speculation parameters. Accordingly, we denote the speculation
parameters of all kernels in all layers of a CNN as (Th,N), in
which Th is a list of threshold values and N is a list of the number
of groups for selecting the corresponding operations.
B. Problem Formulation
The problem of finding the speculation parameters (i.e., (Th,N))to maximize the computation reduction with an acceptable loss can
665
be formulated as an optimization problem. In order to formulate
the problem, we measure the computation reduction by subtracting
the number of MAC operations that are performed by SnaPEA
from the one performed by an unaltered CNN. However, since
the number of MAC operations in the unaltered CNN is constant
across various inputs, maximizing the computation reduction
becomes equivalent to minimizing the number of MAC operations
performed by SnaPEA. Accordingly, we define a function that
calculates the number of MAC operations in SnaPEA as follows.
Let odl,k be the result of a single convolution window obtained
by kernel k in layer l with the speculation parameters Thkl and Nk
lfor the input image d. The number of MAC operations to compute
odl,k can be calculated by the function Op shown in (1). Let assume
that the reordered weights are stored in a 1D array such that the
Nkl speculation weights are placed at the beginning of the array
while the remaining positive weights followed by the remaining
negative weights are placed at the end. The function in (1) returns
Nkl if the value of partial sum after performing Nk
l operations (i.e.,
PartialSumNkl) is less than the threshold value Thk
l . Otherwise,
the number of operations is determined by checking the sign of
the partial sum value obtained by performing operations with the
negative weights (i.e., PartialSumw−). If a negative partial sum
is observed, the function returns the index of the corresponding
negative weight in the array (i.e., Idxw−). If none of the above
cases occurs (last part in 1), the number of operations is set to the
total number of weights in the kernel. Total number of weights
of the kernel is Cin,l ×Dkl ×Dk
l , in which Cin,l is the number of
input channels of the layer l, and Dkl is the kernel width.
Op(odl,k,Thk
l ,Nkl )=
⎧⎪⎪⎨⎪⎪⎩
Nkl , if PartialSumNk
l≤Thk
l ,
Idxw−, if PartialSumNkl>Thk
l and PartialSumw− ≤0,
Cin,l×Dkl ×Dk
l , otherwise
(1)
The amount of computation to produce all the convolution
outputs is the sum of the number of MAC operations required
to produce each individual output. Based on this definition, the
problem is translated into finding the speculation parameters that
minimize total number of MAC operations and meet the constraint
on the accuracy loss, which can be formulated as the following
constrained optimization problem.
Let L be a set of all the layers in a given CNN, Kl a set of all
the kernels in layer l, D an optimization dataset, ε an acceptable
accuracy loss, Thkl and Nk
l the speculation parameters of kernel kof layer l, Od
l,k the outputs of the convolution generated by kernel
k in layer l for the input image d from D, and AccuracyCNN and
AccuracySnaPEA the classification accuracy of the CNN and the
classification accuracy obtained by SnaPEA, respectively. Now,
(Th,N) can be determined by solving the following problem:
minTh,N
∑d∈D
∑l∈L
∑k∈Kl
∑o∈Od
l,k
Op(o,Thkl ,N
kl )
subject to AccuracyCNN−AccuracySnaPEA≤ε(2)
C. Finding the Speculation Parameters
In order to solve the optimization problem formulated as (2), we
devise a greedy algorithm (i.e., Algorithm 1). The algorithm takes
Algorithm 1 Finding the threshold value and its associated
number of operations for all kernels in a CNN
1: Inputs: CNN: a CNN model, D: an optimization dataset,ε: Acceptable loss in classification accuracy
2: Outputs: ParamCNN:Speculation parameters (Th,N) for the CNN
3: // Analyze each kernel individually4: function KERNELPROFILINGPASS(CNN,D,ε)5: Initialize ParamK[l][k]→ /06: for ∀ layer l in CNN do7: for ∀ kernel k in layer l do8: for a set of values (th,n) do9: op, err = Simulate(CNN, D, k, th, n)
10: if err≤ε then11: ParamK[l][k].append((th,n,op))
12: Sort ParamK[l][k] based on op
13: return ParamK14: // Local Optimizer to find a set of params for each layer individually15: function LOCALOPTIMIZATIONPASS(CNN,D,ε,ParamK)16: for layer l in CNN do17: for t in range(0,T) do18: for k in layer l do19: param = ParamK[l][k][t]
20: op, err = Simulate(CNN,D,ε,param)21: if err ≤ε then22: ParamL[l].append((param,op,err))
23: return ParamL24: // Parameter tuning to accommodate for cross-kernel effect25: function ADJUSTPARAM(CNN,ParamCNN,ParamL)26: for ∀ layer l in CNN do27: for ∀ t in range(len(ParamL[l])) do28: meritL[l][t]=
-(ParamL[l][2]-ParamCNN[l][2])
(ParamL[l][1]−ParamCNN[l][1])
29: l,t = Argmax(meritL)30: return (l,t)
31: // Global Optimizer to find the parameters for the entire network32: function GLOBALOPTIMIZATIONPASS(CNN,D,ε,ParamL)33: for ∀ layer l in CNN do ParamCNN[l] = ParamL[l][0]
34: err = Simulate(CNN,D,ParamCNN)35: while err>ε do36: l,t=ADJUSTPARAM(CNN,ParamCNN,ParamL)37: ParamCNN[l] = ParamL[l][t]38: remove ParamL[l][t] from ParamL[l]39: err = Simulate(CNN,D,ε,ParamCNN)
40: return ParamCNN41: Initialize ParamCNN[l]→ /042: ParamK= KERNELPROFILINGPASS(CNN,D,ε)43: ParamL= LOCALOPTIMIZATIONPASS(CNN,D,ε,ParamK)44: ParamCNN=GLOBALOPTIMIZATIONPASS(CNN,D,ε,ParamL)
a CNN, an optimization dataset D, and an acceptable accuracy loss
ε and returns a list named ParamCNN that stores the value of the
speculation parameters (Th,N). The algorithm first characterizes
the sensitivity of the CNN to the speculation performed in each
kernel in isolation. Then, it adjusts the speculation parameters for
all the kernels through a greedy search such that they cooperatively
minimize the computation while keeping the loss less than ε.
Accordingly, we break the algorithm into two main stages (i.e.,
the profiling and the optimization stage) as follows:
Profiling stage. Function KernelProfilingPass in Algorithm 1 pro-
files the number of operations (op) and the accuracy loss (err)
666
corresponding to various values of (Thkl ,N
kl ) for the kernel k in
layer l. The exact mode of each kernel is also included in the pro-
filing results by setting (0,1) as one of the values for its (th,n). The
process is repeated for all the kernels in the CNN. The acceptable
profiling results in terms of the accuracy loss, are accumulated in
a list called ParamK. Each sub-list ParamK[l][k] in the list ParamKis sorted in ascending order based on the value of op.
Optimization stage. The optimization stage evaluates the
combined effects of kernels and determines the proper speculation
parameters for them. To avoid the complexity of evaluating the
combined effects, the optimization stage consists of two functions:
LocalOptimizationPass and GlobalOptimizationPass. The function
LocalOptimizationPass in Algorithm (1), aims to evaluate the
combined effects of kernels in each layer when the speculation
is performed in the layer in isolation. Then, the function identifies
a set of speculation parameters for each individual layer separately
that leads to acceptable accuracy with minimum operations. To do
this, the function LocalOptimizationPass generates T configurations
for layer l such that in the t-th configuration, the speculation
parameters of kernel k is set to t-th profiled parameters from the
sorted list ParamK[l][k]. The configurations yielding an acceptable
accuracy are selected as the set of configurations for the layer l.The acceptable configurations of all layers are populated in a list
called ParamL, and passed to the next function.
The second function, GlobalOptimizationPass, evaluates the
effect of speculation performed in all the layers simultaneously and
adjusts their speculation parameters with respect to the cross-layer
effect on the classification accuracy and computation reduction.
The output of the function is the final speculation parameters for
all the kernels in the CNN which is stored in the list ParamCNN.
To find the final parameters, the function first initializes the
ParamCNN by setting the speculation parameters of each layer
l to ParamL[l][0]. This initialization leads to the maximum
computation reduction given the configurations stored in ParamL.
However, the accuracy loss obtained by the initial setting may
not be acceptable. In case of meeting the desired accuracy, the
current parameters in ParamCNN is returned. Otherwise, the
parameters are adjusted iteratively until the accuracy loss becomes
less than ε. For adjusting the parameters, in the next iteration,
those parameters are of interest that lead to small increase in the
number of operations while large improvement in the classification
accuracy. Hence, we define a merit value as −Δerr/Δop, where
the larger the Δerr and the smaller the Δop are, the larger the
merit is. Accordingly, the function GlobalOptimizationPass selects
the configuration with the maximum merit value among all
the configuration in ParamL and updates the corresponding
speculation parameters in the list ParamCNN.
V. ARCHITECTURE DESIGN FOR SNAPEA
SnaPEA provides an accelerator architecture in order to effi-
ciently execute the CNN with the transformed convolution opera-
tions. Modern CNNs consist of several back-to-back layers includ-
ing convolution, ReLU activation, pooling, and fully-connected.
To provide an end-to-end solution, the accelerator architecture
consists of several units to execute the computation of all layers in
the CNN. In order to efficient execution of CNNs, the architecture,
specifically, targets to optimize the hardware of the convolution
layers because of the following reasons. The first reason is that
the computation of the convolution layers dominates the overall
runtime of modern CNNs [2], [3], [7]–[10]. The second reason
is to execute the convolutions with the reordered weights and to
support the predictive early activation at the hardware level. To
perform the computations of the fully-connected layers, the same
hardware unit designed for the convolution layers is employed.
The fully-connected layers are mainly used to perform the actual
classification. CNNs usually have much smaller number (i.e. one or
two) of fully-connected layers compared to the convolution layers
at the final stage of the network. For example, GoogleNet has 57
convolution layers and only one fully-connected layer. On average,
the computation of fully-connected layers accounts for ≈1% of
the total number of computations performed in CNNs [2], [3], [8].
Therefore, using the same hardware unit for the fully-connected
layers has virtually no impact on the total runtime of the CNNs.
Finally, the SnaPEA architecture consists of dedicated units to
support the computations of ReLU activation and pooling layers
as well.
Figure 6 (a) illustrates the high-level block diagram of the
proposed accelerator architecture. The accelerator consists of a 2D
array of identical Processing Engines (PEs). Each PE is equipped
with an input and output buffer that communicates with the
off-chip memory. The weights of kernels and the inputs—coming
from an off-chip memory—are stored in the dedicated buffers
within each PE. In the following, we explain each unit of the
accelerator architecture in more details.
Processing Engine (PE). Figure 6 (b) depicts the microarchitec-
ture of one PE in the SnaPEA architecture. Each PE comprises
multiple compute lanes, a weight and index buffer, an input/output
buffer, and multiple Predictive Activation Units. Each compute
lane consists of one dedicated Multiply-and-Accumulate (MAC)
unit and one Predictive Activation Unit (PAU). The weight, index,
and input/output buffers are shared across all the compute lanes
within each PE. The computation of a convolution layer in each PE
starts upon receiving a block of input features, their corresponding
weights, and the weight indices from the off-chip memory. In every
cycle, the PE controller reads one weight value from the weight
buffer and broadcasts it to all the compute (MAC units) lanes. The
PE controller also reads one weight index from the index buffer
and sends the fetched index to the input buffer. Upon receiving
the index, the input buffer reads a set of values (one value per
each MAC unit) and sends them to the MAC unit for processing.
Each compute lane is dedicated to perform all the computations
of one convolution window. That is, each MAC unit performs
the multiplication of one input and weight for each convolution
window and sends the results to the accumulation register. The
accumulation register accumulates the partial sums for each
convolution window. At the same time, the Predictive Activation
Unit (PAU) checks the values of the partial sums to determine
whether further computations for each convolution window is
667
(b) PE Microarchitecture
����
���
�
�� ����
����������� ��������
�
�
�����������
�����
�����
���������
���
�����
���
�
���
���!"����!"#�"��������$���%
(a) Block Diagram of SnaPEA Architecture
����������
���
PE00
���
����������
���
PE10
���
����������
���
PE0n
���
����������
���
PE1n
���
����������
���
PEm0
���
����������
���
����
���
�
�����
���
�
����
���
�
���
���
�
����
���
�
�����
���
�
����
���
�
���
���
�
����
���
�
�����
���
�
����
���
�
���
���
�
����
���
�
�����
���
�
����
���
�
���
���
�
����
���
�
�����
���
�
����
���
�
���
���
�
����
���
�
�����
���
�
����
���
�
���
���
�
PEmn
���
���
���
���!"����!"#�"��������$���%
�������&���'
�������&���(
�������&��
Figure 6: (a) The overall structure of the SnaPEA architecture and its multilevel memory hierarchy, containing an off-chip memory and a distributed on-chip bufferfor input and outputs. (b) The microarchitecture of each PE. The weights are shared across the compute lanes.
required. If the PAU determines that no further computations for
a convolution window is required, it data gates the corresponding
multiplier and accumulator to save energy. This process continues
until either all the computations for the current convolution window
are performed or the PAU determines to apply the activation early.
Weight and index buffers. The weight buffer contains the weight
values of the convolution kernels in the pre-determined order
(See Section IV). The weights are ordered offline and loaded
into the memory with the proper ordering. Since the ordering of
the weights are changed, we also need to add an index buffer to
properly index the input buffer. This index is used to load a value
from the index buffer. In every cycle, the controller fetches one
weight from the weight buffer and broadcasts it to all the compute
lanes. Simultaneously, the controller reads an index and sends it to
the input buffer to read the corresponding input value. The input
buffer delivers the inputs to each compute lane to perform one
multiplication for adjacent convolution windows.
Input/Output Buffers. The input buffer holds a portion of input
data for each convolution layer. Upon completion of all the
computations, the results are written into the output buffer. We use
one physical buffer for inputs and outputs. However, the buffer
is logically divided into two sub-buffers for holding the input and
output data of each layer. The logical partitioning allows us to use
each of the sub-buffers as an input or an output buffer. The results
of a layer l stored in the output buffer may be used by the next
layer l+1 in . In this case, the data of each sub-buffers are logically
swapped without wasting additional cycles for data transfers.
Predictive Activation Unit (PAU). Figure 7 illustrates the
microarchitecture of the Predictive Activation Unit (PAU). One
PAU unit is added to each compute lane to support the convolution
operations in the exact and predictive mode. Performing the
convolution operations in the exact mode only requires to check
the sign of the partial sum value during the MAC operations with
the negative weights. Accordingly, in the exact mode, the signal
Predict is set to zero which allows the sign-bit of the partial sum
stored in the register Acc Reg to determine the termination of the
convolution operations. Once the sign-bit becomes one, the signal
terminate is asserted and notifies the controller to terminate the rest
of computations for the underlying convolution window.
In the predictive mode, the sign of the convolution output is
speculated through the threshold value (th) and its associated num-
ber of operations (n) which are statically determined through the
software part (See Algorithm 1). To perform speculation, PAU first
checks the partial sum value, coming from the accumulator register,
with a threshold value after a pre-determined number of MAC
operations. At this time, the controller sets the signal Predict to one.
If the partial sum value is less than the pre-determined threshold
value, PAU predicts that the final value of this convolution window
will eventually become negative. In this case, the PAU performs
the following tasks: (1) notifies the controller that no further
computations are required for this convolution window and (2)
performs the early ReLU activation and sends zero to the output
buffer. If the partial sum value is larger than the pre-determined
threshold, the compute lane continues the computations for the
convolution window normally until it reaches the negative weights.
The next check on the partial sum starts upon starting the MAC
operations with the negative weights. Here, the signal Predict is
de-asserted, and PAU periodically performs a simple one-bit sign
check on the partial sum values after each MAC operations, similar
to the process mentioned in the exact mode. Once the sign-bit
becomes one, the PAU terminates the convolution operations of
the current window and sends a zero value to the output.
The mechanism of dynamically checking the partial sum values
might lead to idle computation lanes. These computation lanes
remain idle until the rest of the lanes finish the computations of
their assigned convolution window. Accordingly, increasing the
computation lanes may result in making more lanes idle despite
providing higher parallelism between the convolution windows.
In Section VI, we evaluate the effect of increasing computation
lanes on the idle cycles and their effects on the performance and
energy savings.
Pooling unit. Once the computations of a group of convolution
windows complete, the PE performs the pooling operation on the
results. Once done, the PE writes the results back into the output
buffer. These results are either used in the computations of the
next layers of CNNs or written back to the off-chip memory, if
no further computations is required.
Organization of PEs. As shown in Figure 12, the SnaPEA archi-
668
���
��
��
������ �
�������������� ��� ���
����
��
��
�������
�
�
�
�
Figure 7: Prediction Activation Unit (PAU). The Predict signal determines thePAU operation mode (exact or predictive). The Terminate signal, once asserted,terminates the computation early.
tecture contains multiple identical PEs organized in a 2D array.
The PEs are logically grouped both vertically and horizontally. The
input data are partitioned between the horizontal PEs and the ker-
nels are partitioned between the vertical PEs. The PEs in the same
horizontal and vertical groups work on the same portion of the in-
put data and kernels, respectively. Before the computation starts, a
portion of input data are broadcasted to all the PEs within the same
horizontal group. Similarly, one or more kernels are broadcasted to
the PEs within the same vertical group. After the input and kernel
data distribution, the PEs start and proceed their computations
independent from other PEs. Once the computations for all the
PEs within the same horizontal group end, the on-chip buffer
delivers the next portion of input data. In this partitioning, some of
the PEs may finish their computations earlier than other PEs within
the same horizontal group. These PEs remain idle until all the other
PEs complete their computations for all the assigned kernels and in-
put data portion. This synchronization mechanism reduces the cost
of multiple data broadcasting among the PEs while having a small
impact on the performance. We evaluate the impact of this synchro-
nization mechanism in Section VI-B by analyzing the sensitivity
of performance to the number of compute lanes per each PE.
VI. EVALUATION
A. Methodology
Workloads. We use several popular medium to large scale dense
CNN workloads. We also include SqueezeNet [6] that maintains
AlexNet-level accuracy with 50× fewer parameters through a
static pruning approach. The fewer parameters in SqueezeNetare attained using an iterative pruning and re-training of the
convolution weights. Table I summarizes the evaluated networks
and some of the most pertinent parameters such as model size,
number of convolution layers (Conv.), number of fully-connected
layers (FC), and the baseline classification accuracy. In all of the
evaluations, we use ILSVRC-2012 [1] validation dataset.
System setup. We use Caffe v1.0 [11] to run the pre-trained
networks on a GPU. We compile Caffe using NVCC v8.0.62 and
GCC v4.8.4 with maximum architecture-specific and compiler
optimizations enabled. We configure Caffe to use Nvidia cuDNN
v6.0, a highly tuned GPU-accelerated deep neural network library.
Training/testing datasets. To learn the threshold values and
their associated set of operations for each kernel, we implement
Algorithm 1 through updating the data of convolutional layers in
Caffe v1.0. We uniformly sample a subset of images from each of
the 1,000 classes in ImageNet [1] to obtain the training and testing
datasets for the proposed algorithm. The uniform sampling among
Table I: Workloads, their released year, model size, number of convolution(Conv.) and fully-connected (FC) layers, and baseline classification accuracy.The model size shows the size of weights in Megabytes.
Network YearModel Size
(MB)
AlexNet
GoogLeNet
SqueezeNet
VGGNet
2012201520162014
224546
554
# of LayersConv. FC
Classification Accuracy
5572613
3113
72.6%84.4%74.1%83.0%
Table II: SnaPEA and EYERISS [2] design parameters and area breakdown.
PE
# Compute Lanes / PEPartial Sum RegisterInput RegisterWeight BufferIndex BufferInput / Output RAMPredictive Activation Units
Acc
l. Number of PEsGlobal Buffer
SnaPEA EYERISS
Size Area (mm2) Size Area (mm2)
4 0.012 1 0.003N/A 0 48 B 0.002N/A 0 24 B 0.001
0.5 KB 0.014 0.5 KB 0.0140.5 KB 0.007 N/A 020 KB 0.250 N/A 0
4 0.008 N/A 064 18.62 256 4.94
N/A 0 1.25 MB 12.9Total Area 18.6 mm2 17.8 mm2
all the classes enables us to cover images from distinct classes
during the training and testing phases of Algorithm 1.
Architecture design and synthesis. We implement the
microarchitectural units of the proposed architecture including the
controllers, PEs, predictive activation unit (PAU), and registers
in Verilog. We use Synopsys Design Compiler (L-2016.03-SP5) and
a TSMC 45-nm standard-cell library to synthesize the proposed
architecture and obtain the area, delay, and energy numbers of the
logic hardware units.
SnaPEA and baseline architecture configurations. In this
paper, we explore an 8×8 array of PEs in SnaPEA, each with
four compute lanes, with a total of 256 MAC units. However, the
SnaPEA architecture can be scaled up to larger numbers of PEs.
Table II lists the major architectural parameters of the SnaPEA
design. We add a weight buffer and an index buffer, each 0.5 KB
per each PE. Both weight and index buffers are shared across all
the compute lanes within each PE. Each PE is also equipped with
a 20 KB buffer, that is evenly divided between input and output.
The total capacity of the buffers therefore is 1.25 MB. Similar to the
weight and index buffers, both input and output buffers are shared
across all the compute lanes within a PE. Sharing the on-chip
memories across multiple PEs enables us to reduce the overhead
of index buffers. We size the input and output buffer so that the
activations of all the CNN models, except VGGNet, fit within
these on-chip buffer. This sizing eliminates the need of draining
and filling the on-chip buffers during the execution. For VGGNet,which has deeper and larger layers, however, SnaPEA has to spill
the activations to memory during the accelerations. We consider
the overhead of spilling the data to the off-chip memory in our
experiments. For the baseline architecture, we use the EYERISS [2]
accelerator. Table II shows the major architectural components for
EYERISS. To have the same peak throughput in both accelerators,
we configure EYERISS to have the same number of MAC units
669
Table III: Absolute and relative energy comparison for different componentsof SnaPEA architecture along with off-chip memory access energy cost. PEenergy includes the cost of Predictive Activation Unit (PAU).
Operation Energy (pJ/Bit) Relative Cost
Register File Access16-bit Fixed Point PEInter-PE CommunicationGlobal Buffer AccessDDR4 Memory Access
0.200.300.401.20
15.00
1.01.52.06.0
75.0
AlexNet
GoogLeNet
SqueezeNet
VGGNet
Geomean1.00×
1.10×
1.20×
1.30×
1.40×
Spe
edup
(a)
AlexNet
GoogLeNet
SqueezeNet
VGGNet
Geomean1.00×
1.05×
1.10×
1.15×
1.20×
Ene
rgy
Red
uctio
n
(b)
Figure 8: Overall (a) speedup and (b) energy reduction with exact mode.
(256) as ours. In addition, we allocate the same on-chip memory
size (1.25 MB) to both accelerators. The frequency of both
accelerators are fixed to 500 MHz. Table II summarizes the area of
the major microarchitectural components in SnaPEA and EYERISS.
Overall, the SnaPEA accelerator needs≈4.5%more area compared
to the EYERISS architecture with the specified configurations
(Table II). This increase in the area is mainly attributed to the added
predictive activation units (PAUs) in the PEs and the controllers.
Energy measurements. Table III lists the energy consumption
of SnaPEA microarchitectural units. For hardware units, we use
the synthesis results with TSMC 45-nm and reported numbers in
TETRIS [8], which uses the same technology node and has a similar
PE architecture as EYERISS. We include the energy overhead of
the predictive activation unit in the energy cost of PE (second row
in Table III). However, for the baseline architecture (EYERISS),
we exclude the energy consumption of the predictive activation
unit and use a relative cost of 1.0 in the evaluations. We use the
publicly available Micron’s DDR4 system power calculator [12]
to estimate the energy cost of accesses to the off-chip memory.
Cycle-level microarchitecture simulation. We develop a
cycle-level microarchitectural simulator that closely model the
architecture of EYERISS and SnaPEA hardware to measure the
performance and energy savings of both hardware. We integrate
the microarchitectural components explained in Section V into the
simulator in a cycle-level manner. To measure the energy savings,
we use the synthesis results and the reported energy numbers
AlexNet
GoogLeNet
SqueezeNet
VGGNet
Geomean0.00×0.50×1.00×1.50×2.00×2.50×
Spe
edup
(a)
AlexNet
GoogLeNet
SqueezeNet
VGGNet
Geomean0.00×
0.50×
1.00×
1.50×
2.00×
Ene
rgy
Red
uctio
n
(b)
Figure 9: Overall (a) speedup and (b) energy reduction with SnaPEA overEYERISS [2] in the predictive mode. The acceptable classification accuracydrop is maintained within ≤3% range of its baseline value.
from some of the recent works [2], [8], [13]. Furthermore, we
use CACTI-P [14] to calculate the area and power of the register
files and on-chip buffers. In the case of any inconsistency in
terms of technology node, we properly scaled the area, delay,
and energy numbers to make them consistent with our synthesis
flow. We integrate the delay and energy numbers collected from
the aforementioned sources into our cycle-level simulator. The
simulator takes the configuration of a CNN architecture as input
and generates an event log for each hardware component. Finally,
using the generated event log along the integrated delay and
energy numbers, the simulator reports the number of cycles and
energy numbers for the whole network.
B. Experimental Results
Overall benefits in the exact mode. Figure 8 illustrates the
speedup and energy reductions when the predictive activation is
disabled (i.e. exact mode). In this approach, SnaPEA hardware
only applies the early activation when the value of partial sum
drops below zero (See Section V). As there is no prediction,
the CNN classification accuracy will not be deteriorated. In
this setting, SnaPEA, on average, delivers 1.3× speedup and
1.16× energy reductions over EYERISS, respectively. Even
for SqueezeNet [6]—a statically pruned convolutional neural
network—SnaPEA yields 1.3× and 1.14×. These savings for
SqueezeNet show that static pruning techniques are complimentary
to the dynamic approach of SnaPEA. Overall, the results in the
exact mode show the practicality of SnaPEA in delivering speedup
and energy reductions even in the pure exact mode, in which the
CNN classification accuracy remains untampered (Table I).
Overall benefits in predictive mode. Figure 9a illustrates the
overall performance improvement of SnaPEA over EYERISS in
the predictive mode while maintaining the classification accuracy
within 3% range of its baseline value (See Table I). In this configura-
670
AlexNet
GoogLeNet
SqueezeNetVGGNet
1.00�
1.50�
2.00�
2.50�
3.00�
3.50�Sp
eedu
p
conv4
conv3
inception_4e/5x5_reduce
inception_4e/1x1
fire5/squeeze1x1
fire6/expand3x3
conv4_2
conv5_3
Figure 10: Speedup of convolutional layers in each network for the predictivemode when the degradation in classification accuracy is set to ≤ 3%.
tion, the predictive activation units (PAUs) might mis-predict a pos-
itive activation value as negative, hence degrading the classification
accuracy. The injected error in the convolutional layers may lead
to a drop in the final classification accuracy. The highest speedup
(2.08×) is observed in GoogLeNet, in which a large fraction of the
features are negative, and hence the saving is larger.
Figure 9b illustrates the energy reduction with SnaPEA in
predictive mode over EYERISS [2]. Similar to the simulation
settings for speedup, the degradation in classification accuracy
is maintained within 3%. Among all the CNN models,
GoogLeNet enjoys the highest energy reductions (1.63×).
Also, in SqueezeNet [6], a statically pruned CNN model, our
technique yields 1.80× and 1.42× speedup and energy reductions,
respectively. This result endorses the effectiveness of SnaPEA,
even compared to static pruning techniques [6], in exploiting the
runtime information to provide significant savings.
Figure 10 illustrates the speedup of convolutional layers
in different networks when accuracy drop is set to 3%. The
maximum range of speedup is observed in GoogLeNet, in which
the maximum speedup is 3.59× achieved by convolution layer
inception 4e/1x1, and the minimum speedup is 17% achieved by
the layer inception 4e/5x5 reduce.
Moreover, in the predictive mode, to achieve acceptable
accuracy drop, a fraction of the convolutional layers can operate
in the predictive mode, which are specified by the software part.
Table IV summarizes the percentage of convolutional layers that
operate in the predictive mode in each network when the accuracy
drop is set to 3%. The average speedup and energy saving across
those layers are also brought in the table. The results show that,
on average, 67.8% of the convolutional layers operate in the
predictive mode, and the average speedup and energy saving
across these layers are 2.02× and 1.89×, respectively.
Prediction accuracy. We study how effective the predictive mode
is in predicting the negative values. Table V shows the average
true negative and false negative rate across all the convolutional
layers in the studied CNN models. The true negative rate measures
the proportion of negative values that are correctly identified as
negative. Applying early activation on these values does not have
any effect on final classification accuracy. The false negative
rate measures the proportion of the positive values that are
mis-predicted as negative and squashed to zero; hence, might lead
to degradation in the final classification accuracy. On average, the
Table IV: The percentage of convolution layers that operates in the predictivemode, when classification accuracy drop is set to ≤3%. The second and thirdcolumn illustrates the average speedup and energy reduction across theseconvolution layers.
Network
AlexNetGoogLeNetSqueezeNetVGGNet
% of Convolution Layers
Average Speedup
60.0%84.21%65.38%61.50%
2.11�2.17�1.94�1.87�
Average EnergyReduction
1.97�2.04�1.84�1.73�
Table V: True negative and false negative rate in predictive mode whenclassification accuracy drop is set to ≤ 3%.
Network True Negative Rate False Negative Rate
AlexNetGoogLeNetSqueezeNetVGGNet
61.84%66.36%49.32%47.54%
21.39%28.37%16.69%15.21%
true (false) negative rate of our proposed prediction mechanism
is 56.26% (20.41%). Due to our optimization technique (See
Algorithm 1), on average, more than 86% of the error occurs
on the small positive values. The small positive values in the
activations generally have slight effect on the final classification
accuracy. The main reason for this is attributed to the fact that each
convolutional layer is commonly accompanied by a max-pooling
layer, in which the small values are filtered out. The high true
negative rate enables us to apply the activation on the negative
values early and significantly reduce the ineffectual operations.
Furthermore, the high true negative rate along the modest false
negative rate exhibits the capability of SnaPEA in utilizing
the runtime information to predict the negative values while
meticulously injecting errors mainly on small positive values.
Sensitivity to the degree of speculation. To study the effect of
our proposed predictive early activation technique, Figure 11
illustrates the speedup with SnaPEA over EYERISS [2] when
the classification accuracy loss varies from 0% to 3%. The 0%
classification accuracy loss is when we do not use any prediction
mechanism (exact mode). The remaining classification accuracy
loss levels (e.g., 1.0%, 2.0%, 3.0%) is when we use the predictive
early activation mechanism (predictive mode). In fact, supporting
distinct levels of loss in the classification accuracy is one of the con-
tributions of our work. The proposed predictive early activations
technique exposes a knob for the user to gracefully navigate the
trade-offs between CNN classification accuracy and performance
and efficiency gains. On average, SnaPEA delivers 1.28×, 1.38×,
1.63×, and 1.9× speedup when we relax the constraint on the
acceptable degradation of classification accuracy to 0.0%, 1.0%,
2.0%, and 3.0%, respectively. As we increase the acceptable
degradation in the classification accuracy all the evaluated CNNs
enjoy a boost in the speedup and energy reductions.
Sensitivity to the number of compute lanes. Figure 12 illustrates
the impact of varying the number of compute lanes within each
PE on speedup with SnaPEA over EYERISS. We present the
results for the predictive mode when the maximum loss in the
CNN classification accuracy is set to 3%. The second bar (Default)shows the speedup in the baseline SnaPEA system (i.e., four
671
AlexNet GoogLeNetSqueezeNet VGGNet Geomean0.0×
0.5×
1.0×
1.5×
2.0×S
peed
up
Quality Loss = 0.0%Quality Loss = 1.0%
Quality Loss = 2.0%Quality Loss = 3.0%
Figure 11: Speedup vs. loss in the CNN classification accuracy. Each barindicates the speedup when the acceptable degradation in the classificationaccuracy is 0% (pure exact mode), 1% (predictive mode), 2.0% (predictivemode), and 3.0% (predictive mode), respectively.
AlexNet GoogLeNetSqueezeNet VGGNet Geomean0.0×
0.5×
1.0×
1.5×
2.0×
Spe
edup
# PEs / Lane = 0.5×# PEs / Lane = Default
# PEs / Lane = 2× more# PEs / Lane = 4× more
Figure 12: Sensitivity of speedup with SnaPEA over EYERISS to the numberof compute lanes per each PEs. Each bar indicates the speedup whenthe number of compute lanes per each PEs is altered by different factors(acceptable classification accuracy drop ≤3%).
compute lanes) over EYERISS with the same number of compute
elements. The rest of the bars (first, third, and fourth bar) show the
speedup of SnaPEA when the number of compute lanes per each
PE is altered uniformly across all the PEs by a factor of half, two,
and four, respectively. Increasing the number of compute lanes
potentially increases the parallelization level between different
convolutional windows. However, due to the synchronization
overhead between the compute lanes per each PE (See Section V,
Organization of PEs), the improvements diminish. The results
show that increasing the number of lanes two times and four
times hurts the performance by ≈36% and ≈45%, respectively.
Also, if we reduce the number of lanes by 0.5×, the performance
decreases by ≈26%. The reason for this behavior is mostly
because of an uneven amount of computations performed by each
compute lane. In contrast to EYERISS [2], in SnaPEA the number
of operations in each convolution window varies due to its runtime
early activation. Therefore, increasing the number of arithmetic
units reduces the utilization of the compute lanes and diminishes
the benefit of higher parallelization.
VII. RELATED WORK
SnaPEA is fundamentally different from the prior studies in
three major ways: (1) we exploit the inherent algorithmic structure
of CNNs and runtime information to judiciously perform early
activation and save ineffectual computations , (2) we expose a knob
that enables the user to gracefully navigate the trade-offs between
the classification accuracy, performance, and energy efficiency ,
and (3) we study the rich and unexplored area of task skipping
in the domain of deep convolutional neural networks and conjoin
these two disjoint lines of research in SnaPEA. Below, we discuss
the most related works.
CNN accelerators. Several accelerators for convolutional neural
networks has been proposed [2], [7]–[10], [15]–[23]. In some
of the most recent works [2], [20], [23], 2D spatial architectures
have been proposed to match with the convolution dataflow and
maximize the data reuse. TETRIS [8] and Neurocube [15] have al-
most the same compute engines as the previous CNN accelerators.
However, these works studied the challenges and opportunities
for designing efficient CNN accelerators in a 3D-stacked memory
setting. Neither of these accelerators evaluated the benefits of
performing early activation in the convolution operation.
Pruning techniques. A handful of research [4], [6], [24]–[26]
proposed various static pruning techniques to reduce the overhead
of computation in deep convolutional neural networks. These static
pruning techniques are agnostic to the dynamically-generated zeros
whose locations in the activation layer vary from one image to
another. As our results show, SnaPEA is complementary to these
techniques and further improve the benefits over the static pruning
techniques. Furthermore, several architectures also have been pro-
posed [7], [9], [17]–[19] for exploiting the sparsity in the input acti-
vations and/or weights to improve the efficiency of the accelerator.
In one of the most recent work, SCNN [7] designs an accelerator
that exploits the sparsity in both the activations and weights. The
proposed novel dataflow in SCNN maximizes the data reuse in
the sparse activations and weights. This work is orthogonal to the
previous efforts that focused on exploiting the sparsity in CNN ac-
celerators. SnaPEA takes on a distinct approach than prior designs
by judiciously re-ordering the MAC operations in a sliding window
and performing the early activation in convolutional windows.
Task skipping. A handful of research efforts [27]–[34] have
looked into task skipping in various domains. In one of the most
recent efforts [29], Sidiroglou et al. proposed loop perforation in
which the accuracy is traded in return for improvement in perfor-
mance. In their proposal, they algorithmically transform the critical
loops in the program and only execute a subset of their iterations.
PredictiveNet [34] proposes a skipping mechanism for CNNs.
They first perform the computations on the most-significant bits
and then speculatively decide whether to perform the computation
on the least-significant bits. However, SnaPEA completely skips
the computations of the significant fraction of the operations. As
such, SnaPEA not only reduces the computation cost, but also
reduces the number of accesses to the on-chip buffers. Although
SnaPEA takes inspiration from the prior proposals in task skipping,
it uniquely applies the task skipping mechanism in the domain
of deep convolutional neural networks in order to effectively
eliminate the ineffectual data transfers and computations.
VIII. CONCLUSION
Traditionally, layers of deep neural networks have been thought
to work in separation while handing each other their results. How-
ever, our work took a disparate approach in considering the most
common sequence of layers in emerging deep networks to reduce
the amount of computation. As such, SnaPEA has devised a pre-
672
dictive early activation that operates in two distinct modes, namely
exact and predictive mode. In the exact mode, in which the nominal
classification accuracy remains untampered, SnaPEA uses a com-
bination of static re-ordering of the weights and low-overhead sign
check to determine when to terminate the computation. SnaPEA
further improves the performance and efficiency of convolution
operations in the predictive mode by speculatively cutting the com-
putation of convolution operations if it predicts its output is nega-
tive, immediately applying activation. Compared to a recent CNN
accelerator, SnaPEA in the exact mode yields 28% speedup (max-
imum of 74%) and 16% (maximum of 51%) energy reductions
across various modern CNNs without affecting their classification
accuracy. With 3% loss in classification accuracy, on average,
67.8% of the convolutional layers operate in the predictive mode,
and the average speedup and energy saving across these layers are
2.02× and 1.89×, respectively. The significant gains due to the
computation and memory access reduction across several modern
CNNs show the effectiveness of our approach that conjoins runtime
information and algorithmic insights into a unified accelerator.
IX. ACKNOWLEDGMENTS
We thank our shepherd, Philip Stanley-Marbell, and the
member of Alternative Computing Technologies (ACT) Lab,
Michael Brzozowski, Jake Sacks, Hardik Sharma, Divya Mahajan,
and Jongse Park for insightful comments and discussions. Amir
Yazdanbakhsh was partly supported by a Microsoft Research
PhD Fellowship. This work was in part supported by NSF awards
CNS#1703812, ECCS#1609823, CCF#1553192, CCF#1029783,
Air Force Office of Scientific Research (AFOSR) Young
Investigator Program (YIP) award #FA9550-17-1-0274, and gifts
from Google, Microsoft, and Qualcomm.
REFERENCES
[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNetLarge Scale Visual Recognition Challenge,” IJCV, 2015.
[2] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture forEnergy-Efficient Dataflow for Convolutional Neural Networks,” in ISCA,2016.
[3] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra,and H. Esmaeilzadeh, “From High-Level Deep Neural Models to FPGAs,”in MICRO, 2016.
[4] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing DeepNeural Networks with Pruning, Trained Quantization, and Huffman Coding,”in ICLR, 2016.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” inCVPR, 2015.
[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer, “SqueezeNet: AlexNet-level Accuracy with 50x Fewer Param-eters and¡0.5 MB Model Size,” arXiv preprint arXiv:1602.07360, 2016.
[7] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany,J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator forCompressed-sparse Convolutional Neural Networks,” in ISCA, 2017.
[8] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalableand Efficient Neural Network Acceleration with 3D Memory,” in ASPLOS,2017.
[9] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, andA. Moshovos, “Cnvlutin: Ineffectual-Neuron-Free Deep Neural NetworkComputing,” in ISCA, 2016.
[10] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,“DianNao: A Small-footprint High-throughput Accelerator for UbiquitousMachine-learning,” in ASPLOS, 2014.
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for FastFeature Embedding,” arXiv preprint arXiv:1408.5093, 2014.
[12] “DDR4 Spec - Micron Technology, Inc.” https://goo.gl/9Xo51F.[13] S. Galal, Energy Efficient Floating-Point Unit Design. PhD thesis, The
Department of Electrical Engineering of Stanford University, 2012.[14] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTI-P:
Architecture-level Modeling for SRAM-based Structures with AdvancedLeakage Reduction Techniques,” in ICCAD, 2011.
[15] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,“NeuroCube: A Programmable Digital Neuromorphic Architecture withHigh-Density 3D Memory,” in ISCA, 2016.
[16] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos,“Stripes: Bit-serial Deep Neural Network Computing,” in MICRO, 2016.
[17] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee,J. M. Hernndez-Lobato, G. Y. Wei, and D. Brooks, “Minerva: EnablingLow-Power, Highly-Accurate Deep Neural Network Accelerators,” in ISCA,2016.
[18] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, andY. Chen, “Cambricon-X: An Accelerator for Sparse Neural Networks,” inMICRO, 2016.
[19] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally,“EIE: Efficient Inference Engine on Compressed Deep Neural Network,”in ISCA, 2016.
[20] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, andO. Temam, “ShiDianNao: Shifting Vision Processing Closer to the Sensor,”in ISCA, 2015.
[21] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou,and Y. Chen, “PuDianNao: A Polyvalent Machine Learning Accelerator,”in ASPLOS, 2015.
[22] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “OptimizingFPGA-based Accelerator Design for Deep Convolutional Neural Networks,”in FPGA, 2015.
[23] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun,“NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision,” inCVPR Workshops, 2011.
[24] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, “Exploringthe Regularity of Sparse Structure in Convolutional Neural Networks,”CoRR, 2017.
[25] H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Petrot, “Ternary NeuralNetworks for Resource-efficient AI Applications,” in IJCNN, 2017.
[26] Y. He, X. Zhang, and J. Sun, “Channel Pruning for Accelerating Very DeepNeural Networks,” arXiv preprint arXiv:1707.06168, 2017.
[27] A. Yazdanbakhsh, G. Pekhimenko, B. Thwaites, H. Esmaeilzadeh, O. Mutlu,and T. C. Mowry, “RFVP: Rollback-Free Value Prediction with Safe toApproximate Loads,” in TACO, 2015.
[28] S. Misailovic, D. M. Roy, and M. C. Rinard, “Probabilistically AccurateProgram Transformations,” in SAS, 2011.
[29] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard,“Managing Performance vs. Accuracy Trade-offs with Loop Perforation,”in FSE, 2011.
[30] S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, “Quality ofService Profiling,” in ICSE, 2010.
[31] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal, and M. Rinard, “Us-ing Code Perforation to Improve Performance, Reduce Energy Consumption,and Respond to Failures,” Tech. Rep. MIT-CSAIL-TR-2009-042, MIT, 2009.
[32] M. C. Rinard, “Using Early Phase Termination to Eliminate Load Imbalancesat Barrier Synchronization Points,” in OOPSLA, 2007.
[33] M. Rinard, “Probabilistic Accuracy Bounds for Fault-tolerant Computationsthat Discard Tasks,” in ICS, 2006.
[34] Y. Lin, C. Sakr, Y. Kim, and N. Shanbhag, “PredictiveNet: AnEnergy-efficient Convolutional Neural Network via Zero Prediction,” inISCAS, 2017.
673