Improving energy gains of inexact DSP hardware … › ~al4 › lingamneni-dac13.pdfexploration of the inexact datapaths. Using this as a basis, we then propose a system-level technique

Improving Energy Gains of Inexact DSP Hardware ThroughReciprocative Error Compensation

Avinash LingamneniDept. of ECE, Rice University

Houston, TX 77005, [email protected]

Arindam BasuSchool of EEE

Nanyang TechnologicalUniversity

50 Nanyang Ave, Singapore639798

Christian EnzWireless & Integrated

Systems Division, CSEM SANeuchatel, CH-2002

Switzerland

Krishna V PalemDept. of CS, Rice University

Houston, TX 77005, [email protected]

Christian PiguetWireless & Integrated

Systems Division, CSEM SANeuchatel, CH-2002

Switzerland

ABSTRACTWe present a zero hardware-overhead design approach calledreciprocative error compensation(REC) that significantly en-hances the energy-accuracy trade-off gains in inexact signalprocessing datapaths by using a two-pronged approach: (a)deliberately redesigning the basic arithmetic blocks to effec-tively compensate for each other’s (expected) error throughinexact logic minimization, and (b) “reshaping” the responsewaveforms of the systems being designed to further reduceany residual error. We apply REC to several DSP primi-tives such as the FFT and FIR filter blocks, and show thatthis approach delivers 2-3 orders of magnitude lower (ex-pected) error and more than an order of magnitude lesserSignal-to-Noise Ratio (SNR) loss (in dB) over the previouslyproposed inexact design techniques, while yielding similarenergy gains. Post-layout comparisons in the 65nm processtechnology show that our REC approach achieves upto 73%energy savings (with corresponding delay and area savingsof upto 16% and 62% respectively) when compared to an ex-isting exact DSP implementation while trading a relativelysmall loss in SNR of less than 1.5 dB.

1. INTRODUCTION AND BACKGROUNDA large class of emerging applications, in particular em-

bedded, multimedia, DSP systems and Recognition, Min-ing and Synthesis(RMS) workloads, can tolerate varyingamounts of error, yet remain potentially useful. This isprimarily attributed to the fact that the information pro-duced by these systems is consumed by our senses thatpossess “cognitive filling” capabilities, and the absence ofa unique and well-defined “golden” result in the end. Hence,in these systems, error (either caused probabilistically dueto inherent variations/perturbations [12] or introduced ex-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DAC’13, May 29 - June 07 2013, Austin, TX, USA.Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00.

plicitly [13]) has been viewed as a commodity that can betraded for lowering of hardware costs such as the energyconsumed, delay and/or area.

However, the existing work in inexact design had a fewmajor shortcomings: (a) the study of efficient ways to com-pose the building blocks to realize a system so as to max-imize the hardware savings remains largely unaddressed.Straightforward approaches would more often than not leadto prohibitive accumulation of error as we combine atomicblocks to realize complex systems, (b) algorithmically well-founded approaches to navigate the combinatorially explo-sive design space of these inexact logic functions was foundlacking, and (c) taking advantage of the knowledge of theunderlying inexactness to redesign the DSP system or morespecifically, the algorithm it implements, was seldom done.Overcoming these shortcomings has the potential for achiev-ing more gains for a given error when compared to naivedesigns and approaches.

!"#$%&'()*+((,%-./%-#((

!"#$%&'(0123&(43"3536%71"(

8%9#:1-5(*;%<3"2(

=--1->?#@3A3#"&#(B1C".@(

!"#$%&'#()*"++,&&'&++

-'.%"/0()'/+

Figure 1: A framework for inexact DSP systems throughReciprocative Error Compensation.

In this paper, we present an approach called reciprocativeerror compensation (shown in Figure 1) to address and over-come these shortcomings. The REC approach provides forefficient datapath realizations by advocating a redesign ofthe basic building blocks through careful selection of “com-pensation buddies”for individual computational blocks, therebyenabling an effective compensation effect to overcome thefirst shortcoming of the existing works. Considering the sec-ond shortcoming, we tap into the substantial research thathas been done in the domain of logic synthesis and optimiza-tion over the last couple of decades and demonstrate thattraditional logic synthesis principles that were largely inap-

plicable in the context of traditional datapath design canin fact be extremely well-suited to guide the design spaceexploration of the inexact datapaths. Using this as a basis,we then propose a system-level technique termed waveformshaping technique that effectively reshapes the frequency(and phase) response curves of the DSP datapaths we aredesigning, to further reduce the error to address the thirdshortcoming. As the waveform shaping is applied at de-sign time, and only involves modifying the coefficients ofthe DSP blocks to be stored in the memories, it does notincur any hardware overheads. Thus, succinctly, REC con-sists of: (i) inexact logic minimization through compensa-tion buddy identification step followed by (ii) a system-levelDSP error compensation through waveform shaping step.We demonstrate the efficacy of our approach by applying itto widely-used signal processing circuits—a pipelined com-plex Fast Fourier Transform (FFT), and a Finite ImpulseResponse (FIR) filter.

The rest of the paper is organized as follows: we present abrief review of the previously published inexact design tech-niques in Section 2. We then start developing our novel RECtechnique in Section 3, which includes a mathematical anal-ysis of the drawbacks of existing approaches, compensationbuddy identification of a node(s), leading to a general algo-rithm for finding them in a given DSP datapath graph. InSection 4, we provide details of the second step of REC—to provide for a system-level compensation in DSP subsys-tems through waveform shaping. In Section 5, we describethe experimental framework and DSP architectures used fordemonstrating the efficacy of the REC framework along withthe results and analysis. We conclude the paper and identifysome possible directions for future work in Section 6.

2. RELATED WORKA plethora of papers, as outlined in [10] and [11], have

started taking advantage of the principle of trading accuracyof the hardware designs in exchange for significant resourcesavings in a wide variety of error-resilient applications. Thiswas originally advocated in the works of Palem [18, 19].Most of the early work focused on utilizing overscaling ofphysical parameters (such as supply voltage) to control andvary the energy-accuracy tradeoffs. These approaches suchas voltage overscaling suffered from inefficient amortizationof the hardware overheads in the targeted embedded sys-tems [17, 1]. However, in the recent years, there has been ashift of focus to innovations at the architectural-level [1, 4,17, 20, 9] that provided zero hardware overhead implemen-tations. These techniques advocated reduction in the logicdensity by either pruning/deletion of a component with alow significance—determined by contribution to the outputbeing computed—[1, 3], or by transformations to “similar”logic that consumes lesser energy [15, 5, 2]. While, by defini-tion, pruning allowed the definitive inclusion or exclusion ofa collection of gates and thus, offered a coarse-grained out-come, transformation of logical structures to “similar” lower-cost counterparts afforded more fine-grained control over theenergy-accuracy tradeoffs due to a much larger associateddesign space. The later techniques, termed inexact logicminimization, involved logic-level manipulation of booleanfunction [15, 5, 2] using the notion of intentional bit flips orcube-swaps (implies an intentional forcing of the output from0→ 1 or 1→ 0 for certain input vectors to assist in furtherlogic minimization). The value of these architectural-layerapproaches has been demonstrated through specific arith-metic structures such as adders or multipliers [16, 20, 5, 1,

17, 2]. However, efficient techniques for realizing inexactcircuits of larger scale, such as signal processing primitivesusing these smaller nodes as building blocks, has not beenexplored. An important part of the problem is that most ofthese previously proposed techniques do not scale efficientlyas we combine simpler systems, such as adders and multi-pliers, to realize more complex designs. Furthermore, mostof these existing techniques either provide a unidirectionaldesign space [1, 17] leading to accumulative error or have anarrow and restrictive design space [16, 20]. On the otherhand, the inexact logic minimization techniques provided aricher design space, but overlooked the inherent compensa-tion ability that could be gleaned from these logic trans-formations and relied on absolute or worst case statisticalmeasures as the guiding principles that more often than notlead to pessimistic under-designed inexact systems.

In this paper, we focus on building upon and improvingthese known techniques through a scalable REC approach,resulting in efficient design of large-scale datapath networksfrom the DSP domain. In particular, we leverage the em-pirical evidence that inexact logic minimization yields errordistributions that are (approximately) gaussian and there-fore, minimizing the mean value of these error distributions(when variance is kept constant) will improve the SNR inthe context of DSP hardware.

3. INEXACT LOGIC MINIMIZATION THROUGHRECIPROCATIVE ERROR COMPENSA-TION

3.1 Drawbacks of Existing Techniques for Sys-tems Composed of Inexact Blocks

To demonstrate the rapid degradation in the quality (ac-curacy) of the results using the existing techniques as thescale of the composition increases, we present a mathemat-ical analysis of a linear network of adders—present in theDSP datapaths—as shown in Figure 2 in this subsection,and empirical results for more complex networks with mul-tipliers in the subsequent subsection.

x(n)

X1f1

X11

N/2 FIFO N/4 FIFO N/4 FIFO

X12

2f2

21

22

ifi

i1

i2

nfn

n1

n2

X out

Node 1 Node 2 Node i Node n

Error 1 Error 2 Error i Error n

Figure 2: A typical composition of datapath elements.

Let there be n nodes in this path with a node at level ihaving inputs (fan-in of) fi and an error represented by arandom variable ξi with a probability distribution functiong(ξi). The expected value of ξi is denoted by µi = E(ξi) =∑ξi · g(ξi) with a variance of σ2

i =∑

(ξi − µi)2 · g(ξi).

Let the error at the primary output of the network be

denoted by ξ̂ with its expected value given by

µ̂ = µn + fn · µn−1 + · · ·+ (

n∏j=2

fj) · µ1 (1)

For the ease of exposition, let us consider a simple case whereall the nodes have the same error properties, i.e., g(ξi) =g(ξj) for all nodes 1 ≤ i ≤ j ≤ n with a mean µ, andwith a uniform number of inputs (fan-in of) M for (M ∈ N,

M > 1), i.e. fi = M for all nodes 1 ≤ i ≤ n, then equation(1) reduces to

µ̂ = µ · (n−1∑i=0

M i) = µ · Mn − 1

M − 1(2)

showing an exponential increase in the expected error valuewith increasing number of nodes.

3.2 Nullifying Expected Error Through Com-pensation Buddies

To avoid such an exponential error increase, through ourREC approach, we will find a compensation buddy for eachinexact node in the circuit. Formally:

Definition 1. Given a node i with its associated error rep-resented by a random variable with mean µi, it is said to be acompensation buddy for the set of its input (fanin) node(s) Φiff

µi = −∑

j∈Φ µj .

Using this definition of compensation buddies in Equation(1), we get

µ̂ = −fn · µn−1 + fn · µn−1 + · · ·

+(

n∏j=3

fj) · (−f2 · µ1) + (

n∏j=2

fj) · µ1

= 0 (3)

Without loss of generality, we can assume that n is even andthe case that it is odd, we can handle by inserting an exactnode n. Hence, it has been shown that the proposed RECtechnique theoretically achieves a zero expected error valueusing compensation buddies wherever such buddies exist forthe design of interest . However, while such a theoreticallimit is possible, the practicality of this approach relies onthe ability to find appropriate compensation buddies whichwe will address the next subsection. An illustrative examplehighlighting the distinction in the error profiles of the con-ventional and REC-based inexact logic minimization tech-niques is shown in Section 8.1 of the appendix.

3.3 Finding the Compensation BuddyTo understand approaches to identifying compensation

buddies, let us consider a circuit that computes a completely-specified boolean function F : Bn → Bm that maps n-input boolean vector x =< x1, x2, . . . xn > to an m-outputboolean vector y =< y1, y2, . . . ym > with an associatedhardware cost CF . The goal of inexact logic minimization

is to find a Boolean function F ′ : Bn′→ Bm′

where n′ ≤ nand m′ ≤ m, such that its cost CF′ is minimal subject to∑

∀ ~i∈I

|F( ~i )−F ′( ~i )|T

≤ Erth

where I is a set of test vectors of size T >> 0 to this cir-cuit and Erth represents the tolerable error (here denotedby the average error) in the given circuit. For an outputyi ∈ y, we define the on-set and off-set of F as xON⊆ Bn

such that F(xON)= 1 and xOFF⊆ Bn such that F(xOFF)= 0 respectively. Inexact logic minimization relies on swap-ping the cubes between the xONand xOFFsets so as to ex-pand the cube(s) size leading to a decrease in the logiccomplexity—calculated by the number of literal reductionor basic gate equivalents. This is the key difference between

!""# "#

"!"# "#

""!# "#

$$"# "#

$"$# "#

"$$# "#

"""# "#

%&'# ()**+#

!""# "#

"!"# "#

""!# "#

$$"# "#

"""# "#

%&'# ()**+#!""# "#

"!"# "#

""!# "#

$!!# "#

!$$# "#

!""# "#

%&'# ()**+#

%&'#,# CBA !!

-./012345#6**/*#7/84# -749)2345#6**/*#7/84#

!"#$%&'()"&*+,--.%'*

()**+#,#/+*0*+!*0*/!*

Figure 3: Example of an exact node (1-bit full adder) andits corresponding inexact compensation buddies.

traditional logic optimization techniques which reduce re-dundant logic to a minimal logic form while maintainingthe functional equivalence, and inexact logic minimizationwhich reduces a minimal logic even further by relaxing therequirement of functional equivalence between the minimizedand the original boolean formulae. By convention, we denotethe inexact nodes having expected or average output valueless than their corresponding correct nodes as positive errornodes and if not, we designate them as negative error nodes.For those familiar with the usage of hypercube represen-tation for logic minimization, essentially, the dominance ofxON→ xOFFcube-swaps leads to positive error nodes whilethe dominance of xOFF→ xONcube-swaps provides negativeerror nodes. An example of this convention on a 1-bit fulladder node is shown in Figure 3.

3.4 A General Algorithm for Design Space Ex-ploration of REC-based Inexact Logic Min-imization

A full-fledged design space exploration using inexact logicminimization is computationally intractable due to a combi-natorial explosion of the solution space for a given accuracyconstraint. As an example, for an n-input m-output func-tion, in the worst case (such as the XOR-dominated func-

tions common in most datapath circuits), we get∑k/2

i=0 C(2k, i)possible logic functions to evaluate for each output, wherek = 2n is the cardinality of the set of input vector combi-nations. This can lead to a doubly exponential complexitybound of Θ(2m.2n

) in the number of input variables—a verylarge number indeed!. However, in this paper, by restrict-ing ourselves to XOR-dominated arithmetic circuits (addersand multipliers) and their use in building symmetrically-structured DSP blocks (FFTs and FIR filters), the solutionspace is quite dense with elements having similar energy-error tradeoff gains. Hence, rather than having to do afull search to find a global minimum, we employ greedysignificance-guided heuristics (for example, alternating positive-and negative-error configurations) combined with random-ized optimization schemes (for example, branch and bound,simulated annealing or stochastic gradient descent) and areable to achieve solutions of good quality. As in past work [13,4], we use an output-significance driven assignment for thedatapath circuits, i.e., nodes feeding outputs with higher(binary) significance are assigned higher significance values.A pseudo-code of the proposed algorithm along with a brief

description of its important steps is provided in Algorithm1 in the Appendix Section 8.2.

3.5 Notes on Extending the Reach of Logic Syn-thesis

Historically, datapath circuits did not lend themselves fa-vorably to traditional logic synthesis techniques given thedominance of XOR gates that did not exhibit the adjacencyproperties (thereby, offering no possibilities of cube expan-sion and hence, logic reduction). Therefore, most of theprior approaches to designing the best datapath circuits in-volved custom design as opposed to automatic logic synthe-sis guided approaches. However, we have made initial stridestowards extending the reach of logic synthesis through inex-act design to include datapath design. For doing so, we willrely on the foundational principles from Quine-McCluskey’swork [14] in identifying adjacency of the minterms as animportant criterion to decide the sequence of inexact logicminimizations. For example, to identify the candidate cubesfor cube swapping between the xONand xOFFsets in step IVof Algorithm 1, we greedily select only those cubes that areadjacent to atleast one cube in the original sets satisfyingthe following necessary and sufficient conditions:

Necessary and Sufficient Conditions forCube-Swapping

(i) The hamming weight of the swapped cube and the originalcube should differ by 1.(ii) The difference between decimal value of the swapped cubeindex and original cube index should be a power of 2.(iii) If the hamming weight of one cube is greater than theother, then its corresponding decimal index should be greateras well.

These conditions guarantee that the swapped cubes are in-deed adjacent to atleast one other cube in the original sub-sets, thereby ensuring the reduction in logic complexity byincreasing cube cover, thereby, reducing the total size of theliterals needed to represent these cubes).

3.6 Results and Analysis for Datapath NetworksIn this subsection, we focus our efforts on quantifying the

gains in error reduction of the proposed REC approach forinexact logic minimization over conventional techniques un-der similar energy, delay and area constraints. We use twoquality metrics for comparison: expected error and SNR(dB)loss, the latter calculated as the absolute value of the signal-to-noise ratio difference between that computed by exactcircuits and that by its inexact relaxation. We do thisthrough empirical comparisons using various datapath el-ements (adders and multipliers) in a pipelined 64-256 pointradix-22 FFT network [8]. We employ two types of inexactconfigurations of this FFT network: (i) inexact multipli-ers only, and (ii) both inexact adders and multipliers. Forthe adders, a combination of parallel prefix (Kogge-Stone)and ripple carry adders were used while for the multipliernetworks, we used the standard truncated array multiplierswith variable correction [6]. We used 16-bit uniformly ran-dom input vectors in the range [-1, 1) generated by Matlaband incorporated operand scaling at each node as appropri-ate to maintain the constant bit-width. As evident from theresults in Table 1, the proposed REC technique reduces themean error by two to three orders of magnitude, and degra-dation in SNR by one order of magnitude when comparedto existing techniques under iso-energy conditions.

Table 1: Average error, variance, and SNR Loss results at theoutputs of FFT network using existing inexact designs (de-noted by Conv.) and the proposed REC technique (denotedby Prop.)

Inexact Multipliers in FFT Blocks

Average error Variance SNR(dB) Loss

PathDepth

Conv. Prop. Conv. Prop. Conv. Prop.

3 0.0125 0.0023 0.165 0.076 0.66 0.055

6 0.036 0.0004 0.22 0.11 1.11 0.09

9 0.051 0.0003 0.037 0.016 4.45 0.39

Inexact Adders & Multipliers in FFT Blocks

Average error Variance SNR(dB) Loss

PathDepth

Conv. Prop. Conv. Prop. Conv. Prop.

3 0.014 0.0048 0.033 0.017 0.74 0.06

6 0.036 0.0004 0.041 0.025 1.25 0.13

11 0.042 0.0015 0.043 0.019 7.78 1.09

4. WAVEFORM SHAPING TO ENCHANCERECIPROCATIVE ERROR COMPENSA-TION

The second contribution of this paper involves further en-hancing the REC approach by utilizing of the knowledgeof the underlying properties of the algorithm to provide forsystem-level compensation. We demonstrate this in the con-text of DSP systems whose system level model is shown inFigure 4(a). As observed in this figure, a typical DSP sys-tem model consists of two key components: a computationaldatapath and a coefficient memory (for storing the systemparameters). The existing inexact techniques modify thecomputational datapath but use pre-computed parameters(coefficients) that were optimized for an exact DSP algo-rithm. However, if we could alter the DSP system parame-ters based on the knowledge of the underlying inexact dat-apath elements, we would potentially minimize the overallerror in the system.

In this paper, we achieve this by casting the DSP networksinto equivalent neuronal models and employing a variation ofthe gradient descent based backpropagation algorithm calledthe Normalized-Least-Mean-Sqare (NLMS) algorithm [7] asshown in Figure 4(b). In DSP systems, this is referred toas waveform shaping of the frequency and phase responsecharacteristics. As this approach only involves modifyingthe coefficients of the DSP blocks that will be stored in thememories, it will be applied during design time and hence,does not incur any hardware overheads. For equivalence be-tween the neuronal models and the actual DSP hardwareimplementations, refer to Section 8.3 of the Appendix. Inthe following subsection, we provide a neuronal model forthe DSP blocks, and show that it can be solved efficientlyusing the NLMS algorithm to modify the inexact DSP sys-tem parameters.

4.1 Neuronal Model for DSP BlocksThe behavior of an exact DSP block that can be modeled

as a linear neuron as follows:

= : {~x(i),~t(i)}; i = 1, 2, 3 . . . k

where, ~x(i) = [x1(i), x2(i), . . . xn]T is an n-dimensional in-

DSP Modules

Coefficient Vectors

x(n)

w(n)

y(n) = x(n)*h(w(n))

h(w(n))

DSP Modules

Coefficient Vectors

x(n)

w(n) w(n)) + e(n)

h(w(n))

+ e(n) DSP

Modules

Coefficient Vectors

x(n)

(n)

(n)) + e(n)

h( (n))

+ e(n)

Inexact DSP Modules

Coefficient Vectors

x(n)

w(n)

w(n))

w(n))

Inexact DSP Modules

Coefficient Vectors

x(n)

w (n)

(n))

y-‐ y(n)

NLMS Optimization

NLMS Optimization

y (n))

y(n)

) n)

(a) (b)

Figure 4: A DSP system level model for (a) conventionalinexact circuits (b) proposed REC-based inexact circuits.

put vector and ~t(i) = [t1(i), t2(i), . . . tk]T is the desired k-dimensional output vector at time (or for sample) i. Similarto a supervised machine learning algorithm, we have twophases—a feedforward computation that propagates the in-put vectors across the inexact blocks to the outputs usinga initial set of weights obtained from the exact DSP blockand an error propagation mode that backpropagates the er-ror values from the output nodes to iteratively adjust theweights for the inexact block to minimize the targeted errormetric.

To elaborate, in the first phase, we compute the output ofthe inexact system as follows:

y(i) =

n∑k=1

wk · xk(i) = ~xT (i) · ~w

where, ~w = [w1, w2, . . . wn] is an n-dimensional coefficientvector. We compute the error of the inexact neuron as

e(i) = t(i)− y(i) = t(n)− ~xT (n) · ~w (4)

Since, the targeted DSP blocks – FIR/FFT primitives–can be modeled as a single-layer linear neurons (refer tosupplemental section 8.3 for details), we employ the NLMSalgorithm [7]. Without loss of generality, we limit ourselvesto presenting the NLMS algorithm on a single neuron in thissection, but we can use this algorithm to iteratively optimizemultiple neurons in a single layer (e.g. FFT neuronal modelshown in the supplemental section 8.3). Hence, the NLMSalgorithm solves the constrained optimization problem ofminimizing the squared Euclidean norm of the change inthe weight vector and can be described as follows:

Minimize ‖∆~w(n)‖2 = ‖~w(n+ 1)− ~w(n)‖2

subject to : ~xT · ~w(n+ 1) = t(n)

We can find a solution for this optimization problem usingLagrange multipliers [7] leading to weight update functionof:

~w(n+ 1) = w(n) + µe(n)~x(n)

~xT (n)~x(n)(5)

We refer the reader to the pseudocode provided in Algo-rithm 2 in Section 8.4 of the appendix for more details.

5. APPLICATION TO DSP DATAPATHS

5.1 Implementation FrameworkThe desired DSP blocks have been described in VHDL and

synthesized using the Cadence RTL compiler using industry-standard 65nm (low power, low leakage) technology libraries.

The Place and Route of the synthesized designs has beendone using Cadence SoC Encounter and the post-layout spicenetlists (with parasitics) have been extracted and simulatedusing the Mentor Graphics ADiT fast-spice simulator to cal-culate the power and delay values accurately. This frame-work is tied into in-house C and Matlab based simulatorsthat have been extended to be compatible with Algorithms1 and 2 to determine the desired error metrics.

5.2 Hardware Implementation of DSP BlocksSynthesized with REC

For the hardware implementation, we consider a pipelined256-point radix-22 complex FFT [8] that is capable of pro-cessing 16-bit complex data. for demonstrating the effective-ness of the proposed REC approach. We opt for the single-path delay feedback (SDF) architecture with a continuoussampling of inputs. Additionally, we have also implementeda data broadcast structure based 30-tap low-pass FIR filterwith 16-bit inputs.

0.006 0.02 0.05 0.2 0.5 1 1.5

12

20

30

40

50

70

SNR Loss (dB)

Perc

enta

ge o

f Ene

rgy

Savi

ngs

[email protected]@[email protected]@1.2VFIR@[email protected]

!"#$%&$'()*+% ,#-$.%&$'()*+%//0% !"#$%&%$'#"%(% )#!%&%!*#+(%/12% !,#,%&%*)#*%(% $#-%&%!*#*(%

Figure 5: Energy Savings vs SNR Loss (dB) plot forthe REC-based inexact logic minimization technique on apipelined 256-point radix-22 complex FFT and 30-tap FIRfilter with the conventional ( exact) circuits as a comparisonbaseline.

In Figure 5, we show the plots of the energy savings of theproposed REC-based inexact logic minimization approachas a function of SNR loss with reference to the conventional“exact” implementation of the 256-point FFT and 30-tapFIR filter. As evident from this figure, the proposed ap-proach achieves between 12%–73% energy savings (a multi-plicative factor of 1.14–3.7), with corresponding delay andarea savings between 2%–16% and 15%–62% respectively,when compared to the exact implementations with an as-sociated loss in the SNR of 0.006 dB upto 1.5 dB. Anotherinteresting and crucial aspect of the proposed technique isthe ability to maintain these savings over the conventionalexact implementations even at scaled supply voltages. Ow-ing to the adoption of an aggressive pipelined strategy, thedelay savings from the proposed approach are not as sig-nificant as the energy and area savings. This is becauseour randomized approach synthesized datapath blocks withnon-uniform levels of logic minimization, wherein the pres-ence of even a single “not-so-minimized” block in a pipelinestage would inflate the critical path delay.

Turning to the idea of optimizing the application param-eters, the waveform shaping technique further enhances thescope of our approach in the context of DSP applications.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001

0.5

0

0.5

1

1.5

2

Time Index

Sign

al V

alue

Ouput of an inexact FIR filter using Waveform Shaping

DesiredOutput WaveformShapedError

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001

0.5

0

0.5

1

Time Index

Sign

al V

alue

Output of a conventional inexact FIR filter

DesiredOutput Conv.InexactError

SNR Loss (in dB) = 0.166

SNR Loss (in dB) = 4.67

Mean Square Error = 0.0435

Mean Square Error = 0.0079

Figure 6: The output of a conventional inexact and awaveform-shaped inexact band-pass FIR filters for 5000samples of white noise input.

The results of the waveform shaping when applied to a in-exact 30-tap band-pass FIR filter—using two different setsof 5000 white noise samples for training and testing—areshown in Figure 6. The proposed waveform shaping tech-nique is able to effectively compensate for most of any resid-ual error from inexact logic minimization in the DSP blocks.To provide the reader some intuition as to how the “shaped”system response curves look, we have included the frequencyand phase responses in Figure 10 of the Appendix.

6. CONCLUSION AND FUTURE WORKWe have presented a novel zero-hardware overhead design

technique called reciprocative error compensation(REC) toenable inexact logic synthesis that scales to the level of sub-systems from the DSP domain. To the best of our knowl-edge, this is the first time that feasibility of inexact sys-tems is being demonstrated at this scale. Our technicalcontributions are two-tiered: (a) synthesis of basic buildingblocks using inexact logic minimization to compensate foreach others’ error through REC, and (b) system-level DSPerror compensation methodology through waveform shap-ing. Through the application of the proposed techniques ona pipelined Fast Fourier Transform (FFT) architecture, weshow that 2-3 orders of magnitude lower (expected) errorand more than an order of magnitude lower Signal-to-NoiseRatio (SNR) loss (in dB) can be achieved, over the previ-ously proposed inexact design techniques under iso-energyconditions. The gains over conventional exact FFT and FIRfilter implementations include upto 73% (or a multiplica-tive factor of 1.14–3.7) energy savings at a minimal SNRloss of less than 1.5 dB. The corresponding delay and areasavings are upto 16% and 62% respectively. Future work in-cludes developing efficient CAD algorithms and optimizationframeworks for further enhancing the gains of our proposedframework and application of this framework to a broaderset of applications such as those embodying RMS workloads.

7. REFERENCES[1] A. Lingamneni et al. Energy parsimonious circuit

design through probabilistic pruning. in proc. ofDATE, pages 764–769, Mar 2011.

[2] A. Lingamneni et al. Parsimonious circuit design forerror-tolerant applications through probabilistic logicminimization. in the proc. of the PATMOS, pages204–213, 2011.

[3] A. Lingamneni et al. Algorithmic methodologies forultra-efficient inexact architectures for sustainingtechnology scaling. in proc. of ACM Intl. Conferenceon Computing Frontiers, pages 3–12, May 2012.

[4] A. Lingamneni et al. Synthesizing parsimoniousinexact circuits through probabilistic designtechniques. in proc. of ACM Transactions onEmbedded Computing Systems, 2013.

[5] D Shin et al. Approximate logic synthesis for errortolerant applications. in the proc. of DATE, pages 957– 960, 2010.

[6] E.J King et al. Data-dependent truncation scheme forparallel multipliers. Asilomar Conf. on Signals,Systems & Computers, 1997.

[7] S. Haykin. Adaptive Filter Theory. Prentice-Hall, Inc.,Englewood Cliffs, NJ., 2002.

[8] S. He and M. Torkelson. A new approach to pipelineFFT processor. in proc. of Parallel ProcessingSymposium, (766-770), 1996.

[9] J. Huang et al. A methodology for energy-qualitytradeoff using imprecise hardware. in the 49th DAC,pages 504–509, 2012.

[10] K. Palem et al. What to do about the end of moore’slaw, probably! in the 49th DAC, pages 924–929, 2012.

[11] K. Palem et al. Ten years of building broken chips:The physics and engineering of inexact computing. inproc. of ACM Transactions on Embedded ComputingSystems, 2013.

[12] K.V. Palem et al. Sustaining moore’s law in embeddedcomputing through probabilistic and approximatedesign: retrospects and prospects. In in proc. ofCASES, pages 1–10, 2009.

[13] L.N.B. Chakrapani et al. Highly energy andperformance efficient embedded computing throughapproximately correct arithmetic: A mathematicalfoundation and preliminary experimental validation.In proc. of IEEE/ACM CASES, pages 187–196, 2008.

[14] E. McCLUSKEY Jr. Minimization of booleanfunctions. Bell System Technical Journal, 1956.

[15] M.R. Choudhury et al. Approximate logic circuits forlow overhead, non-intrusive concurrent error detection.in the proc. of DATE, pages 903 – 908, Mar 2008.

[16] N. Zhu et al. Design of low-power high-speedtruncation-error-tolerant adder and its application indigital signal processing. IEEE Trans. Very LargeScale Integration Sys., 18(8):1225–1229, 2010.

[17] P. Kulkarni et al. Trading accuracy for power with anunderdesigned multiplier architecture. in the proc. ofIntl Conf on VLSI Design, (346-351), 2011.

[18] K. V. Palem. Energy aware algorithm design viaprobabilistic computing: From algorithms and modelsto Moore’s law and novel (semiconductor) devices. Inproc. of CASES, pages 113 – 116, 2003.

[19] K. V. Palem. Energy aware computing throughprobabilistic switching: A study of limits. IEEETransactions on Computers, 54(9):1123–1137, 2005.

[20] V. Gupta et al. Impact: imprecise adders forlow-power approximate computing. in proc. ofISLPED, (409-414), 2011.

8. SUPPLEMENTAL MATERIAL

8.1 An Illustrative Example of Conventionaland REC-based Error Profiles

As a simple example, consider Figure 7(a) where we showthe average error in uniformly random 2000 data samples inthe range [-1, 1) by using 14 conventional inexact full addernodes in a simple 16-bit ripple carry adder. A unidirectionalpositive error profile is evident from the figure that wouldlead to accumulation of error when used in larger structures.On the other hand, if these inexact nodes can be designedto have error profile similar to Figure 7(b) (which has a 20Xlower average error) by using the concept of compensationbuddies, error accumulation can be prevented even in largerdatapath networks.

400 800 1200 1600 2000

0.2

0.15

0.1

0.05

0

0.05

0.1

0.15

0.2

Input Test Vectors

Ave

rage

Err

or a

t the

Out

put

Positive Error Nodes

400 800 1200 1600 2000

0.2

0.15

0.1

0.05

0

0.05

0.1

0.15

0.2

Input Test Vectors

Ave

rage

Err

or a

t the

Out

put

REC Nodes

!"##$%$&'('')*$+"##$%$'(',-$

!"##$%$'('.*)$+"##$%$'('./$

!"#$%#&"#'()*"+%,) -.!)*"+%,)

/'0) /10)

Figure 7: Graph showing the average error in Ripple carryadders using conventional and REC based full adder nodes.Uniformly random 16-bit inputs between [-1, 1) were used.

8.2 Pseudo-code of the REC-based Inexact LogicMinimization Algorithm

In this section, a pseudo-code of the proposed algorithmfor REC-based inexact logic minimization along with a briefdescription of its important steps is provided in Algorithm 1.

8.3 Some Notes on the Neuronal Models forthe DSP Blocks

As described in Section 4, an FIR filter can be modeledas a single neuron (a perceptron) with a linear activationfunction as shown in Figure 8. While an FIR filter can betrivially modeled as a single multi-input neuron – not far-fetched from it circuit implementation, the neuronal modelof an FFT is slightly non-intuitive as it is very differentfrom the typical hardware implementations, for example, thepipelined radix-22 implementation that was chosen for thispaper. Hence, we would like to make a clear distinction be-tween the actual hardware implementation of the FFT [8]and the neuronal model of the FFT shown in Figure 9 that isonly used for applying the proposed idea of waveform shap-ing.

Given the independent and iterative application of theNLMS optimization to determine the weights, the coeffi-cient sharing that was feasible with the exact FFT twiddlefactors is no longer possible as each of the weights can itera-tively converge to a different value. But it should noted thatthe area overhead of adding additional rows to the memorybank storing the coefficients is pretty negligible comparedto the area gains that the proposed FFT achieves. If area

Algorithm 1 Pseudo-code for applying REC-based inexact

logic minimization to a circuit represented as a Directed

Acyclic Graph G//Inputs: Circuit Graph G, Primary Output Significance

Array So, Error threshold Erth//Outputs: Circuit Graph after REC approach GREC

// Step I : Graph with node clustering. Nodes with fanin

and fanout between 3 to 10 are clustered together.

Gc ← ClusterNode(G);

// Step II : Identify a set of all unique nodes types in a

clustered circuit graph

Nu ← Uniquify (Gc)

// Step III : Use sensitivity analysis w.r.t the output Sig-

nificance Array to determine the significance of each node

cluster

S ← ComputeSignificance (Gc, Nu, So);

// Step IV: // Local Exploration – Compute the design

space (both positive and negative configurations) of each

unique node cluster using greedy approach with incremen-

tal (upto ‘L’) cube swaps, or until the node is reduced to

a wire.

for j ∈ {pos, neg} do //Positive- or negative-error

for all i← 1 to |Nu| do

ErCfgj,i ← {φ}while (ni 6= wire) do //foreach ni ∈ Nu

ErCfgj,i ← GreedyCubeSwap(ni, j, L)

end for

end for

// Step V: Global Exploration – Compute the positive-

or negative-error configuration assignment for each node

cluster across dataflow graph Gc.

Cost(GREC) = Cost(G)

(Npos,Nneg)← BipartiteParition(Gs)

//DSP graphs can be partitioned into bipartite subsets due

to their symmetry

while iteration < bound repeat

G’ ← RandomErrCfg (Npos, ErCfgpos, S)

s.t. if sk > sl, then ErCfgk < ErCfgl,∀ k, l ∈ Nu

G’ ← CompensationBuddy (Nneg,ErCfgneg)

//Apply the definition of compensation buddy to de-

termine the error configuration of negative error nodes

if Cost(G’) < Cost(Gmin) && Er(G’) < ErthGREC ← G’

end while

return GREC

y[n]

w0

x[n] x[n-‐1] x[1]

FIR Filter

w1 wn-‐1

Carry Save Adder

Multiplier

Linear Function

Figure 8: A linear neuronal model representing an FIR cir-cuit.

X1

w1 w2

X2

x1 x2 xN

wN w2 w4 w2N

Xk

wk wk2 wkN

Fast Fourier Transform

Figure 9: A linear neuronal model representing an FFT cir-cuit.

is still a primary concern in any design, we can still applythe NLMS algorithm and store the average values of themodified twiddle factors as a possible option.

8.4 Pseudo-code for the REC-based WaveformShaping Algorithm

In this section, a pseudo-code of the proposed algorithmfor REC-based waveform shaping along with a brief descrip-tion of its important steps is provided in Algorithm 2.

Algorithm 2: NLMS-based Coefficient Update

Training Vectors: Input Vectors = ~x(n)Number of Neurons = kExact circuit weight vectors = ~wj(n)Exact circuit response = tj(n)for j = 1, 2, . . . k.

User Selected Parameters: µj

Initial Conditions: ~w(0) = ~wexact

NLMS Algorithm:for n = 1, 2, 3,. . . loop

for j = 1, 2, . . . , k loopej(n) = tj(n)− ~wj · ~xT (n)

~wj(n+ 1) = wj(n) + µjej(n)~x(n)

~xT (n)~x(n)

end forend for

8.5 Response Plots for the Waveform-shapeda FIR Filter

In this section, we provide the frequency and phase re-sponses curves of a 30-tap bandpass FIR filter used in Fig-ure 6 before and after the waveform-shaping technique.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2000

−1500

−1000

−500

0

500

Normalized Frequency (!" rad/sample)

Phas

e (d

egre

es)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−200

−150

−100

−50

0

Normalized Frequency (!" rad/sample)

Mag

nitu

de (d

B)

Conventional Exact Filter WaveformShaped Inexact Filter

Conventional Exact FilterWaveformShaped Inexact Filter

Figure 10: The frequency and phase responses of a conven-tional exact and a waveform-shaped inexact FIR filters.

Documents

Improving energy gains of inexact DSP hardware … › ~al4 › lingamneni-dac13.pdfexploration of the inexact datapaths. Using this as a basis, we then propose a system-level technique