Design of High Performance Multiply-Accumulate Computation...

Design of High Performance Multiply-AccumulateComputation Unit

S.Ahish1, Y.B.N.Kumar2, Dheeraj Sharma3, M.H.Vasantha4

Department of Electronics and Communication

National Institute Of Technology Goa

Email:ahish27@gmail.com1, nithin.shastri@nitgoa.ac.in2, dheeraj24482@nitgoa.ac.in3, vasanthmh@nitgoa.ac.in4

Abstract—In Digital Signal Processing (DSP), Multiply-Accumulate Computation (MAC) unit plays a very importantrole and lies in the critical path. Multiplier is one of the mostimportant block in MAC unit. The overall performance of theMAC unit depends on the resources used by the multiplier.Therefore, this paper describes the design of a Partial ProductReduction Block (PPRB) that is used in the implementation ofmultiplier having better area, delay and power performances.PPRB reduces the partial products row wise by using differentmulti-bit adder blocks instead of conventional coloumn wisereduction. MAC unit consisting of the multiplier realized usingthe proposed partial product reduction technique has a delayreduction of 46%, power consumption is reduced by 39% andarea requirement is reduced by 17% when compared to MACunit realised using conventional multiplier architecture.

Index Terms— Carry-lookahead adder, brent-kung adder,wallace tree, booth multiplier, multiply-accumulate unit.

I. INTRODUCTION

The multiplication-accumulation is main computational ker-

nel and is considered as one of the fundamental operations in

DSP [1]. MAC unit is an integral part of DSP architecture,

hence decides the pace of the general framework that is it

generally lies in the critical path. Creating a high performance

MAC unit is pivotal for continuous DSP applications. Addi-

tionally with perpetually expanding interest for the compact

electronic items, an electronic segment with low power and

less area requirement is very much necessary for market

stand point of view. Hence designing of a MAC unit with

high speed performance, less area requirement and low power

consumption becomes an important aspect in real time video

coding and DSP systems [2].

The critical path delays and hardware complexities of

Multiplier-Accumulation units are investigated to derive a

high performance MAC [3]. In order to improve the power,

delay, speed performances of the MAC unit, the performance

parameters of the multiplier which constitutes the major part

of the MAC unit has to be improved. Much work has been

done on advanced multiplication algorithms and designs [1].

Improving the performance of the multiplier means to re-

duce the resources used by partial product reduction block.

Carry propagation is time consuming, hence performing two

different carry propagations in the same MAC circuit is

inefficient. This can be overcome by bolstering the multiplier

yield back to the input of the PP unit decrease tree hinders

the requirement for a traditional accumulate adder [4], [5],

[6]. Accumulationis consequently taken care of by the last

adder of the multiplier, and just one convey spreading stage

is needed. The issue is that this advancement just applies to

one-cycle Macs, where the long discriminating deferral is a

constraining component in many applications. The multiplier

architecture proposed in this paper is based on the basic

algorithm for multiplication, extraordinary paper and pencil

approach [7] and passes through three fundamental stages: 1)

Partial Product (PP) generation, 2) Partial product reduction,

and 3) Final (carry-propagated) addition. The partial product

reduction block is the resource extensive. In this work, parallel

prefix circuits that take n inputs and produce the outputs has

been used to realize powerful adders [8],[9], these adders are

used to realize the partial product reduction block to get a

multiplier with improved performance [10].

The paper is organised as follows: Section II addresses the

existing methods of multiplier implementation. The Section

III consists of the description of the proposed partial product

reduction block along with brief description about the Brent-

Kung adder and the CLA which will be used in the proposed

architecture. Section IV describes the simulation setup used

for the experiment and the results of the experiment. Finally,

conclusions will be drawn in Section V.

A. General MAC unit operation [11]:

The multiply-accumulation operation is one of the most

used operation in DSP architecture. In order to realize the

expressions such as

y[n] =∑

x[k]h[n− k] (1)

First the multiplication operation for different values of x

and h should be performed and then add the products to get

the output y[n]. Instead of waiting for multiplication results

to become available, addition operation can be computed in

parallel with the multiplication using MAC operation. The

hardware used for this is called as MAC unit. The general

expression representing the MAC operation is as given below

y[n+ 1] = y[n] + x[n+ 1] ∗ h[n+ 1] (2)

where x[i] is the multiplier and h[i] is the multiplicand each

of n-bit size. The basic block diagram of the MAC unit is as

shown in the Fig. 1.

Fig. 1. Conventional MAC unit block diagram.

The multiplier block consists of the partial product generator

block and the partial product reduction block which reduces

the obtained partial products into the final product by the

process of addition. The output of multiplier is provided as

one of the input to the main adder block. The output of the

main adder block is stored in the register. The content of the

register is fed back to the main adder block as one of the

inputs during the next iteration.

II. EXISTING METHODS

A. Booth’s Alogorithm [12] & Wallace Tree Multiplier [7]:

Booth’s algorithm is a multiplication algorithm which also

works for two’s complement numbers. In this method, al-

gorithm makes use of 2 LSB bits to determine whether the

previous step output should be added/subtracted from the one

of the input or retained as it is. The main disadvantage of

basic booth alogorithm is that it requires variable number of

add/subtract operations and it also requires shift operations

between two consecutive add/subtract operations.

The implementation of Wallace Tree multiplier follows a

systematic way of reducing the number of partial products.

Wallace multipliers have a reduction delay that scales loga-

rithmically with the input size. In Wallace Tree construction

modeling, all the bits of all the partial products in every section

are included together by a set of counters in parallel without

propagating any carry. The main disadvantage of Wallace tree

multiplier architecture is that it has a irregular structure, which

makes designing of layout difficult. Furthermore, regardless

of multiplicand size all adder blocks are active, making this

architecture not economical.

III. PROPOSED ARCHITECTURE

A. Brent-Kung Adder [9]:

Brent kung adder belongs to parallel prefix adders family.

Parallel Prefix Adder as the terminology describes that the

prefix outcome of the execution of the operation depends on

the initial inputs. Parallel in this term is defined as the process

which involves the execution of an operation in parallel. This

is done by segmentation into smaller pieces that computed

in parallel. Then all bits of the sum will begin the process

concurrently. In Parallel Prefix adder case, binary addition is

Fig. 2. Carry Look-Ahead Adder Block Diagram.

usually expressed in terms of carry generation signal Gi , carry

propagation signal Pi , carry signal Ci , and sum signal Si ,

at each bit position (i=1 to n) , all these signals can be obtain

by using the equations below

G[i] = a[i] ∧ b[i] (3a)

P [i] = a[i]⊕ b[i] (3b)

C0 = 0 (3c)

C[i] = (a[i] ∧ b[i]) ∨ (a[i] ∧ C[i− 1]) ∨ (b[i] ∧ C[i− 1])(3d)

S[i] = a[i]⊕ b[i]⊕ C[i− 1], i = 1, ..., n. (3e)

B. Carry-Lookahead Adder [?]:

Look-ahead carry calculation accelerate the operation to per-

form addition, in light of the fact that in this carry calculation

for the following stages is computed ahead of time focused

around information signals. Carry look ahead relies on upon

two things: Calculating for every digit position, whether that

position is going to generate a carry if one roll in from the right

and consolidating these figured qualities to have the capacity

to conclude rapidly whether, for each one gathering of digits,

that gathering is going to generate a carry that comes in from

the right. The basic block diagram of CLA is as shown in Fig.

916 2015 IEEE International Advance Computing Conference (IACC)

a7 a6 a5 a4 a3 a2 a1 a0 × b7 b6 b5 b4 b3 b2 b1 b0

0 a7b0 a6b0 a5b0 a4b0 a3b0 a2b0 a1b0 a0b0a7b1 a6b1 a5b1 a4b1 a3b1 a2b1 a1b1 a0b1

p15 p14 p13 p12 p11 p10 p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

Fig. 3. Illustration of 8-bit multiplication and stage one reduction process.

0 0 c11 s17 s16 s15 s14 s13 s12 s11 s10c12 s27 s26 s25 s24 s23 s22 s21 s20 a0b2

0 0 c13 s37 s36 s35 s34 s33 s32 s31 s30 a0b4c14 s47 s46 s45 s44 s43 s42 s41 s40 a0b6

p15 p14 p13 p12 p11 p10 p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

Fig. 4. Illustration of stage two reduction process.

2. and the expressions for sum (Si) and carry (Ci) outputs are

as given below.

G[i] = a[i] ∧ b[i] (4a)

P [i] = a[i]⊕ b[i] (4b)

C[i+ 1] = (a[i] ∧ b[i]) ∨ (a[i] ∧ C[i]) ∨ (b[i] ∧ C[i]) (4c)

S[i] = P [i]⊕ C[i], i = 0, ..., n. (4d)

where Gi is the Carry Generate and Pi is the Carry Propagate

signals.

C. Proposed Partial Product Reduction Block:

Partial Product Reduction Block (PPRB) lies in the critical

path of multiplier. Hence the overall performance of the

multiplier depends on the PPRB performance. Consider 8-bit

multiplication process as shown in the Fig. 3. for the purpose

of explanation.Once the partial products are obtained concatenate the

alternate rows of partial products with zeros and the sum of

partial products is implemented using the 8-bit CLA/Brent-

Kung at this stage. This is performed for all the sets as

indicated in the Fig. 3. It can observed that the partial products

a0b0, a0b2, a0b4, a0b6 are left out during the addition in the

first round. However, the term a0b0 can be directly taken as p0i.e, the LSB of the product of multiplication. The remaining

terms that is a0b2, a0b4, a0b6 are considered for the addition

process during the further stages of partial product reduction.

The bit width of the adders used for partial product reduction

is different for different stages.Fig. 4. shows the stage-2 partial product reduction process

which considers the rows obtained from stage-1 reduction. Let

the sum and carry outputs of the stage-1 be denoted as Ski

and Ck . The output of the 1st block of stage-1 corresponds

to S1i appended with the carry C1. Similarly, S2i and C2

corresponds to 2nd block and so on. The rows resulting from

stage-2 reduction i.e, the result of adding of the rows enclosed

by the boxes depicted in the Fig. 6. is provided as input to

stage-3 reduction. The output of stage-3 reduction is nothing

but the product of multiplication.

IV. SIMULATION SETUP AND RESULTS

Each block is designed and simulated using Verilog coding

and the blocks are combined to form the MAC unit. The

verilog code is synthesized using the Cadence RTL compiler

and simulation is carried out by using Simvision. The working

of the MAC unit is verified for different input values.

The performance parameters of the 16-bit multiplier imple-

mented by using different method are compared by using the

Table I. It can be observed that the area requirement of the

multiplier implemented with CLA is 29% less when compared

to booth multiplier implementation, 8.25% less when com-

pared to multiplier implemented using Brent-Kung adder. The

power requirement of the multiplier implemented with CLA is

52% less when compared to booth multiplier implementation,

4.38% less when compared to multiplier implemented using

Brent-Kung adder. The delay performance of the multiplier

implemented with CLA is 54% less when compared to booth

multiplier implementation, 7.81% less when compared to

multiplier implemented using Brent-Kung adder.

The Table II is used for the comparison of performance 16-

bit MAC unit usig different multiplier implementation and can

be summarized as follows: Area requirement of the MAC unit

using [i] is 21.8% less when compared to MAC unit using

booth multiplier implementation, 6.28% less when compared

2015 IEEE International Advance Computing Conference (IACC) 917

TABLE IPERFORMANCE OF DIFFERENT MULTIPLIER IMPLEMENTATION

Particulars Booth Multiplier Multiplier using CLA[i] Multiplier using Brent-Kung adder[ii]

Instances 826 963 995Area 6562.786 4653.432 5071.853

Leakage Power(uW) 7.301 3.343 3.973Dynamic Power(mW) 0.29 0.1392 0.1451

Total Power(mW) 0.2973 0.1425 0.149Arrival(ns) 12.731 5.864 6.361

TABLE IIPERFORMANCE OF DIFFERENT MAC UNIT IMPLEMENTATIONS

Particulars MAC with Booth Multiplier MAC with Multiplier using CLA MAC with Multiplier using Brent-Kung adder

Instances 1022 1175 1209Area 8108.755 6335.582 6760.354

Leakage Power(uW) 8.463 4.561 5.203Dynamic Power(mW) 0.371545 0.222 0.22922

Total Power(mW) 0.38 0.227 0.234425.Arrival(ns) 13.594 6.972 7.365

Fig. 5. (a) Area performance of different MAC implementation, (b) Delayperformance of different MAC implementation.

0.4Fig. 6. Power consumed by different MAC units

to MAC unit implemented using [ii] . The power consumption

of the MAC unit using [i] is 40.26% less when compared to

MAC with booth multiplier implementation, 3.16% less when

compared to MAC unit implemented using [ii]. The delay

performance of the MAC unit using [i] is 48.71% less when

compared to MAC unit using booth multiplier implementation,

5.34% less when compared MAC unit implemented using [ii].

The comparison of the area, delay and power performances of

different MAC unit implementations is illustrated graphically

in Fig. 5 and Fig. 6, respectively.

V. CONCLUSION

In this paper, the design and implementation of MAC unit

by using Verilog coding has been presented. The MAC unit

discussed can perform 16-bit unsigned operations. In this

work, the multiplier is realized by using different bit-width

Carry-Lookahead Adder (CLA) and Brent-Kung adder. The

MAC unit implemented with the multiplier using the proposed

partial product reduction block achieves better delay, power

and area performace when compared to MAC unit consisting

of the conventional Booth Multiplier.

REFERENCES

[1] A. Farooqui and V. Oklobdzija, “General Data-Path Organization of aMAC Unit for VLSI Implementation of DSP Processors,” in Proc. IEEEIntl. Symposium on Circuits and Systems, May 1998, pp. 260263.

[2] O. Chen, N. Y. Shen, and C. C. Shen, “A Low-Power MultiplicationAccumulation Calculation Unit for Multimedia Applications,” in Proc.IEEE Intl. Conference on Acoustics, Speech, and Signal Processing,Apr. 2003, pp. II6458.

[3] L. H. Chen, L. H. Chen, T. Y. Wang, and Y. C. Ma, “A MultiplicationAccumulation Computation Unit with Optimized Compressors and Min-imized Switching Activities,” in Proc. IEEE Intl. Symposium on Circuitsand Systems, May 2005, pp. 61186121.

[4] A. Abdelgawad, “Low Power Multiply Accumulate Unit (MAC) forFuture Wireless Sensor Networks,” in Proc. IEEE Sensors App. Sym-posium, Feb. 2013, pp. 129132.

[5] A. Abdelgawad and M. Bayoumi, “High Speed and Area-Efficient Multi-ply Accumulate (MAC) Unit for Digital Signal Processing Applications,”in Proc. IEEE Intl. Symposium on Circuits and Systems, May 2007, pp.3199 3202.

[6] T. T. Hoang, M. Sjalander, and P. Larsson-Edefors, “A High-Speed,Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architectureand its Application to a Double-Throughput MAC Unit,” IEEE Trans.on Circuits and Systems, vol. 57, no. 12, pp. 30733081, Dec 2010.

[7] C. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Trans. onElectronics and Computers, vol. EC-13, no. 1, pp. 1417, Feb. 1964.

[8] M. S. Schmookler and A. Weinberger, “High Speed Decimal Addition,”IEEE Trans. on Computers, vol. C-20, no. 8, pp. 862866, Aug. 1971.

[9] R. P. Brent and H. Kung, “A Regular Layout for Parallel Adders,” IEEETrans. on Computers, vol. C-31, no. 3, pp. 260264, Mar. 1982.

[10] W. Chu, A. Unwala, P. Wu, and E. Swartzlander, “Implementation of aHigh Speed Multiplier using Carry Lookahead Adders,” in Proc. IEEEAsilomar Conf. on Signals, Systems and Computers, Nov. 2013, pp.400404.

[11] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, “Discrete-time SignalProcessing (2Nd Ed.),” Upper Saddle River, NJ, USA: Prentice-Hall,Inc., 1999.

[12] A. Booth, “A Signed Binary Multiplication Technique,”Quarterly Journalof Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236 240, Jun.1951.

918 2015 IEEE International Advance Computing Conference (IACC)

Design of High Performance Multiply-Accumulate Computation...

Documents

Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Low Complexity Multiply-Accumulate Units for Convolutional ...the pseudo code in Figure 1. The convolution operator has an input image of dimensions IH ×IW and C channels and is convolved

Weird Operators with Big Benefits Heterogenous Bitwidth ...Josh Fromm. Network Binarization Multiply-accumulate becomes xnor-popcount. 5-30x theoretical speedup. 32x weight memory

DTFT and Digital Filters 11-Apr Digital Filters10 Digital Filters Types FIR From H(z): Filter becomes a “multiply, accumulate, and delay” system: IIR • Impulse response function

ENTAUR DDS TO SERVER ROCESSOR - Centaur Technology · 2019-12-10 · include floating-point multiply-accumulate (MAC) units; a third FP/AVX unit handles FP and crypto (AES) divide

EMBEDDED SYSTEM BASICS AND APPLICATION · • DSP as a GPP is a single chip VLSI unit. • It includes the computational capabilities of microprocessor and multiply & accumulate units

numerical computation of schwarz-christoffel transformations and slit maps for multiply connected

MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator … · 2020. 10. 1. · of multiply-accumulate (MAC) operations performed when a single byte of data is read/written from/to

ColdFire TCP/IP Stack Deep Dive - NXP Semiconductorscache.freescale.com/files/microcontrollers/doc/train_ref_material...• 57 MIPS V2 Core with Enhanced Multiply and Accumulate for

TMS320C54x Digital Filters - TI.com TMS320C54x Digital Filters 3 Example 1. FIR Implementation Using Circular Addressing Mode With a Multiply and Accumulate (MAC) Instruction

Accumulate Capital

INJNTU.COM INJNTU.COM INJNTU · Q)ln the TMS320C5X Processor the MACD Instruction corresponds to --> Multiply Accumulate With Data shift Q)One of the most common operations required

MULTIPLY YOUR WELLNESS - SUREMED Health · To upgrade to Multiply Provider or Multiply Premier call us on 0861 100 789 ... 25% off 5% off ... 4 + Multiply cards to multiply your Smart

Benchmarking a DSP processor - DiVA portal18815/FULLTEXT01.pdf · Maybe the most common feature is the MAC unit. MAC is an abbreviation for multiply and accumulate. This is very useful

Speculate To Accumulate Booklet

Strand: Computation Topic: 5.C.1 and 5.C.2 - Multiply and ... · Strand: Computation Topic: 5.C.1 and 5.C.2 - Multiply and Divide Whole Numbers Level: Fifth Grade Score 4.0 In addition

Careers at accumulate #culturecode

Approximate signed multipliers for multiply-accumulate circuitsessay.utwente.nl/75243/1/oedzes_johan_BA_EEMCS.pdf · 2018-06-29 · 2 MULTIPLIER ANALYSIS 2 Multiplier analysis 2.1

Storage I/O Summary - Stanford Universityweb.stanford.edu/class/ee282h/handouts/Handout40.pdfStorage I/O Summary ... MIPS » MDMX ... Single cycle multiply accumulate (multiple busses

MODIFIED MULTIPLY-ACCUMULATE ARCHITECTURE …journals-sathyabama.com/archives/ies/IES-Int-Journal-Oct-2008-14.pdf · undoubtedly having the challenges in low-power VLSI Design. The