4
A Reconfigurable Multi-Modulus Modulo Multiplier Shibu Menon and Chip-Hong Chang Centre for High Performance Embedded Systems, Nanyang Technological University, 50 Nanyang Drive, Research Techno Plaza, 3rd Storey, Border X Block, Singapore 637553 Abstract This paper presents a novel architecture for a multi-modulus reconfigurable modulo multiplier with moduli selectable from the set {2'-1, 2', 2'+1}. The efficient unification of the bottleneck modulo 2'+1 multiplication leads to its performance nearly matching that of the modulo 2'-1 multiplier. The proposed modulo 2'+1 multiplier is well-suited for applications in multi-modulus reconfigurable architectures. The reconfigurability is achieved by a universal structure of Multi Operand Modulo Adder (MOMA) and two-operand modulo adder. The advantages and applications of a low-complexity configurable architecture for modulo multiplication that incurs little speed penalty compared to the single-function case is presented. Area and timing overheads are inevitably incurred by the multiplexers required for switching between the modulus. They represent a minor penalty to be paid for the agile reconfigurability. Synthesis results in TSMC 0.18,um CMOS technology proved the efficacy of the proposed architectures. Keywords RNS, modulo multiplier, multi-modulus system I. INTRODUCTION Applications of modulo arithmetic extends to fields as varied and extensive as Residue Number System (RNS) applications, fault tolerant computer systems, Fermat number transforms and cryptography [1]. Modulo addition and multiplication form the basis for most modulo arithmetic units [2]. The efficient implementation of modulo adders and multipliers is thus a cornerstone of efficient modulo arithmetic implementation. Special moduli sets consisting of moduli of the types 2'±1 have advantages in terms of their efficient implementation of modulo arithmetic units and reverse converters for RNS applications [3]. Efficient architectures for the special moduli set have also been used in cryptography applications like IDEA [2]. Of the three major types of special modulo channels, the 2'+1 channel is the most critical as it does not provide the efficiency or regularity of structure provided by the 2'- 1 and 2' channels due to the high extent of superfluous computations introduced by the extra bit. The diminished-one system [4] seeks to work around this by diminishing the input operands by one and relying on special circuitry for the detection of zero. Efficient multipliers which approach the path delay of 2' multipliers have been designed in diminished-one system [5]. However, these systems do not consider the incrementers and decrementers required at the outset for conversion to and from normal modulo number system. Modulo 2±+1 multipliers based on normal modulo representations have been presented in [2] and [6]. The system in [6] suffers from the requirement of the correction unit to obtain a legitimate 2'+1 number. The system in [2], on the other hand, has a complex partial product generation unit that is not well suited for reconfigurability and an extra multiplexer on the datapath. This paper proposes an architecture similar to the one proposed in [5], but for a non- diminished one system. Advantages in terms of delay as compared to other reported modulo multipliers in normal representation [2, 6] are presented. Reconfigurable computing in the RNS domain is a rather new field. Most research in this field has been focused on Variable Word Length (VWL) RNS processors [7, 8], which provide configurability in terms of dynamic range by providing the ability to turn off channels not required. While this is a worthwhile feature, the ability to reconfigure architecture for different moduli sets can go further and find other applications. An interesting example would be the case of special four and five moduli sets [3] where the low latency paths of the residue arithmetic unit can be made reconfigurable to save area and power without adding overhead in terms of delay. Another interesting application for reconfigurable multi-modulus modulo multipliers would be in fault-tolerant RNS systems of the type shown in [9], where redundant channels are perfect candidates for reconfigurability. Additionally, having reconfigurable modulo arithmetic units provides flexibility in design of RNS processors [10] or cryptoprocessors catering for multiple security protocols. This paper proposes a novel reconfigurable modulo multiplier architecture based on sharing of common resources for the special three moduli set {2'-1, 2', 2±1 }. Section II presents the algorithm and architecture of the novel modulo 2'+1 multiplier. Section III builds on this architecture to further propose a multi-modulus reconfigurable architecture. Performance tradeoffs from the area/delay perspective are analyzed and presented in Section IV. II. MODULO 2N+1 MULTIPLIER The targeted reconfigurable multiplier calls for a modulo 2'+1 architecture that approaches the latency performance of the modulo 2'-1 multiplier and maintains a similar architecture to minimize area overhead. Diminished-one multipliers, owing to the need for incrementers to convert the product to a consistent normal modulo representation as other moduli are thus not suitable for reconfigurable architectures. Among modulo multipliers that work on normal RNS representation [2, 6], some deficiencies are observed. In the case of [6], it arises from additional latency incurred by the final converters used to obtain modulo 2'+1 representation, as well as the usage of ordinary carry propagation adders (CPAs) in the final stages. The architecture in [2], on the other hand, has the additional area overhead of a 2n correction unit and the additional latency of the multiplexer for selection of the correction terms. 1168 1-4244-0387-1/06/$20.00 (©2006 IEEE

[IEEE APCCAS 2006 - 2006 IEEE Asia Pacific Conference on Circuits and Systems - Singapore (2006.12.4-2006.12.7)] APCCAS 2006 - 2006 IEEE Asia Pacific Conference on Circuits and Systems

Embed Size (px)

Citation preview

A Reconfigurable Multi-Modulus Modulo Multiplier

Shibu Menon and Chip-Hong ChangCentre for High Performance Embedded Systems, Nanyang Technological University,

50 Nanyang Drive, Research Techno Plaza, 3rd Storey, Border X Block, Singapore 637553

Abstract This paper presents a novel architecture for amulti-modulus reconfigurable modulo multiplier with moduliselectable from the set {2'-1, 2', 2'+1}. The efficient unification ofthe bottleneck modulo 2'+1 multiplication leads to itsperformance nearly matching that of the modulo 2'-1 multiplier.The proposed modulo 2'+1 multiplier is well-suited forapplications in multi-modulus reconfigurable architectures. Thereconfigurability is achieved by a universal structure of MultiOperand Modulo Adder (MOMA) and two-operand moduloadder. The advantages and applications of a low-complexityconfigurable architecture for modulo multiplication that incurslittle speed penalty compared to the single-function case ispresented. Area and timing overheads are inevitably incurred bythe multiplexers required for switching between the modulus.They represent a minor penalty to be paid for the agilereconfigurability. Synthesis results in TSMC 0.18,um CMOStechnology proved the efficacy of the proposed architectures.

Keywords RNS, modulo multiplier, multi-modulus system

I. INTRODUCTION

Applications of modulo arithmetic extends to fields asvaried and extensive as Residue Number System (RNS)applications, fault tolerant computer systems, Fermat numbertransforms and cryptography [1]. Modulo addition andmultiplication form the basis for most modulo arithmetic units[2]. The efficient implementation of modulo adders andmultipliers is thus a cornerstone of efficient modulo arithmeticimplementation. Special moduli sets consisting of moduli ofthe types 2'±1 have advantages in terms of their efficientimplementation of modulo arithmetic units and reverseconverters for RNS applications [3]. Efficient architectures forthe special moduli set have also been used in cryptographyapplications like IDEA [2].

Of the three major types of special modulo channels, the2'+1 channel is the most critical as it does not provide theefficiency or regularity of structure provided by the 2'- 1 and 2'channels due to the high extent of superfluous computationsintroduced by the extra bit. The diminished-one system [4]seeks to work around this by diminishing the input operands byone and relying on special circuitry for the detection of zero.Efficient multipliers which approach the path delay of 2'multipliers have been designed in diminished-one system [5].However, these systems do not consider the incrementers anddecrementers required at the outset for conversion to and fromnormal modulo number system. Modulo 2±+1 multipliers basedon normal modulo representations have been presented in [2]and [6]. The system in [6] suffers from the requirement of thecorrection unit to obtain a legitimate 2'+1 number. The systemin [2], on the other hand, has a complex partial productgeneration unit that is not well suited for reconfigurability and

an extra multiplexer on the datapath. This paper proposes anarchitecture similar to the one proposed in [5], but for a non-diminished one system. Advantages in terms of delay ascompared to other reported modulo multipliers in normalrepresentation [2, 6] are presented.

Reconfigurable computing in the RNS domain is a rathernew field. Most research in this field has been focused onVariable Word Length (VWL) RNS processors [7, 8], whichprovide configurability in terms of dynamic range by providingthe ability to turn off channels not required. While this is aworthwhile feature, the ability to reconfigure architecture fordifferent moduli sets can go further and find other applications.An interesting example would be the case of special four andfive moduli sets [3] where the low latency paths of the residuearithmetic unit can be made reconfigurable to save area andpower without adding overhead in terms of delay. Anotherinteresting application for reconfigurable multi-modulusmodulo multipliers would be in fault-tolerant RNS systems ofthe type shown in [9], where redundant channels are perfectcandidates for reconfigurability. Additionally, havingreconfigurable modulo arithmetic units provides flexibility indesign ofRNS processors [10] or cryptoprocessors catering formultiple security protocols.

This paper proposes a novel reconfigurable modulomultiplier architecture based on sharing of common resourcesfor the special three moduli set {2'-1, 2', 2±1 }. Section IIpresents the algorithm and architecture of the novel modulo2'+1 multiplier. Section III builds on this architecture to furtherpropose a multi-modulus reconfigurable architecture.Performance tradeoffs from the area/delay perspective areanalyzed and presented in Section IV.

II. MODULO 2N+1 MULTIPLIERThe targeted reconfigurable multiplier calls for a modulo

2'+1 architecture that approaches the latency performance ofthe modulo 2'-1 multiplier and maintains a similar architectureto minimize area overhead. Diminished-one multipliers, owingto the need for incrementers to convert the product to aconsistent normal modulo representation as other moduli arethus not suitable for reconfigurable architectures.

Among modulo multipliers that work on normal RNSrepresentation [2, 6], some deficiencies are observed. In thecase of [6], it arises from additional latency incurred by thefinal converters used to obtain modulo 2'+1 representation, aswell as the usage of ordinary carry propagation adders (CPAs)in the final stages. The architecture in [2], on the other hand,has the additional area overhead of a 2n correction unit and theadditional latency of the multiplexer for selection of thecorrection terms.

1168

1-4244-0387-1/06/$20.00 (©2006 IEEE

A. Proposed architecturen n

A=Za,2' B =b2jLet i=o and j=0 be two modulo 2n+1

numbers. Their modulo 2n+1 product can be represented as:n n

A982,+, = asbl2+j)i_° j ° 2n+1

n-I nA n-I nA=albj2f+j) +anEbj 2(n) +njqal2(n+i) n1 + anbn2 n,

i o X o X o 2n+1 i O- < v ~~~~~~~~~~~~~~~~~~~~III

I II 2n+1

(1)

The term II of (1) can be represented asn-

5£ (anbk, + ak,bn )2(n,)k=O 2n +1 (2)

By using the following property of modulo 2n+1 arithmetic,

a,b1 2i+ = ab 2k+ 2(12n)2n+1 ~ 2n+1

where k = i+ j (3)

(2) is reducible to:n-I

(anbkv akbn)2(n+k)k=O 2 +1

n-I

n-l= [ansq bn ][ak v bk ]2(n+k)

k=O 2 +1

n-l= sqk 2(n +k)

k=O 2 2+1

n-I

SySq 2k +2(n+k))k=O 2n +1

n-l 2n-1

sq-2 +sZ2nk=O m=n 2n+1

sqk2 +2S2 +1 (4)

The term III of (1) can be expressed as:

an bn 22n a b 20

The term I of (1) can be further simplified as:

n-I n-l-i n-I n-I

E E aibj 2('+j) + E E aibj 2('+j=oIno:In

A 2n2+1 (6)

For the term A of (6), sincei+J>2l,

n-I n-1

y, (a b 2i+j-n + 2i+j2)i=l j=n-i 2n+l

= L (~a-,b2ji+j-n )+22n (2n _ 1-n)i =l j=n-i 2n +l (7)

Thus, modulo 2'+1 multiplication can be represented as:n-I n-l-I n-I n-I

LLaibj2i+j + 5£a b 2~~I+j-n|AxBl2n+l = i=0nj-10 i=1 j=n-i~AxB2 = =

+ sq k2 +anbn 2 +2s+2 n(2 -1-n)k=O 21+1

(8)

The complemented end-around carry (CEAC) [11] for thecarry-save addition of (8) can be represented mathematicallyas:

Cn 2 n 1 = 2 +Cn 2n+1 (9)

Thus, every CEAC adds a 2n term to the result. For input ofn+3 terms, the number of CSA units required in a Wallace treeis n+ 1. Thus, the correction factor terms can be aggregated to:

2 (2n 1 n)+ (n + 1)2n~2 n+122n 1

2 +1 (10)

Thus, modulo 2n+1 multiplication can be reduced to:n-1n-l-i n-1 n-1

Z L a b 2'+j +Y a b 2i+j-nlAxB i=O210 i=1 j=n-i

+ sqk 2 +anbn20+2s+1k=O 2n+1

= PP+sq +anbn20+2s (11)

The architecture of modulo 2n+1 multiplier whichimplements (11) is shown in Figure 1. The arrangement ofequations for the proposed modulo 2n+1 multiplier necessitatesan (n+3)-operand CSA tree with CEAC for the MOMA unit.

It should be noted that for the proposed architecture, thereis a requirement of adding three additional terms compared to[2]. This somewhat widens the design of the MOMA unitwithout necessarily affecting the depth and latency. Extra delaywill be incurred only in the case where the number of operandsfalls in a range such that the addition of 3 terms more causes anextra level of adders to be inserted in the tree. An examplewould be the 8-bit modulo multiplier. For the modulo 2n and2n-1 cases, an 8 operand MOMA would be needed and thenumber of levels would be 4. The need to add three terms formodulo 2n+1 case leads to a 5-level MOMA. Thus anadditional stage of CSA delay is incurred. The architecture in[2] is thus expected to have a slight advantage in the delay ofMOMA for (n+1) operands which is 1-level less than for (n+3)operands. However, the proposed architecture does not requirethe multiplexer for the 2n correction. The elimination of

APCCAS 2006 1169

multiplexer is expected to offset the amount of speedup in [2]due to the situational reduction of one level of delay in theMOMA. For the situations where the number of operands callsfor the same number of CSA stages in the MOMA, theproposed architecture is expected to show a distinctperformance gain. Moreover, the proposed architecture is moreamenable to resource sharing compared with the simplifiedmultiplexer for the generation of partial product terms in [2],which cannot be shared with the partial product generator forthe modulo 2-1 operation. These advantages of the proposedarchitecture will be borne out by the results of Section IV.

A B

l ~~~~~PPGUnitl

Ip ppq0Pa)bj 2s

MOMA (CSA T..e with CEAC) (n 3-p-nd)

2-operand adderl

OFT

Figure 1. Architecture for modulo 2'+1 multiplication

Figure 2. Configurable architecture for the PPG

III. RECONFIGURABLE MODULO MULTIPLIER

A modulo multiplier has three main units: a Partial ProductGenerator (PPG), a Multi-Operand Modulo Adder (MOMA)and a final two-input modulo adder. For an efficientarchitecture flexible enough to be configurable, as muchsharing of resources as possible is required.

The PPG for the case of n = 3 is shown in Figure 2. Thisunit uses multiplexers to choose between relevant partialproducts for a particular moduli set. The adoption of themultiplier architecture proposed earlier in the paper, simplifiesthe design of this unit over that proposed by [2].

MOMA consists of CCSA (Composite Carry Save) unitswith CEAC/EAC [11] selectable as shown in Figure 3. Thesignal for the selection of modulus to be used essentiallychooses between complemented and uncomplemented endaround carry. A diminished-one two-operand prefix adder isused as the final two-operand carry propagation adder. Thisleads to a very simple shared architecture at this level based on[12] and is shown in Figure 4. Multiplexing is needed only atthe level ofthe end around carry unit.

p5dM°d ppo PP1 PP2 PP3 PP4 PP5 PP6PP7 sqk

Figure 3. CCSA tree for MOMA

Figure 4. Shared architecture for two-operand modulo addition

IV. IMPLEMENTATION AND RESULTSThe architectures proposed in this paper have been

implemented in TSMC 0.18,um CMOS standard-celltechnology as a proof-of-concept. The architectures were codedin VHDL and functionally simulated using MODELSIM SE.Synopsys) Design Compiler (v2001.08) was used for thesynthesis. The synthesis constraints that were set included thedescription of a virtual clock with an operation frequency of 10MHz. The output ports were connected to a load equivalent offour buffer cells. A medium optimization effort was used.

All architectures were implemented for different bit-width(n = 8, 16 and 32) and using different kinds of prefix structures[2] for the two-operand adder, i.e., SKL (Sklansky), KS(Kogge-Stone) and BK (Brent-Kung). Firstly, modulo 2'+1multipliers for the proposed architecture and the architecture of[2] were implemented and the area/timing results are provided

APCCAS 20061170

in Tables I and II, respectively for each multiplier. From theresults, the proposed architecture, while taking up more areabecause of the increase in the width of MOMA, showscomparable latency for small bit-widths and better performancethan that of [2] for large bit-widths.

TABLE I. AREA/TIMING RESULT FOR PROPOSED 2N+ 1 MULTIPLIER

Proposed n SKL BK KS8 21235.75 20320.98 20889.80

Area (gm2) 16 67236.73 67552.77 69555.1732 236569.6 228273.2 244076.08 3.05 3.12 3.11

Delay (ns) 16 3.69 3.77 3.7032 4.63 4.77 4.74

TABLE II. AREA/TIMING RESULT FOR 2 1 MULTIPLIER PROPOSED IN [2]

[2] n SKL BK KS8 17097.69 16918.07 18355.05

Area (gm2) 16 63580.88 60321.05 62137.3032 222009.36 222768.22 234678.908 2.96 2.88 2.74

Delay (ns) 16 3.68 3.69 3.8332 4.86 4.89 4.78

Modulo 2'- 1 multipliers using the standard architectures [2]were also implemented to gauge the difference in performancefrom the proposed 2'+1 multipliers. Table III shows thearea/timing results for the implementation of the modulo 2'-1multipliers. As can be seen, the performances of modulo 2'+1multipliers are comparable to the performances of thecorresponding modulo 2'-1 multipliers.

TABLE III. AREA/TIMING RESULT FOR 2 -1 MULTIPLIER

2n_1 n SKL BK KSmultiplier

8 13491.9 14676.1 14516.4Area 16 53970.9 57550.1 57630.0(gm2) 32 212326.2 207007.0 227360.1

8 2.70 2.69 2.70Delay 16 3.61 3.55 3.63(ns) 32 4.61 4.69 4.60

The proposed reconfigurable multiplier was implementedand the area/timing results are presented in Table IV.

TABLE IV. AREA/TIMING RESULT FOR RECONFIGURABLE MULTIPLIER

Reconfigurable n SKL BK KSmultiplier

8 25413.7 26567.9 24948.0Area (gm2) 16 86220.4 86649.6 84590.5

32 279023.0 272033.1 274080.88 3.22 3.22 3.28

Delay (ns) 16 3.85 3.92 3.8332 4.76 4.93 5.06

A ~ ~ ~ ~ r 11 A * * 1 rAs can be seen from this table, the price paid for the

reconfigurable multiplier is much smaller than the combinationof individual 2'- 1 and 2'+1 multipliers.

V. CONCLUSIONReconfigurable Multi-modulus modulo multipliers find

applications in various specialized RNS systems. The mainchallenge in the design of reconfigurable architecture is tominimize the latency tradeoff, while reducing the areaoverhead due to redundant circuitry. For the targetedreconfigurable triple modulo multiplier, the most critical task isto architect a modulo 2'+1 multiplier which approaches thelatency of a 2h- 1 multiplier and yet possesses a similararchitecture that can be unified with the 2'-1 multiplier tomaximize sharing of circuitry. This paper presented a novelmodulo 2'+1 multiplier based on normal RNS representationthat has delay comparable or even better than some existingmultipliers while maintaining good architectural polymorphismof the same operation in two other modulo channels. Areconfigurable multi-modulus architecture that usesmultiplexers for switching between shared circuitry has beensynthesized using the 0. 18um CMOS standard cell libraries tosubstantiate its figure-of-merits.

REFERENCES

[1] P. V. Ananda Mohan, Residue Number Systems Algorithms andArchitectures. Kluwer Academic Publishers, 2002.

[2] R. Zimmerman, "Efficient VLSI implementation of modulo (2nI1)addition and multiplication," in Proc. 14th IEEE Symp. ComputerArithmetic, pp. 158-167, Apr. 1999.

[3] B. Cao, C. H. Chang and T. Srikanthan, "An efficient reverse converterfor the 4-moduli set 12n-1, 2n, 2n+1, 22n+1 based on the new ChineseRemainder Theorem", IEEE Trans. Circuits Syst. -I, vol. 50, no. 10, pp.1296-1303, Oct. 2003.

[4] L. M. Leibowitz, "A simplified binary arithmetic for the Fermat NumberTransform," IEEE Trans. Acoustics, Speech, and Signal Processing, vol.24, no. 5, pp. 356-359, Oct. 1976.

[5] C. Efstathiou, H.Vergos, G.Dimitrakopoulos and Dimitris Nikolos,"Efficient diminished-I modulo 2n1 multipliers," IEEE Trans.Computers, vol. 54, no. 4, pp. 491-496, Apr. 2005.

[6] A. Wrzyszcz, and D. Milford., "A new modulo 2a+1 multiplier",in Proc. IEEE Int. Conf on Computer Design (ICCD-2003), pp. 614-617, Oct. 1993-.

[7] G. C. Cadarillli et al., "Residue number system reconfigurable datapath,"in Proc. IEEE Int. Symp. on Circuits and Syst., vol. 2, pp. 756-759, May2002.

[8] W.K. Jenkins, B.A. Schnaufer and A.J. Mansen, "Combined system-level redundancy and modular arithmetic for fault tolerant digital signalprocessing," in Proc. IEEE Int. Symp. on Computer Arithmetic, pp. 28-35, June 1993.

[9] A. P. Preethy, D. Radhakrishnan and A. Omondi, "Fault-tolerancescheme for an RNS MAC: performance and cost analysis", in Proc.IEEE Int. Symp. on Circuits and Syst., (ISCAS 2001), vol. 2, pp. 717-720, May2001.

[10] V. Paliouras, K. Karaginni, K. and T. Stouraitis, "A low-complexitycombinatorial RNS multiplier," IEEE Trans. Circuits and Syst. II, vol.48, no. 7,pp. 675-683,July2001-.

[11] C. H. Chang, S. Menon, B. Cao and T. Srikanthan, "A configurable dualmoduli multi-operand modulo adder," in Proc. IEEE Int. Symp. onCircuits and Syst., (ISCAS 2005), vol. 2, pp. 1630-1633, May 2005.

[12] C. Efstathiou, D. Nikolos, and J. Kalamatianos "Area-time efficientmodulo 2n - 1 adder design," IEEE Trans.Circuits and Syst., vol. 41, pp.463-467, July 1994

APCCAS 2006

advantages of reconfigurability is very minimal in terms of thetiming overhead and would be acceptable in specialapplications where it is useful. The area cost of such a

1171