4
Fixed and Variable Multi-modulus Squarer Architectures for Triple Moduli Base of RNS Ramya Muralidharan and Chip-Hong Chang Centre for High Performance Embedded Systems and Centre for Integrated Circuits and Systems, Nanyang Technological University, Singapore Abstract— The performance of RNS relies heavily on efficient implementation of residue arithmetic units. In this paper efficient multi-modulus squarer architectures for the moduli 2 n 1, 2 n and 2 n +1 are presented. Two variants of multi-modulus squarer architectures, i.e., fixed and variable multi-modulus architectures, are proposed. Synthesis results based on TSMC 0.18μm CMOS standard cell implementation demonstrate the performance trade-off between the various designs. Compared to single-modulus architecture for n = 24, fixed multi-modulus architecture provides an area and power savings of 5% and 10%, respectively with similar delay. On the other hand, for the same n, variable multi-modulus architecture reduces the area and power dissipation by 50% and 18%, respectively at the expense of 15% increase in delay. I. INTRODUCTION Residue Number System (RNS) is an unconventional number representation that is widely employed in addition- multiplication intensive applications like digital filters, FFT/DFT and cryptography [1, 2]. The well established modulo arithmetic forms the foundation of RNS. Hence, to maximize the advantages offered by RNS, design of efficient modulo arithmetic circuits is imperative. Architecture that performs the same function for more than one modulus is defined as multi-modulus architecture (MMA) [3]. Compared to conventional single-modulus architecture (SMA), MMA exploits hardware reuse and results in substantial area savings. MMA is particularly employed in Variable Word Length (VWL) and fault-tolerant RNS processors. In VWL RNS processors, MMA facilitates selection of appropriate moduli to satisfy the desired dynamic range. The redundant channels employed in fault-tolerant processors are ideal candidates for MMA. In fact, MMA is widely employed in fault-tolerant techniques such as Serial- by-Modulus Dual Modular Redundancy (SBM-DMR) and Compute Until Correct (CUC) [4]. MMA is classified into fixed and variable multi-modulus architectures [3]. Fixed Multi-modulus Architecture (FMA) performs the modulo operation with respect to multiple moduli simultaneously, thereby maintaining parallelism among the moduli. However, in Variable Multi-modulus Architecture (VMA), modulo operations are performed serially, resulting in greater hardware savings. Owing to the complexity of RNS arithmetic units and residue-to-binary converters, special moduli of the forms 2 n ± 1 are preferred over general moduli. The triple moduli set {2 n 1, 2 n , 2 n + 1} is widely used as the base of an RNS by itself or as part of the supersets like {2 n 1, 2 n , 2 n + 1, 2 n1 1}, {2 n 1, 2 n , 2 n + 1, 2 n+1 1} and {2 n 1, 2 n , 2 n + 1, 2 n+1 + 1} [5, 6]. In this paper, we propose simplified multi- modulus squarer architecture for the three moduli, 2 n 1, 2 n and 2 n + 1. It should be highlighted that the proposed architecture employs weighted binary representation instead of the prevalent diminished-1 representation for modulo 2 n +1 arithmetic. Use of diminished-1 representation necessitates input conversion from weighted binary to diminished-1 representation and vice-versa for the output conversion. Using a different representation for only one of the moduli is inefficient in MMA. Therefore, the consistent use of weighted binary representation results in less area and delay overheads [7]. The rest of the paper is organized as follows. Section II introduces the basics of the triple moduli base of RNS. The proposed variable and fixed multi-modulus squarer architectures are detailed in Sections III and IV, respectively. The comparison of performance metrics based on standard cell implementation is presented in Section V. Section VI concludes the paper. II. BACKGROUND RNS is defined by a set of relatively prime moduli {m 1 , m 2 , … , m k } called the base. An integer A 1 k i i m = < is represented as an unique k-tuple {a 1 , a 2 , … , a k }, where a i is the residue of A modulo m i . In RNS, computations are performed in independent residue channels, where the operands in each channel are the residues of the corresponding modulus from the base as follows. { } 1 2 , , , k C A B cc c = = D (1) i i i i m c a b = D (2) 978-1-4244-3828-0/09/$25.00 ©2009 IEEE 441

[IEEE 2009 IEEE International Symposium on Circuits and Systems - ISCAS 2009 - Taipei, Taiwan (2009.05.24-2009.05.27)] 2009 IEEE International Symposium on Circuits and Systems - Fixed

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 IEEE International Symposium on Circuits and Systems - ISCAS 2009 - Taipei, Taiwan (2009.05.24-2009.05.27)] 2009 IEEE International Symposium on Circuits and Systems - Fixed

Fixed and Variable Multi-modulus Squarer Architectures for Triple Moduli Base of RNS

Ramya Muralidharan and Chip-Hong Chang Centre for High Performance Embedded Systems and Centre for Integrated Circuits and Systems,

Nanyang Technological University, Singapore

Abstract— The performance of RNS relies heavily on efficient implementation of residue arithmetic units. In this paper efficient multi-modulus squarer architectures for the moduli 2n−1, 2n and 2n+1 are presented. Two variants of multi-modulus squarer architectures, i.e., fixed and variable multi-modulus architectures, are proposed. Synthesis results based on TSMC 0.18μm CMOS standard cell implementation demonstrate the performance trade-off between the various designs. Compared to single-modulus architecture for n = 24, fixed multi-modulus architecture provides an area and power savings of 5% and 10%, respectively with similar delay. On the other hand, for the same n, variable multi-modulus architecture reduces the area and power dissipation by 50% and 18%, respectively at the expense of 15% increase in delay.

I. INTRODUCTION Residue Number System (RNS) is an unconventional

number representation that is widely employed in addition-multiplication intensive applications like digital filters, FFT/DFT and cryptography [1, 2]. The well established modulo arithmetic forms the foundation of RNS. Hence, to maximize the advantages offered by RNS, design of efficient modulo arithmetic circuits is imperative.

Architecture that performs the same function for more than one modulus is defined as multi-modulus architecture (MMA) [3]. Compared to conventional single-modulus architecture (SMA), MMA exploits hardware reuse and results in substantial area savings. MMA is particularly employed in Variable Word Length (VWL) and fault-tolerant RNS processors. In VWL RNS processors, MMA facilitates selection of appropriate moduli to satisfy the desired dynamic range. The redundant channels employed in fault-tolerant processors are ideal candidates for MMA. In fact, MMA is widely employed in fault-tolerant techniques such as Serial-by-Modulus Dual Modular Redundancy (SBM-DMR) and Compute Until Correct (CUC) [4].

MMA is classified into fixed and variable multi-modulus architectures [3]. Fixed Multi-modulus Architecture (FMA) performs the modulo operation with respect to multiple moduli simultaneously, thereby maintaining parallelism among the moduli. However, in Variable Multi-modulus Architecture (VMA), modulo operations are performed serially, resulting in greater hardware savings.

Owing to the complexity of RNS arithmetic units and residue-to-binary converters, special moduli of the forms 2n

± 1 are preferred over general moduli. The triple moduli set {2n − 1, 2n, 2n + 1} is widely used as the base of an RNS by itself or as part of the supersets like {2n − 1, 2n, 2n + 1, 2n−1

− 1}, {2n − 1, 2n, 2n + 1, 2n+1 − 1} and {2n − 1, 2n, 2n + 1, 2n+1

+ 1} [5, 6]. In this paper, we propose simplified multi-modulus squarer architecture for the three moduli, 2n − 1, 2n and 2n + 1.

It should be highlighted that the proposed architecture employs weighted binary representation instead of the prevalent diminished-1 representation for modulo 2n+1 arithmetic. Use of diminished-1 representation necessitates input conversion from weighted binary to diminished-1 representation and vice-versa for the output conversion. Using a different representation for only one of the moduli is inefficient in MMA. Therefore, the consistent use of weighted binary representation results in less area and delay overheads [7].

The rest of the paper is organized as follows. Section II introduces the basics of the triple moduli base of RNS. The proposed variable and fixed multi-modulus squarer architectures are detailed in Sections III and IV, respectively. The comparison of performance metrics based on standard cell implementation is presented in Section V. Section VI concludes the paper.

II. BACKGROUND RNS is defined by a set of relatively prime moduli {m1,

m2, … , mk} called the base. An integer A 1

k

ii

m=

< ∏ is

represented as an unique k-tuple {a1, a2, … , ak}, where ai is the residue of A modulo mi. In RNS, computations are performed in independent residue channels, where the operands in each channel are the residues of the corresponding modulus from the base as follows.

{ }1 2, , , kC A B c c c= = … (1)

i

i i i mc a b= (2)

978-1-4244-3828-0/09/$25.00 ©2009 IEEE 441

Page 2: [IEEE 2009 IEEE International Symposium on Circuits and Systems - ISCAS 2009 - Taipei, Taiwan (2009.05.24-2009.05.27)] 2009 IEEE International Symposium on Circuits and Systems - Fixed

where o denotes addition, multiplication or other equivalent arithmetic operations like squaring and multiply-accumulate operation [1].

Modulo squaring is an indispensable operation in many DSP and cryptographic applications. For example, modulo exponentiation in predominant cryptographic algorithms like RSA, is typically implemented using repeated modulo squaring and modulo multiplication [2]. Even though a modulo multiplier can be used to perform modulo squaring, a dedicated squarer unit is preferred due to the area saved and the improved performance in an iterative computation environment. To maximally reuse the resources in an RNS processor, it is of interest to investigate efficient multi-modulus squarer architectures that can be used for all k moduli of the base of an RNS.

Typically, modulo squaring corresponding to generalized moduli is implemented with look-up tables, which is highly inefficient for large word-lengths. Moduli of the forms 2n−1, 2n and 2n+1 possess unique number theoretic properties, making the triple moduli set {2n−1, 2n, 2n+1} a preferred base of RNS. The following Lemma 1 illustrates the periodicity of the powers-of-2 series for the modulus 2n−1. Similar properties for the modulus 2n+1 are shown in Lemmas 2 and 3. In the proposed multi-modulus squarer architectures, these number theoretic properties will be leveraged for efficient memoryless implementations.

Lemma 1: For any non-negative integer, s

121222

−−+⋅ =

nniisn (3)

Lemma 2: For any non-negative integer, s

1212

2 22++

+⋅ =nn

iisn (4)

Lemma 3: For any non-negative integer, s

⎪⎩

⎪⎨⎧

−=

+

++

+⋅

odd is s if2

even is s if 22

12

1212

n

n

n i

iisn (5)

In the subsequent sections, multi-modulus squarer designs that cater to all three moduli of the triple moduli base, {2n−1, 2n, 2n+1}, will be presented.

III. VMA FOR MODULO SQUARING The structure of a modulo squarer has three components:

the partial product generator array, the Carry Save Adder (CSA) tree and the final two-operand modulo adder [8].

A. Partial Product Generator Array The partial product generator array generates the partial

product bits according to the simplified partial product matrix for modulo squaring. In order to realize VMA for modulo squaring, the partial product matrix for mod 2n − 1 and mod 2n + 1 squaring are analyzed independently, where mod is the abbreviation for modulo.

For mod 2n − 1 squaring of1

2 10

2n

nj

jj

A a−

−=

= ⋅∑ , the binary

weight of the partial product bits range from 20 to 22n-2. The partial product matrix simplification involves reducing the binary weight of the partial product bits using (3). The simplified partial product matrix for

122

−nA with n = 5 is

illustrated in Fig. 1 [8]. In Fig. 1, the binary weight of the partial product bits in each column is indicated in the first row.

24 23 22 21 20 a1⋅a2 a0⋅a2 a0⋅a1 a1⋅a4 a0⋅a4 a0⋅a3 a3⋅a4 a2⋅a4 a2⋅a3 a1⋅a3

a2 a4 a1 a3 a0 Figure 1. Partial product matrix for mod 25-1 squaring.

Lemma 4: For )1,0( −∈ nj

0=⋅ jn aa (6)

where 2 10

2n

nj

jj

A a+=

= ⋅∑

From Lemma 3, it follows that

( )2 1 2 1 2 1

2 1

2 2 2 1 2

2 2

n n n

n

n i i n i

n i i

a a a

a

+

+ + +

+

+

⋅ = − ⋅ = + −

= + ⋅ (7)

Thus the binary weight of a can be reduced from 2n+i to 2i but a has to be complemented and a correction factor of 2n+i needs to be included.

For mod 2n + 1 squaring, the partial product matrix of

122

+nA can be simplified using (4), (6) and (7) [7]. The

simplified partial product matrix for n = 5 is shown in Fig. 2.

24 23 22 21 20

a1⋅a2 a0⋅a2 a0⋅a1 41 aa ⋅ 40 aa ⋅

a0⋅a3 43 aa ⋅ 42 aa ⋅ 32 aa ⋅ 31 aa ⋅

a2 4a a1 3a a0 + a5

Figure 2. Partial product matrix for mod 25+1 squaring.

For mod 2n squaring, the partial product matrix can be obtained in a straightforward manner by directly truncating the partial product bits of binary weights 2n to 22n−2.

The partial product generator array of the proposed VMA utilizes multiplexers (MUX) to compute modulus dependent partial product bits. By analyzing the simplified partial product matrices, the number of MUXs required for even and

odd n is ⎟⎠⎞

⎜⎝⎛ +⋅⎟

⎠⎞

⎜⎝⎛ 1

22nn and ( ) ⎟

⎠⎞

⎜⎝⎛ +−⋅− 1

411 nn , respectively.

442

Page 3: [IEEE 2009 IEEE International Symposium on Circuits and Systems - ISCAS 2009 - Taipei, Taiwan (2009.05.24-2009.05.27)] 2009 IEEE International Symposium on Circuits and Systems - Fixed

B. CSA Tree Analogous to the partial product generator array, the CSA

tree of each individual modulus is considered at first. In mod 2n − 1 squaring, the binary weight of the carries generated by the CSAs in the most significant bit (msb) positions is reduced from 2n to 20 by (3). In other words, the carries generated in the msb positions are reinserted into the least significant bit (lsb) positions, resulting in a regular structure of CSA tree with End Around Carry (EAC). However, in mod 2n + 1 squarer, by (7), the reinserted carries are complemented, which leads to a CSA tree with Complemented End Around Carry (CEAC) structure. In the proposed VMA, MUXs are employed in the carry feedback paths to implement a CSA tree with either no carry, EAC or CEAC reinsertion for the three moduli 2n, 2n − 1 and 2n + 1, respectively. Therefore, the number of MUXs in the CSA tree is equal to the number of reinserted carries and is given by (n/2)−1 and (n−1)/2 for even and odd n, respectively.

In mod 2n + 1 squarer, the partial product matrix simplification and the carry reinsertion introduce a correction factor as shown in (7). The correction factor has been derived independently for even and odd n and proved to be a constant K [7]. In the proposed VMA, K is taken to be one of the partial products to be summed by the CSA tree. It is set to zero for the mod 2n − 1 and mod 2n squarers.

C. Two-operand modulo adder Efficient mod 2n − 1 adder using different parallel prefix

structures has been proposed in [9, 10]. The fundamental implementation of mod 2n − 1 adder uses integer adder structure followed by an extra prefix level for carry reinsertion. Similar implementation of diminished-1 mod 2n + 1 adder has been proposed in [9], where the reinserted carry is complemented. When the inputs X and Y to a diminished-1 mod 2n + 1 adder are in weighted binary instead of diminished-1 representation, the corresponding output is

121 +++ nYX . Hence, the output can be corrected by reducing the correction factor K by one. The two-operand modulo adder of the proposed VMA employs Sklansky parallel prefix structure and a MUX in the carry reinsertion path.

The VMA for modulo squaring for n = 5 is illustrated in Fig. 3. The input to the VMA is A and the outputs are

nnAA

22

122 ,

−and

122

+nA .

IV. FMA FOR MODULO SQUARING In FMA, each component consists of a common

computation unit that handles operations that are moduli independent. The common computation unit operates in conjunction with the moduli-specific computation units for moduli-dependent computations. For modulo squaring, the FMA technique is applicable only to the partial product generator stage and not to the subsequent stages, i.e., the CSA tree and the two-operand modulo adder [3]. This is because,

unlike the partial product generator array, the inputs to the subsequent stages are not identical for the three moduli.

Figure 3. VMA for modulo squaring.

The proposed fixed multi-modulus squarer architecture consists of a common partial product generator unit that generates partial product bits common to all three moduli. The number of common partial product bits is n2/4 and (n2+3)/4 for even and odd n, respectively. The CSA tree and the two-operand modulo adder of FMA are the same as that of the corresponding SMA. Fig. 4 depicts the proposed FMA using the standard dot-notation [11]. The hollow dots represent the partial product bits generated by the common partial product generator unit. To differentiate from the sum-carry output pair of a CSA, two dots connected by a dashed line are employed to signify the sum-CEAC output pair of a CSA for mod 2n+1 arithmetic. Similar to the VMA, the input to the FMA is A and the outputs are

nnAA

22

122 ,

−and

122

+nA .

Figure 4. FMA for modulo squaring.

V. PERFORMANCE ANALYSIS OF PROPOSED MULTI-

MODULUS SQUARER ARCHITECTURES The proposed VMA and FMA for modulo squaring have

been coded in VHDL and synthesized by Synopsys Design Compiler (V-2004.06-SP2) using TSMC 0.18μm 1.8 V CMOS standard cell library. Furthermore, conventional SMAs for parallel execution of mod 2n − 1, 2n and 2n + 1 squaring were also implemented. The designs were optimized for minimum achievable area and delay independently under the same nominal synthesis environment, i.e., 25 ºC and 1.8 V. The area-optimized and

443

Page 4: [IEEE 2009 IEEE International Symposium on Circuits and Systems - ISCAS 2009 - Taipei, Taiwan (2009.05.24-2009.05.27)] 2009 IEEE International Symposium on Circuits and Systems - Fixed

delay-optimized synthesis results are shown in Tables I and II, respectively.

In the above comparisons, the area of SMA is the combined area of mod 2n − 1, 2n and 2n + 1 squarers while the delay is the maximum of the three channel delays. Compared to SMA, FMA reduces the area and delay marginally by the hardware sharing in the partial product generator array. On the other hand, VMA reduces the area complexity by at least 30% when compared to both SMA and FMA. For n = 24, by comparing the performances of VMA with SMA, an area reduction of 50% with 15% increase in the critical path delay was observed.

TABLE I. AREA-OPTIMIZED SYNTHESIS RESULTS

n SMA FMA VMA

Area (μm2)

Delay (ns)

Area (μm2)

Delay (ns)

Area (μm2)

Delay (ns)

4 1865 1.48 1812 1.48 1274 2.24 8 7613 2.36 7294 2.36 4367 3.17 12 17269 3.30 16472 3.30 9180 4.11 16 31014 3.48 29525 3.49 15870 4.35 20 48146 4.17 45751 4.26 24162 4.99 24 69334 4.52 65822 4.52 34295 5.46

TABLE II. DELAY-OPTIMIZED SYNTHESIS RESULTS

n SMA FMA VMA

Area (μm2)

Delay (ns)

Area (μm2)

Delay (ns)

Area (μm2)

Delay (ns)

4 4170 0.95 2877 0.92 2907 1.28 8 16089 1.40 12267 1.38 10248 1.72 12 34048 1.73 25852 1.75 18108 2.16 16 60104 1.82 46998 1.80 30226 2.14 20 90716 2.16 72755 2.16 48103 2.48 24 128896 2.20 103920 2.20 67409 2.54

Furthermore, the average power consumption was simulated using the Monte Carlo statistical model [12, 13]. In this method, randomly generated input patterns are applied to the circuit and the power dissipation is measured by Synopsys Power Compiler. The input patterns are fed continuously until the computed average dynamic power has converged to a tolerable error determined by a given confidence level. Table III tabulates the average dynamic power dissipation with 99.9% confidence that the error is bounded below 3%. FMA and VMA exhibit lower dynamic power dissipation than its corresponding SMA. For example, by comparing the design of FMA and SMA for n = 24, the saving in dynamic power dissipation is less than 10%. For the same n, the dynamic power dissipation of VMA is lesser than that of SMA by 18%.

VI. CONCLUSION

Special moduli of the forms 2n and 2n ± 1 are used extensively in RNS applications. Multi-modulus architectures for mod 2n − 1, 2n and 2n + 1 squaring that operate in parallel

and in serial, i.e., FMA and VMA, respectively, have been proposed in this paper. The performances of VMA and FMA were evaluated against that of the conventional single-modulus squarer architectures. Synthesis results reveal that FMA leads to marginal savings in area and power consumption. In contrast, VMA reduces the area and the dynamic power substantially with some compromise on delay.

TABLE III. COMPARISON OF DYNAMIC POWER DISSIPATION

n SMA FMA VMA Power (μw) Power (μw) Power (μw)

4 32.009 26.006 24.137 8 98.727 82.003 72.411 12 200.229 176.077 155.312 16 336.705 300.651 262.845 20 504.654 459.237 425.645 24 702.593 633.114 578.183

REFERENCES [1] M. A. Sodertrand, W. K. Jenkins, G. A. Jullien and F. J. Taylor,

Residue Number System Arithmetic: Modern Applications in Digital Signal Processing, IEEE Press, New York, 1986.

[2] J-C. Bajard and L. Imbert, “A full RNS implementation of RSA,” IEEE Trans. on Computers, vol. 53, no. 6, pp. 769-774, June 2004.

[3] V. Paliouras and T. Stouraitis, “Multifunction architectures for RNS processors,” IEEE Trans. on Circuits and Systems- II, vol. 46, no. 8, pp. 1041-1054, Aug. 1999.

[4] W.K. Jenkins, B.A. Schnaufer and A.J. Mansen, “Combined system level redundancy and modular arithmetic for fault tolerant digital signal processing,” Proc. IEEE Symp. on Computer Arithmetic, Windsor, Canada, pp. 28-35, July 1993.

[5] B. Cao, T. Srikanthan and C.H. Chang, “Efficient reverse converters for four-moduli sets {2n-1, 2n, 2n+1, 2n+1-1}and {2n-1, 2n, 2n+1, 2n-1-1},” IEE Proc. Comput. Digit. Tech., vol. 152, no. 5, pp. 687-696, Sep. 2005.

[6] M. Bhardwaj, T. Srikanthan and C.T. Clarke, “A reverse converter for the 4-moduli superset {2n-1, 2n, 2n+1, 2n+1+1},” Proc. IEEE Symp. on Computer Arithmetic, Adelaide, Australia, pp. 168-175, Apr. 1999.

[7] R. Muralidharan, C.H. Chang, C.C. Jong, “A low complexity modulo (2n +1) squarer design,” IEEE Asia-Pacific Conf. on Circuits and Systems, Macau, China, pp. 1296-1299, Nov. 2008.

[8] S. Piestrak, “Design of squarers modulo A with low-level pipelining,” IEEE Trans. on Circuits and Systems- II, vol. 49, no. 1, pp. 31-41, Jan. 2002.

[9] R. Zimmerman, “Efficient VLSI implementation of modulo (2n ±1) addition and multiplication,” Proc. IEEE Symp. on Computer Arithmetic, Adelaide, Australia, pp. 158-167, Apr. 1999.

[10] L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos and J. Kalamatianos, “High-speed parallel-prefix modulo (2n - 1) adders,” IEEE Trans. on Computers, vol. 49, no. 7, pp. 673-680, July 2000.

[11] B. Parhami, Computer Arithmetic-Algorithms and Hardware Designs, Oxford University Press, New York, 2002.

[12] R. Burch, F.N. Najim, P. Yang and T.N. Trick, “A Monte Carlo approach for power estimation,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 1, no. 1, pp. 63-71, Mar. 1993.

[13] R. K. Satzoda, C. H. Chang and T. Srikanthan, “Monte Carlo statistical analysis for dynamic power simulation of RTL design using Synopsys Power Compiler,” Synopsys Users’ Group Conference, Singapore, June 2006. online publication : http://snug-universal.org/papers/papers.htm.

444