4
Fast hard multiple generators for radix-8 Booth encoded modulo 2 n 1 and modulo 2 n +1 multipliers Ramya Muralidharan and Chip-Hong Chang Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore Abstract— Hard multiple generation is the bottleneck operation in radix-8 Booth encoded modulo 2 n 1 and modulo 2 n +1 multipliers. In this paper, fast hard multiple generators for the moduli 2 n 1 and 2 n +1 are proposed. They are implemented as parallel-prefix structures based on the simplified carry equations. Synthesis results based on TSMC 0.18μm, 1.8V CMOS standard-cell library show that the proposed modulo 2 n 1 hard multiple generator reduces the critical path delay of the fastest general-purpose modulo 2 n 1 adder by 12% and 10% for n = 8 and n = 64, respectively. Compared to the smallest modulo 2 n 1 adder, the proposed design leads to 19% and 12% savings in silicon area for n = 8 and n = 64, respectively. The proposed modulo 2 n +1 hard multiple generator also has the least critical path delay among the existing modulo 2 n +1 adders. I. INTRODUCTION Residue Number System (RNS) is an unconventional number representation that is based on modulo arithmetic. It is widely employed in applications like digital filters, convolvers and cryptography [1]. The performance of RNS based datapath relies heavily on the efficient implementation of the RNS to binary converters and the modulo arithmetic units, in particular adders and multipliers. High-speed modulo multipliers using Booth encoding for partial product generation have been proposed in [2], [3]. The Booth encoding technique reduces the number of partial products to be generated and accumulated, thereby minimizing the associated hardware. The radix-4 Booth encoding is most prevalent as all modulo-reduced partial products can be generated by mere shifting and negation. Greater savings in area and dynamic power dissipation are feasible for large word-length multipliers by increasing the radix beyond four [4], [5]. In the radix-8 Booth encoding, the number of partial products is reduced by two-thirds. However this reduction in the number of partial products comes at the expense of increased complexity in their generation. Specifically, the generation of the modulo-reduced hard multiple, 3 m X + involves modulo addition of X and 2X, which results in a long carry-propagation chain. The delay in generating 3 m X + increases the critical path delay of the multiplier considerably, thus undermining the advantages of higher radix Booth encoding. In this paper, efficient modulo 2 n 1 and modulo 2 n +1 hard multiple generators for computation of 2 1 3 n X and 2 1 3 n X + , respectively are proposed. The carry equations are reformulated by considering modulo 2 n 1 and modulo 2 n +1 addition of X and 2X independently. These carry equations are simplified using the number theoretic properties of modulo arithmetic. The proposed parallel-prefix implementations of modulo 2 n 1 and modulo 2 n +1 hard multiple generators employ only 2 log 1 n prefix levels, where a represents the smallest integer greater than or equal to a. They are by far the fastest application-specific adders for this purpose. II. PRELIMINARY ON PARALLEL-PREFIX MODULO ADDERS The carry computation in binary addition of A = 1 0 2 n i i i a = and B = 1 0 2 n i i i b = is a classic prefix problem as the carry-out, c i from bit position i is a function of {a 0 a i } and {b 0 b i } as shown in (1). ( ) 1 i i i i i i c a b a b c = + + (1) The parallel-prefix implementation of (1) involves pre- processing and prefix computation stages. A post-processing stage is needed for the generation of the sum bits, s i from c i . The computations involved in each of the three stages are: Pre-processing: , , i i i i i i i i i g a b p a bh a b = = + = (2) where g i , p i and h i are the generate, propagate and half-sum signals, respectively at bit position i. Prefix computation: For i = 0 ( ) ( ) 0:0 0:0 0 0 , , G P g p = (3) For i = 1 to n 1 ( ) ( ) ( ) ( ) ( ) ( ) :0 :0 1:0 1:0 1 1 0 0 , , , , , , i i i i i i i i i i G P g p G P g p g p g p = = " (4) where ( ) ( ) ( ) , , , g p gp g p gp p = + Post-processing: :0 i i c G = , 1 i i i s h c = (5) Fig. 1 illustrates an 8-bit parallel-prefix binary adder using the Kogge Stone structure. In Fig. 1, the nodes and represent the pre-processing and post-processing units, respectively. The solid node () performs the prefix computation and the hollow node () acts as a buffer. 978-1-4244-5309-2/10/$26.00 ©2010 IEEE 717

[IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

Fast hard multiple generators for radix-8 Booth encoded modulo 2n−1 and modulo 2n+1 multipliers

Ramya Muralidharan and Chip-Hong Chang Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore

Abstract— Hard multiple generation is the bottleneck operation in radix-8 Booth encoded modulo 2n−1 and modulo 2n+1 multipliers. In this paper, fast hard multiple generators for the moduli 2n−1 and 2n+1 are proposed. They are implemented as parallel-prefix structures based on the simplified carry equations. Synthesis results based on TSMC 0.18μm, 1.8V CMOS standard-cell library show that the proposed modulo 2n−1 hard multiple generator reduces the critical path delay of the fastest general-purpose modulo 2n−1 adder by 12% and 10% for n = 8 and n = 64, respectively. Compared to the smallest modulo 2n−1 adder, the proposed design leads to 19% and 12% savings in silicon area for n = 8 and n = 64, respectively. The proposed modulo 2n+1 hard multiple generator also has the least critical path delay among the existing modulo 2n+1 adders.

I. INTRODUCTION Residue Number System (RNS) is an unconventional

number representation that is based on modulo arithmetic. It is widely employed in applications like digital filters, convolvers and cryptography [1]. The performance of RNS based datapath relies heavily on the efficient implementation of the RNS to binary converters and the modulo arithmetic units, in particular adders and multipliers.

High-speed modulo multipliers using Booth encoding for partial product generation have been proposed in [2], [3]. The Booth encoding technique reduces the number of partial products to be generated and accumulated, thereby minimizing the associated hardware. The radix-4 Booth encoding is most prevalent as all modulo-reduced partial products can be generated by mere shifting and negation. Greater savings in area and dynamic power dissipation are feasible for large word-length multipliers by increasing the radix beyond four [4], [5]. In the radix-8 Booth encoding, the number of partial products is reduced by two-thirds. However this reduction in the number of partial products comes at the expense of increased complexity in their generation. Specifically, the generation of the modulo-reduced hard multiple,

3m

X+ involves modulo addition of X and 2X, which results in a long carry-propagation chain. The delay in generating

3m

X+ increases the critical path delay of the multiplier considerably, thus undermining the advantages of higher radix Booth encoding.

In this paper, efficient modulo 2n−1 and modulo 2n+1 hard multiple generators for computation of

2 13 nX

−⋅ and

2 13 nX

+⋅ ,

respectively are proposed. The carry equations are reformulated by considering modulo 2n−1 and modulo 2n+1 addition of X and 2X independently. These carry equations are simplified using the number theoretic properties of modulo arithmetic. The proposed parallel-prefix implementations of modulo 2n−1 and modulo 2n+1 hard multiple generators employ only 2log 1n −⎡ ⎤⎢ ⎥ prefix levels, where a⎡ ⎤⎢ ⎥ represents the smallest integer greater than or equal to a. They are by far the fastest application-specific adders for this purpose.

II. PRELIMINARY ON PARALLEL-PREFIX MODULO ADDERS The carry computation in binary addition of A =

1

02

ni

ii

a−

=

⋅∑ and B = 1

02

ni

ii

b−

=

⋅∑ is a classic prefix problem as the

carry-out, ci from bit position i is a function of {a0…ai} and {b0 …bi} as shown in (1).

( ) 1i i i i i ic a b a b c −= ⋅ + + ⋅ (1) The parallel-prefix implementation of (1) involves pre-

processing and prefix computation stages. A post-processing stage is needed for the generation of the sum bits, si from ci. The computations involved in each of the three stages are: Pre-processing:

, ,i i i i i i i i ig a b p a b h a b= ⋅ = + = ⊕ (2) where gi, pi and hi are the generate, propagate and half-sum signals, respectively at bit position i. Prefix computation:

For i = 0 ( ) ( )0:0 0:0 0 0, ,G P g p= (3) For i = 1 to n −1 ( ) ( ) ( )

( ) ( ) ( ):0 :0 1:0 1:0

1 1 0 0

, , ,

, , ,i i i i i i

i i i i

G P g p G P

g p g p g p− −

− −

= •

= • • • (4)

where ( ) ( ) ( ), , ,g p g p g p g p p′ ′ ′ ′ ′• = + ⋅ ⋅ Post-processing:

:0i ic G= , 1i i is h c −= ⊕ (5) Fig. 1 illustrates an 8-bit parallel-prefix binary adder using

the Kogge Stone structure. In Fig. 1, the nodes □ and ◊ represent the pre-processing and post-processing units, respectively. The solid node (●) performs the prefix computation and the hollow node (○) acts as a buffer.

978-1-4244-5309-2/10/$26.00 ©2010 IEEE 717

Page 2: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

Figure 1. Parallel-prefix binary adder

A modulo 2n−1 addition with dual representation of zero i.e., 0 0

n

and1 1n

, is equivalent to an n-bit end-around-

carry addition given by (6).

2 1 2n noutA B A B c−

+ = + + (6) By replacing cout with Gn−1:0, the carry equation for modulo

2n−1 addition was modified to (7) in [6]. ( ) ( ) ( ) ( )0 0 1 1 1 1, , , ,i i i n n i ic g p g p g p g p− − + += • • • • • (7)

Parallel-prefix implementation of modulo 2n−1 adders using (6) and (7) have been proposed in [2], [6] and [7].

In diminished−1 representation, modulo 2n+1 addition is realized by an n-bit complemented-end-around-carry addition.

2 1 2 1 21 1n n noutA B A B A B c

+ +′ ′ ′ ′+ − = + + = + + (8)

where A′ and B′ are the diminished−1 representations of the weighted-binary summands A and B, respectively.

Analogous to modulo 2n−1 arithmetic, in [8], the carry equation of modulo 2n+1 addition was modified to

( ) ( ) ( ) ( )0 0 1 1 1 1, , , ,i i i n n i ic g p g p g p g p− − + += • • • • • (9) The parallel-prefix modulo 2n+1 adders have been

proposed in [2], [8] and [9].

III. PROPOSED MODULO 2n−1 AND MODULO 2n+1 HARD MULTIPLE GENERATORS

In this section, high-speed application-specific adder designs that compute the hard multiple of the radix-8 Booth encoded modulo 2n−1 and modulo 2n+1 multipliers are proposed. n = 8 is used for the purpose of illustration but the derivation is applicable to any n. A. Modulus 2n−1

Property 1 shows that modulo 2n−1 multiplication of X = 1

0

2n

ii

i

x−

=

⋅∑ by a power-of-two term, 2j, can be simplified to a

circular-left-shift (CLS) operation of X by j bits. Property 1: For j < n

2 12 ( , )n

j X CLS X j−

⋅ = (10)

The hard multiple2 1

3 nX−

⋅ is computed by the modulo 2n−1 addition of X and

2 12 nX

−⋅ . Let the representation of the

multiplicand X for n = 8 be (x7x6x5x4x3x2x1x0). By Property 1, 82 1

2 X−

⋅ is given by (x6x5x4x3x2x1x0x7). The generate and propagate signals for modulo 28−1 addition of (x7x6x5x4x3x2x1x0) and (x6x5x4x3x2x1x0x7) are given by:

( ) ( ) ( ) ( )0 0 0 7 0 7 7 7 7 6 7 6, , , , , ,g p x x x x g p x x x x= ⋅ + = ⋅ +… (11)

From (11), it can be shown that for1 7i≤ ≤ , 1 1i i ip g g− −⋅ = . Furthermore,

( ) ( )0 7 0 7 7 6 7 6 7p g x x x x x x g⋅ = + ⋅ ⋅ = ⋅ = (12) c0 can be calculated using (7) as

( ) ( ) ( )0 0 0 7 7 1 1

0 0 7 0 7 6 0 7 6 5

0 7 6 5 4 0 7 6 5 4 3

0 7 6 5 4 3 2

0 7 6 5 4 3 2 1

, , ,

c g p g p g pg p g p p g p p p gp p p p g p p p p p gp p p p p p gp p p p p p p g

= • • •= + ⋅ + ⋅ ⋅ + ⋅ ⋅ ⋅+ ⋅ ⋅ ⋅ ⋅ + ⋅ ⋅ ⋅ ⋅ ⋅+ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅+ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

(13)

By replacing p0·g7 and pi·gi−1 with g7 and gi−1 respectively, (13) can be simplified to

( ) ( )( )

( )

0 0 7 0 7 6 5

0 7 6 5 4 3

0 7 6 5 4 3 2 1

c g g p p g g

p p p p g g

p p p p p p g g

= + + ⋅ ⋅ +

+ ⋅ ⋅ ⋅ ⋅ +

+ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ +

(14)

* * * * * * * * * *0 0 0 6 0 6 4 0 6 4 2c g p g p p g p p p g= + ⋅ + ⋅ ⋅ + ⋅ ⋅ ⋅ (15)

where ( )( )

( )

0 7 7 0 6*

1 0 0 1 7

1 1 2

0 7 7 0 6*

1 0 0 1 7

1 1 2

if 0 if 1 if 2 7

if 0 if 1 if 2 7

i

i i i i i

i

i i i i i

g g x x x ig g g x x x i

g g x x x i

p p x x x ip p p x x x i

p p x x x i

− − −

− − −

+ = ⋅ + =⎧⎪= + = ⋅ + =⎨⎪ + = ⋅ + ≤ ≤⎩

⋅ = + ⋅ =⎧⎪= ⋅ = + ⋅ =⎨⎪ ⋅ = + ⋅ ≤ ≤⎩

(16)

Hence, c0 can be computed by the following prefix operation

( ) ( ) ( ) ( )* * * * * * * *0 0 0 6 6 4 4 2 2, , , ,c g p g p g p g p= • • • (17)

In general, the carry, ci of 82 12X X

−+ ⋅ is given by:

( ) ( ) ( ) ( )8 8 8 8 8 8

* * * * * * * *2 2 4 4 6 6, , , ,i i i i i i i i ic g p g p g p g p− − − − − −= • • •

(18) Fig. 2 illustrates the proposed parallel-prefix modulo 2n−1

hard multiple generator. In the pre-processing stage, the modified generate *

ig and propagate *ip signals are computed

using OR-AND and AND-OR logic circuitries, respectively along with the half-sum bits hi. This operation is denoted by ■ to indicate that it is different from the □ operator of Fig. 1. The carry ci is computed from the ( )* *,i ig p pair using the prefix

operation in 2log 1n −⎡ ⎤⎢ ⎥ levels. Finally, the post-processing stage computes the sum bits si by the XOR of hi and ci-1. B. Modulus 2n+1

Let X ′ = X −1 be the diminished−1 representation of X

=0

2n

ii

i

x=

⋅∑ . Property 2 shows that the diminished−1

representation of the product of X and a power-of-two term, 2j, can be determined by a complementary-circular-left-shift (CCLS) of X ′ by j bits.

718

Page 3: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

Figure 2. Proposed modulo 2n−1 hard multiple generator

Property 2: For j < n, if P = 2 1

2 n

j X+

( , )P CCLS X j′ ′= (19) Analogous to the 2n−1 modulus, let the diminished−1

representation of the multiplicand, X′ be ( )7 6 5 4 3 2 1 0x x x x x x x x′ ′ ′ ′ ′ ′ ′ ′ .

By Property 2, the diminished−1 representation of 82 12 X

+⋅

is ( )6 5 4 3 2 1 0 7x x x x x x x x′ ′ ′ ′ ′ ′ ′ ′ . The generate and propagate signals for the modulo 28+1 addition of ( )7 6 5 4 3 2 1 0x x x x x x x x′ ′ ′ ′ ′ ′ ′ ′ and

( )6 5 4 3 2 1 0 7x x x x x x x x′ ′ ′ ′ ′ ′ ′ ′ are given by:

( ) ( ) ( ) ( )0 0 0 7 0 7 7 7 7 6 7 6, , , , , ,g p x x x x g p x x x x′ ′ ′ ′ ′ ′ ′ ′= ⋅ + = ⋅ +… (20) From (9), c0 can be written as

( ) ( ) ( ) ( )0 0 0 7 7 6 6 1 1

0 0 7 0 7 6 0 7 6 5

0 7 6 5 4 0 7 6 5 4 3

0 7 6 5 4 3 2

0 7 6 5 4 3 2 1

, , , ,c g p g p g p g pg p p p g p p g g pp g g g p p g g g g pp g g g g g pp g g g g g g p

= • • • •= + ⋅ + ⋅ ⋅ + ⋅ ⋅ ⋅+ ⋅ ⋅ ⋅ ⋅ + ⋅ ⋅ ⋅ ⋅ ⋅+ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅+ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

(21)

Since 1 1i i ip g g− −⋅ = and 1 1i i ig p p− −⋅ = for 1 7i≤ ≤ and

0 7 7p p p⋅ = , (21) can be simplified to:

( ) ( )( )

( )

0 0 7 0 7 6 5

0 7 6 5 4 3

0 7 6 5 4 3 2 1

* * * * * * * * * *0 0 6 0 6 4 0 6 4 2

c g p p g p p

p g g g p p

p g g g g g p p

g p p p g p p g g p

= + + ⋅ ⋅ +

+ ⋅ ⋅ ⋅ ⋅ +

+ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ +

= + ⋅ + ⋅ ⋅ + ⋅ ⋅ ⋅

(22)

where ( )( )

( )

0 7 7 0 6*

1 0 0 1 7

1 1 2

0 7 7 0 6*

1 0 0 1 7

1 1 2

if 0 if 1 if 2 7

if 0 if 1 if 2 7

i

i i i i i

i

i i i i i

g p x x x ig g g x x x i

g g x x x i

p g x x x ip p p x x x i

p p x x x i

− − −

− − −

′ ′ ′+ = ⋅ + =⎧⎪ ′ ′ ′= + = ⋅ + =⎨⎪ ′ ′ ′+ = ⋅ + ≤ ≤⎩

′ ′ ′⋅ = + ⋅ =⎧⎪ ′ ′ ′= ⋅ = + ⋅ =⎨⎪ ′ ′ ′⋅ = + ⋅ ≤ ≤⎩

(23)

Using the prefix notation, (23) can be rewritten as ( ) ( ) ( ) ( )* * * * * * * *

0 0 0 6 6 4 4 2 2, , , ,c g p p g p g p g= • • • (24) In general, ci can be expressed as follows.

( ) ( ) ( ) ( )2 2 4 4 6 6, , , ,i i i i i i i i ic a b a b a b a b− − − − − −= • • • (25) where

( ) ( )( )mod8 mod8

, if 0 7,

, if 6 1i i

i ii i

g p ia b

p g i⎧ ≤ ≤⎪= ⎨ − ≤ ≤ −⎪⎩

(26)

Fig. 3 shows the parallel-prefix implementation of the proposed modulo 2n+1 hard multiple generator. In Fig. 3, only the prefix computation stage with 2log 1n −⎡ ⎤⎢ ⎥ levels is shown. The pre-processing and post-processing stages are similar to the corresponding units of Fig. 2. The dashed line indicates that the modified generate and propagate signals ( )* *,i ig p are

pair-wise swapped and complemented, i.e., ( )* *,i ip g . The number of additional prefix operators in each prefix level i,

21 log 1i n≤ ≤ −⎡ ⎤⎢ ⎥ , is n−2i+1. * *0 0,g p* *

7 7,g p

Figure 3. Proposed modulo 2n+1 hard multiple generator

IV. PERFORMANCE COMPARISON In this section, the performance of the proposed modulo

2n−1 hard multiple generator is compared against the parallel-prefix modulo 2n−1 adders of [2], [6] and [7]. Similarly, the performance of the proposed modulo 2n+1 hard multiple generator is compared against modulo 2n+1 adders of [2], [8] and the fast parallel-prefix adder (FPP) of [9].

The performances of the various designs are first evaluated analytically using the unit-gate model [10]. In this model, the area and delay of a two-input logic gate are considered as one unit, except XOR and XNOR, the area and delay of which are considered as two units each. Tables I and II summarize the area and delay model of the various modulo 2n−1 and modulo 2n+1 adders, respectively. The metric, ki is obtained for each design after factoring out the common factor fi from the area and delay expressions. For example, the area of the proposed modulo 2n−1 hard multiple generator is equal to 3

1 i iik f

=⋅∑ =

( ) ( ) ( )23 log 5 0 1n n n⋅ + ⋅ + ⋅⎡ ⎤⎢ ⎥ . Table I shows that the proposed modulo 2n−1 hard multiple generator has the least area and delay. From Table II, the proposed modulo 2n+1 hard multiple generator has the least delay.

TABLE I. UNIT-GATE AREA AND DELAY OF MODULO 2n−1 ADDERS

ki fi

Proposed [2] [6] [7] Area Delay Area Delay Area Delay Area Delay

2log n⎡ ⎤⎢ ⎥ 3n 2 3n 2 3n 2 3n 2 n 5 0 6 0 6 0 10 0 1 0 2 3 6 0 4 0 4

719

Page 4: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

TABLE II. UNIT-GATE AREA AND DELAY OF MODULO 2n+1ADDERS

ki fi

Proposed [2] [8] [9] Area Delay Area Delay Area Delay Area Delay

2log n⎡ ⎤⎢ ⎥ 6n 2 3n 2 4.5n 2 6n 2 n − 4 0 6 0 1.5 0 1 0 1 12 2 3 6 6 4 12 4 For an accurate evaluation, the proposed hard multiple

generators and the modulo adders of [2], [6]−[9] were described in VHDL and synthesized by Synopsys Design Compiler and mapped to TSMC 0.18μm 1.8V CMOS standard-cell library. The individual designs were optimized for minimum achievable area and delay independently under the same nominal synthesis design environment of 25°C and 1.8V. The area and delay constrained synthesis results for the modulus 2n−1 are shown in Tables III and IV, respectively. Both area (in μm2) and delay (in ns) are listed under each type of optimization constraint to ensure that the secondary performance metric is not compromised.

TABLE III. AREA CONSTRAINED RESULTS FOR MODULO 2n−1 ADDERS

n Proposed [2] [6] [7] Area Delay Area Delay Area Delay Area Delay

8 1144 0.73 1443 1.07 1410 0.72 1437 0.80 16 2820 0.89 3386 1.46 3353 0.88 3406 0.95 24 5029 1.04 5595 1.75 5827 1.03 5907 1.11 32 6706 1.04 7803 2.04 7770 1.03 7876 1.11 40 9713 1.19 10278 2.18 11043 1.18 11176 1.26 48 11655 1.19 12753 2.49 13252 1.18 13412 1.26 56 13598 1.19 15228 2.63 15461 1.18 15647 1.26 64 15540 1.19 17703 2.94 17669 1.18 17882 1.26

TABLE IV. DELAY CONSTRAINED RESULTS FOR MODULO 2n−1 ADDERS

n Proposed [2] [6] [7] Area Delay Area Delay Area Delay Area Delay

8 2408 0.43 2750 0.64 2737 0.51 3233 0.49 16 5498 0.53 5907 0.75 6533 0.61 6895 0.61 24 9590 0.65 9829 0.83 10907 0.73 12008 0.72 32 12377 0.65 13950 0.92 13960 0.73 15577 0.73 40 17676 0.76 17696 0.99 20384 0.84 21056 0.83 48 20863 0.76 21731 0.97 22928 0.84 25496 0.85 56 25277 0.77 25503 0.98 26814 0.84 29711 0.84 64 29086 0.76 30177 1.02 30739 0.84 33224 0.84

From Table III, comparing with [2], [6] and [7], the proposed modulo 2n−1 hard multiple generator saves about 20% and 12% of area for n = 8 and n = 64, respectively. From Table IV, for n = 64, the critical path delay is reduced by 25% over [2] and 10% over [6] and [7].

TABLE V. AREA CONSTRAINED RESULTS FOR MODULO 2n+1 ADDERS

n Proposed [2] [8] [9] Area Delay Area Delay Area Delay Area Delay

8 1407 0.81 1493 1.13 1443 0.77 1696 0.72 16 3765 0.96 3479 1.49 3629 0.92 4347 0.87 24 6679 1.11 5754 1.77 6370 1.07 7554 1.03 32 9556 1.11 7993 2.11 8808 1.07 10724 1.03 40 13571 1.27 10571 2.35 12384 1.23 15032 1.18 48 16445 1.27 13072 2.61 15085 1.23 18198 1.18 56 19319 1.27 15574 2.81 19382 1.25 21365 1.19 64 23258 1.27 18075 3.18 20753 1.23 25596 1.18

The area and delay constrained synthesis results are presented in Tables V and VI, respectively for the modulo 2n+1 adders. From Table VI, the proposed modulo 2n+1 hard

multiple generator has the least critical path delay for all n. For an intermediate value of n such as 32, the proposed design reduces the critical path delay by 25% with an area overhead of 20% over [2] and by 8% with an area overhead of 8% over [8]. Compared to the FPP design of [9], an area and delay savings of 10% each are achieved for n = 32 since the proposed hard multiple generator employs simpler pre-processing and post-processing units.

TABLE VI. DELAY CONSTRAINED RESULTS FOR MODULUO 2n+1 ADDERS

n Proposed [2] [8] [9] Area Delay Area Delay Area Delay Area Delay

8 2807 0.45 2883 0.63 3003 0.51 3419 0.51 16 6676 0.57 6493 0.76 7554 0.63 8748 0.63 24 11938 0.68 10760 0.88 12301 0.74 13724 0.73 32 17254 0.68 14017 0.91 17087 0.74 19881 0.75 40 23660 0.77 18385 0.97 23920 0.83 28078 0.83 48 28816 0.77 21957 1.02 27672 0.85 32242 0.85 56 32455 0.77 26910 0.99 35732 0.85 37724 0.86 64 38932 0.80 31068 1.04 37372 0.85 46200 0.86

V. CONCLUSION Application-specific adder designs that compute the hard

multiple of the radix-8 Booth encoded modulo 2n−1 and modulo 2n+1 multipliers were proposed. The parallel-prefix implementations of the proposed modulo hard multiple generators were derived from the modified carry equations. Both designs compute the corresponding modulo-reduced hard multiple in only 2log 1n −⎡ ⎤⎢ ⎥ prefix levels. The critical path delay of the proposed designs was shown to be the least among the existing modulo adders based on unit-gate model and synthesis results.

REFERENCES [1] M. A. Soderstrand, W. K. Jenkins, G. A. Jullien and F. J. Taylor,

Residue Number System Arithmetic: Modern Applications in Digital Signal Processing, IEEE Press, New York, 1986.

[2] R. Zimmermann, “Efficient VLSI implementation of modulo (2n ±1) addition and multiplication,” in Proc. IEEE Symp. on Computer Arithmetic, Adelaide, Australia, pp. 158-167, Apr. 1999.

[3] C. Efstathiou, H. T. Vergos and D. Nikolos, “Modified Booth modulo 2n-1 multipliers,” IEEE Trans. on Computers, vol. 53, no. 3, pp. 370-374, Mar. 2004.

[4] B. S. Cherkauer and E. G. Friedman, “A hybrid radix-4/radix-8 low power signed multiplier architecture,” IEEE Trans. on Circuits and Syst. – II, vol. 44, no. 8, pp. 656-659, Aug. 1997.

[5] P. –M. Siedel, L. D. McFearin and D. W. Matula, “Secondary radix recodings for higher radix multipliers,” IEEE Trans. on Computers, vol. 54, no. 2, pp. 111-123, Feb. 2005.

[6] L. Kalampoukas et al., “High-speed parallel-prefix modulo (2n−1) adders,” IEEE Trans. on Computers, vol. 49, no. 7, pp. 673-680, July 2000.

[7] G. Dimitrakopoulos et al., “New architectures for modulo 2n−1 adders,” in Proc. IEEE Int. Conf. on Electronics, Circuits and Systems, Gammarth, Tunisia, pp.1-4, Dec. 2005.

[8] H. T. Vergos, C. Efstathiou and D. Nikolos, “ Diminished-one modulo 2n+1 adder design,” IEEE Trans. on Computers, vol. 51, no. 12, pp. 1389-1399, Dec. 2002.

[9] H. T. Vergos and C. Efstathiou, “Efficient modulo 2n+1 adder architectures,” Integration, the VLSI journal, vol. 42, no. 2, pp. 149-157, Feb 2009.

[10] A. Tyagi,"A reduced-area scheme for carry-select adders," IEEE Trans. on Compters, vol. 42, no. 10, pp.1163-1170, Oct. 1993.

720