[IEEE 2011 IEEE International Symposium on Circuits and Systems (ISCAS) - Rio de Janeiro, Brazil (2011.05.15-2011.05.18)] 2011 IEEE International Symposium of Circuits and Systems

A New RNS Scaler for {2n – 1, 2n, 2n + 1}

Jeremy Yung Shern Low and Chip Hong Chang School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.

[email protected]

Abstract— This paper presents an efficient RNS scaling algorithm for the balanced special moduli set { }2 1, 2 , 2 1n n n− + . By exploiting the relationship between the scaling constant and the residues of the three-moduli set using the New Chinese Remainder Theorem I (New CRT-I), the complicated modulo reduction operations for large integer scaling in RNS can be greatly simplified. The scaling constant has been chosen as

( )2 2 1n n + such that all residues of the scaled integer are

identical and equal to the scaled integer output. This is particularly useful as no expensive and slow residue-to-binary converter is required for interfacing with conventional number system after the digital signal processing and scaling in RNS domain. The scaling error occurs only conditionally and is proven to be at most unity. The proposed design can be implemented entirely based on full adders with complexity commensurate with a multi-operand modulo 2 1n − adder. Its area-time complexity is at least 86% lower than one of the fastest ROM-based scaler designs for the same moduli set over a wide dynamic range of 15 bits and above.

I. INTRODUCTION We are living in an ubiquitous computing world

surrounded by sophisticated portable electronic devices like cell phone, digital audio players, digital camera, camcorder, Global Position System (GPS), radio-frequency identification tags, etc.. These contemporary devices that support pervasive computing capitalize on advanced integrated circuit technology to enable system integration on silicon for device miniaturization. The demand for increasing integration density presents significant challenges to the circuit designers due to the conflicting design criteria such as chip size, performance, versatility and power budget. The throughput rate of real-time application-specific digital signal processing (DSP) functions, such as Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), FIR filter, are particularly constrained by the computational intensive datapaths dominated by multiple multiplication and addition operations. While long carry propagation chain is inevitable in these operators under the conventional binary number system, the Residue Number System (RNS) emerges as an appealing alternative solution. In RNS, large wordlength integer is decomposed into smaller wordlength numbers in different modulo channels, and arithmetic operations such as addition, subtraction and

multiplication of these numbers can be carried out in parallel without carry propagation across different modulo channels. By introducing subword-level parallelism into the datapath, the system throughput rate can be increased. From the power dissipation perspective, for the same operand length, RNS can tolerate a larger reduction in supply voltage than the corresponding weighted Binary Number System (BNS) architecture under the same delay constraint, which results in a quadratic reduction in dynamic power dissipation [1], [2]. These features of RNS are acclaimed for the design of inner product step processor (IPSP) like architectures [3], [4], which encompasses the generic architecture of the above digital signal processors (DCT, FFT and FIR filter).

RNS is not a panacea though. Due to its non-weighted nature, it suffers from a major drawback in intermodular1 operations, such as sign detection, magnitude comparison, division, reverse conversion and scaling. These operations require relatively larger hardware area and longer delay, if executed directly in RNS domain. Unfortunately, scaling is one of the crucial operations in IPSP because it ensures that the intermediate results of the operations before the final stage would not exceed the dynamic range of the system. Performing scaling in BNS within an RNS does not help to solve the problem because the required residue-to-binary conversion is also an intermodular operation. Typically, a RNS scaler is placed after each stage of RNS computation, which may consist of more than one arithmetic operation, to prevent overflow error. Consequently, the throughput bottleneck of an IPSP lies in the delay of the RNS scaler. Some research had been done to reduce the hardware cost and enhance the performance of the RNS scaler [5] – [9]. Generally, RNS scaling problem is formulated either based on modulo reduction of the least integer function of division or by the Chinese Remainder Theorem (CRT). The former method usually involves computationally intensive base extension operation [7], [9] although the outputs are free from scaling error. On the other hand, CRT-based approaches [5], [6], [8] usually produce scaled integer with some fractional error. All the aforementioned RNS scaling algorithms are

1 Intermodular operations refer to operations in RNS that require the

information of all residue channels in order to produce the correct output results.

978-1-4244-9474-3/11/$26.00 ©2011 IEEE 1431

implemented using ROMs. ROM-based approaches have low throughput besides consuming more silicon area, especially when the number of moduli and the wordlengths of the moduli are large.

This paper presents a new RNS scaling algorithm for the celebrated three-moduli set, { }2 1, 2 , 2 1n n n− + using Theorem I of the New CRT (also known as New CRT-I) proposed in [10]. New CRT-I was originally used to simplify the reverse converter design of special moduli set RNS by reducing the size of the modulo operation [11]. In our proposed RNS scaling algorithm, New CRT-I is ingeniously applied with carefully selected scaling constant so that large integer scaling in RNS can be performed with simpler circuits than the scaler designed with the classical CRT. It is interesting to note that the resultant residue digits of the scaled output for all the modulus channels are identical and is equal to the scaled integer output with a scaling error of no more than one. To the best of our knowledge, this is the first RNS scaler that is designed based on the New CRT-I and no other RNS scaler has this unique property that all the residue channels share a common output value upon scaling. Consequently, any scaled residue can be interfaced directly to the normal binary system without the need for the residue-to-binary conversion. The proposed scaling algorithm lends itself nicely to efficient hardware implementation and it also minimizes the total number of scaling operations required in an IPSP architecture.

This paper is organized as follows: Section II provides the detailed derivation of the proposed RNS scaling algorithm. Error analysis is provided followed by a numerical example. Section III describes the architecture implementation of the proposed RNS scaler. The area and delay performances of the proposed architecture are evaluated and compared against the fastest RNS scaler for the same moduli set using a unit gate model. The paper is concluded in Section IV.

II. PROPOSED SCALING ALGORITHM In RNS, an integer is represented by an -tuple

1 2( , , )Nx x x with respect to a set of pairwise relatively prime numbers 1 2{ , , }Nm m m , where

ii m

x X= , i = 1, 2, …, N and

imX is defined as X mod mi. The dynamic range of a moduli

set, 1 2{ , , }Nm m m is given by 1

Nii

M m=

= ∏ .

New Chinese Remainder Theorem I (New CRT-I) [10], [11]: Given the residue number ( 1 2, ,..., nx x x ) with respect to a moduli set { 1 2, ,..., nP P P }, the integer X in normal binary representation can be computed by:

1 2 1

1 1 1 2 1 2 1 2 3 2

( 1) 1 2 1 1 ...

( ) ( ) ...

... ( )n n

n n n n P P P P

X x k P x x k PP x x

k PP P x x−

− − −

= + − + − +

+ − (1)

where 2

1 1 ...1

nP Pk P = ,3

2 1 2 ...1

nP Pk PP = ,… , 1 1 1... 1n

n n Pk P P− − = .

Theorem 1[11]: For a three-moduli set { }1 2 3, ,P PP , X is related to its residue digits by:

( ) ( )2 3

1 1 1 2 1 2 2 3 2 P PX x P k x x k P x x= + − + − (2)

where 2 3

1 1 1 P Pk P = and 3

2 1 2 1 Pk P P = .

Let 1 2 32 1, 2 , 2 1n n nP P P= + = = − , the expressions for k1 and k2 can be derived as follows:

2 11 2 2 1n nk −= − + (3)

12 2nk −= (4)

By replacing (3), (4) and the corresponding expressions foriP , i = 1, 2 and 3 into (2), we have

( )

( )( ) ( )1

2 1 12 1 3 2 2 2 1

2 1

2 2 1 2 2n n

n

n n n n

X x

x x x x− −

⋅ −

= + + ⋅

− + − + ⋅ − (5)

The following axioms and properties are used for the derivation of our scaling algorithm [12].

Axiom 1: B AB

A X AX=

Axiom 2: m m m m

X Y X Y± = ±

Axiom 3: m m m m

X Y X Y⋅ = ⋅

Property 1: Multiplying an n-bit binary number x by r power of 2 in modulo 2 1n − is equivalent to a circular left shift operation,

( )2 1

2 ,n

rnx CLS x r

−= (6)

where CLSn(x, r) denotes a circular shift of the n-bit binary number x by r bits to the left.

Property 2:

( ) ( )2 1 2 12 1

2 2 2 1 2 ,n nn

r r n rnx x x CLS x r

− −−− = − − = = (7)

where x is the 1’s complement of the binary number x.

By definition, scaling an integer variable X by a constant integer k can be obtained by dividing both sides of (5) by k. The scaling factor k has been chosen to be ( )2 2 1n n + such that the logic expression, and hence its hardware complexity, for the RNS scaling operation can be greatly simplified as shown below.

Using Theorem I and Axiom 1, we have the scaled integer, Y given by:

1432

( )2 2 1n n

XY⎢ ⎥⎢ ⎥=

+⎢ ⎥⎣ ⎦

( )( ) ( ) ( )2 1

2 1 3 21

1

0

2 2 1 2 2

22 2 1

n n n n

Pnn n

x x x xx− −

≈

− + − −⎢ ⎥⎢ ⎥+ ⋅⎢ ⎥= +⎢ ⎥+⎢ ⎥⎣ ⎦

( )( ) ( )2 12 1 3 2

12 2 11 2 22

n n n nn P

x x x x− −− + − −⎢ ⎥≅ + ⋅⎢ ⎥⎣ ⎦

(8)

where ( )2 2 1n nP = − .

( ) ( )

( )( ) ( )

2 11

2 1 3 2

2 1

1 12 1 3 2 2 1

2 2 1 22

2 1 2

n

n

n nn

n

n n

Y x x x x

x x x x

−−

−

− −

−

⎛ ⎞− += − + −⎜ ⎟⎝ ⎠

≅ − − + −

(9)

Since 2 1n − is smallest modulus of the three-moduli set, it is trivial to note that taking the modulo operations with respect to 2n and 2 1n + on the scaled integer output Y produce the same result. In other words, the scaled residue digits are identical and is equal to the scaled integer computed by (9), i.e.,

ii m

y Y Y= = for i = 1, 2, and 3. In what follows, we will show that the error due to the RNS scaling is reasonably small.

The actual scaled output, i.e., before the truncation in (9), is given by (8) and is rewritten here for ease of exposition.

( ) ( )1 12 1 3 2

2 1

12 1 22 n

n nexact nY x x x x− −

−

⎢ ⎥⎛ ⎞= − + − + −⎢ ⎥⎜ ⎟⎝ ⎠⎣ ⎦

(10)

Let 11 2 1nk −′ = − , 1

2 2nk −′ = , 1 2 1x x x′ = − and 2 3 2x x x′ = − , then

1 1 2 22 1

11 1 2 2

2 1

1 1 2 2 1

12

2

n

n

n

exact n

n

Y k x k x

xk x k x

k x k x c

−

−

−

⎢ ⎥⎛ ⎞′ ′ ′ ′= + +⎢ ⎥⎜ ⎟⎝ ⎠⎣ ⎦

′⎢ ⎥′ ′ ′ ′= + +⎢ ⎥⎣ ⎦

′ ′ ′ ′⎢ ⎥= + +⎣ ⎦

(11)

The scaled output computed by (9) can be expressed as:

1 1 2 2 2 1nY k x k x−

′ ′ ′ ′= + (12)

By definition of modulo operation, m

x x mμ= − , where μ is a non-negative integer. exactY and Y can be written as:

1 1 2 2 1exactY k x k x c Pμ′ ′ ′ ′= + + −⎢ ⎥⎣ ⎦ (13)

1 1 2 2 1Y k x k x Pμ′ ′ ′ ′= + − (14)

Since 2 12 2 1n nx x− ≤ − ≤ − , 1 1c− ≤ < . When 2 1x x<(i.e., 1 0c− ≤ < ), the floor function on exactY produce

1 1 2 2 1 1exactY k x k x Pμ′ ′ ′ ′= + − − (15)

As a result, the difference between Y and exactY is unity. Similarly, it can be proved that there is no difference between Y and actY for 2 1x x≥ . In summary, the proposed RNS scaling

algorithm introduce a unity error when 2 1x x< .

The following numerical example is used to illustrate our proposed RNS scaling algorithm. Let n = 5, X = 5678 ≡ (5, 14, 2) with respect to the moduli set {31, 32, 33} and the actual scaled value, Y = ( )5678 33 32×⎢ ⎥⎣ ⎦ = 5. Table 1 shows the calculation of the RNS scaling operation step by step based on (9). It can be seen from Table 1 that the scaled value calculated by the proposed algorithm tallies with the actual value and the residue representation of 5 in {31, 32, 33} is exactly (5, 5, 5).

TABLE I. COMPUTATION TRACES OF X = 5678 SCALED BY K = 33X32 IN RNS {31, 32, 33}.

Moduli, { }1 2 3, ,m m m {31, 32, 33}

Residues, { }1 2 3, ,x x x {5, 14, 2}

2 1x x− 12

3 2x x− −9 12 1n − − 15

12n − 16

( )( )12 12 1n x x− − − 180

( )13 22n x x− − −144

( )( ) ( )1 12 1 3 22 1 2n nx x x x− −− − + − 36

( ) ( ) ( )1 12 1 3 2 2 1

2 1 2n

n nx x x x− −

−− − + − 5

III. IMPLEMENTATION AND PERFORMANCE EVALUATION Applying Properties 1 and 2 to the expanded expression of

(9), the following simplified expression can be obtained.

( ) ( ) ( )1 12 1 3 2 2 1

1 1 1 12 2 1 1 3 2 2 1

1 13 2 1 1 2 1

1 11 1 2 3 2 1

2 1 2

2 2 2 2

2 2

2 2

n

n

n

n

n n

n n n n

n n

n n

Y x x x x

x x x x x x

x x x x

x x x x

− −

−

− − − −

−

− −

−

− −

−

= − − + −

= − − + + −

= − − +

= + + +

(16)

Thus, the complex intermodular RNS scaling operation has been greatly reduced to an operation as simple as one multi-operand modulo addition, as shown in the final expression of (16). The architecture is shown in Fig. 1. Since 2 1n − is the smallest modulus of the moduli set, no further modulo reduction is needed to generate the residue digits of the scaled value Y.

1433

The RNS scaling algorithms proposed by Garcia [9] has been recognized as one of the fastest methods formulated based on the modulo reduction of the least integer function of division. To compare the area and time complexity of our proposed scaler with that of [9], we adopted the analytical model in [8], which gives us the ROM model to estimate the number of transistors, AROM required to realize the LUT of [9] as well as its unit gate delay, TROM.

For a 2n p× ROM [8]:

/2

(2 )

/2 /2

2 ( / 2 1) 2

2 ( / 2 2) (2 1)

nn n

ROM p

n p

A n p

p n p

⎡ ⎤⎢ ⎥×

⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

= + +⎡ ⎤⎢ ⎥

+ + + +⎢ ⎥⎣ ⎦ (17)

2(2 )(1 log / 2 )n NANDROM p

T n n t×

= + + ⋅⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ (18)

where tNAND is the delay of a two-input NAND gate.

Figure 1. Architecture of New CRT-I-based RNS scaler for three-moduli

set, { }2 1, 2 , 2 1n n n− + with scaling constant, ( )2 2 1n nk = +

In unit gate delay model, a two-input monotonic gate, such as an AND or a NAND gate, is said to have one unit gate area (approximated to 6 transistors for static CMOS implementation) and one unit gate delay. The estimated number of transistors (A), unit gate delay (T) and the AT product of both designs for n = 5 to 8 (representing a dynamic range of approximately 15 to 24 bits) are tabulated in Table II.

TABLE II. COMPARISON OF AREA-TIME COMPLEXITY

#transistors (A) delay (T) ATn Proposed [9] Proposed [9] Proposed [9] 5 960 8089 15 13 14400 1051576 1152 38966 15 15 17280 5844907 1344 110655 15 17 20160 18811358 1536 512136 15 18 23040 9218448

From Table II, the number of transistors required to implement our proposed design is at least 8 times smaller than that of [9]. This is because our proposed design consists of only two n-bit CSAs and one modulo 2 1n − adder, as indicated in Fig. 1. The number of transistors required to implement these combinational logic blocks increases gradually with n whereas the sizes of the LUTs used by the scaler of [9] increases exponentially with n. The unit gate delay of our proposed design is relatively low and constant

while that of [9] increases gradually with n. This is because our design requires only two levels of CSA for all values of n and the modulo adder used [13] consists of a parallel prefix structure and some simple combinational logic gates, which has delay logarithmically dependent on n. Overall, the AT complexity of our proposed design is at least 86.3% lower than that of [9].

IV. CONCLUSION Scaling is an important and yet difficult operation in RNS.

This paper proposes a simple and fast RNS scaler for the balanced special-moduli set { }2 1, 2 , 2 1n n n− + . Unlike other RNS scaling algorithms, which are typically developed based on classical CRT or by base-extension with ROM-based implementation, our design is based on the New CRT-I formulation and can be implemented completely by full adders. Only one residue channel is required to produce the scaled output as all residues share the same scaled value as the scaled integer with the chosen scaling factor. The proposed RNS scaler consumes very low hardware with speed comparable to the fastest ROM-based design for the same dynamic range. The delay of the proposed design is logarithmically dependent on the wordlength of the modulus, n and hence is relatively constant over a wide dynamic range.

REFERENCES [1] M. N. Mahesh, and M. Mehendale, “Low power realization of Residue

Number System based FIR filters,” in Proc, of 13th Int. Conf. on VLSI Design, Jan. 2000, pp. 30-33.

[2] G. C. Cardarilli, A. D. Re, A. Nannarelli, and M. Re, “Low power and low leakage implementation of RNS FIR filters,” in 39th Asilomar Conf. on Signals, Syst. and Computer, 2005, pp. 1620-1624.

[3] R. Conway, and J. Nelson, “Improved RNS FIR filter architectures,” IEEE Trans. on Circuits and Syst. II: Analog and Digital Signal Processing, vol. 51, n1, pp. 26-28, Jan. 2004.

[4] G. L. Bernocchi, G. C. Cardarili, A. Nannarelli and M. Re, “Low power adaptive filter based on RNS components,” in Proc. 2007 IEEE Int. Symp. on Circuits and Syst., New Orleans, USA, pp. 3211-3214, May 2007.

[5] M. Griffin, F. Taylor, and M. Sousa, “New scaling algorithm for the Chinese Remainder Theorem,” in Conf. Rec. Twenty-Second Asilomar Conf. on Signals, Syst. and Computers, 1988, pp. 375-378.

[6] M. A. P. Shenoy, and R. Kumaresan, “A fast and accurate RNS scaling technique for high speed signal processing,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 37, n. 6, pp. 929-937, Jun. 1989.

[7] F. Barsi, and M.C. Pinotti, “Fast base extension and precise scaling in RNS for look-up table implementations,” IEEE Trans. on Signal Processing, vol. 43, n. 10, pp. 2427-2430, Oct. 1995.

[8] Z.D. Ulman, and M. Czyzak, “Highly parallel, fast scaling of numbers in nonredundant residue arithmetic,” IEEE Trans. on Signal Processing, vol. 46, n. 2, pp. 487-496, Feb. 1998.

[9] A. Garcia, and A. Lloris, “A look-up scheme for scaling in the RNS,” IEEE Trans. on Computers, vol. 48, n. 7, pp. 748-751, Jul. 1999.

[10] Y. Wang, “New Chinese remainder theorems,” in Proc. 32th Asilomar Conf. Signals, Systems, Computers, vol. 1, 1998, pp. 165–171.

[11] Y. Wang, X. Song, M. Aboulhamid, and H. Shen, “Adder based residue to binary converters for (2n-1, 2n, 2n+1),” IEEE Trans. On Signal Processing, vol. 50, n. 7, pp. 1772-1779, Jul. 2002.

[12] B. Cao, C. H. Chang, and T. Srikanthan, "A residue-to-binary converter for a new five-moduli set," IEEE Trans. on Cir. and Syst.-I, vol. 54, no.5, pp. 1041-1049, May 2007.

[13] G. Dimitrakopoulos, D. G. Nikolos, H. T. Vergos, D. Nikolos, and C. Efstathiou, “New architectures for modulo 2N - 1 adders,” in Proc. of the IEEE Int. Conf. on Electronics, Circuits, and Syst., Gammarth, Tunisia, Dec. 2005, pp. 1-4.

mod 2 1n − adder

x1 x2

n-bit CSA w EAC

bit reshuffling

x3

n

n-bit CSA w EAC

Y = y1 = y2 = y3

1434

Documents

[IEEE 2011 IEEE International Symposium on Circuits and Systems (ISCAS) - Rio de Janeiro, Brazil (2011.05.15-2011.05.18)] 2011 IEEE International Symposium of Circuits and Systems