High-performance, low-power architecture for scalable radix 2 montgomery modular multiplication algorithm

152 CAN. J. ELECT. COMPUT. ENG., VOL. 34, NO. 4, FALL 2009

I Introduction

Much of telecommunications traffi c requires the exchange of private information, e.g., electronic mail, electronic banking, medical data-bases, and electronic commerce. Some of the hardware components in these systems such as smart cards [15] and hand-helds are restricted in area and power resources. A key to accommodating these restric-tions is maximizing effi ciency of the embedded cryptography. With broadening demand for secure communications, this is becoming an increasingly important design criteria of such devices.

Many cryptographic applications, such as the encryption/decryption operations of the RSA algorithm [16], the Digital Signature Standard [2], the Diffi e-Hellman key exchange algorithm [5], and elliptic curve cryptography [7], all have an extensive use of modular multiplication and modular exponentiation. The modular exponentiation operation applies the modular multiplication operation repeatedly [10],[6], [4], [14], [1]. So, the performance of these cryptographic applications strongly depends on the effi ciency of the modular multiplication oper-ation, hence the benefi t of gains in this area to the design of devices such as referred to above.

There are several approaches for computing the modular multiplica-tion operation. The most effi cient approach is the Montgomery mod-ular Multiplication (MM) algorithm [11],[28]. The main advantage of this algorithm over ordinary modular multiplication algorithms is that the modulus reduction of the partial product is done by shift operations that are easy to implement in hardware.

A number of fi xed precision Montgomery modular multipliers have been presented in papers [24], [9], [27], [12], [26], [18]. The hardware architectures of these modular multipliers were designed to deal with a fi xed number of bits.

There are several papers published about the scalable Montgomery multipliers. The most important of these are published by A. F. Tenca and Ç. Koç [20], [21], [23], [17]. In their publications, they introduced what is called word-based Montgomery multiplication algorithm to implement the scalability. However, they did not consider their hard-ware from the low-power-consumption point of view.

The goal of this work is to present a new processor array archi-tecture for scalable radix 2 Montgomery modular multiplication algo-rithm and compare this architecture and the previous one extracted by Ç. Koç [8] in terms of speed, area, and power consumption.

This paper is organized as follows: Section II presents the Scalable Multiple-Word Radix 2 Montgomery Multiplication (MWR2MM) al-gorithm [8] . Section III describes the new processor array architecture and shows a brief discussion for the previous architecture of Ç. Koç [8]. Section IV compares the two processor array architectures in terms of area, maximum speed, and power consumption. Finally Section V concludes the paper.

High-performance, low-power architecture for scalable radix 2 montgomery modular

multiplication algorithm

Architecture haute performance et de faible puissance pour un algorithme de multiplication

modulaire de réduction montgomery radix 2

This paper presents a new processor array architecture for scalable radix 2 Montgomery modular multiplication algorithm. In this architecture, the multipli-cand and the modulus words are allocated to each processing element rather than pipelined between the processing elements as in the previous architecture extracted by Ç. Koç. Also, the multiplier bits are fed serially to the fi rst processing element of the processor array every odd clock cycle. By analyzing this architecture, we found that it has a better performance — in terms of area and speed — and lower power consumption than the previous architecture extracted by Ç. Koç.

Ce document présente une nouvelle architecture matricielle de processeur pour des algorithmes de multiplication modulaire de réduction Montgomery Radix 2. Dans cette architecture, le multiplicande et les mots du modulo sont répartis à chaque élément de traitement plutôt qu’orientés entre les éléments du processus comme dans l’architecture précédente extraite par Ç Koc. Également, les bits du multiplicateur sont envoyés en série au premier élément du processus matriciel du processeur à chaque coup impair de l’horloge. En analysant cette architecture, nous avons constaté qu’il a une meilleure performance en termes d’espace et de vitesse, en plus de la consommation électrique moindre que celle de l’architecture précédente.

Keywords: processor array; Montgomery multiplication; cryptography; secure communications; low power modular multipliers

Atef Ibrahim, Fayez Gebali, Hamed El-Simary and Amin Nassar *

Can. J. Elect. Comput. Eng., Vol. 34, No. 4, Fall 2009

* Fayez Gebali is with the Department of Electrical and Computer Engineering, University of Victoria, Victoria, British Columbia V8W 3P6, Canada. Atef Ibrahim is with the same department at the University of Victoria, as well as the Microelectronics Department of Electronics Research Institute, Egypt; E-mails: {fayez, atef}@ece.uvic.ca.

Amin Nassar is also with Cairo University, in the Electronics and Electrical Communications Department; E-mail: [email protected]. Hamed El-Simary is with the Electronics Research Institute, Cairo, Egypt and also the College of Computer Engineering and Science, King Saud University, Saudi Arabia; E-mail: [email protected].

IBRAHIM / GEBALI / EL-SIMARY / NASSAR: HI-PERFORMANCE MODULAR MULTIPLICATION ARRAY ARCHITECTURE 153

II MWR2MM algorithm

The notation used in this paper is as follows:

• M — modulus.

• jm — a single bit of M at position j .

• A — multiplier operand.

• ja — a single bit of A at position j .

• B — multiplicand operand.

• n — operand’s precision.

• R — a constant (called a Montgomery parameter), = 2nR .

• jqa — coeffi cient that determines the multiples of the multiplicand B ( jqa B× ).

• jqm — coeffi cient that determines the multiples of the modulus M ( jqm M× ).

• S — intermediate partial product, or fi nal result of modular multiplication.

• w — word size (in number of bits) of either B , M or S.

• = ne w — number of words in either B , M or S .

• ,a bC C — carry bits.

• ( 1) (1) (0)( ,..., , )eB B B− — word vector of B.

• ( 1) (1) (0)( ,..., , )eM M M− — word vector of M .

• ( 1) (1) (0)( ,..., , )eS S S− — word vector of S.

• ( )( 1...0)i

j kS − — bits 1k − to 0 from the thi word of S at iteration j .

Fig. 1 shows the steps of the MWR2MM algorithm [8]. We can skip the fi nal reduction step of the algorithm since there are many techniques used previously [25], [13], [3] to avoid the subtraction in this step.

III Hardware implementation

In this section, we will extract the Dependency Graph (DG) of the MWR2MM algorithm and discuss the new processor array architecture (referred to in this paper as Design1) that implements this algorithm. Also, we will briefl y mention the previous processor array architecture (referred to in this paper as Design2) described by [8], [22].

III.A Obtaining the algorithm dependency graphThe MWR2MM algorithm, Fig. 1, can be easily defi ned on a two dimensional (2D) domain since there are two indices ( , )j i . The DG is shown in Fig. 2. The computation domain is the convex hull in the 2D space, where the algorithm operations are defi ned as indicated by the greyed and black circles in the 2D plane. From Fig. 2 we notice that the black circle operation corresponds to the computa-tion of algorithm steps 3, 4, 5 and 6; the greyed circle operation corresponds to the computation of algorithm steps 8, 9 and 10. Also, from this fi gure we notice that the input variables ( )iB and ( )iM are represented by horizontal lines, variables jqa and jqm are repre-sented by vertical lines, and the output variable jS is represented by the diagonal lines.

By using the methodology reported by the second author [29], we can explore all possible processor arrays of the scalable MWR2MM algorithm, Fig. 1. This methodology allows for four processor array confi gurations. Two of these processor arrays are suitable for effi cient hardware implementation (Design1 and Design2) and the other two are complex designs and are not suitable for hardware implementa-tion. So, we will neglect the second two designs. Design1 (illustrated in Figs. 3, 4, 5 and 6) has not been reported before by any author, while Design2 (illustrated in Figs. 7 and 8) has been previously reported by Ç. Koç [8], [22].

Figure 2: MWR2MM algorithm dependency graph for = 5n , and = 1w .

Scalabe MWR2MM algorithm

1. = 0S

2. for = 0j to 1n − do

3. =j jqa a

4. (0) (0) (0)

1( , ) = ( )a jj jC S S qa B− + ×

5. (0)(0)=j jqm S

6. (0) (0) (0)( , ) = ( )b jj jC S S qm M+ ×

7. for i = 1 to e -1 do

8. ( ) ( ) ( )

1( , ) = ( )i i ia jj jC S S qa B− + ×

9. ( ) ( ) ( )( , ) = ( )i i ib jj jC S S qm M+ ×

10. ( 1) ( ) ( 1 )

(0) ( 1 . . .1 )= ( , )i i ij j j wS S S− −

−

11. end for;

12. =a a bC C or C 13.

( 1) ( 1 )( 1 . . .1 )= ( , )e e

aj j wS C S− −−

14. end for;

Figure 1: Scalable MWR2MM algorithm.


III.B Design1By using DG of Fig. 2 and the method reported by the second

a uthor [29], we can obtain the new processor array architecture of MWR2MM algorithm shown in Fig. 3. The processor array consists of

= ( 1 ) / 2z e + Processing Elements (PEs). Input words (2 )iB , (2 1)iB + , (2 )iM , (2 1)iM + are allocated to processor PE i and Input ja is allocated

to processor PE0; jqa , jqm are generated inside PE0 and pipelined to the next PEs with higher indices. The intermediate output words ( )i

jS of each PE are pipelined between adjacent PEs. A tristate buffer at the output of each PE ensures that it is the only output fed to the output bus.

1) First PE architecture: Fig. 4 shows the block diagram of the fi rst Processing Element (PE0). The main functional blocks are the two Carry Save Adders (CSAs) that perform steps 4, 6, 8, and 9 in the MWR2MM algorithm, Fig. 1. We used CSAs in our hardware imple-mentation to keep the addition operations fast by preventing the propa-gation of the output carries. So, the intermediate partial product S is represented in a carry save form as two bit vectors: SS (sum vector) and SC (carry vector). PE0 computes the fi rst words (0)

jS ( , )SSr SCr of the partial product S and after latency of n clock cycles (last − cycl− in = 1) the result words (0)

1nS − ),( (0)(0) OSCOSS will be available on the output bus through the tristate buffers (Fig. 4).

As can be seen from the MWR2MM algorithm, Fig. 1, the multi-plier ja is scanned one bit at a time , every odd clock cycle, to fi nd the coeffi cient digit jqa . The coeffi cient, jqa , controls a multiplexer (Fig. 4) and effectively generates (0)jqa B× in the odd clock cycles (odd− cycl = 1) and (1)jqa B× in the even clock cycles (odd − cycl = 0). Another multiplexer (controlled by signal odd − cycl) is used to select between (0)B and (1)B as inputs to the multiplexer controlled by coef-fi cient jqa (Fig. 4). The word (0)jqa B× and the two words of (0)

1jS − ,

( SSr , SCr ) after delayed by one clock cycle through the DFFs, are the inputs to CSA1 during odd clock cycles. On the other hand, the word (1)jqa B× and the two words of (1)

1jS − , ( SSf in− , SCf in− ) from PE1, are the inputs to CSA1 during even clock cycles. Two multiplex-ers (controlled by control signal odd − cycl) are used to select between the words ( SSr , SCr ), after delayed by one clock cycle, and ( SSf in− , SCf in− ) of the partial product S (Fig. 4), as inputs to the CSA1. The outputs of CSA1 are the two words sumA and carryA. The two words (sumA (0) , carryA (0) ) from CSA1 and the word (0)jqm M× are the inputs to CSA2 during odd clock cycles. On the other hand, the two words from CSA1 (sumA (1) , carryA (1) ) and (1)jqm M× are the in-puts to CSA2 during even clock cycles. The outputs of CSA2 are the two words sumB and carryB.

In step 10 of the MWR2MM algorithm, Fig. 1, the partial product S is right-shifted by one bit; this is done by storing signifi cant bits of sumB and carryB into DFFRs (DFFs with asynchronous reset). The right shifted bit must be made zero before the shifting operation happens to avoid data loss. We can do that by adding jqm times the modulus M to the partial product S (steps 6 and 9). jqm is the Least Signifi cant Bit (LSB) of the words SumA, SumA(0). This bit is stored during the odd clock cycles (odd − cycl = 1) to be applied to the words of S during the even clock cycles(odd − cycl = 0) (Fig. 4).

The two carry signals Car − out and Cbr − out are propagated to the next PE to be used as a carry input for the CSA1 and CSA2 of this PE, respectively.

The two signals jqa , jqm are propagated through DFFRs to the next PE.

2) Intermediate PE architecture: Fig. 5 shows the block diagram of the intermediate Processing Element (PE i ). This PE computes the words ( )i

jS ( SSq , SCq ), ( 1)ijS + (SSr , SCr ), 1 2i e≤ ≤ − of the

partial product S . The coeffi cient, jqa , received from the previous PE, controls a multiplexer and effectively generates (2 )ijqa B× in the odd clock cycles (odd − cycl = 1) and (2 1)ijqa B +× in the even clock cycles (odd − cycl = 0). Another multiplexer is used to select between (2 )iB and (2 1)iB + as inputs to the multiplexer controlled by

Figure 3: Design1 processor array architecture.

Figure 4: Design1 First PE architecture.

Figure 5: Design1 intermediate PE architecture.


jqa (Fig. 5). The word (2 )ijqa B× and the two words of (2 )1i

jS − , ( SSr , SCr) after delayed by one clock cycle, are the inputs to CSA1 during odd clock cycles. On the other hand, the word (2 1)ijqa B +× and the two words of (2 1)

1 ,ijS +− ( SSf in− , SCf in− ) from PE i +1, are the inputs

to CSA1 during even clock cycles. Two multiplexers (controlled by signal odd − cycl) are used to select between the two words (2 )

1i

jS − ,(SSr , SCr) after delayed by one clock cycle, and (2 1)

1i

jS +− , (SSf in− ,

SCf in− ) from next PE, of the partial product S as inputs to the CSA1 (Fig. 5). The other two multiplexers (controlled by signal odd − cycl− delay) are used to select between words (2 )

1i

jS − , ( SSr , SCr), and (2 1)

1i

jS −− ( SSq in− , SCq in− ) as inputs to the intermediate DFFRs, (Fig.

5). The control signal odd − cycl − delay is the control signal odd −cycl but delayed by one clock cycle. The outputs of CSA1 are the two words — sumA and carryA. The two words (sumA (2 )i , carryA (2 )i )from CSA1 and the word (2 )ijqm M× are the inputs to CSA2 during odd clock cycles.

On the other hand, the two words (sumA (2 1)i+ ,carryA (2 1)i+ ) from CSA1 and (2 1)ijqm M +× are the inputs to CSA2 during even clock cycles. The outputs of CSA2 are the two words — sumB and carryB. The signifi cant bits of sumB and carryB are stored into DFFRs to implement right shift operation. The result words ( )

1i

nS − ( ( )iOSS , ( )iOSC ), and ( 1)

1i

nS +− ( ( 1)iOSS + , ( 1)iOSC + ) will be available

on the output bus through tristate buffers after latencies ( )n i+ and ( 1)n i+ + clock cycles, respectively. The last PE (PE z ) is the same as this intermediate PE if the number of operand words e is odd (i.e.,

1e + is even) while it can be simplifi ed if the number of e is even (i.e., 1e + is odd) as will be seen in the next subsection.

3) Last PE architecture when ( 1)e + odd: In this case the previous PE will operate only on the input words ( ) = 0eB , ( ) = 0eM , and ( ) = 0eS . Since all of th ese inputs have a value of zero then the previous PE block diagram can be reduced to the PE block diagram shown in Fig. 6. This PE is only used to propagate the last carries of Ca, Cb during the last cycle of computation. The last result word

( 1)1

enS −− ( ( 1) ( 1),e eOSS OSC− − ) will be available on the output bus through

tristate buffers after latency ( 1)n e+ − .

We notice from the previous PEs of Design1 that the two signals jqa and jqm are delayed two clock cycles, in each PE except the last one (PE z ), before propagating to the next PE. This is done by propagating these coeffi cients through DFFRs.

III.C Design2 Fig. 7 shows the processor array architecture obtained by [8], [22].

The processor array consists of = ( 1) / 2z e + PEs. Input words ( )iB ,

( )iM are pipelined through PEs and Input ja is allocated to each PE;jqa , jqm are generated and used inside each PE. The intermediate

output words ( )ijS of each PE are pipelined to the next PE with higher

index. Fig. 8 shows the block diagram of Design2 PE [8],[22].

IV Designs comparison

In this section, we compare the two designs, Design1 and Design2, in terms of area, speed, and power consumption.

IV.A Area estimationThe area of each design mainly depends on two design parameters: the number of architecture stages ( z ) and the word size ( w ) of the oper-ands. The following area estimation for the basic function elements of each PE are given by [19] in terms of 2-input NOR gates (the tristate buffer area , A _Tri state , was estimated by using the same target tech-nology). [19] sets the target technology to AMI05 fast auto (0.5 µ m CMOS with hierarchy preserved) provided in the ASIC Design Kit (ADK) from the Mentor Graphics corporation.

• A DFF (D Flip-Flop area) = 4.79,

• A DFFR (D Flip-Flop with asynchronous rest) = 5.92

• A REG (D Flip-Flop with asynchronous rest and load enable) = 7.97

• A Mux (2-input Multiplexer area) = 1.4,

• A FA (Full Adder area) = 6,

• A CSA ( w ) ( w bit CSA area) = w A FA ,

• A _Tri state (tristate Buffer area) = 0.8 .

Figure 7: Design2 processor array architecture.

Figure 8: Design2 MWR2MM PE architecture.

Figure 6: Design1 last PE architecture when ( 1)e + odd.


synthesis tool, from Mentor Graphics corporation, for xc3s1600e-4fg400 (Spartan 3E) as a target technology.

Equations 4 and 5 represent the total number of clock cycles needed for Design1 and Design2, respectively

1 = 2 2( 1)DesignTC n z+ − (4)

2 = / (2 1) 1DesignTC n z z e + + + (5)

Table 4 compares the total computation time, using different values of z and w , in sµ needed for Design1 and Design2, for = 1024n , re-spectively. We conclude from Table 4 that Design1 has a small gain in reducing the total computation time over Design2 for different values of z and w.

IV.C Power estimationTabl e 5 shows the values of power consumption in mW for the two designs for = 1024n and different values of z and w . These values are obtained using “XPower analyzer” tool from Xilinx corporation

Using these area estimations for the previous basic elements, we can estimate the total area of Design1 and Design2, as:

• Design1 area estimation in case of ( 1)e + is even : 1 = 51.08 50.56 5.40 30.42DesignA zw z w+ − − (1)

• Design1 area estimation in case of ( 1)e + is odd: 1 = 51.08 50.56 43.04 39.52DesignA zw z w+ − − (2)

• Design2 area estimation: 2 = 57.64 52.46DesignA zw z+ (3)

Since n and w values in most of the cryptosystems are always chosen even, this means that ( 1)e + is always odd, thus we will only use Equation 2 in our comparison between the two designs.

Table 1 was constructed using Equations 2 and 3, and compares the total estimated area, in number of NOR gates, needed for Design1 and Design2, for = 1024n and for different values of z and w . Table 2 compares the total area, in number of CLBs, obtained from synthe-sizing the two designs using Leonardo synthesis tool, from Mentor Graphics corporation, for xc3s1600e-4fg400 (Spartan 3E) as a target technology and for = 1024n and different values of z and w . From Tables 1 and 2, we conclude that Design1 has a signifi cant gain in re-ducing area over Design2 for different values of z and w .

IV.B Comp utation time estimationThe total computation time for each design is equal to the product of the number clock cycles it takes and the clock period. The critical path delay (clock period) depends on the number of PEs ( z ) and the word size ( w ) of the operands. It increases as the number of PEs ( z ) and/or the word size ( w ) increases (due to increases in the parasitic resist-ance and capacitance).

Table 3 shows the values of the critical path delay (clock period), as a function of the number of PEs ( z ) and the word size ( w ) of the operands, for Design1 and Design2. These values are obtained from synthesizing the VHDL code of each design using Leonardo

Table 2Comparison between the total area in number of CLBs for

MWR2MM Design1 and Design2 with

n = 1024, 1 2

2= | |Design Design

Design

A AP A

− .

Table 1Comparison between the total estimated area in number of

NOR gates for MWR2MM Design1 and Design2 with

n = 1024, 1 2

2= | |Design Design

Design

A AP A

− .

Table 3Critical path delay (ns) for Design1 and Design2 of

MWR2MM processor arrays with n = 1024.

Table 4Comparison between the total computation time ( )sµ

µ

for MWR2MM Design1 and Design2 with

n = 1024, 1 2

2= | |Design Design

Design

T TQ T

− .

Table 5Comparison between the power consumption (mW) for

MWR2MM Design1 and Design2 with

n = 1024 (@10MHZ and 1.8V), 1 2

2= | |Design Design

Design

P PR P

− .


[22] G. Todorov and A. Tenca, “Asic design, implementation and analysis of a scalable high-radix montgomery multiplier,” Master’s thesis, Oregon State University, USA, 2000.

[23] T. Todorov, A. Tenca, and Ç. Koç, “High-radix design of a scalable modular multi-plier,” in Ç Koç, D. Naccache, and C. Paar, editors, Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science No. 2162, Springer Verlag, Berlin, Germany, pages 189--205, 2001.

[24] C. Walter, “Systolic modular multiplication,” in IEEE Trans. on Computers, 42(3):376--378, March 1993.

[25] C. Walter, “Montgomery exponentiation needs no fi nal subtractions,” in Electronics Letters, 35(21):1831--1832, Oct. 1999.

[26] C. Walter, “Improved linear systolic array for fast modular exponentiation,” in IEEE Proc. on Computers Digital Technique, 147(5):323--328, Sep. 2000.

[27] S. Wang, W. Tsai, and C. Shung, “Two systolic architectures for montgomery multi-plication,” in IEEE Trans. on VLSI Systems, 8(1):103--107, Feb. 2000.

[28] T. Yanik, E. Savas, and Ç. Koç, “Incomplete reduction in modular arithmetic,” in Mathematics of Computation, 149(2):46--54, March 2002.

[29] F. El-Guibaly and A. Tawfi k, “Mapping 3d iir digital fi lter onto systolic arrays,’’ in Multidimensional Systems and Signal Processing, vol. 7, no. 1, pp. 7--26, Jan. 1996.

for xc3s1600e-5fg484 (Spartan 3E) FPGA as target technology. The improvement of Design1 over Design2 is also shown in Table 5. We notice from Table 5 that Design1 has a reasonable gain in reducing power consumption over Design2. This reduction in power consump-tion achieved by Design1 can be attributed to its reduced area, for dif-ferent values of z and w , as seen in Table 2.

V Conclusion

This paper presented a new processor array architecture for the Scalable MWR2MM algorithm. In this architecture, the multipli-cand and the modulus words are allocated to each processing ele-ment rather than pipelined between the processing elements as in the previous architecture extracted by Ç Koç, and the multiplier bits are fed serially to the fi rst processing element of the processor array every odd clock cycle. We analyzed the two architectures (Design1 and Design2) and compared them in terms of area, speed and power consumption. We found that the new architecture has a better per-formance and low power consumption than the previous architecture extracted by Ç Koç.

References

[1] J. Bajard, L. Didier, and P. Kornerup, “An rns montgomery modular multiplication algorithm,” in IEEE Micro, 47(7):766--776, July 1998.

[2] N. I. for Standards and Technology, “Digital signature standard (dss),” in FIPS PUB 186-2, Jan. 2000.

[3] G. Hachez and J. Quisquater, “Montgomery exponentiation with no fi nal subtrac-tions: Improved results,” in Ç Koç, D. Naccache, and C. Paar, editors, Cryptographic Hardware and Embedded Systems,Lecture Notes in Computer Science No. 1965, Springer, Berlin, Germany, pages 293--301, 2000.

[4] T. Hamano, “O(n)-depth circuit algorithm for modular exponentiation,” in “IEEE 12th Symp. on Computer Arithmetic, IEEE Computer Society Press, Los Alamitos, CA., pages 188--192, 1995.

[5] M. Hellman, “New directions on cryptography,” in IEEE Trans. on Information Theory, 22(6):644--654, Nov. 1976.

[6] B. Kaliski, Ç. Koç, and T. Acar, “Analysing and comparing montgomery multiplica-tion algorithms,” in IEEE Micro, 16(3):26--33, June 1996.

[7] N. Koblitz, “Elliptic curve cryptosystems,” in Mathematics of Computation, 48(177):203--209, Jan. 1987.

[8] Ç. Koç and A. Tenca, “A word-based algorithm and architecture for montgom-ery multiplication,” in Ç Koç, D. Naccache, and C. Paar, editors, Cryptographic Hardware and Embedded Systems,Lecture Notes in Computer Science No. 1717, Springer, Berlin, Germany, pages 94--108, 1999.

[9] P. Kornerup, “A systolic, linear-array multiplier for a class of right-shift algorithms,” in IEEE Trans. on Computers, 43(8):892--898, August 1994.

[10] A. Menezes, Applications on Finite Fields, Kluwer Academic Publishers, Boston, MA, pages 113 to–116, 1993.

[11] P. Montgomery, “Modular multiplication without trial division,” Mathematics of Computation, 44(170):519--521, April 1985.

[12] P. Noel, X. Wang, and T. Kwasniewski,” Low power design techniques for a mont-gomery modular multiplier,” in Proc. of International Symp. Intelligent Signal Processing and Communication Systems, pages 449-- 452, Dec. 2005.

[13] H. Orup, “Simplifying quotient determination in high-radix modular multiplication,” in Proc. 12th IEEE Symp. Computer Arithmetic, pages 193--199, July 1995.

[14] C. Paar and T. Blum, “Montgomery modular exponentiation on reconfi gurable hard-ware, in IEEE 14th Symp. on Computer Arithmetic, IEEE Computer Society Press, Los Alamitos, CA., pages 70--77, 1999.

[15] D. Raihi and L. Naccache, “Cryptographic smart cards,” in IEEE Micro, 16(3):14--23, June 1996.

[16] L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-key cryptosystems,” Comm. ACM, 21(2):120--126, Feb. 1978.

[17] E. Savas, A. Tenca, M. Ciftcibasi, and Ç. Koç, “Multiplier architectures for gf(p) and gf( 2n ),” in IEEE Proc. on Computers and Digital Techniques, 151(2):147--160, March 2004.

[18] N. Takagi, “A radix 4 modular multiplication hardware algorithm for modular expo-nentiation,” in IEEE Trans. On Computers, 41(8):949--955, August 1992.

[19] L. Tawalbeh and A. Tenca, “Radix-4 asic design of a scalable montgomery modular multiplier using encoding techniques,” Master’s thesis, Oregon State University, USA, 2002.

[20] A. Tenca and Ç. Koç, “A scalable architecture for modular multiplication based on montgomery’s algorithm,” in IEEE Trans. On Computers, 52(9):1215--1221, Sep. 2003.

[21] A. Tenca, E. Savas, and Ç. Koç, “A design framework for scalable and unifi ed archi-tectures that perform multiplication in gf(p) and gf( 2m ),” in International Journal of Computer Research, 13(1):68--83, 2004.

Atef Ibrahim received the BSc. degree in Electronics Engineering from Mansoura University, Egypt, in 1998 and MSc. degree in Electronics and Electrical Communications from Cairo University, Egypt, in 2004. He received the PhD from the Electronics and Electrical Communications Department of Cairo University, Egypt in 2009. He is cur-rently a post-doctoral visitor student in the Electrical and Computer Engineering Department of University of Victoria, Canada, as well as being with the Microelectronics Department of Electronics Research Institute, Egypt. His re-search interests include computer Arithmetic, Cryptography, and VLSI design.

Fayez Gebali received the BSc degree in electrical engin-eering (fi rst class honors) from Cairo University, the BSc degree in mathematics (fi rst class honors) from Ain Shams University, and the PhD degree in electrical engineering from the University of British Columbia where he was a holder of NSERC postgraduate scholarship. Dr. Gebali is a professor of computer engineering at the University of Victoria. His re-search interests include multicore processors, computer com-munications, and computer arithmetic. He is a senior member of IEEE.

Hamed El-Simary is a professor at the VLSI department, Electronics research Institute, Cairo, Egypt, and currently is on leave, working as a professor at the college of computer engineering and science, King Saud University, Alkharj, SA. Research Interest includes, low power circuit design, and computer architecture.

Amin Nassar is a professor at the Electronics and Electrical communications department, Faculty of Engineering, Cairo University, Cairo, Egypt. Research Interest includes, Computer-Aided Design, Microprocessors & Interface, and Industrial Electronics.

Documents

High-performance, low-power architecture for scalable radix 2 montgomery modular multiplication algorithm