High performance scalable elliptic curve cryptosystem processor for Koblitz curves

Microprocessors and Microsystems 37 (2013) 394–406

Contents lists available at SciVerse ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

High performance scalable elliptic curve cryptosystem processor for Koblitz curves

K.C. Cinnati Loi, Seok-Bum Ko ⇑Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, SK, Canada

a r t i c l e i n f o

Article history:Available online 6 April 2013

Keywords:FPGAElliptic Curve Cryptography (ECC)Koblitz curvesScalable ECC processorFinite field arithmetic

0141-9331/$ - see front matter Crown Copyright � 2http://dx.doi.org/10.1016/j.micpro.2013.03.003

⇑ Corresponding author. Tel.: +1 306 966 5456.E-mail address: [email protected] (S.-B. Ko).

a b s t r a c t

A scalable elliptic curve cryptography (ECC) processor is presented in this paper. The proposed ECC pro-cessor supports all five Koblitz curves recommended by the National Institute of Standards and Technol-ogy (NIST) without the need to reconfigure the FPGA. The paper proposes a finite field arithmetic unit(FFAU) that reduces the number of clock cycles required to compute the elliptic curve point multiplica-tion (ECPM) operation for ECC. The paper also presents an improved point addition (PADD) algorithm totake advantage of the novel FFAU architecture. A scalable ECC processor (ECP) that is completely imple-mented in hardware that makes use of the novel PADD algorithm and FFAU is also presented in thispaper.

The design is synthesized and implemented for a target Virtex-4 XC4VFX12 FPGA. It uses 2431 slices,1219 slice registers, 3815 four-input look-up tables (LUT) and can run at a maximum frequency of155.376 MHz. The proposed design is the fastest scalable ECP that supports all five Koblitz curves knownto the authors as it evaluates the ECPM for K-163 in 0.273 ms, K-233 in 0.604 ms, K-283 in 0.735 ms, K-409 in 1.926 ms and K-571 in 4.335 ms. The proposed design is suitable for server-side security applica-tions where both high-speed and scalability are important design factors.

Crown Copyright � 2013 Published by Elsevier B.V. All rights reserved.

1. Introduction

Elliptic Curve Cryptosystems (ECC) were independently pro-posed by Miller [1] and Koblitz [2] in the 1980s. The advantageof ECC over the more commonly used Rivest, Shamir, Adleman(RSA) [3] algorithm for public-key cryptography is in the reducedkey sizes that ECC allows for, while providing a similar level ofsecurity. The shorter key sizes allow implementations of ECC tobe more efficient either in terms of higher throughput or lowerarea. Due to its many advantages, ECC has been adopted by manystandards, such as NIST [4], SEC [5], and FIPS 186–3 [6]. The mosttime consuming operation in ECC protocols is the elliptic curvepoint multiplication (ECPM). Thus, much of the research in the cur-rent literature focuses on optimizing the ECPM operation. Due toits ability for quick prototyping and ease of reconfiguration, field-programmable gate-arrays (FPGAs) are a popular choice for theimplementation platform for ECC processors, which act as copro-cessors, to run alongside the software that implements the proto-col level operations. This paper proposes an architecture for ascalable ECC processor (ECP) using the FPGA as a target platform.

Koblitz curves are a special type of ECC curves in GF(2m) that re-places the point doubling (PDBL) operations with finite field squar-

013 Published by Elsevier B.V. All

ing (FFSQ) operations, which is much simpler, by converting thescalar multiplier, k, into s-non-adjacent form (sNAF) [7]. This paperfocuses only on the implementation of a ECC processor for Koblitzcurves, but the processor architecture can be adapted for pseudo-random curves as well.

In 2009, Hassan and Benaissa [8] proposed a scalable ECP thatuses the PicoBlaze soft-core microcontroller in Xilinx FPGAs toimplement the design using the hardware/software co-design(HSC) approach. The motivation is to design a processor that canhandle the elliptic curves up to the 193-bit key size suggested inStandards for Efficient Cryptography Group (SECG) [5] withoutthe need to reconfigure the hardware. The design goal is to reducearea consumption for area constrained platforms, such as RFID,mobile handsets, smart cards, and wireless sensor networks [8].Since then, Hassan and Benaissa have also proposed scalable de-signs that support curves up to 571 bits recommended by the Na-tional Institute of Standards and Technology (NIST) [9,10] also forarea-constrained environments. The advantage of a scalable designis its ability to modify the key size at run-time, which is applicablefor security protocols like IPSec, where the ECC parameters arenegotiated at run-time [11].

HSC has the advantage in its flexibility, since it allows for soft-ware to implement the complex control of the data flow. However,the use of the PicoBlaze microcontroller is also one of its main lim-itations. The microcontroller causes the processor to require many

rights reserved.

http://dx.doi.org/10.1016/j.micpro.2013.03.003

mailto:[email protected]

http://dx.doi.org/10.1016/j.micpro.2013.03.003

http://www.sciencedirect.com/science/journal/01419331

http://www.elsevier.com/locate/micpro

Table 1Variables used in Section 2.1.

Variable Description

m Key length or length of finite field vector (163, 233, 283,409, 571)

w Digit size (32)s Number of Digits = dm/wei, j IndexesA(t), B(t), Z(t) A finite field element in polynomial basis

representation; A(t) = (am�1, . . . ,a1,a0), where ai isbinary, and similarly for B(t) and Z(t)

C(t) Result of A(t) � B(t); has length 2m � 1P(t) Irreducible polynomialA = (As�1, . . . ,A1,A0) Each Ai is a binary vector of size w and similarly for B

and ZU, V Binary vectors of size w; (U,V) is the concatenation of

the two vectorsT0, . . . , T3 Temporary binary vectors of size w

K.C.C. Loi, S.-B. Ko / Microprocessors and Microsystems 37 (2013) 394–406 395

clock cycles to load and store intermediate values. Furthermore,the proposed designs in [8–10], due to their low-area approach, re-sult in ECC processors that have long latencies in the calculation ofthe ECPM. The latencies presented in [8–10] are in the order of tensand hundreds of milliseconds, whereas the most efficientnon-scalable designs have latencies in the order of microseconds[12,13].

Their scalable design is proposed for area-constrained platforms[8–10]. However, scalability is a feature that is more important tothe server side. The server-side hardware must be able to supportmultiple key sizes in order to handle a large number requests andmust be able to handle them quickly. Thus, a high-speed design isalso preferred for use in server-side applications.

The long latencies of the HSC approach presented in [8–10] arenot suitable for server-side requirements as Cohen and Parhi [14]suggest that for server-side applications, the timing requirementof a ECPM may be at least 100 ECPMs per second or at most10 ms per ECPM. Thus, in this paper, the complete design is imple-mented in hardware in order to show that by implementing thecontrol signals and the temporary storage in hardware, the timingperformance improves immensely with only a slight increase inhardware utilization. The proposed design can support all five Ko-blitz curves recommended by NIST [4], which is the target curveused in [10]. Furthermore, the timing improvement of the pro-posed design is not simply a result of transferring the software por-tions in [10] to hardware. This paper also proposes a novelarchitecture of a scalable finite field arithmetic unit (FFAU), whichoptimizes the speed performance of the ECPM operation, and pro-poses a modified point addition (PADD) algorithm to utilize theproposed FFAU.

The main contribution of this paper is the proposed architec-ture of a novel scalable FFAU to optimize the ECPM. By modify-ing the PADD algorithm to utilize the FFAU, a more efficientscalable ECP architecture is presented. The proposed design issuitable for server-side applications due to its higher speedand its scalability, as it supports all the Koblitz curves (K-163,K-233, K-283, K-491, K-571) recommended by NIST [4]. Further-more, to the author’s best knowledge, this paper presents thefirst scalable ECP implemented completely in hardware thatcan support all five Koblitz curves (i.e. up to 571 bits) recom-mended by NIST [4].

The rest of this paper is organized as follows: Section 2 re-views the basics of finite field operations, elliptic curve cryptog-raphy in Koblitz curves and describes the novel point additionalgorithm proposed in this paper; Section 3 discusses the hard-ware architecture and implementation of the scalable ECP; Sec-tion 4 discusses the implementation results; and Section 5concludes the paper.

2. Elliptic curve cryptography in Koblitz curves

This section is organized into two subsections. Firstly, the basicsof finite field operations are reviewed, including the algorithmsused by the scalable ECP proposed in this paper. A revised reduc-tion algorithm is used to output the result of the FFAU least-signif-icant-digit (LSD)-first. Secondly, the basics of elliptic curvecryptography in Koblitz curves are reviewed and a novel PADDalgorithm that improves the hardware implementation of the pro-posed scalable ECP is presented.

2.1. Scalable finite field operations

The elliptic curves discussed in this paper are defined overbinary fields using polynomial basis. Thus, binary finite field

(FF) operations are the fundamental building blocks inimplementing ECC operations. These FF operations include FFaddition (FFADD), FF squaring (FFSQ), FF multiplication(FFMULT) and FF inversion (FFINV). Out of these operations,FFADD is the most trivial and can be implemented using abit-wise exclusive-OR (XOR) operation. FFINV is the mostcomplex operation. However, using the Itoh–Tsujii algorithm[15], FFINV can be accomplished with a series of FFMULT andFFSQ. Thus, the most complicated operations that need to beimplemented in a FF arithmetic unit (FFAU) are FFMULT andFFSQ.

In the section, upper case variable names are used for vectorsand lower case variable names are used for scalar values. The vari-ables and their respective descriptions used in this section arelisted in Table 1.

In general, hardware FFMULT implementations can be dividedinto three categories: bit-serial, bit-parallel, digit-serial. Bit-serialimplementations consume the least hardware resources and isscalable, but requires m clock cycles per multiplication [16].Bit-parallel implementations require only 1 clock cycle but gen-erally require more hardware resources and are more difficult tomake them support multiple fields simultaneously [17]. Digit-se-rial implementations are a compromise between the bit-serialand bit-parallel implementations [18]. Digit-serial implementa-tions allow for the multiplier to support fields of different bitlengths by simply processing different number of digits, so it isvery suitable for a scalable design like the one presented in thispaper.

Digit-serial FFMULT can be defined as follows:

ZðtÞ ¼ ½AðtÞ � BðtÞ�mod PðtÞ ¼Xs�1

i¼0

aitiw �Xs�1

j¼0

bjtjw

" #mod PðtÞ

¼Xs�1

i¼0

Xs�1

j¼0

aibjtðiþjÞw

" #mod PðtÞ ¼ CðtÞmod PðtÞ ð1Þ

The implementation of (1) requires s2 w-bit multiplications. Inthis paper, FFMULT is implemented using the Comba algorithm[19] and w is chosen to be 32. Comba algorithm is a type of di-git-serial multiplication and it is chosen in this paper to imple-ment the scalable FFAU because it outputs the result fromleast-significant digit (LSD) to the most-significant digit (MSD)as shown in Algorithm 1. Since it outputs the results LSD-first,it can be combined with the reduction unit to evaluate the mod-ulo P(t) portion of the FFMULT operation.

396 K.C.C. Loi, S.-B. Ko / Microprocessors and Microsystems 37 (2013) 394–406

Algorithm 1. Comba multiplication for GF(2m) (Modified fromAlgorithm 2 in [20])

INPUT: A = (As�1, . . . ,A1,A0) and B = (Bs�1, . . . ,B1,B0)OUTPUT: Z = A � B = (Z2s�1, . . . ,Z1,Z0)

(U,V) (0,0)for i from 0 to s � 1 do

for j from 0 to s � 1 do(U,V) (U,V) + Aj � Bi�j

end forZi VV U, U 0

end forfor i from s to 2s � 2 do

for j from i � s + 1 to s � 1 do(U,V) (U,V) + Aj � Bi�j

end forZi VV U, U 0

end forZ2s�1 Vreturn Z = (Z2s�1, . . . ,Z1,Z0)

FFSQ can be implemented using the FFMULT operation, but itcan be implemented much simpler by using the followingproperty:

AðtÞ2 ¼ am�1t2m�2 þ � � � þ a1t2 þ a0mod PðtÞ; ð2Þ

which is simply interleaving 0 bits and operand bits.The modulo P(t) operation, required by both FFMULT and FFSQ,

is called reduction. The reduction operation used in this paper isbased on the reduction algorithms presented in Algorithm 2.41–2.45 in [21]. However, the reduction used in this paper evaluatesLSD-first, instead of MSD-first as shown in [21], in order to be com-bined with FFMULT using the Comba algorithm. During the Combaalgorithm for FFMULT, the first for loop in Algorithm 1 producesthe lower s/2 digits of the product, and are not reduced. The secondfor loop generates the upper s/2 digits which will be reduced. Sim-ilarly, FFSQ is performed digit-wise, so each digit is squared byinterleaving 0 s between each bit and only the upper s/2 digitsneed to be reduced. Due to the change to LSD-first reduction, addi-tional steps must be performed to reduce the result of FFMULT orFFSQ. For example, in the 233-bit reduction, s = d233/32e = 8 andaccording to Algorithm 2.42 in [21], the 8th digit is reduced byadding a shifted version of the 8th digit with the 0th, 1st, 3rdand 4th digits, and the 9th digit is shifted and added to the 1st,2nd, 4th and 5th digits, and so on. When reducing the 12th digit,it is shifted and added to the 4th, 5th, 7th and 8th digits, but the8th digit has already been reduced before and needs to be reducedagain because it is beyond the boundary of the reduced result of sdigits. Similarly, the reduction of the 13th, 14th and 15th digitsalso have the same problem.

Furthermore, in this paper, the output of the FFMULT and FFSQ isnever completely reduced in improve efficiency. Since FFMULT andFFSQ is performed digit-wise, the reduction is only completed to theborder of the digit. For example, in 233-bit mode, s = d233/32e = 8,so the reduction operation, reduces the digits (Z15, . . . ,Z9,Z8), butdoes not reduce the 23 most-significant bits (MSB) in Z7, makingall the intermediate results 8 � 32 = 256 bits instead of 233 bits.The final reduction step, which reduces the 23 MSBs of Z7 occursafter all the calculations are performed and just before the resultis output. The proposed reduction algorithm is shown in Algorithms2–6 and they occur after the Zi V operation in the second for loopin Algorithm 1. The final reduction is not shown as it is the same asthe final steps in Algorithms 2.41–2.45 in [21].

Algorithm 2. Reduction by P(t) = t163 + t7 + t6 + t3 + 1

INPUT: Z = (Z11, . . . ,Z1,Z0)OUTPUT: Z mod P(t) = (Z5, . . . ,Z1,Z0) (without final reduction)

for i from 6 to 11if i 6 9 then

(Zi�4,Zi�5,Zi�6) = (Zi�4,Zi�5,Zi�6) + (Zi� 29) + (Zi� 32) +(Zi� 35) + (Zi� 36)

else if i = 10 then(T0,Z5,Z4) = (0,Z5,Z4) + (Zi� 29) + (Zi� 32) + (Zi� 35) +

(Zi� 36)(Z2,Z1,Z0) = (Z2,Z1,Z0) + (T0� 29) + (T0 � 32) +

(T0� 35) + (T0� 36)else//[i = 11]

(T1,T0,Z5) = (0,0,Z5) + (Zi� 29) + (Zi� 32) + (Zi � 35)+ (Zi� 36)

(Z2,Z1,Z0) = (Z2,Z1,Z0) + (T0� 29) + (T0 � 32)+ (T0� 35) + (T0� 36)

(Z3,Z2,Z1) = (Z3,Z2,Z1) + (T1� 29) + (T1 � 32)+ (T1� 35) + (T1� 36)

end ifend forreturn Z = (Z5, . . . ,Z1,Z0)

Algorithm 3. Reduction by P(t) = t233 + t74 + 1

INPUT: Z = (Z15, . . . ,Z1,Z0) OUTPUT: Z mod P(t) = (Z7, . . . ,Z1,Z0)(without final reduction)for i from 8 to 15 do

if i 6 11 then(Zi�4,Zi�5,Zi�6,Zi�7,Zi�8) = (Zi�4,Zi�5,Zi�6,

Zi�7,Zi�8) + (Zi� 23) + (Zi� 97)else if i = 12 then

(T0,Z7,Z6,Z5,Z4) = (0,Z7,Z6,Z5,Z4) + (Zi� 23) + (Zi� 97)(Z4,Z3,Z2,Z1,Z0) = (Z4,Z3,0,Z1,Z0) + (T0� 23) + (T0� 97)

else if i = 13 then(T1,T0,Z7,Z6,Z5) = (0,0,Z7,Z6,Z5) + (Zi � 23) + (Zi� 97)(Z4,Z3,Z2,Z1,Z0) = (Z4,Z3,Z2,Z1,Z0) + (T0

� 23) + (T0� 97)(Z5,Z4,Z3,Z2,Z1) = (Z5,Z4,Z3,Z2,Z1) + (T1

� 23) + (T1� 97)else if i = 14 then

(T2,T1,T0,Z7,Z6) = (0,0,0,Z7,Z6) + (Zi� 23) + (Zi� 97)(Z4,Z3,Z2,Z1,Z0) = (Z4,Z3,Z2,Z1,

Z0) + (T0� 23) + (T0� 97)(Z5,Z4,Z3,Z2,Z1) = (Z5,Z4,Z3,Z2,

Z1) + (T1� 23) + (T1� 97)(Z6,Z5,Z4,Z3,Z2) = (Z6,Z5,Z4,Z3,

Z2) + (T2� 23) + (T2� 97)else [i = 15] then

(T3,T2,T1,T0,Z7) = (0,0,0,0,Z7) + (Zi� 23) + (Zi � 97)(Z4,Z3,Z2,Z1,Z0) = (Z4,Z3,Z2,Z1,

Z0) + (T0� 23) + (T0� 97)(Z5,Z4,Z3,Z2,Z1) = (Z5,Z4,Z3,Z2,

Z1) + (T1� 23) + (T1� 97)(Z6,Z5,Z4,Z3,Z2) = (Z6,Z5,Z4,Z3,

Z2) + (T2� 23) + (T2� 97)(Z7,Z6,Z5,Z4,Z3) = (Z7,Z6,Z5,Z4,

Z3) + (T3� 23) + (T3� 97)end if

end forreturn Z = (Z7, . . . ,Z1,Z0)

Table 2Variables used in Section 2.2.

Variable Description

Ea An elliptic curvex, y Elliptic curve equation variables or affine coordinates of a

point; binary vectors of size mP, Q Points on an elliptic curve; Represented in either affine or

projective coordinatesk Scalar multiplier is scalar multiplication; vector of size m

represented in binary or s-non-adjacent form (sNAF)X, Y, Z Lopez–Dahab coordinates of a point; binary vectors of size

mA, . . . , G Temporary binary vectors of size mT1, T2,T3 Temporary binary vectors of size m(ul�1, . . . ,u1,u0) Binary vector resultant from taking sNAF (k); l is assumed to

be equal to m in this paper




for i from 9 to 17 doif i 6 16 then

(Zi�8,Zi�9) = (Zi�8,Zi�9) + (Zi� 5) + (Zi� 10)+ (Zi� 12) + (Zi� 17)

else//i = 17(T0,Z8) = (0,Z8) + (Zi� 5) + (Zi� 10) + (Zi� 12)

+ (Zi� 17)(Z1,Z0) = (Z1,Z0) + (T0� 5) + (T0� 10) + (T0 � 12)

+ (T0� 17)end if


Algorithm 5. Reduction by P(t) = t409 + t87 + 1



(Zi�10,Zi�11,Zi�12,Zi�13) = (Zi�10,Zi�11,Zi�12,Zi�13)+ (Zi� 7) + (Zi� 94)

else if i = 23 then(T0,Z12,Z11,Z10) = (0,Z12,Z11,Z10) + (Zi� 7) + (Zi� 94)(Z3,Z2,Z1,Z0) = (Z3,Z2,Z1,Z0) + (T0� 7) + (T0� 94)

else if i = 24 then(T1,T0,Z12,Z11) = (0,0,Z12,Z11) + (Zi� 7) + (Zi� 94)(Z3,Z2,Z1,Z0) = (Z3,Z2,Z1,Z0) + (T0� 7) + (T0� 94)(Z4,Z3,Z2,Z1) = (Z4,Z3,Z2,Z1) + (T1� 7) + (T1� 94)

else//i = 25(T2,T1,T0,Z12) = (0,0,0,Z12) + (Zi� 7) + (Zi� 94)(Z3,Z2,Z1,Z0) = (Z3,Z2,Z1,Z0) + (T0� 7) + (T0� 94)(Z4,Z3,Z2,Z1) = (Z4,Z3,Z2,Z1) + (T1� 7) + (T1� 94)(Z5,Z4,Z3,Z2) = (Z5,Z4,Z3,Z2) + (T2� 7) + (T2� 94)

end ifend for

return Z = (Z12, . . . ,Z1,Z0)




(Zi�17,Zi�18) = (Zi�17,Zi�18) + (Zi� 5) + (Zi� 7)+ (Zi� 10) + (Zi� 15)

else//i = 35(T0,Z17) = (0,Z17) + (Zi� 5) + (Zi� 7) + (Zi� 10)

+(Zi� 15)(Z1,Z0) = (Z1,Z0) + (T0� 5) + (T0� 7) + (T0� 10)

+(T0� 15)end if


2.2. Point multiplication on Koblitz curves

For clarity, the description of the variables used in this sectionare listed in Table 2.

Koblitz curves [22] are a special type of elliptic curves definedover GF (2m) and they have the following form:

Ea : y2 þ xy ¼ x3 þ ax2 þ 1 ð3Þ

where a = 0 or 1. The main operation in ECC including Koblitzcurves is the elliptic curve point multiplication (ECPM). Given apoint, P, defined on the curve Ea and an integer, k, ECPM is definedas follows:

Q ¼ kP ¼ P þ P þ � � � þ P|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}ktimes

ð4Þ

where Q is the resultant point, which is also on the curve Ea. Oneway of computing the ECPM is the double-and-add operation,where a sequence of point doubling (PDBL) and point addition(PADD) operations are evaluated based on the binary representationof k as shown in Algorithm 7. PDBL is the operation Q 2Q andPADD is the operation Q Q + P in Algorithm 7. In this section,operations between two points or between a point and a scalarare point operations (i.e. PDBL, PADD or ECPM) and operations be-tween two binary vectors are finite field operations (i.e. FFADD,FFMULT or FFSQ). If the points P and Q are represented using affinecoordinates, i.e. using two coordinates (x,y), each PDBL and PADDwould require a FFINV, which is the most complex FF operation asmentioned above. In order to make PDBL and PADD more efficient,the use of Lopez–Dahab (LD) coordinates have been suggested [21],which removes the need for FFINV in every PDBL and PADD opera-tion, by using three coordinates (X,Y,Z), where x = X/Z and y = Y/Z2.In [21], the authors explain that by mixing LD coordinates and affinecoordinates, PADD can be simplified to 8 FFMULT, 5 FFSQ and 9FFADD, whereas PDBL using LD coordinates requires 4 FFMULT, 5FFSQ and 4 FFADD.

Algorithm 7. Left-to-right point multiplication (Modified fromAlgorithm 3.27 in [21])

INPUT: k = (kt�1, . . . ,k1,k0), P – a point on Ea

OUTPUT: Q = kPQ 1

for i from t � 1 down to 0 doQ 2Q

if ki = 1 thenQ Q + P

end ifend for

return Q

In Koblitz curves, the double-and-add operation can be furtherimproved by converting the scalar, k, into s-non-adjacent form(sNAF), which rewrites k into the form k ¼

Pl�1i¼0uisi, where

ui 2 {0,±1} and ul�1 – 0. Using LD coordinates and sNAF represen-


tation of k, Algorithm 7 is modified and shown in Algorithm 8. Themajor differences between Algorithms 7 and 8 are that the latterperforms Q sQ for PDBL, which is simply a FFSQ on each coordi-nate and the need to add or subtract a point in PADD, which can bedone efficiently because the negative of a point with affine coordi-nates (x,y) is given by �(x,y) = (x,x + y). Thus if P is given by coor-dinates (x,y), Q Q � P = Q Q + P1 if P1 = (x,x + y).

The implementation of the sNAF conversion algorithm is out ofthe scope of the paper. Interested readers can refer to [7] and [23].Instead, the test vectors for k are generated randomly and offline.As pointed out in [10], the hardware cost of the conversion algo-rithm is extremely high, especially if it needs to support all five Ko-blitz curves.

Algorithm 8. s NAF point multiplication on Koblitz curves (Mod-ified from Algorithm 3.66 in [21])

INPUT: k – a binary integer, P(x,y) – a point on Ea

OUTPUT: Q = kP

Compute sNAF (k)=Pl�1

i¼0uisi

//Perform the first point addition of Q 1 ± Pif ul�1 = 1 then

Q(X3,Y3,Z3) P(x,y)else

Q(X3,Y3,Z3) P(x,x + y)end iffor i from l � 2 down to 0 do

//Perform PDBL (Q sQ)

QðX3;Y3; Z3Þ QðX23; Y

23; Z

23Þ

// Perform PADDif ui = 1 then

Q(X3,Y3,Z3) Q(X3,Y3,Z3) + P(x,y)end ifif ui = �1 then

Q(X3,Y3,Z3) Q(X3,Y3,Z3) + P(x,x + y)end if

end for

return Qðx3; y3Þ QðX3=Z3;Y3=Z23Þ

Since the point doubling operation is simplified to a series ofFFSQ, the focus of point multiplication algorithms on Koblitzcurves is on the PADD operation. In [21], the PADD operation (X3-

,Y3,Z3) (X1,Y1,Z1) + (X2,Y2), where (X1,Y1,Z1) is a point in LD coor-dinates and (X2,Y2) is a point in affine coordinate and (X3,Y3,Z3) arethe LD coordinates of the point resultant from the addition, is givenby:

A Y2 � Z21 þ Y1

B X2 � Z1 þ X1

C Z1 � BD B2 � ðC þ aZ2

1ÞZ3 C2

E A � CX3 A2 þ Dþ E

F X3 þ X2 � Z3

G ðX2 þ Y2Þ � Z23

Y3 ðEþ Z3Þ � F þ G

ð5Þ

In order to optimize the point addition for a scalable hardwareimplementation and to reduce the complexity of the control signals,a novel PADD algorithm has been developed.

T1 ð0þ X2Þ � Z3 þ X3

X3 ð0þ Z3Þ � T1 þ 0

T3 Z23 þ 0

Y3 ð0þ Y2Þ � T3 þ Y3

Z3 X23 þ 0

T2 ð0þ Y3Þ � X3 þ 0

T1 T21 þ 0

X3 ðaT3 þ X3Þ � T1 þ T2

X3 Y23 þ X3

T1 ð0þ X2Þ � Z3 þ X3

Y3 Z23 þ 0

Y3 ðX2 þ Y2Þ � Y3 þ 0Y3 ðT2 þ Z3Þ � T1 þ Y3

ð6Þ

The novel PADD algorithm only requires operations of the followingtwo formats: Z = (C + A) � B + D (MULT mode) and Z = B2 + D (SQmode). These modes can share most of the hardware resources be-cause of the similarities in the expression. In MULT mode, the argu-ment A is added to C, using XOR gates, before performing themultiplication with B using the Comba algorithm described in Algo-rithm 1, whereas in SQ mode, the B argument is squared by simplyinterleaving 0 bits as described in (2). The reduction operation,which is required but not shown in (6), is common to both modes,so the hardware resources are shared. Finally, the addition of the Dargument, which uses XOR gates, is also common to both modes.The proposed PADD algorithm facilitates the implementation ofthe processor, since it is able to utilize the scalable FFAU unit moreeffectively as will be shown in Section 3.

3. Design and architecture of ECC processor

This section is divided into two subsections. First, the designand architecture of the FFAU is presented. It implements the twoformats of the operations given in (6). Second, the architecture ofthe entire scalable ECP, which uses the FFAU, is presented.

3.1. Finite Field Arithmetic Unit (FFAU)

The FFAU computes the two formats of operations required byPADD and PDBL, namely Z = (C + A) � B + D (MULT mode) andZ = B2 + D (SQ mode). Fig. 1 shows the top level block diagram ofthe FFAU, which shows how the two operation formats are han-dled. In Fig. 1, the two FFADD modules (‘+’ in Fig. 1) are imple-mented as two-input XOR gates and do not require extra clockcycles as they are pure combinational logic. Furthermore, FFADDis integrated into FFMULT and FFSQ, without affecting the criticalpath of the processor. Fig. 2 shows the details of the ‘MULT_SQunit’ in the FFAU, which calculates the FFMULT and FFSQoperations.

As shown in Fig. 2, the inputs ‘A’ and ‘B’ are input into the mod-ule and stored in a dual-port RAM. These RAMs need to be 32 bitswide and 18 words deep since the FFMULT and FFSQ operationsuse 32-bit operands and the largest number of digits that theECP supports is s = d571/32e = 18. The values from the RAM areread out and multiplied or squared depending on the mode se-lected. The ‘multiplier unit’ (‘x’ in Fig. 2) is a 32-bit Karatsuba–Of-man multiplier [24], which is a purely combinational logic. The ‘SQunit’ implements (2) by interleaving the 32-bit input with 0 s. Inother words, it inputs a 32-bit binary vector (a31,a30, . . . ,a1,a0)and outputs (a31,0,a30,0, . . . ,0,a1,0,a0).

In MULT mode, the 32-bit digits of the operands are stored in‘RAM A’ and ‘RAM B’. The digits are read out to the ‘multiplier unit’

Fig. 1. Block diagram of the FFAU.


according to the indexes in the for loops of the Comba algorithm inAlgorithm 1. The multiplexer is set to select the output of theFFADD unit (‘+’ in Fig. 2)) to accumulate in the 63-bit ‘UV register’,which corresponds to the operation (U,V) (U,V) + Aj � Bi�j in Algo-rithm 1. Once the inner loop in Algorithm 1 is completed, the least-significant 32 bits of ‘UV register’ are sent to the right hand side ofFig. 2 for reduction.

In SQ mode, the 32-bit digits of the operands are stored in ‘RAMA’ and the multiplexer selects the output of the ‘SQ unit’ to store inthe ‘UV register’. Since there is no further processing required, theleast-significant 32 bits of ‘UV register’ can be immediately sent tothe right hand side of Fig. 2 for reduction.

As a result, the logic on the left side of Fig. 2 implements theComba algorithm and (2) completely. The output is sent to theright side of Fig. 2, which implements the reduction portion ofFFMULT and FFSQ. ‘RAM C’ is the same size as ‘RAM ‘A’ and ‘RAMB’ and stores the digits of the intermediate and final result ofFFMULT or FFSQ.

Fig. 2. Block diagram of

The ‘MSD Shift Unit’ implements the shifting portion of thereduction as shown in Algorithm 2 to Algorithm 6 and its block dia-gram is shown in Fig. 3. Inside the dashed lines in Fig. 3 are vectorsof 0 bits with lengths indicated by the numerical value. Each blockperforms the necessary shifting to reduce module P(t) according toAlgorithm 2 to Algorithm 6. The operation in three of the fiveblocks are four-input XOR operations. The output of the ‘MSD ShiftUnit’ is 5 � 32 = 160 because among the five types of curves, thereduction that shifts and adds the most digits is the K-233 in Algo-rithm 3, which affects five 32-bit digits. Thus, ‘Shift Reg 1’ and‘Shift Reg 2’, which are shift registers, need to be 160 bits wide.All the additions in Fig. 2 are implemented as two-input XORs.There are two ‘Shift Regs’ in Fig. 2 because, as previously men-tioned, by using a LSD-first reduction algorithm, the reductionoperation requires that some upper digits to be reduced a secondtime. The second reduction is handled by ‘Shift Reg 2’, and the in-put of the ‘MSD Shift Unit’ is selected to be the least-significant32 bits of ‘Shift Reg 1’ by the multiplexer.

The two finite field adders on the right hand side of Fig. 2 areused to combine the output of the ‘MSD Shift Unit’ with the valuecurrently in the ‘Shift Regs’ to combine the shifting portion in Algo-rithms 2–6 to the registered portion.

3.2. Scalable ECC Processor (ECP)

The block diagram of the scalable ECP is shown in Fig. 4. Thescalable ECP computes Algorithm 8 after the sNAF (k) computation,so it requires as inputs the point P with affine coordinates x1 and y1

and sNAF converted value of k, which has a magnitude and sign.The outputs of the scalable ECP are the affine coordinates, x3 andy3, of the resultant point Q = kP. For simplicity, the finite state ma-chine (FSM) and some control signals are not shown. Table 3 liststhe signals used in Fig. 4 and their descriptions. The inputs x1

and y1 are 32-bit buses, where the values are input digit-by-digit,so it takes s = dm/we clock cycles to input all the values. The kand ks signals are also 32-bit buses, so they are input digit-by-digitsimultaneously with x1 and y1. The resultant coordinates x3 and y3

are output in 32-bit buses, so they are also output digit-by-digitand it takes s clock cycles.

The core of the processor is the FFAU described in Section 3.1.The inputs to the core of the processor are chosen by multiplixers

the MULT_SQ unit.

Fig. 3. Block diagram of the MSD Shift Unit.

Fig. 4. Block diagram of the scalable ECC processor.

Table 3Signals used in Figs. 4 and 6.

Name Width I/O/Inta Description

clk 1 I clock (not shown)reset 1 I reset (not shown)x1 32 I x coordinate of the input point Py1 32 I y coordinate of the input point Pload 1 I load signal to trigger the start of processing (not shown)mode_sel 3 I select among the 5 curves defined in NIST [4] (not shown)k 32 I magnitude of sNAF (k) in Algorithm 8ks 32 I sign of sNAF (k) in Algorithm 8x3 32 O x coordinate of the output point Qy3 32 O y coordinate of the output point Qready 1 O indicates the ECP is ready for the next operation (not shown)done 1 O indicates the beginning of outputting digits (not shown)PC 4 Int program counter that is reset to 0 every time the FSM enters PDBL or PADDcur_k 1 Int the magnitude of the current sNAF (k) bit being processedk_count 10 Int counter for the current sNAF (k) bit; counts down from m � 1 to 0

a Input Signal (I); Output Signal (O); Internal Signal (Int).


Fig. 5. Block diagram of the Final Shift Unit.

Fig. 6. FSM of the scalable ECC processor.

Table 4Clock Cycles of ECPM

m tMULT tSQ tINV tPADD tPDBL tP2AC tECPM

163 46 20 3654 25,272 9720 7420 42,419233 78 28 7276 59,592 19,488 14,736 93,825283 89 24 7747 78,208 20,304 15,696 114,218409 181 36 16,679 221,408 44,064 33,756 299,242571 332 42 28,256 544,540 71,820 57,218 673,597

Table 5FPGA resource utilization comparison.

Work FPGA Sliceregisters

LUT Slices BRAM

2009 [8] Spartan-3 XC3S200 626 1841 1041 42010 [9] Spartan-3 XC3S200 650 2025 1127 42011 [11] Virtex-4 XC4VFX12 n/a n/a 2405–3528 162006 [25] Virtex-E XCV2000E 475 2556 n/a 82010 [10] Spartan-3 XC3S200 913 2028 1278 4This work Spartan-3 XC3S400 1232 3850 2220 8This work Virtex-4 XC4VFX12 1219 3815 2431 8This work Virtex-E XCV2000E 1641 4320 2856 18


(MUX), which selects the arguments of the operation. The tempo-rary results are stored in a RAM. In addition to the x and y inputs,the RAM needs to store xy = x + y and X3, Y3, Z3,T1,T2, and T3, whichare used in (6). Each variable that is stored in the RAM is 32 bitswide and 18 words deep, similar to ‘RAM C’ in Fig. 2, so the totalsize of the RAM in Fig. 4 is 18 � (32 � 9) = 18 � 288 bits. The out-put of the FFAU is connected back to the inputs of the RAM througha set of MUXes. The MUXes at the input of the RAM select whichvariable to insert into the RAM on each cycle. The select signals

of the MUXes are asserted by the controller and depend on the va-lue of the program counter and the current state in the FSM.

At the end of the evaluation of the ECPM, namely afterQðx3; y3Þ QðX3=Z3;Y3=Z2

3Þ in Algorithm 8, the result of the FFAUis used by the ‘Final Shift Unit’, which computes the reduction ofthe MSBs of the MSD that was not performed in the FFAU, as ex-plained in Section 2.1. The values are stored in two registers


(‘Shifted X Reg’ and ‘Shifted Y Reg’), one for x3 and another one fory3, which will be added to the final result as it is being outputtedfrom the system. The ‘Final Shift Unit’ is similar to the ‘MSD ShiftUnit’ in Fig. 3 but it does not use the full 32-bit input and lengthsof the zero-bit vectors are different. The block diagram of the ‘FinalShift Unit’ is shown in Fig. 5.

The FSM of the scalable ECP is shown in Fig. 6. The processor re-sets to the IDLE state. When the load signal is detected, the FSMmoves into the LOAD state, where it will only stay for one clock cy-cle, i.e. the calculations begin when some of the input values arestill being read in. PDBL, PADD, ISQ, IMULT, and FMULT state tran-sitions are all evaluated when an FFAU instruction is complete. Thevalue of the program counter (PC) increments when an FFAUinstruction is complete and resets back to 0 when entering thePDBL and PADD states. The PDBL state performs three FFSQ onX3, Y3 and Z3, respectively. Once completed (PC = 2), PC resets to0 and if the current bit of k is 0, PDBL restarts again, otherwise itwill move to the PADD state to compute the PADD instructionsand return to the PDBL state when PC is 12, which means all 13instructions in (6) have been executed and PC resets to 0. At thecompletion of either PDBL or PADD states, if all the bits in k areprocessed (i.e. k_count = 0), then the FSM moves to the ISQ state,which initiates the Itoh–Tsujii algorithm [15] for FFINV. At thispoint, the PC and k_count values are no longer used. The ISQ statecomputes the series of FFSQ operations and IMULT computes 1FFMULT and returns to the ISQ state. The number of times theISQ and IMULT states cycle depends on the selected field (m). Oncethe FFINV is complete, FSM moves to the FMULT state which com-putes the final multiplication of X3 � 1/Z3 the first time andY3 � 1=Z2

3 the second time. Thus, the first time the FSM exits the

Table 6Timing performance comparison.

Work ECC curves Algorithm

2009 [8] SEC [5] Montgomery–Lopez–Dahab

2010[9] NIST Random Curves [4] Montgomery–Lopez–Dahab

2011 [11] NIST Random Curves [4] Affine Double-and-Add

2006 [25] NIST Random Curves [4] Mixed Coordinates

2010 [10] NIST Koblitz Curves [4] Mixed Coordinates

This work NIST Koblitz Curves [4] Mixed Coordinates



FMULT state, it goes back to ISQ and computes Z2 and its inverse.The second time the FSM exits FMULT, it moves to the WAIT state,where it waits for the FFAU to finish outputting the result and theprocessor outputs the final result. The FSM allows for the processorto move immediately back to the LOAD state if the load signal isdetected at the WAIT state, otherwise it will return to the IDLEstate.

Using the design of the scalable ECP described above, Table 4shows the number of clock cycles required for computing each ofthe operations. FFMULT requires one clock cycle for loading, s2

clock cycles for the Comba multiplication, 2 + reduc clock cyclesfor reduction, where reduc is dependent on the bit length and threeclock cycles for pipelining (tMULT = 1 + s2 + 2 + reduc + 3). FFSQ re-quires one clock cycles for loading, 2s clock cycles for digit-wisesquaring, reduc clock cycles for reduction and 3 clock cycles forpipelining (tSQ = 1 + 2s + reduc + 3). FFINV uses the Itoh–Tsujiialgorithm which requires (m � 1) FFSQ and blog2(m � 1)c +h(m � 1) � 1 FFMULT, where h(m � 1) is the Hamming weight of(m � 1) (tINV = (m � 1)tSQ + (blog2(m � 1)c + h(m � 1) � 1)tMULT).The values in brackets in Fig. 6 represent the number of clockcycles required to run each state.

The PADD state is only entered when cur_k is 1, the number oftimes state PADD is entered depends on the Hamming weight ofthe value of sNAF (k). Since the average Hamming weight of k insNAF is m/3, the total number of clock cycles required for PADDis on average tPADD = (8tMULT + 5tSQ) � d(m � 1)/3e. PDBL needs torun (m � 1) times, so it requires tPDBL = (m � 1) � 3tSQ clock cycles.The projective to affine coordinate conversion (P2AC) requirestP2AC = 2tINV + tSQ + 2tMULT clock cycles. Finally, the complete ECPMrequires tECPM = 1 + tPADD + tPDBL + tP2AC + s clock cycles, where 1

FPGA Max. freq. (MHz) m Latency (ms)

Spartan-3 XC3S200 63.3 113 14131 24163 38193 56

Spartan-3 XC3S200 68.3 163 38233 73.4283 104409 251571 287.4

Virtex-4 XC4VFX12 136 113 0.52139 131 0.69145 163 1.07

Virtex-E XCV2000E 150 163 1.95233 6.0283 6.48

Spartan-3 XC3S200 90 163 15.5283 45.1571 121.4

Spartan-3 XC3S400 93.084 163 0.456233 1.008283 1.227409 3.215571 7.236

Virtex-4 XC4VFX12 155.376 163 0.273233 0.604283 0.735409 1.926571 4.335

Virtex-E XCS2000E 43.983 163 0.964233 2.133283 2.597409 6.804571 15.315

Table 7Performance metrics comparison.

Work FPGA m Efficiency (ECPM/s/100 slices) ATP (slices � s)

2009 [8] Spartan-3 XC3S200 113 6.862 14.574131 4.003 24.984163 2.528 39.558193 1.715 58.296

2010 [9] Spartan-3 XC3S200 163 2.335 42.826233 1.209 82.722283 0.853 117.208409 0.354 282.877571 0.309 323.900

2011 [11] Virtex-4 XC4VFX12 113 79.962 1.251131 50.480 1.981163 26.490 3.775

2006 [25] Virtex-E XCV2000E 163 n/a n/a233 n/a n/a283 n/a n/a

2010 [10] Spartan-3 XC3S200 163 5.048 19.809283 1.735 57.638571 0.645 155.149

This work Spartan-3 XC3S400 163 98.783 1.012233 44.688 2.238283 36.712 2.724409 14.011 7.137571 6.225 16.064

This work Virtex-4 XC4VFX12 163 150.679 0.664233 68.105 1.468283 55.966 1.787409 21.358 4.682571 9.489 10.538

This work Virtex-E XCS2000E 163 36.322 2.753233 16.415 6.092283 13.482 7.417409 5.146 19.432571 2.286 43.740

Fig. 7. Plot of the latency comparison for implementations on Spartan-3.


clock cycle is spent in the LOAD state, and s clock cycles in theWAIT state.

4. Implementation Results

The proposed scalable ECP has been implemented using the Xi-linx ISE 11.5 and Xilinx ISE 9.1i softwares. The target FPGAs se-

lected were the Spartan-3 XC3S400, Virtex-4 XC4VFX12 andVirtex-E XCV2000E in order to compare to other designs in the cur-rent literature. The XC3S400 FPGA was chosen instead of theXC3S200 because there are not enough resources to fully imple-ment the proposed design on the XC3S200. However, the twoFPGAs are in the same family, so they can still be compared. Thehardware utilization results are shown in Table 5 along with the

Fig. 8. Plot of the efficiency comparison for implementations on Spartan-3.

Fig. 9. Plot of the ATP comparison for implementations on Spartan-3.


hardware utilization of other designs in the current literature thatalso present scalable ECP designs. The hardware utilizations for theproposed design in Table 5 are obtained post-place-and-route. Thecomparison of the hardware utilization among the designs pre-sented in Table 5 will be discussed in combination with the timingperformances below.

The timing performance and comparison to other designs in thecurrent literature are shown in Table 6. Due to the different FPGAsused in literature, the hardware utilizations and timing perfor-mances are summarized in Table 7 using two performance metrics:efficiency and area-time product (ATP). In Table 7, efficiency is gi-ven by the number of ECPM calculations per second per 100 slices,so a larger value corresponds to a better performance. ATP is givenby the number of slices multiplied by the latency, so better perfor-mance yields a lower ATP value. According to Table 7, the proposeddesign outperforms the ECPs in current literature in both the effi-ciency and ATP metrics.

As previously mentioned, the designs in [8–10] use the HSC.More specifically, the design proposed in [10] is the main compar-

ison target to the design proposed in this paper because it is theonly design that supports Koblitz curves.

Instead of using the PicoBlaze soft-core microcontroller in theFPGA to implement a majority of the control signals, the entire pro-posed scalable ECP is implemented in hardware. As shown in Table6, [8,9] implement different ECC curves, so the designs cannot befairly compared to the proposed design. Only [10] implementsthe same NIST Koblitz curves as implemented in this paper. Eventhough the target FPGA is different, a comparison of the implemen-tation results between the current design and the design in [10]can still be discussed because the FPGA is in the same family. Table6 shows that even though the number of slices used is increased by73.7%, the latency of the calculation of the ECPM is decreased by16.8 times for the 571-bit mode, 36.8 times for 283-bit modeand 34.0 times for 163-bit mode. Furthermore, the proposed de-sign implements all five Koblitz curves recommended by NIST[4], whereas [10] only implements the three curves with reductionalgorithms that are more similar because K-163, K-283 and K-571all use a pentomial with low order terms as the reduction polyno-


mial, whereas K-233 and K-409 use trinomials with a higher ordermiddle term. As shown in Figs. 3 and 5, if only K-163, K-283, and K-571 were implemented, the reduction always combines fourshifted terms together, whereas if K-233 and K-409 were alsoimplemented, the reduction hardware architecture would needto also include the concatenation of inputs.

There are two major reasons for the improvement in timing la-tency. Firstly, the novel PADD algorithm reduces the number ofclock cycles used by the FFADD operation, which reduces the totalnumber of clock cycles. Secondly, since all the temporary variablesare stored in a RAM in hardware, there is no need for store and loadoperations that would be needed in a microcontroller environ-ment. In addition, the maximum frequency of the proposed designis higher than in [10], even with the added complexity of all thecontrol signals being implemented in hardware. To the authors’best knowledge, the proposed design is the fastest scalable ECPin current literature that supports all five Koblitz curves recom-mended by NIST [4], which is up to 571 bits. Even though the tim-ing latency is not as fast as microseconds as in [12,13], thescalability of the proposed design supports a wider range of secu-rity requirements without the need to reconfigure the hardware toadapt to the curve being used.

As visual aid, the latency, efficiency and ATP comparisons for[8–10] and the design in this paper are plotted in Figs. 7–9, respec-tively. In the plots, a lower latency and ATP values and a higherefficiency value represent better performance. Figs. 7–9 clearlydemonstrate that the proposed design outperforms the designspresented in [8–10].

The comparison with the other scalable ECP designs in the cur-rent literature is explained in the remainder of this section. Thecomparisons may not be completely fair due to the support of dif-ferent elliptic curves. However, these comparisons demonstrate animproved performance of using the scalable ECP architecture basedon the proposed FFAU.

The design proposed in [11] is also a HSC scalable ECP, wherethe authors use the on-chip PowerPC in Virtex-4 FX series to buildthe system. However, the reconfiguration of the portion that com-putes the ECPM needs to be dynamically reloaded at run time,which explains the range of hardware utilization for differentcurves and the range of maximum frequencies. Furthermore, thedesign is only implemented up to 163 bits, and estimates are pro-vided only up to 283 bits. Nevertheless, the design proposed in thispaper has been implemented for the same XC4VFX12 FPGA and thehardware utilization is lower than the 163-bit implementationusing 2431 slices compared to 3528 slices, and the 163-bit modeis 3.9 times faster using the design proposed in this paper.

The design proposed in [25] is a scalable ECP completely built inhardware, as is the design proposed in this paper. However, thearchitecture of the design in [25] resembles one of a general-pur-pose processor, where it uses the pipelining stages of instructionfetch, decode, execute and write back. The design proposed in thispaper is more optimized to the instructions required for the ECPMoperation, which results in the better timing performance of theproposed design. Moreover, the design proposed in [25] only sup-ports curves up to 283 bits, whereas the proposed design supportscurves up to 571 bits.

5. Conclusions

This paper proposes a scalable ECP that can support all five Ko-blitz curves recommended by NIST [4] without reconfiguring thehardware. The proposed design is completely implemented inhardware in order to decrease the computational latency of theECPM. Compared to the state-of-the-art designs in the current lit-erature, the proposed design consumes only 73% more hardware

than the only other scalable ECP in the current literature knownto the authors that also support the same five Koblitz curves[10], yet it can accomplish a 16–36 factor decrease in timing la-tency. The decrease in timing latency is advantageous in server-side applications where higher speeds are required to handle thelarge volume of requests. Moreover, the scalability provided bythe proposed design is important for server-side applications tobe able to establish connections with different users requiring dif-ferent security levels.

A novel PADD algorithm is presented in this paper, which takesadvantage of the FFAU architecture also proposed in this paper. TheFFAU architecture eliminates the need to spend extra clock cyclesto compute FFADD operations by integrating them into theFFMULT and FFSQ operations. Furthermore, a revised reductionalgorithm is proposed that processes digits LSD-first in order towork efficiently with FFMULT and FFSQ. The paper also presentsa scalable ECP architecture completely implemented in hardwarethat makes use of the FFAU and the novel PADD algorithm. Further-more, implementing the scalable ECP completely in hardware, asopposed to taking the HSC approach, also eliminates the clock cy-cles required to load and store temporary values into registers. Inthe proposed design, the writing of the intermediate values hap-pens concurrently with the FFAU computations.

For future work, the Lopez–Dahab algorithm [26] for computingpseudo-random curves recommended by NIST [4] can be examinedto evaluate whether the novel FFAU architecture can also be ap-plied to that algorithm to improve the performance of scalableECPs that support pseudo-random curves. Other future works alsoinclude to examine whether the novel PADD algorithm can benefitfrom using more than 1 FFAU, i.e. to examine the effects of parall-elization in order to further improve the timing latency, and to in-crease the digit size from w = 32–64 or even 128 to examine thearea-time trade-off of the FFAU architecture and its effect on theimplementation of the scalable ECP.

Acknowledgement

The current research is supported by the University of Saskatch-ewan College of Graduate Studies and Research Dean’s Scholarship.

References

[1] V. Miller, Use of elliptic curves in cryptography, in: CRYPTO85: Proceedings ofthe Advances in Cryptology, Springer-Verlag, 1986, pp. 417–426.

[2] N. Koblitz, Elliptic curve cryptosystems, Mathematics of Computation 48 (177)(1987) 203–209.

[3] R. Rivest, A. Shamir, L. Adleman, A method for obtaining digital signatures andpublic-key cryptosystems, Communications of the ACM 21 (2) (1978) 120–126.

[4] National Institute of Standards and Technology, Recommended Elliptic Curvesfor Federal Government Use, July 1999.

[5] Standards for Efficient Cryptography, Section 2: Recommended Elliptic CurveDomain Parameters, July 2000.

[6] Federal Information Processing Standard, FIPS PUB 186-3: Digital SignatureStandard (DSS), June 2009.

[7] J. Solinas, Efficient arithmetic on Koblitz curves, Designs, Codes andCryptography 19 (2000) 195–249.

[8] M. Hassan, M. Benaissa, Low area – scalable hardware/software co-design forelliptic curve cryptography, in: 3rd International Conference on NewTechnologies, Mobility and Security (NTMS), 2009, pp. 1–5.

[9] M. Hassan, M. Benaissa, A scalable hardware/software co-design for ellipticcurve cryptography on PicoBlaze microcontroller, in: Proceedings of 2010 IEEEInternational Symposium on Circuits and Systems (ISCAS), 2010, pp. 2111–2114.

[10] M. Hassan, M. Benaissa, Flexible hardware/software co-design for scalableelliptic curve cryptography for low-resource applications, in: 21st IEEEInternational Conference on Application-specific Systems Architectures andProcessors (ASAP), 2010, pp. 285–288.

[11] M. Morales-Sandoval, C. Feregrino-Uribe, R. Complido, I. Algredo-Badillo, Areconfigurable GF(2m) elliptic curve cryptographic coprocessor, in: 2011 VIISouthern Conference on Programmable Logic (SPL), 2011, pp. 209–214.

http://refhub.elsevier.com/S0141-9331(13)00048-3/h0005











[12] Y. Zhang, D. Chen, Y. Choi, L. Chen, S. Ko, A high performance ECC hardwareimplementation with instruction-level parallelism over GF(2163),Microprocessors and Microsystems 34 (6) (2010) 228–236.

[13] K. Järvinen, J. Skytta, Fast point multiplication on Koblitz curves:parallelization method and implementations, Microprocessors andMicrosystems 33 (2009) 106–116.

[14] A.E. Cohen, K.K. Parhi, Fast reconfigurable elliptic curve cryptographyacceleration for GF(2m) on 32 bits processors, Journal of Signal ProcessingSystems 60 (1) (2010) 31–45.

[15] T. Itoh, S. Tsujii, A fast algorithm for computing multiplicative inverses inGF(2m) using normal bases, Information and Computation 78 (3) (1988) 171–177.

[16] G.N. Selimis, A.P. Fournaris, H.E. Michail, O. Koufopavlou, Improvedthroughput bit-serial multiplier for GF(2m) fields, Mathematics ofComputation 48 (177) (1987) 243–264.

[17] A. Reyhani-Masoleh, M.A. Hasan, Low complexity bit parallel architectures forpolynomial basis multiplication over GF(2m), IEEE Transactions on Computers53 (8) (2004) 945–959.

[18] C.H. Kim, C.P. Hong, S. Kwon, A digit-serial multiplier for finite field GF(2m),IEEE Transactions on VLSI Systems 13 (4) (2005) 476–483.

[19] P.G. Comba, Exponentiation cryptosystems on the IBM PC, IBM SystemsJournal 29 (4) (1990) 526–538.

[20] J. Großchädl, S. Tillich, P. Ienne, L. Pozzi, A.K. Verma, When instruction setextensions change algorithm design: a study in elliptic curve cryptography, in:4th Workshop on Application Specific Processors, 2005, pp. 2–9.

[21] D. Hankerson, A. Menezes, S. Vanstone, Guide to Elliptic Curve Cryptography,Springer Verlag, New York, NY, USA, 2004.

[22] N. Koblitz, CM-curves with good cryptographic properties, in: CRYPTO91:Proceedings of the Advances in Cryptology, Lecture Notes in Computer Science,vol. 576, Springer, 1991, pp. 279–287.

[23] B.B. Brumley, K.U. Järvinen, Conversion algorithms and implementations forKoblitz curve cryptography, IEEE Transactions on Computers 59 (1) (2010) 81–92.

[24] A. Karatsuba, Y. Ofman, Multiplication of multi-digit numbers on automata,Soviet Physics Doklady 7 (1963) 595–596.

[25] M. Benaissa, W.M. Lim, Design of flexible GF(2m) elliptic curve cryptographyprocessors, IEEE Transactions on Very Large Scale Integration (VLSI) Systems14 (6) (2006) 659–662.

[26] J. Lopez, R. Dahab, Fast multiplication on elliptic curves over GF(2m) withoutprecomputation, in: CHES99: Proceedings of the First International Workshopon Cryptographic Hardware and Embedded Systems, Springer Verlag, 1999,pp. 316–327.

K.C. Cinnati Loi received his dual B.Sc. in ElectricalEngineering and in Computer Science in 2008 from theUniversity of Saskatchewan, Canada. He received hisM.Sc. at the University of Saskatchewan in 2010. He iscurrently a Ph.D. candidate at the University of Sas-katchewan. His research interests are hardware imple-mentation of cryptosystems, high performance FPGAapplications and hardware/software co-design.

Seok-Bum Ko received his Ph.D. in Electrical andComputer Engineering at the University of Rhode Island,USA in 2002. He is currently an associate professor inElectrical and Computer Engineering at the University ofSaskatchewan, Canada. His research interests includecomputer arithmetic, digital design automation, andcomputer architecture. Dr. Ko is a senior member ofIEEE computer society.




















































Documents

High performance scalable elliptic curve cryptosystem processor for Koblitz curves