[IEEE 2011 International Conference on Broadband, Wireless Computing, Communication and Applications (BWCCA) - Barcelona, Spain (2011.10.26-2011.10.28)] 2011 International Conference

Finite Field Multiplication Using ReorderedNormal Basis Multiplier

Fayez Gebali, Senior Member, IEEE , Turki Al-Somani, Member, IEEE

Shell et al.: Bare Demo of IEEEtran.cls for ComputerSociety JournalsAbstract—We present in this paper affine linear and nonlinear tech-niques for design space exploration of the finite-field multiplication usingreordered normal basis. Fifteen basic designs are possible using theselinear techniques that are in close agreement with the results previouslypublished using ad-hoc techniques. However, the major contributionof this paper is the introduction of nonlinear techniques to allow thedesigner to control the workload per processor and also control thecommunication requirements between processors. We present alsomodels for the performance of processor arrays implementing the finitefield multiplier. Performance includes system area, delay and powerconsumption. The main parameters affecting performance include thenumber of bits processed in parallel per processor and the hardwaredetails such as how much each performance parameters depend on thenumber of bits being processed in parallel.

Index Terms—Concurrency, Parallel algorithms, Optimal normal basis,Finite field. Affine linear scheduling and projection. Nonlinear schedulingand projection.

1 INTRODUCTION

EFficient finite field modular multiplication is impor-tant in elliptic curve cryptosystems (ECC) [1]. Opti-

mal normal basis (ONB) [2] used to represent the fieldelements provide savings in the number of arithmeticoperations when the Massey-Omura finite field multi-plication scheme is used [3]. The efficiency of operationdepends greatly on the choice of the basis used forfield element representation and on how the operationsare implemented in hardware or software. Since ECCis suitable for embedded and mobile devices, powerbecomes a decisive factor due to limited battery capacity.In the finite field GF (2m), polynomial basis (PB) andnormal basis (NB) are the most commonly used bases[4], [5]. Normal basis is more suitable for hardwareimplementations than polynomial basis since operationsin normal basis representation are mainly comprisedof rotation, shifting and exclusive-ORing which can beefficiently implemented in hardware [6]. In normal basis,

• F. Gebali is with the Department of Electrical and Computer Engineering,University of Victoria, Victoria BC, V8W 3P6, Canada. E-mail: seehttp://www.ece.uvic.ca/∼fayez/info/contact.html

• T. Al-Somani is with the Department of Computer Engineering, UmmAl-Qura University, P.O. Box 715, Makkah 21955, Saudi Arabia. E-mail:[email protected]

multiplication can be modeled as a matrix multiplicationwhere two input vectors are multiplied by a multiplica-tion matrix resulting in output product bits. The numberof ones inside the multiplication matrix is referred toas normal basis complexity and is equal to 2m − 1 foroptimal normal basis (ONB) [2]. An optimal normal basisis one with the minimum possible number of non-zeroelements in the multiplication matrix.

Two types of optimal normal bases have been foundwhich are referred to as optimal normal basis type I andtype II [2]. The reordered normal basis is referred to asa certain permutation of a type II optimal normal basis[7], [8], [9]. Sunar and Koc in [8] provided an efficientfull bit-parallel type II optimal normal basis multiplierusing the reordered normal basis. This paper exploresseveral systolic arrays of the Sunar-Koc Multiplication al-gorithm using systematic technique that combines affineand nonlinear Precessing Element (PE) scheduling andassignment of computations to processors.

2 SUNAR-KOC OPTIMAL NORMAL BASISTYPE II MULTIPLIER

The elements of the finite field GF (2m) are usually repre-sented by m-dimensional vectors where the componentsof each vector are binary bits over GF (2) [10], [11], [12].

The basis for representing any element in GF (2m) isa set of m linearly independent vectors. A normal basisfor GF (2m) is represented by the set of m elements{

β20 , β21 , β22 , · · · , β2m−1}

where β ∈ GF(2m)is the normal element.

In normal basis, an element A ∈ GF (2m) can beuniquely represented in the form A =

∑m−1i=0 aiβ

2i ,where ai ∈ 0, 1. Sunar and Koc in [8] reported an efficientType II optimal normal basis multiplier. The main ideaof the work reported in [8] was based on convertingthe two operands into their equivalent representation inanother basis, perform the multiplication in that basisand convert the product back to the normal basis.

The optimal normal basis of Type II for the fieldGF (2m) is constructed by expressing the normal elementβ in the form:

β = γ + γ−1 (1)

2011 International Conference on Broadband and Wireless Computing, Communication and Applications

978-0-7695-4532-5/11 $26.00 © 2011 IEEE

DOI 10.1109/BWCCA.2011.51

320

where γ is the primitive (2m + 1) first root of unity [8]which implies

γ2m+1 = 1 and γi �= 1 1 ≤ i < 2m+ 1 (2)

Sunar and Koc showed that the basis for GF (2m) can beexpressed by the set of m elements{

γ + γ−1, γ2 + γ−2, γ22+ γ−22 , · · · , γ2m−1

+ γ−2m−1}

(3)

Two elements A and B ∈ GF (2m) can be represented inthe new basis as:

A =

m∑i=1

ai(γi + γ−i

)B =

m∑j=1

bj(γj + γ−j

)(4)

The product C = A ·B is written as:

C = X + Y + Z (5)

where

X =∑

1 ≤ i, j ≤ m

i �= m

aibj(γi−j + γ−(i−j)) (6)

Y =m∑i=1

m−i∑j=1

aibj(γi+j + γ−(i+j)) (7)

Z =

m∑i=1

m∑j=m−i+1

aibj(γi+j + γ−(i+j)) (8)

The term X has the property that the exponent (i−j) of γis already within the proper range, i.e., −m ≤ (i−j) ≤ mfor all i, j ∈ [1,m].

The term Y should be ensured to be in the properrange. The ranges of the two indices i and j are i =1, 2, · · · , m and j = 1, 2, · · · , m− i . Their

sum k = i+ j is always within the range 1 ≤ k ≤ m.The basis elements of Z, however, are all out of range.

Thus, the identity in Eq. (2) is used to bring them to theproper range by modifying Eq. (8) as:

Z =

m∑i=1

m∑j=m−i+1

aibj(γi+j + γ−(i+j))

=

m∑i=1

m∑j=m−i+1

aibj(γ2m+1+(i+j) + γ−(2m+1−(i+j)))

(9)

Therefore, if k = i+ j > m, γk is replaced by γ2m+1−k.

3 PARALLELIZATION OF THE TYPE II MULTI-PLIER ALGORITHM

Several techniques were proposed for hardware imple-mentations of iterative algorithms [13], [14], [15] butwere limited to simple two-dimensional (2D) algorithmssuch as matrix-vector multiplication. The first authorproposed a formal algebraic procedure for processorarray implementation starting from a regular iterativealgorithm with arbitrary dimensions [16], [17] and arbi-trary number of variables. The algorithm was expressed

as a convex hull or a computation domain D in a spacewhose dimensionality matched the number of indicesof the algorithm. D in that technique expresses thedependences of the variables on the algorithm indices. Thisis in direct contrast with the directed graph of earliertechniques. That dependence of each algorithm variablewas expressed as a dependence matrix A and its nullspacewas used to design the hardware to implement the algo-rithm. In this paper, we extend the technique proposedin [16] to explore more implementations using nonlinearoperators to augment the linear task scheduling andprojection operators used earlier.

3.1 Type II Multiplier Algorithm AnalysisThe multiplier algorithms in (6), (7) and (8) has two inputvariables A and B and three output variables X , Y andZ. The bounds on the indices effectively result in theconvex hull D ⊂ Z2 whose facets are defined by thebounds on the algorithm indices. D can be essentiallyvisualized in a 2-D plane. Therefore we can use a sim-plified version of the techniques reported in [17], [18],[19], [20]. The dependence graph of the three variablesis shown in Fig. 1 and indicates that all nodes in thegraph are being used to do useful computations. Some

y5

y4

y3

y2

z1z2z3z4z5

a5

a4

a3

a2

a1

b1 b2 b3 b4 b5

4

3

2

1

i

j

4 3 2 1

Fig. 1. Merged dependence graph for output variables X,Y and Z in D when m = 5.

of the nodes in the graph do more than one operationon the incoming data in order to produce two typesof partial results. For example, nodes near the top leftcorner are responsible for producing output samples xand y. Similarly, nodes near the bottom right corner areresponsible for producing output samples x and z.

4 DATA SCHEDULING

We use an affine scheduling function such that point p =[i j]t ∈ D is associated with the time value

t(p) = s p− s = i s1 + j s2 − s (10)

321

where s = [s1 s2] is the scheduling vector and sis a scalar constant. We have several options for thescheduling vector:

s1 =[1 1

]s2 =

[1 0

]s3 =

[0 1

]s4 =

[1 −1 ]

s5 =[ −1 1

]Fig. 2 is the resulting directed acyclic graph (DAG)associated with the scheduling function s2. All points

0

1

2

3

4

y5

y4

y3

y2

z1z2z3z4z5

4

3

2

1

4 3 2 1

Fig. 2. Directed acyclic graph (DAG) associated with thescheduling function s2 when m = 5.

lying on a horizontal line are associated with the sametime index value.

Nonlinear scheduling overcomes the limitations ofaffine scheduling. Let us start by modifying our linearscheduling operation in Eq. (10) to be the nonlinearscheduling operation:

t(p) =

⌊sp− s

α

⌋(11)

Using this artifact, we increase the workload that mustbe performed at each time step by the factor α. Theoutput samples produced at each time step is increasedalso by the same factor. Fig. 3 shows the new nonlineartiming scheme for the case when α = 2. The grayedregions indicate equitemporal time domains. The totalnumber of iterations is reduced to �m/α� = 3 instead of5 when m = 5.

5 NODE PROJECTION

In this section we discuss how we can assign a PEto each node in the DAG. We define linear projectionusing the projection direction d. If two points in DAG liealong the projection direction such that p2 = p1 + ed,then these two points will be mapped onto the samepoint in the processor array space. Once we decide ona projection direction, we can find the correspondingprojection matrix P whose nullvector is d [16], [17]. Apoint p ∈ D will be projected to point p in the processorarray space using the operation p = Pp. The projection

0

1

2

y5

y4

y3

y2

z1z2z3z4z5

4

3

2

1

4 3 2 1

Fig. 3. Nonlinear scheduling for the different nodes in themodulo multiplier algorithm for the case m = 5 and α = 2and the scheduling function s2.

vectors associated with each scheduling function areillustrated in Table 1.

Linear projection operations suffers from some limi-tations. The number of nodes allocated to a PE is fixedand can not be modified. Furthermore, the total numberof PEs is also fixed and can not be modified. Theselimitations are overcome through using nonlinear nodeprojection. Reference [17] discussed the need to developnonlinear node projection. Let us start by defining non-linear projection operation:

p = �Pp

β (12)

where β is a nonzero positive integer. The nonlinearprojection operation in (12) produces �m/β� PEs.

Fig. 4 shows the nonlinear projection scheme for thecase when β = 2 and choice of projection direction d12,d21 or d43. We see that at each time step two samplesof X ′′ and Y are produced. The total number of PEs isnow �m/β� = 2 instead of 5 when m = 5. What is veryfortunate also is that the interprocessor communicationdid not increase at all. Each PE still sends two bits to theneighbouring PEs.

6 PERFORMANCE MODELING OF MULTIPLIERSTRUCTURES

We are interested here in obtaining performance fig-ures for area, power, and delay for reordered normalbasis multipliers. Several type II optimal normal basisor reordered normal basis multipliers were reportedin [3], [9], [21], [22], [23], [24], [25], [26]. The resultsindicated uniform complexity when counting numberof elementary gates. We discuss here effect of clusteringprocessors using our nonlinear scheduling and projec-tion operations.

The area of the PE can be estimated as

A(PE) = A0 + βA1 (13)

322

TABLE 1The projection vectors associated with each scheduling vector.

Scheduling Possible Projection Directionsvectors

s1 = [ 1 1 ] d11 = [ 1 1 ]t d12 = [ 1 0 ]t d13 = [ 0 1 ]t

s2 = [ 1 0 ] d21 = [ 1 0 ]t d22 = [ 1 −1 ]t d23 = [ 1 1 ]t

s3 = [ 0 1 ] d31 = [ 0 1 ]t d32 = [ −1 1 ]t d33 = [ 1 1 ]t

s4 = [ 1 −1 ] d41 = [ 1 −1 ]t d42 = [ 0 −1 ]t d43 = [ 1 0 ]t

s5 = [ −1 1 ] d51 = [ −1 1 ]t d52 = [ −1 0 ]t d53 = [ 0 1 ]t

PE0

y5

y4

y3

y2

z1z2z3z4z5

4

3

2

1

4 3 2 1

PE1 PE2

Fig. 4. Nonlinear projection for the case m = 5 and β = 2and choice of d12, d21 or d43.

where A0 is portion of the PE area that does not dependon the processor workload and A1 is the the PE areathat is dedicated to storing or processing a single bit(i.e. when β = 1). A0 is due to the areas of the controlsection and A1 is due to the areas of the ALU, registerbank and I/O sections.

The power consumption of the PE at each iterationstep can be estimated as

P (PE) = P0 + αβP1 (14)

where P0 is portion of the PE power budget that does notdepend on the workload due to the areas of the controlsection. P0 could be due to overhead tasks that do notdepend on how much data is being processed. P1 is theportion of the PE power budget to process or access asingle bit of data.

Although finite-field arithmetic does not involve carrypropagation, the reordered normal basis multiplicationscheme requires propagation of the sum signal as shownin Fig. 1. Therefore, the PE delay during each iteration

step depends on how many bits are being processed byeach PE, i.e.

D(PE) = D0 + αD1 (15)

where D0 is portion of the PE delay that does not dependon the number of bits being processed at each iteration,which could be due to access from the local memory orregisters, and D1 is the PE delay to operate on a singlebit.

6.1 Estimating Processor Array Performance

We assume the basic structure for the reordered normalbasis multiplier as a processor array where each proces-sor has a small set of registers to store its own data andeach two neighbouring processors have a shared cacheto exchange data. Figure 5 shows the structure of theprocessor array to implement the reordered normal basismultiplication.

PE0

Shared Memory

Interconnection Network

PE1 PEn-1...

Fig. 5. Processor array to implement the reorderednormal basis multiplication.

The area of the processor array is the sum of the PEareas plus the areas associated with the neighbour-to-neighbour links:

A(array) = n(PE) [A(PE) + 2ab] (16)

where n(PE) is the number of PEs in the processor arrayand ab is the area for a single bit link between theprocessors. Typically ab A(PE).

323

The power consumed by the processor array at eachiteration is given by

P (array) = n(PE) [P (PE) + 2pb] (17)

where pb is the power required to send a bit from oneprocessor to its neighbour.

The total time required by the processor array tocomplete the multiplication operation is given by

D(array) = ni [D(PE) + db] (18)

where ni the total number of iterations and db is the timedelay to communicate between the neighbouring PEs.

6.2 Comparing Areas for Standard vs. NonlinearMultipliersFrom (16), the standard structures reported in the liter-ature have a total area given by:

As = m (A0 +A1 + 2ab) (19)

From (16), the total processor area when nonlinear datascheduling and task projection are used is given by:

An =

⌈m

β

⌉(Ap + 2ab) =

⌈m

β

⌉(A0 + βA1 + 2ab) (20)

where the area of the interconnection network is ignoredsince it will be the same for either type of processor arrayused. The ratio of As/An is given by:

As

An≈ β(1 + ra + εa)

1 + βra + εa(21)

where εa = 2ra/A0 1.Figure 6 shows the area improvement according to

the above equation for increasing values of the ratiora and εa = 0.1. Four different cases were considered:

Fig. 6. Area improvement for nonlinear vs. standardmultipliers for different β values. β = 2 (dotted line), β = 4(dash-dot line), β = 8 ( dashed line) and β = 128 (solidline).

β = 2 (dotted line), β = 4 (dash-dot line), β = 8 (

dashed line) and β = 128 (solid line). It is apparentthat nonlinear projection leads to area reduction whenthe ratio A1/A0 1. This is probably true for mostprocessors since the ALU area needed to process orstore one bit is much smaller than the area needed tostore all the data and implement the control unit for anyprocessor.

6.3 Comparing power for Standard vs. NonlinearMultipliersFrom (17), the power consumed by the standard proces-sors reported in the literature per iteration is given by:

Ps = m (P0 + P1 + 2pb) (22)

According to [27], pb ≈ 50p for off-chip communications.However, when all the PEs are integrated in one chip weget pb ≈ P1. Thus we have

Ps = m (P0 + 3P1) (23)

From (17), power consumption of the nonlinear designis:

Pn =m

β[P0 + (αβ + 2)P1] (24)

The ratio of Ps/Pn is given by

Ps

Pn≈ β(1 + 3rp)

1 + (αβ + 2)rp(25)

Figure 7 shows the power improvement according to theabove equation for increasing values of β. Four different

Fig. 7. Power consumption improvement for nonlinear vs.standard multipliers for different α values. α = 1 (dottedline), α = 2 (dash-dot line), α = 4 (dashed line) and α = 8(solid line).

cases were considered: α = 1 (dotted line), α = 2 (dash-dot line), α = 4 (dashed line), α = 8 (solid line) andrp = 1. It is surprising that nonlinear projection leadsto dramatic power reduction when α = 1, i.e. lineardata scheduling, and when β � 1. We also notice thatthe parameter α > 1 actually results in more powerconsumption .

324

6.4 Comparing Delay for Standard vs. NonlinearMultipliersFrom (18) the delay of the standard processors reportedin the literature is given by:

Ds = m (D0 +D1 + db) (26)

where the delay to communicate with the shared mem-ory is ignored since it is present for both the standardand nonlinear architectures. Since we assumed all PEsto lie on the same chip, we can assume db ≈ D1 and wehave

Ds = m (D0 + 2D1) (27)

From (18) the delay of the nonlinear design is:

Dn =⌈mα

⌉[D0 + (α+ 1)D1] (28)

The ratio Ds/Dn is given by

Ds

Dn=

α (1 + 2rd)

1 + (α+ 1)rd(29)

Figure 8 shows the delay improvement according to theabove equation for increasing values of the ratio rd. Four

Fig. 8. Power consumption improvement for nonlinear vs.standard multipliers for different α values. α = 2 (dottedline), α = 8 (dash-dot line), α = 64 (dashed line) andα = 256 (solid line).

different cases were considered: α = 2 (dotted line),α = 8 (dash-dot line), α = 64 (dashed line), α = 256(solid line) and re = 1. It is apparent that nonlinearscheduling leads to delay reduction for high values ofα especially for values of rd < 1. However that case israre and typically rd � 1. Therefore, it is safe to takeα = 1 or 2 at the most. Such choices will lead to betterpower consumption ratios, as was shown in Section 6.3.

7 CONCLUSIONS

We illustrated how affine linear and nonlinear datascheduling and projection techniques were used for de-sign space exploration of the Sunar-Koc optimal normal

basis type II multiplication algorithm. The nonlinearscheduling and projection techniques can control theworkload of each PE or thread during each iterationstep and can control the interprocessor communicationas well. The area, power, and delay performance ofthe processor array were studied. The most importantconclusion was using nonlinear projection operation,using the parameter β � 1, gives better system area andpower performance.

REFERENCES

[1] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Compu-tation, vol. 48, pp. 203–209, 1987.

[2] R. C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R. Wilson,“Optimal normal bases in GF(pn),” Discrete Appl. Math, vol. 22,pp. 149–161, 1988/89.

[3] J. Massey and J. Omura, Computational method and apparatus forfinite field arithmetic. US Patent No. 4,587,627, 1986.

[4] A. Menezes, Elliptic Curve Public Key Cryptosystems. Boston, MA.:Kluwer Academic Publishers, 1993.

[5] A. Menezes, I. B. F., X. Gao, R. Mullin, S. Vanstone, andT. Yaghoobian, Applications of Finite Fields. Boston, MA.: KluwerAcademic Publishers, 1993.

[6] R. Lidl and H. Niederreiter, Introduction to finite fields and theirapplications. Cambridge, UK: Cambridge University Press, 1994.

[7] S. Gao and S. Vanstone, “On orders of optimal normal basisgenerators,” Math. Computation, vol. 64, no. 2, pp. 1227–1233, 1995.

[8] B. Sunar and C. Koc, “An efficient optimal normal basis type IImultiplier,” IEEE Transactions on Computers, vol. 50, no. 1, pp. 83–88, Jan. 2001.

[9] S. Gao and G. Sobelman, “Improved VLSI designs for multipli-cation and inversion in gf(2M ) over normal bases,” in Proc. 13thAnn. IEEE Intl ASIC/SOC Conf., 2000, pp. 97–101.

[10] A. Lenstra and E. Verheul, “Selecting cryptographic key sizes,”Journal of Cryptology, vol. 14, no. 4, pp. 255–293, Sep. 2001.

[11] A. Wander, N. Gura, H. Eberle, V. Gupta, and S. Chang, “En-ergy analysis for public-key cryptography for wireless sensornetworks,” in IEEE PerCom’05, Pisa, Italy, Mar. 2005.

[12] L. Batina, N. Mentens, K. Sakiyama, B. Preneel, and I. Ver-bauwhede, “Low-cost elliptic curve cryptography for wirelesssensor networks,” in Proceedings of the 3rd European workshop onsecurity and privacy in ad hoc and sensor networks (ESAS 2006),Hamburg, Germany, Sep. 2006.

[13] S. Rao and T. Kailath, “Regular iterative algorithms and theirimplementation on processor arrays,” Proc. IEEE, vol. 76, no. 3,pp. 259–269, Mar. 1988.

[14] S. Kung, VLSI Array Processors. Englewood Cliffs, N.J.: Prentice-Hall, 1988.

[15] E. Abdel-Raheem, “Design and VLSI implementation of multiratefilter banks,” Ph.D. dissertation, Dept. of Electrical and ComputerEng., Univ. of Victoria, 1995.

[16] F. El-Guibaly and A. Tawfik, “Mapping 3D IIR digital filter ontosystolic arrays,” Multidimensional Systems and Signal Processing,vol. 7, no. 1, pp. 7–26, Jan. 1996.

[17] F. Gebali, Algorithms and Parallel Computers. New York: JohnWiley, 2011.

[18] M. Fayed, “A security coprocessor for next generation IP tele-phony architecture, abstraction, and strategies,” PhD in ComputerEngineering, University of Victoria, ECE Dept, Univ. of Victoria,Victoria, BC, V8W 3P6, 2007.

[19] M. Fayed, M. W. El-Kharashi, and F. Gebali, “A high-speed, low-area processor array architecture for multiplication and squaringover GF(2m),” in Proceedings of the Second IEEE International Designand Test Workshop (IDT 2007), Y. Zorian, H. ElTahawy, A. Ivanov,and A. Salem, Eds., Cairo, Egypt, 2007, pp. 226–231.

[20] F. Gebali and A. Rafiq, “Processor array architectures for deeppacket classification,” IEEE Transactions on Parallel and DistributedComputing, vol. 17, no. 3, pp. 241–252, 2006.

[21] W. Geiselmann and D. Gollmann, “Symmetry and duality innormal basis multiplication,” in Proceedings of Applied Algebra,Algebraic Algorithms, and Error Correcting Codes Symposium, Jul.1998, pp. 230–238.

325

[22] M. Feng, “A VLSI architecture for fast inversion in gf(2m),” IEEETransactions on Computers, vol. 38, pp. 1383–1386, 1989.

[23] G. B. Agnew, R. C. Mullin, I. Onyszchuk, and S. A. Vanstone, “Animplementation for fast public key cryptosystem,” J. Cryptology,vol. 3, pp. 63–79, 1991.

[24] A. Reyhani-Masoleh and M. Hasan, “Efficient digit-serial normalbasis multipliers over GF(2m),” IEEE Transactions on Computers,special issue on cryptographic hardware and embedded systems, vol. 52,no. 4, pp. 428–439, 2003.

[25] ——, “Low complexity wordlevel sequential normal basis multi-pliers,” IEEE Transactions on Computers, vol. 54, no. 2, pp. 98–110,2005.

[26] H. Wu, A. Hasan, I. Blake, and S. Gao, “Finite field multiplierusing redundant representation,” IEEE Transactions on Computers,vol. 51, no. 11, pp. 1306–1316, 2002.

[27] D. Liu and C. Svensson, “Power consumption estimation inCMOS VLSI chips,” IEEE Journal of Solid-State Circuits, vol. 29,no. 6, pp. 663–670, Jun. 1994.

326

Documents

[IEEE 2011 International Conference on Broadband, Wireless Computing, Communication and Applications (BWCCA) - Barcelona, Spain (2011.10.26-2011.10.28)] 2011 International Conference