10
110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015 High-Throughput Finite Field Multipliers Using Redundant Basis for FPGA and ASIC Implementations Jiafeng Xie, Pramod Kumar Meher, Senior Member, IEEE, and Zhi-Hong Mao, Senior Member, IEEE Abstract—Redundant basis (RB) multipliers over Galois Field ( ) have gained huge popularity in elliptic curve cryptog- raphy (ECC) mainly because of their negligible hardware cost for squaring and modular reduction. In this paper, we have proposed a novel recursive decomposition algorithm for RB multiplication to obtain high-throughput digit-serial implementation. Through ef- cient projection of signal-ow graph (SFG) of the proposed algo- rithm, a highly regular processor-space ow-graph (PSFG) is de- rived. By identifying suitable cut-sets, we have modied the PSFG suitably and performed efcient feed-forward cut-set retiming to derive three novel multipliers which not only involve signicantly less time-complexity than the existing ones but also require less area and less power consumption compared with the others. Both theoretical analysis and synthesis results conrm the efciency of proposed multipliers over the existing ones. The synthesis results for eld programmable gate array (FPGA) and application spe- cic integrated circuit (ASIC) realization of the proposed designs and competing existing designs are compared. It is shown that the proposed high-throughput structures are the best among the corre- sponding designs, for FPGA and ASIC implementation. It is shown that the proposed designs can achieve up to 94% and 60% savings of area-delay-power product (ADPP) on FPGA and ASIC imple- mentation over the best of the existing designs, respectively. Index Terms—ASIC, digit-serial, nite eld multiplication, FPGA, high-throughput, redundant basis. I. INTRODUCTION F INITE FIELD multiplication over Galois Field ( ) is a basic operation frequently encountered in modern cryp- tographic systems such as the elliptic curve cryptography (ECC) and error control coding [1]–[3]. Moreover, multiplication over a nite eld can be used further to perform other eld opera- tions, e.g., division, exponentiation, and inversion [4]–[6]. Mul- tiplication over can be implemented on a general pur- pose machine, but it is expensive to use a general purpose ma- chine to implement cryptographic systems in cost-sensitive con- sumer products. Besides, a low-end microprocessor cannot meet the real-time requirement of different applications since word- Manuscript received January 20, 2014; revised May 10, 2014; accepted July 30, 2014. Date of publication October 01, 2014; date of current version January 06, 2015. This work was supported by the National Science Foundation under Grant CNS-0964079. This paper was recommended by Associate Editor C.P. Ravikumar. J. Xie and Z.-H. Mao are with the Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261 USA (e-mail: [email protected], [email protected]). P. K. Meher is with the School of Computer Engineering, Nanyang Techno- logical University, Singapore 639798 (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSI.2014.2349577 length of these processors is too small compared with the order of typical nite elds used in cryptographic systems. Most of the real-time applications, therefore, need hardware implementation of nite eld arithmetic operations for the benets like low-cost and high-throughput rate. The choice of basis to represent eld elements, namely the polynomial basis, normal basis, triangular basis and redundant basis (RB) has a major impact on the performance of the arith- metic circuits [7]–[9]. The multipliers based on RB [6], [10] have gained signicant attention in recent years due to their several advantages. Not only do they offer free squaring, as normal basis does, but also involve lower computational complexity and can be implemented in highly regular computing structures [10]–[14]. Several digit-level serial/parallel structures for RB multiplier over have been reported in the last years [10]–[14] after its introduction by Wu et al. [10]. An efcient serial/par- allel multiplier using redundant representation has been pre- sented in [10]. A bit-serial word-parallel (BSWP) architecture for RB multiplier has been reported by Namin et al. [11]. Sev- eral other RB multipliers also have been developed by the same authors in [12]–[14] for reducing the complexity of implemen- tation and for high-speed realization. We nd that the hardware utilization efciency and throughput of existing structures of [12]–[14] can be improved by efcient design of algorithm and architecture. In this paper, we aim at presenting efcient digit-level serial/parallel designs for high-throughput nite eld multiplica- tion over based on RB. We have proposed an efcient recursive decomposition scheme for digit-level RB multiplica- tion, and based on that we have derived parallel algorithms for high throughput digit-serial multiplication. We have mapped the algorithm to three different high-speed architectures by mapping the parallel algorithm to a regular 2-dimensional signal-ow graph (SFG) array, followed by suitable projection of SFG to 1-dimensional processor-space ow graph (PSFG), and the choice of feed-forward cut-set to enhance the throughput rate. Our proposed digit-serial multipliers involve signicantly less area-time-power complexities than the corresponding existing designs. Field programmable gate array (FPGA) has evolved as a mainstream dedicated computing platform. FPGAs however do not have abundant number of registers to be used in the multiplier. Therefore, we have modied the proposed algorithm and architecture for reduction of register-complexity particularly for the implementation of RB multipliers on FPGA platform. 1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

  • Upload
    others

  • View
    6

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

High-Throughput Finite Field MultipliersUsing Redundant Basis for FPGA

and ASIC ImplementationsJiafeng Xie, Pramod Kumar Meher, Senior Member, IEEE, and Zhi-Hong Mao, Senior Member, IEEE

Abstract—Redundant basis (RB) multipliers over Galois Field( ) have gained huge popularity in elliptic curve cryptog-raphy (ECC) mainly because of their negligible hardware cost forsquaring andmodular reduction. In this paper, we have proposed anovel recursive decomposition algorithm for RB multiplication toobtain high-throughput digit-serial implementation. Through effi-cient projection of signal-flow graph (SFG) of the proposed algo-rithm, a highly regular processor-space flow-graph (PSFG) is de-rived. By identifying suitable cut-sets, we have modified the PSFGsuitably and performed efficient feed-forward cut-set retiming toderive three novel multipliers which not only involve significantlyless time-complexity than the existing ones but also require lessarea and less power consumption compared with the others. Boththeoretical analysis and synthesis results confirm the efficiency ofproposed multipliers over the existing ones. The synthesis resultsfor field programmable gate array (FPGA) and application spe-cific integrated circuit (ASIC) realization of the proposed designsand competing existing designs are compared. It is shown that theproposed high-throughput structures are the best among the corre-sponding designs, for FPGA andASIC implementation. It is shownthat the proposed designs can achieve up to 94% and 60% savingsof area-delay-power product (ADPP) on FPGA and ASIC imple-mentation over the best of the existing designs, respectively.

Index Terms—ASIC, digit-serial, finite field multiplication,FPGA, high-throughput, redundant basis.

I. INTRODUCTION

F INITE FIELD multiplication over Galois Field ( )is a basic operation frequently encountered in modern cryp-

tographic systems such as the elliptic curve cryptography (ECC)and error control coding [1]–[3]. Moreover, multiplication overa finite field can be used further to perform other field opera-tions, e.g., division, exponentiation, and inversion [4]–[6]. Mul-tiplication over can be implemented on a general pur-pose machine, but it is expensive to use a general purpose ma-chine to implement cryptographic systems in cost-sensitive con-sumer products. Besides, a low-end microprocessor cannot meetthe real-time requirement of different applications since word-

Manuscript received January 20, 2014; revised May 10, 2014; accepted July30, 2014. Date of publication October 01, 2014; date of current version January06, 2015. This work was supported by the National Science Foundation underGrant CNS-0964079. This paper was recommended by Associate Editor C.P.Ravikumar.J. Xie and Z.-H. Mao are with the Department of Electrical and Computer

Engineering, University of Pittsburgh, Pittsburgh, PA 15261 USA (e-mail:[email protected], [email protected]).P. K. Meher is with the School of Computer Engineering, Nanyang Techno-

logical University, Singapore 639798 (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSI.2014.2349577

length of these processors is too small compared with the orderof typical finite fields used in cryptographic systems. Most of thereal-time applications, therefore, need hardware implementationof finite field arithmetic operations for the benefits like low-costand high-throughput rate.The choice of basis to represent field elements, namely the

polynomial basis, normal basis, triangular basis and redundantbasis (RB) has a major impact on the performance of the arith-metic circuits [7]–[9]. The multipliers based on RB [6], [10] havegained significant attention in recent years due to their severaladvantages. Not only do they offer free squaring, as normalbasis does, but also involve lower computational complexityand can be implemented in highly regular computing structures[10]–[14].Several digit-level serial/parallel structures for RB multiplier

over have been reported in the last years [10]–[14]after its introduction by Wu et al. [10]. An efficient serial/par-allel multiplier using redundant representation has been pre-sented in [10]. A bit-serial word-parallel (BSWP) architecturefor RB multiplier has been reported by Namin et al. [11]. Sev-eral other RB multipliers also have been developed by the sameauthors in [12]–[14] for reducing the complexity of implemen-tation and for high-speed realization. We find that the hardwareutilization efficiency and throughput of existing structures of[12]–[14] can be improved by efficient design of algorithm andarchitecture.In this paper, we aim at presenting efficient digit-level

serial/parallel designs for high-throughput finite field multiplica-tion over based on RB. We have proposed an efficientrecursive decomposition scheme for digit-level RB multiplica-tion, and based on that we have derived parallel algorithms forhigh throughput digit-serial multiplication. We have mapped thealgorithm to three different high-speed architectures by mappingthe parallel algorithm to a regular 2-dimensional signal-flowgraph (SFG) array, followed by suitable projection of SFG to1-dimensional processor-space flow graph (PSFG), and thechoice of feed-forward cut-set to enhance the throughput rate.Our proposed digit-serial multipliers involve significantly lessarea-time-power complexities than the corresponding existingdesigns. Field programmable gate array (FPGA) has evolved asa mainstream dedicated computing platform. FPGAs howeverdo not have abundant number of registers to be used in themultiplier. Therefore, we have modified the proposed algorithmand architecture for reduction of register-complexity particularlyfor the implementation of RB multipliers on FPGA platform.

1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

XIE et al.: HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS 111

Apart from these we also present a low critical-path digit-serialRB multiplier for very high throughput applications.The rest of the paper is organized as follows: The proposed

algorithm for finite field RB multiplication over is pre-sented in Section II. The proposed structures for high-throughputdigit-serial realization of the multiplications are derived fromthe proposed algorithm in Section III. In Section IV, we havediscussed the estimation of hardware and time complexitiesalong with the comparative performance of proposed designsover the recent competing designs. Conclusions are presented inSection V.

II. MATHEMATICAL FORMULATION

A. Brief Review of Existing Digit-Serial RB Multiplier

Assuming to be a primitive th root of unity, elements infinite field can be represented in the form:

(1)

where , for , such that the setis defined as the RB for finite field elements,

where is a positive integer not less than [6], [10].For a finite field , when is prime and 2 is

a primitive root modulo , there exists a type I optimalnormal basis (ONB) [10], where is an element of , and

.Let be expressed in RB representation as

(2)

(3)

where . Let be the product of and , whichcan be expressed as follows

(4)

where denotes modulo reduction. Define, where , we have

(5)

In the recently proposed RB multipliers of [13] and [14], bothoperands and are decomposed into a number of blocks toachieve digit-serial multiplication, and after that the partial prod-ucts corresponding to these blocks are added together to obtainthe desired product word. The existing digit-serial RB multipli-cation algorithm is stated as follows:

Existing Algorithm Existing digit-serial multiplication algorithm[13] and [14]

Inputs:

,

and

are two RB representation elements, for .

Output:

1. Initialization: for and

2. For all values of , do

3. For all values of , do

4. For to , do

5.

6. End For

7. End For

8.

9. End For

Step 5 of existing algorithm refers to the computation ofdigit-wise partial products for the digit-serial multiplicationwhere operands and are decomposed into a number ofdigits, and step 8 refers to the addition of those partial productsto compute the product word.Although the existing algorithm of [13] and [14] is the most

efficient one out of all reported algorithms for digit-serial mul-tiplication, we find that the hardware utilization efficiency andthroughput of existing structures of [13]-[14] could be improvedfurther by efficient design of algorithm and architecture. Particu-larly, due to its larger number of digit-wise partial products (step5), and extra time for addition of those partial products (step 8),which not only increases the average computation time (ACT) toperform the multiplication but also involves extra hardware re-sources for storage and addition of larger number of partial prod-ucts.

B. Proposed Digit-Serial RB Multiplication Algorithm

Alternatively, we canwrite (5) in a bit-levelmatrix-vector formas:

......

.... . .

......

(6)

Page 3: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

112 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

Looking at the structure of bit matrix in (6), we can definebit-shifted (reduced) forms of operand as follows

(7)

where

(8)

The recursions on (8) can be extended further to have

(9)

where .Let and be two integers such that , where

. For simplicity of discussion, we assume1 , anddecompose the input operand into number of bit-vectorsfor , as follows:

(10)

Similarly, we can generate number of shifted operand vec-tors for , as follows:

(11)

The product given by the bit-level matrix-vectorproduct in (6) can be decomposed into inner-products of vec-tors and for as:

(12)

where denotes

(13)

Note that each for is a -point bit-vectorand each for is bit-shifted forms ofoperand . From (12) and (13) we can find that the desired mul-tiplication can be performed by cycles of successive accumula-tion of for , while each can be computedas . The proposed digit-serial multipli-cation algorithm based on (12) and (13) is described in Algorithm1.

1When , we can append number of zeros to each of the operandsto satisfy the condition .

Algorithm 1 Proposed digit-serial multiplication algorithm

Inputs: and are the pair of elements (in RB representation)in to be multiplied.

Output:

1. Initialization step

1.1

2. Multiplication step

2.1. For to

2.2. For to

2.3.

2.4. End For

2.5. End For

3. Final step

3.1.

where step 2.3 refers to the digit-serial multiplication process.According to our proposed algorithm, we generate less numberof partial products, and partial products are accumulated as soonas they are computed, which not only shortens the ACT, but alsosignificantly reduces register and adder complexities of proposedstructures over that of existing ones in [13] and [14].

III. DERIVATION OF PROPOSED HIGH-THROUGHPUTSTRUCTURES FOR RB MULTIPLIERS

In this section, we derive the proposed multipliers from theSFG of the proposed Algorithm 1.

A. Proposed Structure-I

According to (12) and (13), the RBmultiplication can be repre-sented by the 2-dimensional SFG (shown in Fig. 1) consisting ofparallel arrays, where each array consists of bit-shifting

nodes (S node), multiplication nodes (M nodes) andaddition nodes (A nodes). There are two types of S nodes (S-Inode and S-II node). Function of S nodes is depicted in Fig. 1(b),where S-I node performs circular bit-shifting by one position andS-II node performs circular bit-shifting by positions for the de-gree reduction requirement. Functions of M nodes and A nodesare depicted in Fig. 1(c) and 1(d), respectively. Each of the Mnodes performs an AND operation of a bit of serial-input operandwith bit-shifted form of operand , while each of the A nodes

performs an XOR operation. The final addition of the output ofarrays of Fig. 1 can be performed by bit-by-bit XOR of the

operands in number of A nodes as depicted in Fig. 1. Thedesired product word is obtained after the addition of paralleloutput of the arrays.For digit-serial realization of RB multiplier, the SFG of Fig. 1

can be projected along -direction to obtain a PSFG as shown inFig. 2, where input bits are loaded in parallel to multiplicationnodes during each cycle period. The functions of nodes of PSFGare the same as those of corresponding nodes in the SFG of Fig. 1except an extra add-accumulation (AA) node. The function of theAA node is, as described in Fig. 2(b), to execute the accumulationoperation for cycles to yield the desired result thereafter.

Page 4: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

XIE et al.: HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS 113

Fig. 1. Signal-flow graph (SFG) for parallel realization of RB multiplicationover based on (12) and (13). (a) The proposed SFG. (b) Functionaldescription of S node, where S-I node performs circular bit-shifting of one posi-tion and S-II node performs circular bit-shifting by positions. (c) Functionaldescription of M node. (d) Functional description of A node.

Fig. 2. Processor-space flow graph (PSFG) of digit-serial realization of finitefield RB multiplication over . (a) The proposed PSFG. (b) Functionaldescription of add-accumulation (AA) node.

For efficient realization of a digit-serial RB multiplier, we canperform feed-forward cut-set retiming in a regular interval in thePSFG as shown in Fig. 3. As a result of cut-set retiming of theFig. 3, the minimum duration of each clock period is reduced to

, where and denote the delay of an AND gateand an XOR gate, respectively.The PSFG of Fig. 3, is mapped to the high-throughput digit-se-

rial RBmultiplier (shown in Fig. 4), referred to as proposed struc-ture-I (PS-I). PS-I contains three modules, namely the bit-permu-tationmodule (BPM), partial product generationmodule (PPGM)

Fig. 3. Cut-set retiming of PSFG of finite field RB multiplication over, where “ ” denotes delay.

Fig. 4. Proposed structure-I (PS-I) for RB multiplier, where “R” denotes a reg-ister cell. (a) Detailed structure of the RB multiplier. (b) Structure of the bit-per-mutation module (BPM). (c) Structure of the AND cell in the partial productgeneration module (PPGM). (d) Structure of the XOR cell in the PPGM. (e)Structure of the register cell in the PPGM. (f) Structure of the finite field accu-mulator.

and finite field accumulator module. The BPM of Fig. 4 per-forms rewiring of bits of operand to feed its output to partialproduct generation units (PPGU)s according to the S nodes ofPSFG of Fig. 3, as shown in Fig. 4(b). The AND cell, XOR celland register cell of PPGM perform the function of M node, Anode and delay imposed by the retiming of PSFG of Fig. 3, re-spectively. Structures and functions of AND cell, XOR cell andregister cell are shown in Fig. 4(c), (d), and (e), respectively. Theinput operands are fed to PPGU in staggered manner to meet the

Page 5: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

114 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

Fig. 5. PS-I for RB multiplier when . (a) Proposed cut-set retiming ofPSFG when . (b) Detailed internal structure of merged regular PPGU. (c)Corresponding PS-I for the case .

timing requirement in systolic pipeline. The accumulator consistsof parallel bit-level accumulation cells [as shown in Fig. 4(f)].The newly received input is then added with the previously accu-mulated result and the result is stored in the register cell to be usedduring the next cycle. The duration of minimum cycle period ofthe PS-I is . The proposed digit-serial design gives thefirst output of desired product cycles after the pair ofoperands are fed to the structure, while the successive output areproduced at the interval of cycles thereafter.

B. Modification of Proposed Structure-IFor any integer value of , we can have , where

and . Without loss of generality, for simplicity ofdiscussion, we can assume . The approach proposed here for

however can be easily extended to the cases where .Define , and , such that (13) can berewritten as

(14)

Based on (14), we can modify the retiming of PSFG of Fig. 3to derive suitable digit-level architecture for RB multiplier over

. For example, to obtain the proposed structure for, a pair of S nodes, a pair of M nodes and a pair of A nodesof the PSFG of Fig. 3 can be merged to form a macro-node asshown within the dashed-lines in Fig. 5. Each of these macro-nodes can be implemented by a new PPGU to obtain a PPGM of

PPGUs. Accordingly, two regular PPGUs in the structure ofFig. 4 can be emerged into a new regular PPGU as shown in Fig.5(b), which consists of two AND cells and two XOR cells (thefirst PPGU requires only one XOR cell). The functions of ANDcell, XOR cell and register cell are the same as those describedin Fig. 4. The critical path of the structure of Fig. 5(c) amountsto . The first output of desired product is availablefrom this structure after a latency of cycles, while thesuccessive outputs are available thereafter in each cycles of

Fig. 6. Proposed structure-II (PS-II) for RB multiplier, where “R” denotes aregister cell. (a) Modified PSFG. (b) Structure of RB multiplier.

Fig. 7. Novel cut-set retiming of PSFG and its corresponding structure: PS-III.(a) Cut-set retiming. (b) BPM and PPGM of PS-III.

duration . The technique used to derive the structurefor may be extended for any value of , to obtain a structureconsisting of PPGUs.The technique based on (14) can significantly reduce the reg-

ister complexity of the proposed structure, since consecutive

Page 6: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

XIE et al.: HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS 115

TABLE ICOMPARISON OF AREA-TIME COMPLEXITIES OF DIGIT-SERIAL RB MULTIPLIERS

: Time duration of critical-path. : is the number of bits of operand fed to each PPGU during each cycle period.

TABLE IICOMPARISON OF AREA-TIME COMPLEXITIES OF DIFFERENT MULTIPLIERS WHERE THERE EXIST A TYPE I ONB

: Time duration of critical-path.

: The structure listed here refers to the AND-efficient digit-serial (AEDS) in [20].

: The structure listed here refers to the XOR-efficient digit-serial (XEDS) in [20].

: The structure listed here refers to the -sequential multipliers with parallel output I ( -SMPOI) in [21].

: Referring to the -sequential multipliers with parallel output II ( -SMPOII) in [21].

: The structure listed here refers to type I ONB structure.

: . : . : .

: . : . : .

: . : . : .

: . : .

: The structure listed here can be seen as the case of , where is the number of bits of operand fed to each PPGU during each cycle period.

: The structure listed here refers to the case where can be any value .

PPGUs of the PS-I can be merged together to formunits to be processed concurrently. This strategy is quite usefulfor FPGA-based implementation since the value of can bechosen appropriately, such that the PSFG nodes selected to beprocessed in a cycle can be mapped to a basic unit of FPGA withlow register complexity.

C. Proposed Structure-II

We can further transform the PSFG of Fig. 3 to reduce the latencyand hardware complexity of PS-I. To obtain the proposed struc-ture, serially-connected A nodes of the PSFG of Fig. 3 aremerged into a pipeline form of A nodes as shown within

the dashed-box in Fig. 6(a). These pipelined A nodes can be im-plemented by a pipelined XOR tree, as shown in Fig. 6(b). Sinceall the AND cells can be processed in parallel, there is no need ofusing extra “0”s on the input path to meet the timing requirementin systolic pipeline. The critical path and throughput of PS-II arethe same as those of PS-I. Similarly, PS-II can be easily extendedto larger values of to have low register-complexity structures.

D. Proposed Structure-III

Since the S nodes of Fig. 3 perform only the bit-shifting op-erations they do not involve any time consumption. Therefore,we can introduce a novel cut-set retiming to reduce the critical-

Page 7: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

116 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

Fig. 8. Comparisons of keymetrics of various structures for . (a) Com-parisons of area-complexity (number of ALUT). (b) Comparisons of maximumfrequency (MHz). (c) Comparisons of power consumption (mW).

path further, as shown in Fig. 7(a). It can be observed that thecut-set retiming allows to perform the bit-addition and bit-mul-tiplication concurrently, so that the critical-path is reduced to

, i.e., the throughputof thedesign is increased.The proposed high-throughput structure (PS-III) of RBmultiplierthusderived isshowninFig.7(b). It consistsof PPGUs,andeach PPGU consists of one AND cell, one XOR cell and two reg-ister cells. The proposed structure yields thefirst output of desiredresult cycles after thefirst input is fed to the structure,while the successiveoutputs areavailable ineach cycles.

IV. AREA-TIME-POWER COMPLEXITIES

A. Complexities of PS-I, PS-II, and PS-III

PS-I requires PPGUs, where each of the regularPPGUs consists of XOR gates and AND gates. The finite

Fig. 9. Comparisons of area-delay-power complexities of various structuresfor (delay refers to the ACT of a structure. Area, delay and power aremeasured in number of ALUT, s and mW, respectively). (a) Comparisonsof area-delay product (ADP). (b) Comparisons of power-delay product (PDP).(c) Comparisons of area-delay-power product (ADPP).

field accumulator requires XOR gates and bit-registers. Theproposed design in total requires XOR gates, AND gatesand bit-registers. After a latency of cycles,PS-I gives the desired output word in every cycles of duration

.PS-I for any value of consists of PPGUs. The complete

structure of the multiplier thus requires XOR gates, ANDgates and bit-registers. The latency of the structureamounts to cycles, where the duration of minimumcycle period is .PS-II has similar area-time complexities as those of PS-I ex-

cept that it involves less registers and lower latency than the latter.In total, PS-II requires registers and yields its first result

Page 8: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

XIE et al.: HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS 117

TABLE IIIAREA-TIME-POWER COMPLEXITIES COMPARISON OF VARIOUS MULTIPLIERS

BASED ON FPGA IMPLEMENTATION

: Area, delay and power are measured in number of ALUT, s and

mW, respectively.

: Delay refers to the ACT of a structure.

after a latency of cycles. For any value of , PS-IIrequires registers.PS-III requires PPGUs, where each of the

regular PPGUs consists of XOR gates, AND gates andbit-registers. The proposed design in total would require XORgates and AND gates. Besides, it needs a total ofbit-registers. After a latency of cycles, PS-III givesthe desired output word in every cycles of duration .

B. Comparison with Existing Digit-Serial RB Multipliers

The area-time complexities of proposed structures and existingstructures of [10]–[14] for RB multiplier are listed in Table I. Forsimplicity of discussion, we refer PS-I of Fig. 4 and PS-II of Fig. 6as the case of , respectively.In [13] and [14], the authors have shown that their structures

outperform the previous structures in [10]–[12]. Therefore wecompare the performance of proposed structures only with thoseof [13] and [14]. PS-I and PS-II (for ), not only involveless time complexity (shorter ACT), but also have less XOR gatesthan those of existing designs of [13] and [14]. PS-I and PS-II (for

) require less registers, at the cost of a small increase incritical-path. And PS-III has the lowest time-complexity amongall the structures listed in Table I.

C. Comparison with Existing Digit-Serial Multipliers Havinga Type I ONB

The complexity of RB multiplier is almost the same as thatof type I ONB [10]. The area-time complexities of the proposedmultipliers and architectures of [10]–[14] and [18]–[22] (forwhich there exists a type I ONB) are shown in Table II. Notethat these complexities are estimated by substituting for ,according to the definition in [10].The authors in [13] and [14] have shown that their multipliers

outperform the previously proposed structures in [10]–[12] and[18]–[21]. Therefore we compare our proposed structures onlywith [13]-[14] and [22]. PS-I and PS-II not only require lessnumber of logic gates and registers ( number less XOR gatesand nearly number less registers), but also have shorter ACTcompared to the structure in [22].

D. Comparison of Synthesis Results for FPGA Implementation

We have used Altera Quartus II 12.0 and chosen Arria II GZEP2AGZ225FF35C3 FPGA device to synthesize the proposeddesigns as well as the existing competing designs. The key syn-thesis results are obtained, in terms of area, maximum frequencyand power consumption with respect to various , and .The number of adaptive look-up table (ALUT) is taken as thearea measure. For fair estimation, we have used the same inputdata and the same clock frequency (100 MHz) to obtain thesynthesis results using Quartus II PowerPlay Power Analyzer.The metrics, i.e., area, maximum usable clock frequency andpower consumption of various structures estimated from thesynthesis results are shown in Fig. 8. We have also estimatedthe area-delay product (ADP), power-delay product (PDP) andarea-delay-power product (ADPP) of proposed and existingstructures from the synthesis results as shown in Fig. 9, wherethe delay refers to the ACT estimated from the minimum dataarrival time.For a detailed comparison, we have listed the synthesis results

(area, delay, power, ADP, PDP, and ADPP) of proposed designs(PS-I/PS-II and PS-III) along with the best of the existing designsof [13]/[14] in Table III, for and ,

and , and and ,respectively.It can be seen that for FPGA implementation, the proposed

structures (except PS-III) outperform the existing designs. Asshown in Figs. 8, 9 and Table III, PS-I can provide a saving of upto94% of ADP and ADPP and 65% of PDP over the existing designof [13], for and . Besides, as shown inFig. 9 and Table III, as the value of increases, the ADP of pro-posed structures decreases. It is worth noting that the ALUT ofAltera FPGA devices can be mapped to logic function involvingmultiple Boolean operations, so that the number of synthesizedALUT decreases as increases. This feature also explains whythe ADP of PS-III is worse than others.

E. Comparison of Synthesis Results for ASIC Implementation

We have also synthesized the proposed structures and the ex-isting structures using Synopsys Design Compiler and North Car-olina State University’s 45 nm FreePDK [15] to obtain the area,time and power complexities of the designs. Using those logicalsynthesis results, we have plotted the area, delay and power con-sumption (at 1GHz) in Fig. 10, and we have calculated the ADP,PDP and ADPP of the designs (shown in Fig. 11).As shown in Figs. 10 and 11, proposed structures outperform

the existing designs. Basic structures of PS-I and PS-II ( )have the lowest ADP and PDP among all these structures. Asincreases, the ADP and PDP of proposed structures also increasea little while the ADPP decreases, so that the ADP and the PDPof PS-II are the lowest among all the structures for the case of{ , , and }, while the ADPP of PS-II isthe lowest for the case of { , , and }. Fora detailed comparison, synthesis results in terms of area-delay-power complexities of proposed designs of PS-II, and PS-III; andthe best of the existing designs [14] are listed in Table IV for thecase of { , , and }, { , , and

}, and { , , and }, respectively.As shown in Figs. 10 and 11 and Table IV, especially for the

case of { , , and }, PS-II can save at most18% ADP, 17.8% PDP and 60% ADPP over the existing designof [14], as shown in Fig. 11 and Table IV. Moreover, as shown in

Page 9: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

118 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 62, NO. 1, JANUARY 2015

Fig. 10. Comparisons of key metrics of various structures for . (a)Comparisons of area-complexity ( ). (b) Comparisons of delay (ns). (c)Comparisons of power consumption (mW).

Table IV, PS-III of Fig. 7 has the lowest time complexity amongall the structures.

F. Design Selection

From Figs. 8, 9, 10, and 11, we find that PS-I and PS-II out-perform the other structures in both FPGA and ASIC platformsin terms of area, time and power complexities. Besides, becauseof their low area-time-power complexities and high throughputrate, PS-I and PS-II can be used in various real time applications.Especially for FPGA implementation, it is suggested to use eitherPS-I/II (for ) based on the area constraint and speedrequirement of applications. For ASIC implementation, PS-I andPS-II of Figs. 4 and 6 or PS-III of Fig. 7 are preferred for theirefficiency in area-time-power complexities. For applications re-quiring highest throughput, PS-III of Fig. 7 is the best choice. In

Fig. 11. Comparisons of area-delay-power complexities of various structuresfor (delay refers to the ACT of a structure. Area, delay and powerare measured in , ns and mW/GHz, respectively). (a) Comparisons of area-delay product (ADP). (b) Comparisons of power-delay product (PDP). (c) Com-parisons of area-delay-power product (ADPP).

summary, we can choose different structures according to the re-quirements of different application environments.

V. CONCLUSION

We have proposed a novel recursive decomposition algorithmfor RB multiplication to derive high-throughput digit-serial mul-tipliers. By suitable projection of SFG of proposed algorithm andidentifying suitable cut-sets for feed-forward cut-set retiming,three novel high-throughput digit-serial RB multipliers are de-rived to achieve significantly less area-time-power complexitiesthan the existing ones. Moreover, efficient structures with lowregister-count have been derived for area-constrained implemen-tation; and particularly for implementation in FPGA platformwhere registers are not abundant. The results of synthesis showthat proposed structures can achieve saving of up to 94% and

Page 10: 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: …kresttechnology.com/krest-academic-projects/krest-major... · 2015-07-22 · 110 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I:

XIE et al.: HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS 119

TABLE IVAREA-TIME-POWER COMPLEXITIES COMPARISON OF VARIOUS MULTIPLIERS

BASED ON ASIC IMPLEMENTATION

: area, delay and power are measured in , ns and mW(1 GHz),

respectively.

: delay refers to the ACT of a structure.

60%, respectively, of ADPP for FPGA and ASIC implemen-tation, respectively, over the best of the existing designs. Theproposed structures have different area-time-power trade-offbehavior. Therefore, one out of the three proposed structurescan be chosen depending on the requirement of the applicationenvironments.

REFERENCES[1] I. Blake, G. Seroussi, and N. P. Smart, Elliptic Curves in Cryptography,

ser. London Mathematical Society Lecture Note Series.. Cambridge,U.K.: Cambridge Univ. Press, 1999.

[2] N. R. Murthy and M. N. S. Swamy, “Cryptographic applications ofbrahmaqupta-bha skara equation,” IEEE Trans. Circuits Syst. I, Reg.Papers, vol. 53, no. 7, pp. 1565–1571, 2006.

[3] L. Song and K. K. Parhi, “Low-energy digit-serial/parallel finite fieldmultipliers,” J. VLSI Digit. Process., vol. 19, pp. 149–C166, 1998.

[4] P. K. Meher, “On efficient implementation of accumulation in finitefield over and its applications,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 17, no. 4, pp. 541–550, 2009.

[5] L. Song, K. K. Parhi, I. Kuroda, and T. Nishitani, “Hardware/softwarecodesign of finite field datapath for low-energy Reed-Solomn codecs,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 2, pp.160–172, Apr. 2000.

[6] G. Drolet, “A new representation of elements of finite fieldsyielding small complexity arithmetic circuits,” IEEE Trans. Comput.,vol. 47, no. 9, pp. 938–946, 1998.

[7] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, “Low-complexitybit-parallel systolic montgomery multipliers for special classes of

,” IEEE Trans. Comput., vol. 54, no. 9, pp. 1061–1070, Sep.2005.

[8] P. K. Meher, “Systolic and super-systolic multipliers for finite fieldbased on irreducible trinomials,” IEEE Trans. Circuits Syst.

I, Reg. Papers, vol. 55, no. 4, pp. 1031–1040, May 2008.[9] J. Xie, J. He, and P. K. Meher, “Low latency systolic montgomery mul-

tiplier for finite field based on pentanomials,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 2, pp. 385–389, Feb.2013.

[10] H. Wu, M. A. Hasan, I. F. Blake, and S. Gao, “Finite field multiplierusing redundant representation,” IEEE Trans. Comput., vol. 51, no. 11,pp. 1306–1316, Nov. 2002.

[11] A. H. Namin, H. Wu, and M. Ahmadi, “Comb architectures for finitefield multiplication in ,” IEEE Trans. Comput., vol. 56, no. 7, pp.909–916, Jul. 2007.

[12] A. H. Namin, H. Wu, and M. Ahmadi, “A new finite field multiplierusing redundat representation,” IEEE Trans. Comput., vol. 57, no. 5,pp. 716–720, May 2008.

[13] A. H. Namin, H. Wu, and M. Ahmadi, “A high-speed word level finitefield multiplier in using redundant representation,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 10, pp. 1546–1550,Oct. 2009.

[14] A. H. Namin, H. Wu, and M. Ahmadi, “An efficient finite field multi-plier using redundant representation,” ACM Trans. Embedded Comput.Sys., vol. 11, no. 2, Jul. 2012, Art. 31.

[15] North Carolina State University, 45 nm FreePDK wiki [Online]. Avail-able: http://www.eda.ncsu.edu/wiki/FreePDK45:Manual

[16] P. K. Meher, “Systolic and non-systolic scalable modular designs offinite field multipliers for Reed-Solomon Codec,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 17, no. 6, pp. 747–C757, Jun.2009.

[17] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Im-plementation. New York: Wiley, 1999.

[18] J. L. Massey and J. K. Omura, “Computational method and apparatusfor finite field arithmetic,” U.S. patent application, 1984.

[19] S. Gao and S. Vanstone, “On orders of optimal normal basis genera-tors,” Math. Comput., vol. 64, no. 2, pp. 1227–1233, 1995.

[20] A. Reyhani-Masoleh and M. A. Hasan, “Efficient digit-serial normalbasis multipliers over ,” IEEE Trans. Comput., vol. 52, no. 4,pp. 428–439, Apr. 2003.

[21] A. Reyhani-Masoleh and M. A. Hasan, “Low complexity word-levelsequential normal basis multipliers,” IEEE Trans. Comput., vol. 54,no. 2, pp. 98–C110, Feb. 2005.

[22] A. H. Namin, H. Wu, and M. Ahmadi, “A word-level finite field mul-tiplier using normal basis,” IEEE Trans. Comput., vol. 60, no. 6, pp.890–895, Jun. 2011.

Jiafeng Xie received the M.E. degree from CentralSouth University. His research interest includesVLSI cryptographic circuits design.

Pramod Kumar Meher (SM’03) received thePh.D. degree from Sambalpur University. Hisresearch interest includes design of dedicated, re-configurable architectures for computation-intensivealgorithms. Dr. Meher has served as AssociateEditor for the IEEE TRANSACTIONS ON CIRCUITSAND SYSTEMS—PART II: EXPRESS BRIEFS and IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS—PARTI: REGULAR PAPERS during 2008–2013. Cur-rently, he is serving as Associate Editor for theIEEE TRANSACTIONS ON VERY LARGE SCALE

INTEGRATION (VLSI) SYSTEMS, and Circuits, Systems, and Signal Processing.

Zhi-Hong Mao (S’96–M’05–SM’09) receivedthe Ph.D. degree from the Massachusetts Instituteof Technology, Cambridge, MA, USA. He is anAssociate Professor and a William Kepler White-ford Faculty Fellow in University of Pittsburgh,Pittsburgh, PA, USA. Dr. Mao received the FacultyEarly Career Award of the NSF, the Andrew P.Sage Best Transactions Paper Award of the IEEESystems, Man, and Cybernetics Society in 2010,and the Outstanding Service Award as AssociateEditor of IEEE TRANSACTIONS ON INTELLIGENT

TRANSPORTATION SYSTEMS in 2013.