ANALYSIS OF AN AREA EFFICIENT VLSI … OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING ... using VHDL description language. The logic ... Architecture for Floating Point Multiplier

IJVD: 3(1), 2012, pp. 21-26

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FORFLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*

Anbuselvi M. and Salivahanan S.Department of Electronics and Communication Engineering, SSN College of Engineering,

Rajiv Gandhi Salai, Kalavakkam-603110, India

This article deals with the VLSI architecture of the Floating point and Galois field multiplier, using a techniquecalled Wave-pipelining. Wave -pipelining is a circuit design technique that allows digital synchronoussystems to be clocked at rates higher than conventional pipelining techniques. Wave pipelining can improvethe throughput of a logic circuit while avoiding some of the overheads of traditional pipelining. Multiplicationplays a very important role in the signal processing applications. In the VLSI platform, the area consumptionis judged with the number of gates required to realize the logic. Accordingly, the multiplier structures,which we have traditionally, are computation intensive thereby involves larger usage of flipflops and slicesin terms of architecture realization. To reduce the area consumption, the technique of wave pipelining hasbeen incorporated, which also paves way for the low power architecture. The concept has been verifiedwith the other kind of multiplier namely, Galois field multiplier, which has its existence in coding theoryand cryptography analysis. The analysis of the designed architectures is done in Xilinx and Synopsys, targetedto 90nm technology.Keywords: Digital design, Floating point, Galois field, Maximum rate pipelining, Multiplier, VLSIarchitecture, Wave-pipelining.

1. INTRODUCTION

With the advent of the signal processing techniquesand technologies, the computation complexity of theindividual blocks of any application has beenoptimized. In general, the most common operationinvolved in any signal processing application is themultiplication operation. As a fact, the process ofmultiplication is computation intensive, whichconsumes more power and area, when comparedto the other arithmetic operations.

In an effort to improve the throughput of digitalsystems, the architecture for multipliers has beenoptimized using the pipelining technique. When alogic network is pipelined, synchronizing elements,either latches or registers, are inserted to partitionthe network into stages. Pipelining of a circuit intoN stages can result in speedup in throughput uptoa factor of N. The inserted synchronizing elementsincrease the area and power consumption of thelogic. Thereby the additional overheads are increasein latency and cycle time. Conventional pipelinedsystems allow data to propagate from a registerthrough the combinational network to anotherregister prior to initiating the subsequent data

transfer. Thus, the maximum operating frequencyis determined by the maximum propagation delaythrough the longest pipeline stage. Wave-Pipeliningor maximum rate pipelining is a circuit design thatallows digital systems to be clocked at rates higherthan that can be achieved with conventionalpipelining techniques. Wave-pipelining relies on thepredictable finite signal propagation delay throughcombinational logic for virtual data storage.

Wave pipelining of combinational circuits hasbeen shown to achieve clock rates 2 to 7-times thosepossible for the same circuits with conventionalpipelining. Rather, knowledge of the signalpropagation delay characteristics of the logicnetwork is used at design time to manage the signaldelays so as to ensure that operations do notinterfere with their predecessor nor successorcomputations [1, 2]. Unlike ordinary pipelining,wave pipelining does not require internal clockelements to increase throughput. The synchroni-zation of internal computations is achieved bybalancing inherent RC delays of combinational logicelements, thus allowing circuits to be pipelined at avery fine grain level. The rate at which logic can

22 Anbuselvi M. and Salivahanan S.

propagate through the circuit depends not on thelongest path delay but on the difference betweenthe longest and shortest path delays [3].

2. WAVE-PIPELINING

While improving the throughput of a logic circuit,traditional pipelining of VLSI systems results inoverheads in latency, cycle time, area, and powerconsumption. Cycle time overhead results from thetime required for signals to propagate out of thesynchronizing elements, from the time required forsignals to set up to the synchronizing elements priorto their being stored in the synchronizing elements,and for the unintentional clock skew in the arrivalof the synchronizer clock signal. Instead, in wave-pipelining, cycle time is determined by the variationin the propagation delay of the signals through thelogic, the input and output register delays.

Latency through the traditional pipeline isdefined as the total elapsed time from the time ofintroduction of data, at the input to the first stageof the pipeline, to the time the results ofcomputations performed on that data arrive at theoutput of the final stage of the pipeline. Area andpower overhead results from the additionaltransistors and wires used to implement thesynchronizing latches or registers, and from theincreased clock buffer area and power needed todrive the clock inputs to the synchronizers. The areaand power overheads of a traditional pipeline areavoided in the wave pipelining since there are noseparate synchronizers [4].

( ) 2CK MAX MIN S H CKT D D T T (1)According to the equation (1), the clock period

is directly proportional to the difference betweenthe maximum and minimum delay. Reducing thedifference in delay, by buffer insertion, the clockspeed can be increased, thereby realizing the wave-pipelined circuit.

3. FLOATING POINT MULTIPLIER

IEEE 754 single precision is the standard definedfor the floating point representation. The floating-point representation is one way to represent realnumbers. A floating-point number n is representedwith an exponent e and a mantissa m, so that: n = be× m, where b is the base number (also called radix).The three basic components are the sign, exponent,and mantissa as shown in Figure 2. IEEE 754standard defines the sign representation with asingle bit, exponent with 8 bits and mantissa partwith 22 bits.

Figure 1: Model of a Wave-pipelined Circuit

Figure 1 shows the model of a wave-pipelinedcircuit. There is no internal registers inside the logicblock. There are only flip-flops inserted at the inputand output side of the logic block. For the designedlogic, the maximum and minimum delay iscalculated. The technique of buffer insertion can beused to equalize the delay inside the logic element.

Figure 2: The Storage Layout for Single-precision Floating-point Binary

The floating-point format can represent a widerange of scale without losing precision, while thefixed-point format has a fixed window ofrepresentation. Hence, for example in a 32-bitfloating-point representation, numbers from 3.4 ×1038 to 1.4 × 10–45 can be represented with ease. Thisis one of the reasons why floating-point represen-tation is the most common solution. Floating-pointrepresentations also include special values likeinfinity, Not-a-Number (NaN) (e.g. result of squareroot of a negative number).

The architecture of the floating point multiplieris shown in figure 3. According to that, the sign bitof the multiplicand and multiplier are xored. Theexponent part of the multiplicand and the multiplierare added and normalized to get the exponent partof the result. The mantissa part of the multiplicandand multiplier are multiplied and normalized to findthe product term. Normalization is done tocompensate the loss in precision. At the product term,the overflow effect is take care by the rounding logic.The parallel architecture for speeding up thecomputation has been addressed in the literature [5].

Analysis of an Area Efficient VLSI Architecture for Floating Point Multiplier and Galois Field Multiplier* 23

4. GALOIS FIELD MULTIPLIER

The need for portable circuits able to communicatewith high bandwidths pushes the development ofhigh speed and low-power circuits. In this context,efficient Galois field GF (2m) arithmetic blocks aredesired in many fields like error-control coding andcryptosystems. In error-control coding, the Galoisfield GF (2m) arithmetic, mainly the field additionand multiplication is the basis of Reed-Solomonencoding and decoding [6, 7].

In cryptographic applications, the GF (2m)arithmetic is largely used in elliptic-curvecryptosystems. In these applications, the buildingblocks that greatly influence system complexity andtiming performance are the ones implementing thealgebraic blocks. The addition operation in GF (2m)is equivalent to a simple bitwise XOR operation. Onthe other hand, the multiplication operation requiresa larger and a slower hardware. The multiplierdesign presents a good area which is suitable forelliptic curve crypto processor design. Thereforeelliptic curve crypto system can be used inapplications that require small area and lowconsumption power such as smart cards and cellulartelephones. The different kind of architectures ofGalois field multiplier is addressed in the literature[8]. But the trade off between area and speed alwaysexists with respect to the various architectures.

Figure 3: Floating Point Multiplier Architecture This paper presents efficient hardwareimplementations for Galois field multiplier. Figure 4shows a basic 4-bit multiplier structure. The operandsare as shown, with the multiplier residing in a 4-bitshift register, the multiplicand in a 4-bit register, theresult in the middle (R (3) – R (0)), and an “irreduciblepolynomial” at the bottom. It is possible to load themultiplier and multiplicand serially, and have theirreducible polynomial arrive as part of the “poweron” initialization process. As the operation occurs,there will be a common clock shifting the multiplierand the result registers. The irreducible polynomialand the multiplicand remain static.

Generally all numbers in a Galois Field will be1s and 0s and for GF (2m), there will be 2m distinctsymbols. For m = 4, there will be 16 distinct symbols.When we multiply, we will use what is calledpolynomial form, so the arithmetic will be similarto standard arithmetic multiplication, except that ifthe results overflow the four bit limit, we mustadjust the result by subtracting the modulus, m. Theirreducible polynomial we used is, x4 + x + 1, whichwill be represented by 10011 in binary [9]. The valueof the multiplier (in bold) is incrementally placedin front of the parenthetic multiplicand, sosuccessive bits of the multiplier can be read downthat position from row to row. They arrive mostsignificant bit first. Also, multiplying number timesone preserves the number. Multiplying by zero willproduce a zero, as well. Due to the large number ofpartial results that have 0000 in them, we donftsee the effect of intermediate shifting.


5. SYNTHESIS

The architecture of the floating point multiplier andGalois field multiplier is realized using VHDLdescription language. The logic verification has beenperformed using Modelsim. The designed structure

Figure 4: A Basic 4-bit Galois Field Multiplier

is synthesized using XILINX 9.1 ISE Tool. Thesynthesis report with respect to Spartan 3e FPGA,is analyzed for the device utilization by the designedarchitecture. The floating point multiplierarchitecture shown above is designed with differentstages of pipelining.

Table 1Device Utilization Summary for Floating Point Multiplier

Pipelining Stages/ Single stage Two stage Three stage Wavepipelining

Device Utilized

No. of slice flip flops 601 751 846 456Logic utilization

No. of 4 input luts 2,481 2,491 2,494 2,409

No. of occupied slices 1,548 1,614 1,645 1,463Logic distribution

No. of Slices 1,548 1,614 1,645 1,463

Total Number of 2,658 2,669 2,660 2,5944 input luts

Gate count 24,115 25,366 26,080 22,505

The wave-pipelined architecture of themultiplier is designed by, computing the maximumand minimum delay along the different paths insidethe logic. The non-critical path having the minimumdelay is considered for delay equalization. Thebuffers are inserted at the appropriate paths, therebyreducing the difference between the maximum andminimum delay of the logic block.

The synthesis report for floating point multiplierwith different pipelining stages has been shown inTable 1. The detailed synthesis report speaks aboutthe device utilization, timing involved and the total

memory usage. Considering the device utilizationreport, the logic utilization in terms of number offlip-flops, number of lookup table and finally thetotal gate count is analyzed.

The floating point multiplier is analyzed withthe different stages of pipelining and compared withthe wave-pipelined structure. The above resultproves that the area consumption of the multipliersgets reduced with the wave-pipelining technique interms of number of flip flops or LUTs. The aboveanalysis has been strengthened with the Synopsystool, targeted to the 90nm technology.

Analysis of an Area Efficient VLSI Architecture for Floating Point Multiplier and Galois Field Multiplier* 25

We infer from Table 2 that the total areaoccupied by the logic increases with the increase inthe stages of pipelining. But with the use of wave-pipelining technique, the throughput, latency andalso the total area gets reduced compared with theinitial single stage architecture. The synthesis reportfor Galois field multiplier with different pipeliningstages has been shown in Table 3. The above resultproves that the architecture when targeted to XilinxFPGA, the area consumption of the multipliers getsreduced with the wave-pipelining in terms of

Table 2Area Analysis for Floating Point Multiplier in Synopsys Tool


Area

Combinational area (µm2) 150661.65 150580.46 150667.16 150147.00

Noncombinational Area (µm2) 112808.18 133938.38 146806.13 85447.28

Net Interconnect (µm2) 518903.89 536814.81 54808.05 494210.46

Total cell area (µm2) 263469.83 284518.84 297473.29 235594.28

Total area (µm2) 782373.73 821333.65 845491.34 729804.74

number of flip flops or LUTs. The above analysishas been strengthened with the Synopsys tool,targeted to the 90nm technology.

The analysis of the Galois field multiplier withdifferent stages of pipelining has been done in theSynopsys tool also. The inference from the Table 4, isthat the area and also power gets reduced for the wave-pipelined architecture, when compared to the differentarchitectures of GF multiplier. Thus the designedarchitecture is area efficient and power efficient.

Table 2Area Analysis for Floating Point Multiplier in Synopsys Tool

Pipelining Stages/ Single stage Two stage Three stage WavepipeliningArea

Combinational area (µm2) 6068.39 6317.15 6896.72 6268.39

Noncombinational Area (µm2) 20458.36 23573.71 27772.66 17343.09

Net Interconnect (µm2) 36447.86 39171.27 45318.04 33813.54

Total cell area (µm2) 26726.75 29890.86 34669.38 23611.40

Total area (µm2) 63174.62 69062.14 79987.43 57424.95

Power 708.08 µW 798.46 µW 1.23mW 620.37 µW

Table 3Device Utilization Summary for Galois Field Multiplier


Device Utilized

No. of slice 88 102 115 75Logic utilization

No. of slice flip flops 121 144 167 98

No. of occupied slices 93 105 117 80Logic distribution

No. of Slices 93 105 117 80

Total Number of 87 88 86 874 input luts

Gate count 1,712 1,902 2,074 1,528


6. CONCLUSION

This paper aims at analyzing the performance offloating point and Galois field multipliers with theeffect of wave-pipelining. Both the architectureshave been studied and different stages of pipelininghave been implemented. The different architecturesof both floating point and GF multiplier is alsosynthesized using the Synopsys tool, targeted to90nm. It is found that the GF multiplier with wave-pipelined structure is both area and power efficient.Hence wave-pipelining is found to be more superiorin terms of area and power when compared withother pipelining stages. The same architectures canbe designed with other wave-pipelining methods,such as logic restructuring and node collapsing.

REFERENCES

[1] Donald A. Joy and Maciej J. Ciesielski, “ClockPeriod Minimization with Wave Pipelining”, IEEETransaction On Computer Aided Design of IntegratedCircuits and Systems, 12(14), April 1993.

[2] Fabian Klass, Maciji Ciesielski, Wayne P. Burlesonand Wental Liu, “Wave -Pipelining: A Tutorial andResearch Survey”, IEEE Transactions on Very LargeScale Integration (VLSI) System, 6(3), September 1998.

[3] G. Lakshminarayanan and B. Venkataramanai,“Optimization Techniques for FPGA-Based Wave-pipelined DSP Blocks”, IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems, 13(7), July 2005.

[4] Ramalingam Sridhar and Xuguang Zhang,“Synchronization of Wave Pipelined Circuits”,IEEE 1994.

[5] Sanjiv Kumar Mangal, Raghavendra B. Deshmukh,M. Badghare and R.M. Patrikar, “FPGAImplementation of Low Power Parallel Multiplier”,20th International Conference on VLSI Design(VLSID’07).

[6] Nick Iliev, James Stine, and Nathan Jachimiec,“Digital Finite-Field Multiplier for Reed-SolomonChannel Codes in GF (2^8) with ProgrammableBasis Polynomial”, IIT VLSI Lab.

[7] R. Lidl, and H. Niederreiter, “Introduction to FiniteFields and Their Applica tions”, Cambridge Univ.Press. 1994.

[8] Joes Luis Imana, “Bit-Parallel ArithmeticImplementations Over Finite Fields GF (2m) withReconfigurable Hardware”, pp. 337-356, KluwerAcademic, 2002.

[9] C. Yeh, I. S. Reed, and T.K. Trouong, “SystolicMultipliers for Finite Fields GF (2m)”, IEEE Trans.On Computers, C-33, pp. 357, 1984.

Documents

ANALYSIS OF AN AREA EFFICIENT VLSI … OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING ... using VHDL description language. The logic ... Architecture for Floating Point Multiplier