Implementation and Comparison of Softcore Multiplier Architectures for FPGAs

Introduction and BackgroundMultiplier Architectures

ResultsConclusion

Implementation and Comparison of SoftcoreMultiplier Architectures for FPGAs

Shahid Abbas

Projektarbeit (Master of Science)Fachgebiet Digitaltechnik

Universtat Kassel

August 22, 2014

1 / 25


ResultsConclusion

Outline

1 Introduction and BackgroundMotivationFundamentals of Binary MultiplicationXilinx Virtex-6 SliceFloPoCo Library and Bit Heaps

2 Multiplier ArchitecturesTarget Specific ImplementationLUT-Based Multipliers

3 ResultsSimulationSynthesis Results

4 Conclusion

2 / 25


ResultsConclusion

Outline




4 Conclusion

2 / 25


ResultsConclusion

Outline




4 Conclusion

2 / 25


ResultsConclusion

Outline




4 Conclusion

2 / 25


ResultsConclusion

MotivationFundamentals of Binary MultiplicationXilinx Virtex-6 SliceFloPoCo Library and Bit Heaps

Motivations

Fast Multiplication for Signal Processing

Limited number of DSP Blocks in FPGA [1]

Fixed word size

Use big multiplier for small word size

Fixed allocation

Place and routing issues

Use of FPGA logic blocks for multiplier of any word size

Softcore multiplier that work in conjunction with DSP multipliers

3 / 25


ResultsConclusion


Motivations



Fixed word size


Fixed allocation




3 / 25


ResultsConclusion


Motivations



Fixed word size


Fixed allocation




3 / 25


ResultsConclusion


Motivations



Fixed word size


Fixed allocation




3 / 25


ResultsConclusion


Motivations



Fixed word size


Fixed allocation




3 / 25


ResultsConclusion


Motivations



Fixed word size


Fixed allocation




3 / 25


ResultsConclusion


Motivations



Fixed word size


Fixed allocation




3 / 25


ResultsConclusion


Motivations



Fixed word size


Fixed allocation




3 / 25


ResultsConclusion


Fundamentals of Binary Multiplication

1 Partial Products Calculation

2 Addition of Partial Products by proper shifting

A=A3

B=B3

20·B0·A

x

+

Step 1

Step 2

A0…

… B0

21·B1·A

22·B2·A

23·B3·A

+

+

=

Figure: Binary 4×4-bit Multiplication

4 / 25


ResultsConclusion


Fundamentals of Binary Multiplication

1 Partial Products Calculation

2 Addition of Partial Products by proper shifting

A=A3

B=B3

20·B0·A

x

+

Step 1

Step 2

A0…

… B0

21·B1·A

22·B2·A

23·B3·A

+

+

=

Figure: Binary 4×4-bit Multiplication

4 / 25


ResultsConclusion


Xilinx Virtex-6 Slice [2]

Configurable Logic Blocks (CLB) contains two slices

Each slice contains four Look-Up Tables (LUT), eight Flip-Flops, multiplexers and acarry-propagation logic.

Single or two outputs per LUT

0

1

0

1

0

10

1c_in

c_out

LUTLUTLUTLUT

Figure: Xilinx Virtex-6 Slice

5 / 25


ResultsConclusion






0

1

0

1

0

10

1c_in

c_out

LUTLUTLUTLUT


5 / 25


ResultsConclusion






0

1

0

1

0

10

1c_in

c_out

LUTLUTLUTLUT


5 / 25


ResultsConclusion


FloPoCo Library and Bit Heaps

Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4].

Bit heap is a data-structure holds unevaluated sum of any number of bits weighted bypower of two [5].

Equally weighted bits aligned in column as order is irrelevant for sum

before first compression

1 0.530 ns

1 1.061 ns

before 3-bit height additions

before final addition

Figure: Bit-Heap Structure for 16×16-Bit Multiplier

6 / 25


ResultsConclusion







1 0.530 ns

1 1.061 ns




6 / 25


ResultsConclusion







1 0.530 ns

1 1.061 ns




6 / 25


ResultsConclusion

Target Specific ImplementationLUT-Based Multipliers

Target Specific Implementation [6]

Best Fit design in Logic Blocks = Better Performance

a b

c_out c_in

sum

0

1

Figure: Full Adder Implementation with Multiplexer and XOR-Gates

7 / 25


ResultsConclusion


Target Specific Implementation [6]

0

1

0

1

0

10

1c_in

c_out

S0S1S2S3

LUTLUTLUTLUT

a0b0a1b1a2b2a3b3a4b4a5b5a6b6a7b7

Figure: Slice configuration of 4-LUTs for Partial Product and Addition

8 / 25


ResultsConclusion


Target Specific Implementation (Automated)

vector < vector < pair < int, int >>>

0

00

0

0

c_in=0

c_out (to bit-heap)

Partial-Product Calculation

Re-arrangement

3-LUT Slice

4-LUT Slice

n 8-bit

m 8-bit

Before MultiplicationA

B

20·B0·A

21·B1·A

22·B2·A

23·B3·A

24·B4·A

25·B5·A

26·B6·A

27·B7·A

Figure: 8×8-bit Multiplier Implementation in FloPoCo9 / 25


ResultsConclusion


Target Specific Implementation (For 8×8-bit Multiplier)

Automated Implementation


Manual interconnection of Slices

Addition using Bit Heaps

Addition using Arithmetic Expressions

10 / 25


ResultsConclusion








10 / 25


ResultsConclusion








10 / 25


ResultsConclusion








10 / 25


ResultsConclusion








10 / 25


ResultsConclusion


Target Specific Implementation (Manual)

Re-arrangement

To b

it h

eap

To b

it h

eap

To b

it h

eap

To b

it h

eap

To b

it h

eap

AND-gate

Figure: 8×8-bit Multiplier with Manual Interconnection of Slices

11 / 25


ResultsConclusion


LUT-Based Multipliers [7] [5]

Multiplication of two numbers can be obtained by the bit shifted additions ofsmall multiplier result

A = 2nA1 + A0 (1)

B = 2nB1 + B0 (2)

A× B = 22nA1B1 + 2n(A1B0 + A0B1) + A0B0 (3)

A basic n ×m-bit multiplier can be instantiated multiple times

Add results of each instance through proper shifting

12 / 25


ResultsConclusion




A = 2nA1 + A0 (1)

B = 2nB1 + B0 (2)

A× B = 22nA1B1 + 2n(A1B0 + A0B1) + A0B0 (3)



12 / 25


ResultsConclusion




A = 2nA1 + A0 (1)

B = 2nB1 + B0 (2)

A× B = 22nA1B1 + 2n(A1B0 + A0B1) + A0B0 (3)



12 / 25


ResultsConclusion



3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits)


1×4-LUT based Multipliers

33

6

A B

Y

Figure: 3×3-LUT Multiplier

3

5

A B

Y

2


13 / 25


ResultsConclusion






33

6

A B

Y


3

5

A B

Y

2


13 / 25


ResultsConclusion






33

6

A B

Y


3

5

A B

Y

2


13 / 25


ResultsConclusion


LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier)

3x3 3x3

3x33x3

0 1 2 3 4 5

0

1

23

4

5

2x3

2x3

3x1 3x16 2x1

6 7

A

B

i ii

iii iv

v

vi

vii viii ix

Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo

AB = A0..2B0..2 + 23(A3..5B0..2 + A0..2B3..5) + 26(A6..7B0..2 + A3..5B3..5 + A0..2B6)

+ 29(A6..7B3..5 + A3..5B6) + 212A6..7B6

(4)

14 / 25


ResultsConclusion


LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier)

3x3 3x3

3x33x3

0 1 2 3 4 5

0

1

23

4

5

2x3

2x3

3x1 3x16 2x1

6 7

A

B

i ii

iii iv

v

vi

vii viii ix

Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo

AB = A0..2B0..2 + 23(A3..5B0..2 + A0..2B3..5) + 26(A6..7B0..2 + A3..5B3..5 + A0..2B6)

+ 29(A6..7B3..5 + A3..5B6) + 212A6..7B6

(4)

14 / 25


ResultsConclusion

SimulationSynthesis Results

Simulation

1 Word sizes are even and equal

2 Word sizes are even and unequal.

3 Width of large word is even and other is odd

4 Width of large word is odd and other is even

5 Word sizes are odd and unequal

6 Word sizes are odd and equal

Eight Designs for every of above specifications

48-Designs for each architecture

Self-Checking testbenches were generated using FloPoCo function emulate(TestCase*tc)

TestBench 10000 option was used to generated 10000 random testcases during core-generation.

Simulation on ModelSim

15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Simulation












15 / 25


ResultsConclusion


Synthesis Results

0 500 1000 1500 2000 2500 3000 3500 4000 45000

100

200

300

400

500

600

700

800

900

1000

Speed Vs Complexity (N X M)

Complexity (N X M)

Fre

quency (

MH

z)

fmax

= 906.62 MHz in Target Specific Multiplier

Target Specfic Multiplier

3x3 LUT Multiplier

1x4 LUT Multiplier

3x2 LUT Multiplier

Figure: Comparison of Architectures on the basis of Speed (for N×M-bit)

16 / 25


ResultsConclusion


Synthesis Results

0 10 20 30 40 50 60 700

100

200

300

400

500

600

700Speed Vs Complexity (N X N)

Complexity (N)

Fre

quency (

MH

z)


3x3 LUT Multiplier

1x4 LUT Multiplier

3x2 LUT Multiplier

Figure: Comparison of Architectures on the basis of Speed (for N×N-bit)

17 / 25


ResultsConclusion


Synthesis Results

0 500 1000 1500 2000 2500 3000 3500 4000 45000

200

400

600

800

1000

1200

1400

1600

1800Slice Usage Vs Complexity (N X M)

Complexity (N X M)

Num

ber

of S

lices


3x3 LUT Multiplier

1x4 LUT Multiplier

3x2 LUT Multiplier

Figure: Comparison of Architectures on the basis of Slice usage (for N×M-bit)

18 / 25


ResultsConclusion


Synthesis Results

0 10 20 30 40 50 60 700

200

400

600

800

1000

1200

1400

1600

1800Slice Usage Vs Complexity (N X N)

Complexity (N)

Num

ber

of S

lices


3x3 LUT Multiplier

1x4 LUT Multiplier

3x2 LUT Multiplier

Figure: Comparison of Architectures on the basis of Speed (for N×N-bit)

19 / 25


ResultsConclusion


Synthesis Results

Average Performance

Table: Average values of parameters for different architectures

Architecture No. of Flip-Flops No. of LUTs No. of Slices Frequency (MHz)

Target Specific 1144 1615 419 346.36

3×3-LUT 1422 1893 491 301.03

3×2-LUT 1730 1962 513 264.95

1×4-LUT 2019 2340 610 259.98

20 / 25


ResultsConclusion


Synthesis Results

Automatic Vs Manual Interconnection of Slices (8×8-bit)

Table: Automatic Vs Manual routing between Slices

No. of FFs No. of LUTs No. of Slices Frequency (MHz)

Automatic 56 74 21 686.81

Manual(Bit Heap) 22 74 20 256.61

Manual (Without Bit Heap) 40 59 16 414.08

21 / 25


ResultsConclusion


Synthesis Results

Automatic Vs Manual Interconnection of Slices (8×8-bit)




Figure: Bit Heap Structure for AutomaticInterconnection of Slices


1 0.530 ns



Figure: Bit Heap Structure for ManualInterconnection of Slices

22 / 25


ResultsConclusion

Conclusion

Fast multipliers with minimum resources can be implementedby choosing appropriate architecture.

Target Specific Implementation showed best results due toaverage fast speed and less consumption of resources.

Automated generation of this approach can modified withintroduction of AND-gate for corner elements.

Slice usage can be improved by their manual interconnection,with compromise over speed.

23 / 25


ResultsConclusion

Conclusion





23 / 25


ResultsConclusion

Conclusion





23 / 25


ResultsConclusion

Conclusion





23 / 25


ResultsConclusion

References

[1] Ian Kuon and J. Rose.Measuring the Gap Between FPGAs and ASICs.Computer-Aided Design of Integrated Circuits and Systems, 26:203–215, February 2007.

[2] Xilinx.Virtex-6 FPGA, Configurable Logic Block User Guide, UG364 (v1.2).http://www.xilinx.com/support/documentation/user_guides/ug364.pdf, 2012.

[3] F. de Dinechin and B. Pasca.Designing Custom Arithmetic Data Paths with FloPoCo.Design and Test of Computers, 28:18–27, 2011.

[4] Florent de Dinechin.Tutorial held at HiPEAC’2013 “Building Custom Arithmetic Operators with the FloPoCo Generator”.http://perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/flopoco-tutorial.pdf,2013.

[5] Brunie N., de Dinechin F., Istoan M., Sergent G., Illyes K., and Popa B.Arithmetic core generation using bit heaps.In Proc. IEEE FPL ’2013, pages 1–8, Porto, Portugal, 2–4, 2013.

[6] H. ParandehAfshar and P. Ienne.Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs.In Proc. IEEE FPL ’2011, pages 225–231, Chania, Greece, 5–7, 2011.

[7] F. de Dinechin and B. Pasca.Large multipliers with fewer DSP blocks.In Proc. IEEE FPL ’2009, pages 225–231, Chania, Greece, Aug 31-Sept 2 2011.

24 / 25

http://www.xilinx.com/support/documentation/user_guides/ug364.pdf

http://perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/flopoco-tutorial.pdf


ResultsConclusion

Thanks for your attention !

25 / 25

Engineering

Implementation and Comparison of Softcore Multiplier Architectures for FPGAs