47
Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm * , Johannes Kappauf * , Matei Istoan and Peter Zipf * * University of Kassel, Germany University Lyon, France 24'th IEEE Symposium on Computer Arithmetic 25.07.2017

Resource Optimal Design of Large Multipliers for FPGAs

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Resource Optimal Design of Large Multipliers for FPGAs

Resource Optimal Design of Large Multipliers for FPGAs

Martin Kumm*, Johannes Kappauf*,

Matei Istoan† and Peter Zipf*

*University of Kassel, Germany†University Lyon, France

24'th IEEE Symposium on Computer Arithmetic

25.07.2017

Page 2: Resource Optimal Design of Large Multipliers for FPGAs

Motivation

Multiplication is a fundamental arithmetic operation

Embedded multipliers available in the FPGA fabric are limited in size (& quantity)

Larger multipliers can be decomposed into smaller multipliers realized by DSP blocks or logic resources

Question of interest: How to do the decomposition in a (resource) optimal way?

2

Page 3: Resource Optimal Design of Large Multipliers for FPGAs

Outline

1. How to formulate the problem as tiling problem?

2. How do the tiles look like?

3. How to solve the problem?

3

Page 4: Resource Optimal Design of Large Multipliers for FPGAs

Outline

1. How to formulate the problem as tiling problem?

2. How do the tiles look like?

3. How to solve the problem?

4

Page 5: Resource Optimal Design of Large Multipliers for FPGAs

Multiplier Decomposition

5

A×B = (AH2n +AL)(BH2

m +BL)

= AHBH| {z }

M4

2n+m+AHBL| {z }

M3

2n +ALBH| {z }

M2

2m+ALBL| {z }

M1

A large multiplier can be decomposed into several smaller multipliers:

Page 6: Resource Optimal Design of Large Multipliers for FPGAs

Multiplier Tiling

6

The multiplier can be graphically represented as an X×Y board which is tiled by smaller multiplier, represented as rectangles [de Dinechin 2009]

The required left shift can be obtained from the sum of the tile coordinates (x,y)

016320

16

32

M1

M2M4

M3

y

← x

32×32 board with

n=m=16 bit mult.

A×B = (AH216 +AL)(BH2

16 +BL)

= AHBH| {z }

M4

232+AHBL| {z }

M3

216+ALBH| {z }

M2

216+ALBL| {z }

M1

Page 7: Resource Optimal Design of Large Multipliers for FPGAs

Multiplier Tiling

7

A valid multiplier tiling is as follows:

The board must completely covered without overlaps of the tiles

Overlaps with the border of the board are allowed

01724344158530

17

24

34

41

58

53

y

← x

53×53 multiplier [de Dinechin 2009]

Page 8: Resource Optimal Design of Large Multipliers for FPGAs

Outline

1. How to formulate the problem as tiling problem?

2. How do the tiles look like?

3. How to solve the problem?

8

Page 9: Resource Optimal Design of Large Multipliers for FPGAs

Logic-based Tiles

9

Several LUT-based multipliers can be used:

3×3 Mult., which can be mapped to six 6-input LUTs (LUT6) [Brunie 2013]

2×3 Mult. which can be mapped to three LUT6 (realizing five LUT5) [Kumm 2015]

1×2 Mult., uses a single LUT6 (realizing two LUT5)

In addition, LUT/carry-chain multipliers are used:

Single row of an FPGA-optimized Baugh-Wooley multiplier [Parandeh-Afshar 2011]

Page 10: Resource Optimal Design of Large Multipliers for FPGAs

Shapes of the Logic-based Tiles

10

030

3

(a) 3× 3

030

2

020

3

(b) 3× 2/2× 3

010

2

020

1

(c) 2× 1/1× 2

. . .

. . .0k0

2

(d) k × 2

......

020

k

(e) 2× k

Page 11: Resource Optimal Design of Large Multipliers for FPGAs

LUT Requirements in the Compressor Tree

11

0 200 400 600 800 1,000 1,200 1,400 1,6000

500

1,000

Input bits (#bits)

#LUTs

multi-input addition

x3 operation0.65×#bits

Page 12: Resource Optimal Design of Large Multipliers for FPGAs

Logic-based Multipliers

12

Cost is composed to:

To get the "quality" of a multiplier, an efficiency metric is defined as benefit/cost ratio:

Es =areas

costs

costs = #LUTm+ 0.65ws

Shape Tile area Word size (ws) #LUTm Total cost (costs) Efficiency (Es)

1× 1 1 1 1 1.65 0.6251× 2 2 2 1 2.3 0.872× 3 6 5 3 6.25 0.963× 3 9 6 6 9.9 0.91

2× k 2k k + 2 k + 1 1.65k + 2.3 2k

1.65k+2.3

(= 1.21 for k → ∞)

Page 13: Resource Optimal Design of Large Multipliers for FPGAs

DSP-based Tiles

13

Xilinx DSP blocks contain 18×25 bit (signed)/17×24 bit (unsigned) multipliers

They contain additional post-adders

These can be used to add a multiplier result already obtained

This reduces the size of the compressor tree

Graphically, this can be represented as a so-called super-tile

[Banescu 2010]

Page 14: Resource Optimal Design of Large Multipliers for FPGAs

Super-Tiles of Xilinx FPGAs

14

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

Page 15: Resource Optimal Design of Large Multipliers for FPGAs

Outline

1. How to formulate the problem as tiling problem?

2. How do the tiles look like?

3. How to solve the problem?

15

Page 16: Resource Optimal Design of Large Multipliers for FPGAs
Page 17: Resource Optimal Design of Large Multipliers for FPGAs
Page 18: Resource Optimal Design of Large Multipliers for FPGAs
Page 19: Resource Optimal Design of Large Multipliers for FPGAs
Page 20: Resource Optimal Design of Large Multipliers for FPGAs

Formalizing the Problem

20

Constant/Variable Meaning

x, y ∈ N0 CoordinatesX,Y ∈ N0 Outer bounds of the multiplier to be designedMx,y ∈ {0, 1} Shape of the multiplier to be designed; true when (x, y) is within

the area of the multiplierS Set of small multipliers with different shapeS = |S| Number of available smaller multiplierss ∈{0, 1, . . . , S − 1} Shape index of smaller Multiplierms

x,y∈ {0, 1} Boolean constant describing each small multiplier; true when

(x, y) is within the area of the multiplier of shape s

costs ∈ R Cost of a small multiplier of shape s

dsx,y

∈ {0, 1} Decision variable, which is true when multiplier of shape s isplaced at coordinate (x, y)

Page 21: Resource Optimal Design of Large Multipliers for FPGAs

Specification of a Tile

21

0120

1

2

3

y

← x

m0

0,0= m

0

0,1= m

0

0,2= m

0

1,0= m

0

1,1= 1

Setting

with all other m's zero would define the following tile:

Page 22: Resource Optimal Design of Large Multipliers for FPGAs

ILP Formulation

22

The multiplier tiling problem can be reformulated into an integer linear programming (ILP) as follows:

The ILP problem can be solved by using standard solvers

minimize

S−1X

s=0

X−1X

x=0

Y−1X

y=0

costsdsx,y

subject to

S−1X

s=0

X−1X

x0=0

Y−1X

y0=0

msx−x0,y−y0d

sx0,y0 = 1

9

=

;

for 0 ≤ x ≤ X,

0 ≤ y ≤ Y

with Mx,y = 1

Page 23: Resource Optimal Design of Large Multipliers for FPGAs

ILP Formulation

23

Graphical representation of the left-hand-side of the ILP constraint:

0123450

1

2

3

4

5

y

← x

m0

0,3d0

1,2= 0

m0

0,2d0

1,2= 1

m0

0,1d0

1,2= 1

m0

0,0d0

1,2= 1

m0

1,1d0

1,2= 1

m0

1,0d0

1,2= 1

Page 24: Resource Optimal Design of Large Multipliers for FPGAs

The cost of DSP blocks are hard to compare with the cost of LUTs

Better to constrain the DSP count of a certain application

A single additional constraint can be used to specify the number of DSPs (#DSP):

where Ds specifies the number of DSPs in multiplier shape s

Additional DSP Constraint

24

S−1X

s=0

X−1X

x=0

Y−1X

y=0

Dsdsx,y = #DSP

Page 25: Resource Optimal Design of Large Multipliers for FPGAs

Four important cases were considered:

24×24 (single precision)

32×32

53×53 (double precision)

64×64

Each evaluated for varying DSP count up to DSP-only implementation

Results

25

Page 26: Resource Optimal Design of Large Multipliers for FPGAs

Resulting Tilings 24/32 Bit

26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP

Page 27: Resource Optimal Design of Large Multipliers for FPGAs

Resulting Tilings 24/32 Bit

26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP

Baugh-Wooley multiplier

[Parandeh-Afshar 2011]

Page 28: Resource Optimal Design of Large Multipliers for FPGAs

Resulting Tilings 24/32 Bit

26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP

2×k and 1:2 performs

best for LUT-based

multiplication

Page 29: Resource Optimal Design of Large Multipliers for FPGAs

Resulting Tilings 24/32 Bit

26

0240

24

24× 24, 0 DSP

0240

17

24

24× 24, 1 DSP

0240

34

24

24× 24, 2 DSP

0320

32

32× 32, 0 DSP

024320

17

32

32× 32, 1 DSP

017320

24

32

32× 32, 2 DSP

0617320

24

41

32

32× 32, 3 DSP

08

32

32× 32, 4 DSP

efficient solution

utilizing

two super-tiles

Page 30: Resource Optimal Design of Large Multipliers for FPGAs

Resulting Tilings 53 Bit

27

082449530

17

34

41

53

53× 53, 5 DSP

02450530

17

34

53

53× 53, 6 DSP

03172734530

24

41

58

53

53× 53, 7 DSP

12294153580

12

24

41

58

53× 53, 8 DSP

012244158

12

29

41

53

58

53× 53, 9 DSP

Page 31: Resource Optimal Design of Large Multipliers for FPGAs

Resulting Tilings 53 Bit

27

082449530

17

34

41

53

53× 53, 5 DSP

02450530

17

34

53

53× 53, 6 DSP

03172734530

24

41

58

53

53× 53, 7 DSP

12294153580

12

24

41

58

53× 53, 8 DSP

012244158

12

29

41

53

58

53× 53, 9 DSP

pinwheel inside of a pinwheel

logic-mult. consumes

1/4 are compared to

previous hand-optimized

design [de Dinechin 2009]

Page 32: Resource Optimal Design of Large Multipliers for FPGAs

Resulting Tilings 64 Bit

28

017345158640

24

41

58

64

64× 64, 7 DSP

0173458640

17

24

30

34

58

64

64× 64, 8 DSP

0623404764

0

6

23

40

47

64

64× 64, 9 DSP

016234064

02

16

19

23

33

40

43

47

50

67

64

64× 64, 10 DSP

02448720

13

23

30

47

64

64× 64, 11 DSP

Page 33: Resource Optimal Design of Large Multipliers for FPGAs

Optimization & Synthesis Results

29

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]

24×24

[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4

[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9

32×32

[Banescu 2010] 0 1024 339 275.8proposed 0 1024 276 18.6% 304.4

[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6

proposed 1 616 180 12.2% 302.5

[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0

[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6

proposed 3 64 44 41.3% 314.5

[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4

proposed 4 0 13 23.5% 181.7

Page 34: Resource Optimal Design of Large Multipliers for FPGAs

Optimization & Synthesis Results

29

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]

24×24

[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4

[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9

32×32

[Banescu 2010] 0 1024 339 275.8proposed 0 1024 276 18.6% 304.4

[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6

proposed 1 616 180 12.2% 302.5

[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0

[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6

proposed 3 64 44 41.3% 314.5

[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4

proposed 4 0 13 23.5% 181.7

less slices because of better

logic-based multiplier/compressor tree

Page 35: Resource Optimal Design of Large Multipliers for FPGAs

Optimization & Synthesis Results

29

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]

24×24

[Brunie 2013] 1 216 65 212.4proposed 1 168 58 10.8% 287.4

[Brunie 2013] 2 0 0 418.9proposed 2 0 0 0.0% 418.9

32×32

[Banescu 2010] 0 1024 339 275.8proposed 0 1024 276 18.6% 304.4

[Brunie 2013] 1 648 205 192.8[Banescu 2010] 1 616 234 352.6

proposed 1 616 180 12.2% 302.5

[Brunie 2013] 2 288 94 270.1proposed 2 256 82 12.8% 338.0

[Brunie 2013] 3 135 75 194.0[Banescu 2010] 3 176 75 426.6

proposed 3 64 44 41.3% 314.5

[Brunie 2013] 4 0 17 314.7[Banescu 2010] 4 40 38 379.4

proposed 4 0 13 23.5% 181.7

less slices because of better

super-tile usage

Page 36: Resource Optimal Design of Large Multipliers for FPGAs

Optimization & Synthesis Results

30

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]

53×53

[Banescu 2010] 5 1029 350 298.2proposed 5 769 295 15.7% 313.2

[Brunie 2013] 6 468 196 214.1[Banescu 2010] 6 721 220 298.2

proposed 6 361 180 8.2% 263.2

[Banescu 2010] 7 313 223 378.9proposed 7 193 137 38.6% 290.2

[Banescu 2010] 8 265 145 356.4proposed 8 25 81 44.1% 272.7

[Brunie 2013] 9 162 125 195.6[Banescu 2010] 9 215 174 255.8

proposed 9 0 72 42.4% 348.8

64×64

[Banescu 2010] 7 1504 614 245.0proposed 7 1191 430 30.0% 270.5

[Brunie 2013] 8 1188 420 194.2[Banescu 2010] 8 1096 449 280.7

proposed 8 652 348 17.1% 261.2

[Banescu 2010] 9 864 413 262.9proposed 9 475 217 47.5% 249.6

[Banescu 2010] 10 592 341 250.7proposed 10 187 179 47.5% 267.7

[Brunie 2013] 11 270 196 162.8[Banescu 2010] 11 592 268 225.3

proposed 11 0 108 44.9% 265.4

Page 37: Resource Optimal Design of Large Multipliers for FPGAs

Optimization & Synthesis Results

30

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz]

53×53

[Banescu 2010] 5 1029 350 298.2proposed 5 769 295 15.7% 313.2

[Brunie 2013] 6 468 196 214.1[Banescu 2010] 6 721 220 298.2

proposed 6 361 180 8.2% 263.2

[Banescu 2010] 7 313 223 378.9proposed 7 193 137 38.6% 290.2

[Banescu 2010] 8 265 145 356.4proposed 8 25 81 44.1% 272.7

[Brunie 2013] 9 162 125 195.6[Banescu 2010] 9 215 174 255.8

proposed 9 0 72 42.4% 348.8

64×64

[Banescu 2010] 7 1504 614 245.0proposed 7 1191 430 30.0% 270.5

[Brunie 2013] 8 1188 420 194.2[Banescu 2010] 8 1096 449 280.7

proposed 8 652 348 17.1% 261.2

[Banescu 2010] 9 864 413 262.9proposed 9 475 217 47.5% 249.6

[Banescu 2010] 10 592 341 250.7proposed 10 187 179 47.5% 267.7

[Brunie 2013] 11 270 196 162.8[Banescu 2010] 11 592 268 225.3

proposed 11 0 108 44.9% 265.4

DPS-only solutions with less DPSs

found

Page 38: Resource Optimal Design of Large Multipliers for FPGAs

A method was proposed to optimally solve the multiplier tiling problem using ILP

Method allows to trade between DSP and logic resources

The problem is trackable for practical multiplier sizes

Combined with carefully selected logic-based multipliers and DSP super-tiles, significant resource reductions could be achieved

Conclusion

31

Page 39: Resource Optimal Design of Large Multipliers for FPGAs

Thank You!

32

References

[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012

[Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015

[Parandeh-Afshar 2011] Measuring and Reducing the Performance Gap between Embedded and

Soft Multipliers on FPGAs, FPL 2011

[Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010

[Brunie 2013] Arithmetic Core Generation Using Bit Heaps, FPL 2013

Page 40: Resource Optimal Design of Large Multipliers for FPGAs
Page 41: Resource Optimal Design of Large Multipliers for FPGAs

Resulting LUT Cost

34

24× 24 (single precision floating point)

#DSP 2 1 0LUT cost 31.2 179.95 502.8∆LUT – 148.75 322.85CPU [s] 22.7 129 8

32× 32 (unsigned)

#DSP 4 3 2 1 0LUT cost 57.85 119.2 256.8 567.95 881.6∆LUT – 61.35 137.6 311.15 313.65CPU [s] 146 320 187 382 19

53× 53 (double precision floating point)

#DSP 9 8 7 6 5LUT cost 144.3 164.45 307 450.5 759.7∆LUT – 20.15 142.55 143.5 309.2CPU [s] 1433 701 4331 2112 27215

64× 64 (unsigned)

#DSP 11 10 9 8 7LUT cost 198.25 354.8 570.7 862.5 1192.35∆LUT – 156.55 215.9 291.15 329.9CPU [s] 43031 81149 21382 54001 TO

Page 42: Resource Optimal Design of Large Multipliers for FPGAs

35

Efficiency Comparison

0 10 20 30 40 50 60 70

0.6

0.8

1

1.2

Area

E

2× k

1× 1

1× 2

2× 3

3× 3

Page 43: Resource Optimal Design of Large Multipliers for FPGAs

Problem Shapes Considered

36

(a) Multi-Input addition of10 numbers with 10 bit each

(b) x3 operation for an inputword size of 6 bit

Page 44: Resource Optimal Design of Large Multipliers for FPGAs

DSP-based Tiles

37

X-Ref Target - Figure 2-1

X

17-Bit Shift

17-Bit Shift

0

Y

Z

1

0

0

48

48

4

48

BCIN* ACIN*

OPMODE

PCIN*

MULTSIGNIN*

PCOUT*

CARRYCASCOUT*

MULTSIGNOUT*

CREG/C Bypass/Mask

CARRYCASCIN*

CARRYIN

CARRYINSEL

A:B

ALUMODE

B

B

A

C

M

P

PP

C

MULT25 X 18

A

18

30

3

PATTERNDETECT

PATTERNBDETECT

CARRYOUT

4

7

48

48

30

18

P

P

5

D 25

25

INMODE

BCOUT* ACOUT*

18

30

4 1

3018

Dual B Register

Dual A, D,

and Pre-adder

Xilinx DSP48E1 block

Page 45: Resource Optimal Design of Large Multipliers for FPGAs

0

1

0

1

0

1

Carry

Logic

0

1

LUTLUTLUTLUT

A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

38

Previous Work

Page 46: Resource Optimal Design of Large Multipliers for FPGAs

0

1

0

1

0

1

Carry

Logic

0

1

LUTLUTLUTLUT

A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011]

Two partial products are generated and added using carry chain

Compression tree of already reduced PP's necessary

full adder

38

Previous Work

Page 47: Resource Optimal Design of Large Multipliers for FPGAs

[Walters 2014] Partial-Product Generation and Addition for Multiplication in FPGAs with 6-Input

LUTs, ASILOMAR 2014

[Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015

[Walters 2016] Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs, Computers, MDPI

[Parandeh-Afshar 2011]: Measuring and Reducing the Performance Gap between Embedded and

Soft Multipliers on FPGAs, FPL 2011

[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012

[Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010

[Brunie 2013]: Arithmetic Core Generation Using Bit Heaps, FPL 2013

39

Literature