81
FPGA Multipliers Bogdan PASCA projet Ar´ enaire, ENS-Lyon/INRIA/CNRS/Universit´ e de Lyon, France RAIM’11 February 7-10, 2011

FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

FPGA Multipliers

Bogdan PASCA

projet Arenaire, ENS-Lyon/INRIA/CNRS/Universite de Lyon, France

RAIM’11February 7-10, 2011

Page 2: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Outline

Background & Context

Algorithmic techniques for reducing DSP count of large multipliersKaratsuba-Ofman algorithmNon-Standard tilingsSquarersTruncated multipliers

Conclusions

Bogdan PASCA FPGA Multipliers 1

Page 3: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

What’s an FPGA?

Field Programmable Gate Array

integrated circuit

has a regular architecture (hence array)

logic elements can be programmed to perform various functions

Bogdan PASCA FPGA Multipliers 2

Page 4: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Modern FPGA Architecture

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

Page 5: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Modern FPGA Architecture

RA

MR

AM

RA

MR

AM

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

Page 6: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Modern FPGA Architecture

RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

Page 7: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Modern FPGA Architecture

RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

Page 8: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Modern FPGA Architecture

RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

Page 9: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Modern FPGA Architecture

RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP

LUT

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

Page 10: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Modern FPGA Architecture

RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP

LUT

shift 17

18

18

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

Page 11: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

What can we compute?

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

Page 12: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

What can we compute?

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

Page 13: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

What can we compute?

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

Page 14: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

What can we compute?

LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

FA

FA

FA

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

Page 15: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Need of DSP blocks

Multiplication in logic is expensive

n × n bit ≈ n2︸︷︷︸partial products

+ n(n − 1)︸ ︷︷ ︸adder tree

LUTs

18× 18 bit ≈ 324LUT + 306LUT = 630LUTs

1 DSP block = 8 LEs (size on FPGA layout)

DSP blocks are a need in modern FPGAs

17 bit shift

17 bit shift

48

48

B

P

18

18

A

C

P

Bogdan PASCA FPGA Multipliers 5

Page 16: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Need of DSP blocks

Multiplication in logic is expensive

n × n bit ≈ n2︸︷︷︸partial products

+ n(n − 1)︸ ︷︷ ︸adder tree

LUTs

18× 18 bit ≈ 324LUT + 306LUT = 630LUTs

1 DSP block = 8 LEs (size on FPGA layout)

DSP blocks are a need in modern FPGAs

17 bit shift

17 bit shift

48

48

B

P

18

18

A

C

P

Bogdan PASCA FPGA Multipliers 5

Page 17: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1

→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2

→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3

→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%

A temporal coding hardware implementation for spiking neuralnetworks4

→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)

Bogdan PASCA FPGA Multipliers 6

Page 18: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1

→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2

→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3

→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%

A temporal coding hardware implementation for spiking neuralnetworks4

→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)

Bogdan PASCA FPGA Multipliers 6

Page 19: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1

→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2

→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3

→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%

A temporal coding hardware implementation for spiking neuralnetworks4

→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)

Bogdan PASCA FPGA Multipliers 6

Page 20: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1

→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2

→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3

→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%

A temporal coding hardware implementation for spiking neuralnetworks4

→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)

Bogdan PASCA FPGA Multipliers 6

Page 21: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1

→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2

→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3

→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%

A temporal coding hardware implementation for spiking neuralnetworks4

→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)

Bogdan PASCA FPGA Multipliers 6

Page 22: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Perceiving Multiplications Visually

XY

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

Page 23: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Perceiving Multiplications Visually

∑Y2:0

X2:0

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

Page 24: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Perceiving Multiplications Visually

∑X5:3

Y5:3

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

Page 25: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Perceiving Multiplications Visually

∑X3:1

Y4:3

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

Page 26: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Perceiving Multiplications Visually

∑X3:1

Y4:3

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

Page 27: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Perceiving Multiplications Visually

005

5

3

X0X1

Y0

Y1

X0Y0

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

Page 28: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Perceiving Multiplications Visually

005

5

3

X0X1

Y0

Y1

X0Y0

+23+3X1Y1

+23X1Y0

+23X0Y1

XY =

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

Page 29: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Karatsuba-Ofman algorithm

trading multiplications for additions

Bogdan PASCA FPGA Multipliers 8

Page 30: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

Page 31: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

Page 32: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

Page 33: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

Page 34: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

Page 35: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

Page 36: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

Page 37: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

Bogdan PASCA FPGA Multipliers 10

Page 38: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Bogdan PASCA FPGA Multipliers 10

Page 39: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Bogdan PASCA FPGA Multipliers 10

Page 40: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Bogdan PASCA FPGA Multipliers 10

Page 41: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Bogdan PASCA FPGA Multipliers 10

Page 42: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Bogdan PASCA FPGA Multipliers 10

Page 43: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Bogdan PASCA FPGA Multipliers 10

Page 44: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Y0

Y1

Y2

X0

Y3

X1X2X3X0X1X2X3

Bogdan PASCA FPGA Multipliers 10

Page 45: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Visual Interpretation

X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Y0

Y1

Y2

X0

Y3

X1X2X3X0X1X2X3

Bogdan PASCA FPGA Multipliers 10

Page 46: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Implementation

fairly trivial starting from the equation:

XY = 22kX1Y1 + 2k(X1Y1 + X0Y0 − DXDY ) + X0Y0

z

z

DSP48

17

17

17

17

18

18

Y0

X0

Y0

Y1

X0

X1

Y1

X1

36

34

34

X0Y0

X0Y0 − DXDY

X1Y1 + X0Y0 − DXDY

P6851X1Y1

34(16 : 0)

(33 : 17)

34x34bit multiplier using Virtex-4 DSP48

X1Y1 + X0Y0 − DXDY is implemented inside the DSPs

need to recover X1Y1 with a subtraction

Bogdan PASCA FPGA Multipliers 11

Page 47: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Results

latency frequency (MHz). slices5 DSPs

LogiCore 6 447 26 4

LogiCore 3 176 34 4

K-O-2 3 317 95 3

Table: 34x34-bit multipliers on Virtex-4

trade-off 1DSPs (>630 Logic Elements) for 138 Logic Elements

latency frequency(MHz) slices DSPs

LogiCore 11 353 185 9

LogiCore 6 264 122 9

K-O-3 6 317 331 6

Table: 51x51 multipliers on Virtex-4

trade-off 3DSPs (>1890 Logic Elements) for 292 Logic Elements

5On Virtex4 devices 1 slice = 2 Logic ElementsBogdan PASCA FPGA Multipliers 12

Page 48: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-Standard tilings

new multiplication algorithms

Bogdan PASCA FPGA Multipliers 13

Page 49: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

Bogdan PASCA FPGA Multipliers 14

Page 50: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

Bogdan PASCA FPGA Multipliers 14

Page 51: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

∑Y2:0

X2:0

Bogdan PASCA FPGA Multipliers 14

Page 52: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

∑X5:3

Y5:3

Bogdan PASCA FPGA Multipliers 14

Page 53: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

∑X3:1

Y4:3

Bogdan PASCA FPGA Multipliers 14

Page 54: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

∑X3:1

Y4:3

Bogdan PASCA FPGA Multipliers 14

Page 55: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

X0

05

5

1

3Y

23+1X3:1Y4:3

Bogdan PASCA FPGA Multipliers 14

Page 56: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem

X0

05

5

1

3Y

23+1X3:1Y4:3

+21+5X3:1Y5

+23X0Y5:3

+X3:0Y2:0

+24X5:4Y5:0

XY =

Bogdan PASCA FPGA Multipliers 14

Page 57: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Tilings

Performing a 53× 53-bit multiplication on Virtex5

51

48

(a) standard tiling

00

16

33

163358

58

(b) Logicore tiling

34

0

0

24

41

58 34 17

41 24

17

M1

M2

M3M4M5

M6

M7M8

(c) proposed tiling

standard tiling ≡ classical decomposition (12 DSPs)

Logicore 11.1 tiling uses 10 DSPs (4 DSPs used as 17x17-bit)

our proposed tiling does it in 8 DSPs and a few LUTs

Bogdan PASCA FPGA Multipliers 15

Page 58: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Tiling Architecture - 53x53bit

34

0

0

24

41

58 34 17

41 24

17

M1

M2

M3M4M5

M6

M7M8

XY = X0:23Y0:16 (M1)+ 217(X0:23Y17:33 (M2)+ 217(X0:16Y34:57 (M3)+ 217X17:33Y34:57)) (M4)+ 224(X24:40Y0:23 (M8)+ 217(X41:57Y0:23 (M7)+ 217(X34:57Y24:40 (M6)+ 217X34:57Y41:57))) (M5)+ 248X24:33Y24:33

X24:33Y24:33 (10x10 multiplier) probably best implemented in LUTs.

parenthesis makes best use of DSP48E internal adders (17-bitshifts)

Bogdan PASCA FPGA Multipliers 16

Page 59: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Tiling Results

58x58 multipliers on Virtex-5 (5vlx50ff676-3)6

latency Freq. REGs LUTs DSPs

LogiCore 14 440 300 249 10

LogiCore 8 338 208 133 10

LogiCore 4 95 208 17 10

Tiling 4 366 247 388 8

Remarks

save 2 DSP48E for a few LUTs/REGs

huge latency save at a comparable frequency

good use of internal adders due to the 17-bit shifts

6Results for 53-bits are almost identicalBogdan PASCA FPGA Multipliers 17

Page 60: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Squarers

simple methods to save resources

Bogdan PASCA FPGA Multipliers 18

Page 61: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Squarers

appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.

Squaring with k = 17 on a Virtex-4

X02

X12

X0X1

X0X1

(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2

0

X02

X12

X0X1

X0X1

X0X2

X0X2

X1X2

X1X2X22

(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2

1 + X 20

+ 2 · 23kX2X1

+ 2 · 22kX2X0

+ 2 · 2kX1X0

Bogdan PASCA FPGA Multipliers 19

Page 62: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Squarers

appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.

Squaring with k = 17 on a Virtex-4

X02

X12

X0X1

X0X1

(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2

0

X02

X12

X0X1

X0X1

X0X2

X0X2

X1X2

X1X2X22

(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2

1 + X 20

+ 2 · 23kX2X1

+ 2 · 22kX2X0

+ 2 · 2kX1X0

Bogdan PASCA FPGA Multipliers 19

Page 63: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Squarers

appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.

Squaring with k = 17 on a Virtex-4

X02

X12

X0X1

X0X1

(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2

0

X02

X12

X0X1

X0X1

X0X2

X0X2

X1X2

X1X2X22

(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2

1 + X 20

+ 2 · 23kX2X1

+ 2 · 22kX2X0

+ 2 · 2kX1X0

Bogdan PASCA FPGA Multipliers 19

Page 64: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

However ...

(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2

0

shifts of 0, 18, 34 the previous equation

the DSP48 of VirtexIV allow shifts of 17 so internal adders unused

Workaround for ≤ 33-bit multiplications

rewrite equation:

(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2

0

compute 2X1 by shifting X1 by one bit before inputing into DSP48block

Bogdan PASCA FPGA Multipliers 20

Page 65: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

However ...

(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2

0

shifts of 0, 18, 34 the previous equation

the DSP48 of VirtexIV allow shifts of 17 so internal adders unused

Workaround for ≤ 33-bit multiplications

rewrite equation:

(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2

0

compute 2X1 by shifting X1 by one bit before inputing into DSP48block

Bogdan PASCA FPGA Multipliers 20

Page 66: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Results – 32-bit and 53-bit squarers on Virtex-4

latency frequency slices DSPs bits

LogiCore 6 489 59 432LogiCore 3 176 34 4

Squarer 3 317 18 3

LogiCore 18 380 279 1653LogiCore 7 176 207 16

Squarer 7 317 332 6

DSPs saved without any overhead

impressive 10 DSPs saved for double precision squarer

Bogdan PASCA FPGA Multipliers 21

Page 67: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Squarers on Virtex5 using tilings

the tiling technique can be extended to squaring

36

53

17

0

M1

M2

M3 M6M5

M4

041 24 0

19

36

53

M1

M2

M3

M4M5

Issues

darker squares are computed twice thus need be removed.

thanks to symmetry diagonal multiplication of size n shouldconsume only n(n + 1)/2 LUTs instead of n2 .

Bogdan PASCA FPGA Multipliers 22

Page 68: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Truncated multipliers

Bogdan PASCA FPGA Multipliers 23

Page 69: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Truncated multipliers

Classical technique

reduce resources, delay, or power consumption

controlled accuracy degradation

×

∑BA

u

kd

n − k

v

remove some of the least-significant d columns

keep the error smaller than 2k

Bogdan PASCA FPGA Multipliers 24

Page 70: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Truncated multipliers

Classical technique

reduce resources, delay, or power consumption

controlled accuracy degradation

×

∑BA

u

kd

n − k

v

remove some of the least-significant d columns

keep the error smaller than 2k

Bogdan PASCA FPGA Multipliers 24

Page 71: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Truncated multipliers

Classical technique

reduce resources, delay, or power consumption

controlled accuracy degradation

×

∑BA

u

kd

n − k

v

remove some of the least-significant d columns

keep the error smaller than 2k

Bogdan PASCA FPGA Multipliers 24

Page 72: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Error budget

×

∑BA

u

kd

n − k

v

Etotal = Eapprox + Eround ≤ 2k

Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1

Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤

∑di=1 i2i−1

Eapprox < 2k−1→ d = f (k)

Precision k Discarded (d)

Single 23 18Double 52 46

Quadruple 112 105

Bogdan PASCA FPGA Multipliers 25

Page 73: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Error budget

×

∑BA

u

kd

n − k

v

Etotal = Eapprox + Eround ≤ 2k

Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1

Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤

∑di=1 i2i−1

Eapprox < 2k−1→ d = f (k)

Precision k Discarded (d)

Single 23 18Double 52 46

Quadruple 112 105

Bogdan PASCA FPGA Multipliers 25

Page 74: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Tiling the truncated board

M2

M3 M1

k

d

M4

M2

M3 M1

k

d

M4 M2

M3 M1

k

d

Sol 1: tile and discard columns (save additions)

waste DSPs

Sol 2: use softcore multiplier (trade a DSP for logic)

Best : tile with softcore multipliers so that Eapprox ≤ 2k−1

use the extra precision for free

Bogdan PASCA FPGA Multipliers 26

Page 75: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Tiling the truncated board

M2

M3 M1

k

d

M4 M2

M3 M1

k

d

M4

M2

M3 M1

k

d

Sol 1: tile and discard columns (save additions)

waste DSPs

Sol 2: use softcore multiplier (trade a DSP for logic)

Best : tile with softcore multipliers so that Eapprox ≤ 2k−1

use the extra precision for free

Bogdan PASCA FPGA Multipliers 26

Page 76: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Tiling the truncated board

M2

M3 M1

k

d

M4 M2

M3 M1

k

d

M4 M2

M3 M1

k

d

Sol 1: tile and discard columns (save additions)

waste DSPs

Sol 2: use softcore multiplier (trade a DSP for logic)

Best : tile with softcore multipliers so that Eapprox ≤ 2k−1

use the extra precision for free

Bogdan PASCA FPGA Multipliers 26

Page 77: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Reality Check – faithfully rounding

Mantissa Multipliers for SP,DP,QP, Virtex4 (left) and Virtex5(right)

FPGA Prec. Latency, Freq. Resources

Virtex5DP 6 cycles @ 414MHz 320LUT 302REG 5DSP

QP 20 cycles @ 334MHz 2497LUT 2321REG 19DSP

QP 14 cycles @ 245MHz 2249LUT 1576REG 19DSP

Virtex4DP 11 cycles @ 368MHz 358sl. 7DSP

QP 21 cycles @ 368MHz 1735sl. 26DSP

Virtex4DP reduce DSPs from 10 to 7 while also reducing slice countQP reduce DSPs from 49 to 26 at without any slice penalty

Virtex5DP reduce DSP from 6 to 5 for and roughly half the LUTs and REGsQP reduce DSP from 34 to 19 at a small increase in logic resources.

Bogdan PASCA FPGA Multipliers 27

Page 78: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Another point of view

(wE ,wF )=accuracy

(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗

truncate multipliers when IEEE-754 compliance is not needed

function approximation by polynomial evaluation

log2(1 + x) (53-bit)

default 27 DSPsoptimized Horner 23 DSPs

optimized Horner + truncated multipliers 11* DSPs

Bogdan PASCA FPGA Multipliers 28

Page 79: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Another point of view

(wE ,wF )=accuracy

(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗

truncate multipliers when IEEE-754 compliance is not needed

function approximation by polynomial evaluation

log2(1 + x) (53-bit)

default 27 DSPsoptimized Horner 23 DSPs

optimized Horner + truncated multipliers 11* DSPs

Bogdan PASCA FPGA Multipliers 28

Page 80: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Conclusion

save DSPs by exploiting the flexibility of the FPGA

Karatsuba-Ofman reduces DSP cost at small price in logic elements

tiling techinques adapt better to asymmetric DSPs

dedicated squarers significantly reduce DSP count

control accuracy and save DSPs using truncated multipliers

Bogdan PASCA FPGA Multipliers 29

Page 81: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Thank you for your attention !

http://flopoco.gforge.inria.fr/

Questions ?

Bogdan PASCA FPGA Multipliers 30