Ernest Jamro Kat. Elektroniki AGH, Kraków Dep. Of Electronics, AGH

Hardware Implementation of AlgorithmsSprzętowa Implementacja Algorytmów

Układy mnożące, konwolweryMultipliers, convolvers

Ernest Jamro

Kat. Elektroniki AGH, Kraków

Dep. Of Electronics, AGH

2

Mnożenie / Multiplication

1 0 0 1

X 1 0 1 1

1 0 0 1

1 0 0 1

0 0 0 0

+ 1 0 0 1

1 1 0 0 0 1 1

9 x 11= 99

3

Parallel Array MultipliersMnożenie równoległe

&

&

&

&

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

a0 a1 a2 a3

& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

b3

2ck+sl= =sl-1+(ai

bj)+ck-1

p0 p1 p2 p3 p4 p5 p6 p7

4

FPGA, Built-in multiplier DSP48

5

Sequential Multiplier /Mnożenie sekwencyjne

A3 A2 A1 A0 B3

B2

B1

B0

FA FA HA

FF FF FF FF FF

PISO

Register / Rejestr

Sumator Adder

Rejestr (Przesuwny) (Shift) Register

Pn,0 Pn,1 Pn,2 Pn,3

FA

FF FF FF

6

Wallace Tree Multiplier(with Carry Save Adders)

W układach FPGA nie zaleca się stosowania CSA

In FPGA the CSA are not recommended

7

Mnożenie ze znakiem / Multiplication of Sign numbers

Znak, Moduł / Sign-Module

Standardowe mnożenie liczb dodatnich / Standard unsigned multiplication

Znak= Znak1 XOR Znak2 Sign= Sign1 xor Sign2

W kodzie uzupełnień do dwóch Two’s Complement

2

0

1 22N

ii

iN

N aaa

C. R. Baugh and B. A.Wooley, “A two’s complement parallel array multiplication algorithm,” IEEE Trans. Comput., vol. C-22, pp. 1045–1047, Dec. 1973.

2

0

2

0

2

01

12

01

111

22 )2()2(22222N

ii

iN

i

N

ii

iNi

iNN

iNi

iNNN

N baabbababa

2

0

2

0

122N

iNi

iN

iNi

i baba

2

0

2

0

2

0

2

01

11

122 )2()2(222222N

ii

iN

i

N

ii

iN

iiN

iNNi

iNNNN

N baababbaba

(a1+a2)*(b1+b2)= a1b1+ a1b2+a2b1+a2b2

8

Mnożenie w kodzie uzupełnień do 2 / Two’s complement multiplication

&

&

&

!&

& +

& +

& +

!& +

& +

& +

& +

!& +

!& +

!& +

!& +

& +

a0 a1 a2 a3

!& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

b3

sl-1+(aibj)+ck-1=

=2ck+sl

p0 p1 p2 p3 p4 p5 p6 p7

1

9

Układ mnożący o zredukowanej szerokości / Reduced-width multiplier

&

&

&

&

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

a0 a1 a2 a3

& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

b3

sl-1+(aibj)+ck-1=

=2ck+sl

p0 p1 p2 p3 p4 p5 p6

truncation line

p7

10

Kompensacja błędu redukcji / Truncation error compensation

&

&

&

&

& +

& +

& +

& +

& +

& +

& +

& +

& +

a0

a1

a2 a3

& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

sl-1+(aibj)+ck-1=

=2ck+sl

p3 p4 p5 p6 p7

11

Mnożenie przez stały współczynnik / Constant Coefficient Multiplier

Look Up Table (LUT)

LUT Address Data

Example: Y= 5*X

Address Data

0 0

1 5

2 10

3 15 ...

12

LUT-based Multiplier Constant Coefficient: C

Y = CA = CA(0:3) + 24 CA(4:7) Input

LUT B

LUT A

4 4

8

12 12

Adder

8 4

12

16

output

13

Different ROM sizesInput data width = 6 bits

Mem161

Adder

Mem161

in

out

6

24

a)

Mem161

Adder

in

out

6

4

b)

Mem321

Adder

in

out

6

5

c)

14

Heteregenous memory usage Virtex: 161, 321, 4k1, 2k2, 1k4, 5128, 25616

Input data and coefficient width= 14

25616 321 3161

147

7 5 4 1

3116

21

25616 321 3161

7 5 4

3116

21

Adder

28

14

7

21

7

21

1

21

11

15

Exchange distributed RAM to BRAM

CLBBRAM

25616 321 3161

147

7 5 4 1

3116

21

7 5 4

3116

21

+

28

14

7

21

7

21

1

11

25616 321 3161 21LUT161

+

we

wy

14

4

LUT21

LUT161

4

LUT21

LUT161

4

LUT21

LUT161

2

LUT21

16

Area [CLB] for different input and coeffitinent width K

0

2

4

6

8

10

12

4 6 8 10 12 14 16 18 20 22 24

Only CLB, scale 1:10

# of BRAM

Equvalent cost of 1 BRAM

17

MM (Multiplierless Multiplication)Mnożenie bezmnożne

• Binary Representation, example B= 14= 11102

M= AB= (A<<1)+(A<<2)+(A<<3)

• Sub-structure Sharing (SS) example B= 27= 110112

tmp= A + (A<<1)

M= AB= tmp + (tmp<<3)

• Canonic Sign Digit (CSD)

set {0, 1, -1} (0 – no operation, 1 – addition, -1 (1) – subtraction)

example: B= 7 = 1112 B= 1001CSD

M=B·A= (A<<2) + (A<<1) + A M= (A<<3)-A

18

BINARNIE CSD

insert symbol ‘1’ only if the total number of operation is reducedCoefficientBinary (TC) CSDMCSD3 11 101 117 111 1001 100111 1011 10101 101123 10111 101001 11001

Start

i=0, c0=0bn=bn-1

ci+1= bi+1bi bici bi+1ci

di= bi+ci-2ci+1

i= i+1

YNi<n

Stop

Start

i=0carry= false

(bi=1 and carry)or

(bi=0 and not carry)

di=0

Y

iwN Y

N

j=i+1

jwNY

0Q(i,j)<2Y N

Q(i,j)<2and not

(Y<0 and j=w)(sign bit)

di= 1carry= false

di= -1carry= true

i= i+1

carry and B>0Y

di= 1

Stop

N

Y N

Standard Modified

19

Applience of different techniques of MM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

3 4 5 6 7 8 9 10 11 12K

CSD-SS

SS

CSD

BR

20

The MM cost for different coefficients

0

2

4

6

8

10

12

14

16

18

0 50 100 150 200 250

coeff

CLBs

21

Filters FIR

1

0

)()()(N

k

kixkhiy

Układ opóźniający / Delay Module

Układ arytmetyczny / Arithmetic Module

x(i)

x(i) x(i-1) x(i-N+1)

y(i)

z -1 z -1

w 2 w 1 w 0 Input a y+2,x+2 a y+2,x+1 a y+2,x

+

Output

22

Filter FIR (sposób pośredni/ transposed)

1

0

)()()(N

k

kixkhiy

Układ opóźniający Delay

Układy mnożące / Arithmetic module x(i)

x(i) h(0)

x(i-1) h(1)

x(i-N+1) h(N-1)

y(i) z-1 +

Input

Output z-1

+

z-1 +

h(0) h(1) h(2)

23

FIR 2D

z-1 z-1

w2,2 w2,1 w2,0

Line Buf. z-1 z-1

w1,2 w1,1 w1,0

Line Buf. z-1 z-1

w0,2 w0,1 w0,0

Input ay+2,x+2 ay+2,x+1 ay+2,x

ay+1,x+2 ay+1,x+1 ay+1,x

ay,x+2 ay,x+1 ay,x

+

Output

by+1,x+1

24

Examples of 2D FIR Filters

1 2 1

2 4 2

1 2 1

-1 -2 -1

0 0 0

1 2 1

1 1 1

1 -8 1

1 1 1

Low-Pass Sobel Laplace

25

FIR Filter N=2LUT-based multipliers

z-1

LUTM0

LUTL0

LUTM1

LUTL1

In 8

4 4 4 4

Adder1 Adder0

Adder2

12 12 1212

13 134

18

4

Multiplier 1 Multiplier 2

Adder1 Adder0

Adder2

12 12 1212

13

9

414

18

Adders Block

FIR, Arytmetyka w innej kolejności(Parallel) Distributed Arithmetic

1

0

1

0

1

0,2

N

i

N

i

L

jji

jiii ahah

1

0

1

0,2

L

j

N

ijii

j ahcoefficient

inputdifferent bits of

the input

27

Arytmetyka Rozproszona (Distributed Arithmetic)

a0,0 a1,0 ... aN-1,0

S0

a0,1 a1,1 ... aN-1,1

S1<<1

a0,L-1 a1,L-1 ... aN-1,L-1

SL-1<<(L-1)WDAC

. . .

1

00,

N

iii ah

1

01,

N

iii ah

1

01,

N

iLii ah

1

0

2L

jj

j S

1

0

1

0,2

L

j

N

ijii

j ah

WDAC=K+ log2(N+1)

WLC= K+WIN

The same input bit weight

(smaller LUT widths)

28

Filtry FIR z liniową fazą / Linear Phase Filters(symetryczne/ symmetric: h(0)=h(N-1), h(1)=h(N-2), ...)

29

FPGA, Built-in multiplier DSP48

30

Example of sub-structure sharing for FIR filters

H(z)= 5 + 13z-1 + 5z-2 = 1012 + 11012z-1 + 1012z-2

Example 1:

A= 5 = 1012- temporary expression

H(z)= A + (1000 + A)z-1 + Az-2

Example 2:

A= 1 + z-1

H(z)= 5A + 8z-1 + 5z-2

31

Materiały dodatkowe

The END

32

Szybkie mnożenie w układach FPGA

AND

+

AND

+

+

+

+

27a7

b

26a6

b

AND

+

25a5

b

24a4

b

AND

+

23a3

b

22a2

b

AND

+

21a1

b

20a0

b Ewentualne rejestry potokowe

26·(2·a7 ·b + a6 ·b)

33

Układy mnożące w FPGA

Fragment of Virtex Configurable Logic Block (CLB)

Przykład:

G4 - a7

G3 - bi

G2 - a6

G1 - bi+1

F4 – a7

F3 – bi-1

F2 – a6

F1 – bi

(a7 and bi) xor (a6 and bi+1)

Documents

Ernest Jamro Kat. Elektroniki AGH, Kraków Dep. Of Electronics, AGH