A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1

Distributed Arithmetic

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products

1

DA is a bit-serial technique to greatly reduce resource requirements for the dot product calculation

So-called because the resources are not easily recognizable: “Where’s the MAC module?”

Takes advantage of small tables of pre-computed coefficients and clever rearrangement of the math

What is Distributed Arithmetic?

2

In signal processing the most common operation is the dot product

DA lends itself well to FPGA implementation due its use of lookup tables

DA can reduce gate count by 50%-80% in signal processing arithmetic!

Why use Distributed Arithmetic?

3

It turns out that the dot product is used extensively in DSP (FIR, FFT, etc)

Recall that dot product is a sum of products:

Written as a summation:

Recall: The Dot Product

332211

3

2

1

321

xAxAxA

A

A

A

xxx

Axy

K

kkk xAy

0

4

Simple example: smoothing data via DSP (low-pass filter)

Accomplished with an FIR filter. General form:

So we could implement a “3-tap (K=4) moving average filter”:

Why is the Dot Product important?

]2[3

1]1[

3

1][

3

1][ nnnnh

1

0

][][K

kk knAnh

(In this special case, A1=A2=A3=0.33)

5

Recall the goal:

X is the filter input, (digital!), so let’s consider two’s complement representation (scaled x<1 for cleanliness)

Putting them together

Developing the Math

K

kkk xAy

1

1

10 2

N

n

nknkk bbx

K

k

N

n

nknkk bbAy

1

1

10 2

N – total bits

6

Expand the summation:

We can precompute all terms that depend on the input data (bk0..bkK) and store them in a ROM of size 2K+1

The x inputs can then be used to address the ROM directly: LUT!

Developing the Math

nK

k

N

n

K

kknkkk bAbAy

2)(

1

1

1 10

Since bkn is 0 or 1, this hasonly 2K possible values

Two possible values

7

Non-DA Hardware Implementation

Developing the Hardware

K

kkk xAy

0

8-bit Multiplier

8-bit Adder

)4(,,,,,, 4321 KDCBAxCCCCALet

Based on theoriginal equation

8

We said this is ‘bit-serial’ technique, so how can we perform multiplication?

Here, x is 4-bit input and A is 8-bit constant

The Scaling Accumulator Multiplier

ExampleMultiplication

x = 1011A = 1011001

1 10110010 0000000

1 1011001

1 +1011001

10010000101

Shift right by 1

Result register

xA

AND with 1 paralleland 1 serial input

9

So, now we substitute the scaling accumulator into our original design. Getting closer...


K

kkk xAy

010

Let’s rearrange the hardware to match our expanded eqn:


nK

k

N

n

K

kknkkk bAbAy

2)(

1

1

1 10

We first sum the products ofeach input bit and its constantThen we add and scale

each of those terms

11

Now recall that we had the clever idea to use pre-computed sums in a LUT for the bitwise addition


Address Data

0000 0

0001 C0

0010 C1

0011 C0+C1

... ...

1110 C0+C1+C2

1111 C0+C1+C2+C312

We need to accommodate the negative term, so we add one more address line to the LUT called Ts. ROM size now 2K+1

Ts is a timing signal. Ts =1 during sign bit time, 0 otherwise

We also need this bit to know when the final result is ready

HW Finishing Touches

nK

k

N

n

K

kknkkk bAbAy

2)(

1

1

1 10

Address Data

10000 0

10001 -C0

11111 -(C0+C1+C2+C3)

For all Ts = 1 the ROM contains the negative of the appropriate sum

13

Complete DA Hardware!This is an example of K=4DA dot-product hardware

ROM Size = 2K+1=25=32

Here is our scaling accumulator

Switch SWA in pos 2 after Ts=1,at which point y contains final result

14

Computes N-bit dot product in N cycles

Reduced area and high speed due to the ROM

However, requires 2K+1 size ROM (grows exponentially with input lines)

Input sizes often 16 bits -> Need 128K ROM!

Performance

15

Bit-serial means N-bit dot product requires N cycles... Slower than parallel?

N HW multipliers not generally practical due to large area\power!

Time-multiplexing your parallel HW multiplier means you lose the speed gain: N vs K

Example: K=8, N=8 takes the same time on time multiplexed parallel HW vs DA bit-serial

Distributed Arithmetic Speed

16

We can reduce the ROM size to 2K with some tricks

There are other math tricks to reduce the size further to 2K-1

Improving our HW: ROM size

Replace adder with adder/subtractor

Ts becomes control line foradder/subtractor

ROM size is reduced by half

17

Speed determined by serial nature of input – 1 BAAT We can expand the HW to do multi-bit at a time

Improving our HW: Speed

Introduce input as bitpairs x10x11, x12x13, etc

Shift LSB of pair result by 1

Shift accumulator feedback by 2

Requires 2 ROMs instead of 1

18

DA lends itself easily to DSP because of its easy application to the dot product

DA is easily implementable on FPGA because of the similar architecture-> LUTs (of course better on custom hardware)

DA is not limited to dot product; will work for any algorithm where pre-computed values can be leveraged

When to use Distributed Arithmetic

19

DA is a very efficient means of mechanizing the dot product

The use of DA can save 50-80% area over the parallel approach

Like everything, DA has tradeoffs:ROM size input linesSpeed area (multi ROM)

Conclusion

20

Application of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review. White, Stanley. IEEE ASSP Magazine July 1989(I pulled most of the basic talk info from here)

Parallel and Pipelined Architecture Designs for Distributed Arithmetic-Based Recursive Digital Filters. Hwang, H. and Su. C. IEEE Xplore VLSI Signal Processing IX, 1996 35-44(this has some slight remarks about bit parallel vs bit serial, also auto-regressive moving average filter example)

Distributed Arithmetic for Efficient Base-Band Processing in Real-Time GNSS Software Receivers. Waelchli, G et al. Journal of Electrical and Computer Engineering volume 2010(application to GPS)

An FPGA-Based Parallel Distributed Arithmetic Implementation of the 1-D Discrete Wavelet Transform. Al-Haj, Ali. Informatica 29 (2005) 241-247(DSP example using a Virtex FPGA)

21

References & Further Reading

Documents

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1