FIR Filter Fits in an FPGA using a Bit Serial Approach structure implemented in FPGAs. To illustrate the advantages of bit serial designs for FPGAs , I examine the implementation of

1

FIR Filter Fits in an FPGA using a Bit Serial Approach

Raymond J. Andraka, Senior EngineerRaytheon Company, Missile Systems Division, Tewksbury MA 01876

INTRODUCTION

Early digital processors almost exclusively used bitserial architectures because of the high cost ofhardware. Bit serial machines have beensupplanted by parallel architectures mostly due tothe low cost of hardware today. As a result, bit se-rial solutions are often overlooked in applicationswhere they may be the better choice. This is es-pecially true when designing with Field Pro-grammable Gate Arrays. Many of the elements(Eg. multipliers) used in parallel structures will noteven fit in an FPGA. In those cases where an ele-ment fits, the routing resources often areinsufficient or the resulting design is to slow to bean attractive replacement for dedicated functionparts.

By returning to a bit-serial architecture, it is fre-quently possible to pack a relatively complex func-tion into a single FPGA. A throughput improve-ment may even be realized over an equivalentparallel structure implemented in FPGAs. Toillustrate the advantages of bit serial designs forFPGAs , I examine the implementation of an entireFIR filter in a single FPGA using only bit serialelements.

What is Bit Serial?Bit parallel designs process all of the bits of an in-put simultaneously at a significant hardware cost.In contrast, a bit serial structure processes the inputone bit at a time, generally using the results of theoperations on the first bits to influence the process-ing of subsequent bits. The advantage enjoyed bythe bit serial design is that all of the bits passthrough the same logic, resulting in a huge reduc-tion in the required hardware. Typically, the bitserial approach requires 1/nth of the hardwarerequired for the equivalent n-bit parallel design.The price of this logic reduction is that the serialhardware takes n clock cycles to execute, while theequivalent parallel structure executes in one. Thetime-hardware product, however, for the serialstructure is often smaller than for equivalentparallel designs because the logic delays betweenregisters are generally significantly smaller. Thismeans that the serial machine can operate at a

higher clock frequency. In the case of FPGAs,signal routing contributes significant propagationdelays and often uses up logic cells. The serialstructures tend to have very localized routing, oftento only one destination. In contrast, the parallelmachines usually need signals extended across thewidth of the processing element. The limited andslow routing resources in FPGAs make the serialprocessing elements even more attractive. In somecases, the overall throughput for a serial designimplemented in an FPGA can actually exceed thatof an equivalent parallel design in the same device.

FPGA selection and backgroundThe techniques described in this paper apply to anyFPGA, as well as to VLSI designs (which maybenefit from the same advantages-especially wherehigh data rates are not necessary). For the purposeof this project, the CLI 6000 series FPGAs made byConcurrent Logic Inc (CLI) were selected. TheseFPGAs are RAM based, and contain a 56 x 56array of logic cells interconnected via busses anddirect wires to nearest neighbors. Each cellbasically consists of a half adder with a D flip flopon the sum output and some extra logic to allowother functions to be programmed. The cell isprogrammed to one of 64 configurations by adedicated ram under the cell. The cells each have2 (3 for certain special functions) inputs and 2outputs, all of which are accessible from any sideof the cell. Each input and output can be directlywired to/from any of the cell’s four nearestneighbors, or to any of 4 local busses which extendto other cells in the row or column. The localbusses are broken into segments eight cells longand are connected across the breaks byprogrammable repeaters. The relatively simple cellin the CLI array is fairly well suited to the bit serialstructures.

THE BASIC BUILDING BLOCKS

The most basic functions required for nearly anysignal processor include addition, negation and de-lays. These blocks can then be used to constructthe more complicated structures such asmultipliers. In most cases, using a bit-serial ar-

2

chitecture simplifies the hardware required sinceall of the bits pass through a single bit wide el-ement. I will discuss the construction of thesebasic elements in the next paragraphs.

Bit Serial AdderA bit-serial adder is constructed using a full adderwith registers on both its carry and sum outputs.The registered carry output is wired back to thecarry input of that full adder. In operation, the twowords to be added together are simultaneouslyshifted least significant bit first into the remainingtwo inputs. The carry out from the addition ofeach bit is stored and then used in the summationof the next bit. In effect, the carries are held sta-tionary while the inputs are rippled past. Thisadder is also known as a Carry Save Adder becauseof the nature of its operation. The output of thecircuit is registered to allow bit pipelining. Theresulting latency is one bit time. The serial addermust be cleared before each word to avoid errors.The input words and output word are always equallength. As with parallel addition, the radix pointsof the input words must be aligned.

S A

B

S

C HA

A

B

S

C HA

D Q

D Q ^

Y

X

a) Carry-Save Adder Schematic

1

^ &

S

Y

^ &

B

B

A

B

PFDHA

PFDHA

^ &

X

PXOND

PINV

A

B B

B

A

Z

Z

b) Example Carry-Save Adder implementation.

Figure 1. Carry-Save Adder schematic and layout. Cellsare: PFDHA= half adder with a register on sum output,PXOND = non-registered half adder , PINV = inverter

(necessary to correct inverted half adder output).

D Q

D Q

^A S

+

a) Two's Complement Schematic

^

&

B

B

PFDHA

+

PFDOR

A B

S

A A

Z Z

b) Two's Complement Layout and Implementation(input A wired to two cells and must be a buss input)

Figure 2. Two's Complement schematic and layout.FPGA cells are: PFDHA= half adder with a register on

sum output, PFDOR = OR with register on output.

Bit Serial Two’s ComplementThe second elementary function required is a partwhich will compute the two's complement of theinput argument. Recalling the serial algorithm fortwo's complement (starting at least significant bitcopy each bit until first one is encountered then in-vert remaining bits) yields a simple solution. A se-rial two's complement circuit is shown in figure 2.The input should be presented least significant bitfirst. The carry (detect)flip-flop must be reset be-fore each input word, since it causes the XOR toinvert the input continuously after the first logicone is detected. The output is registered, so thefunction has a one bit latency.

DelaysThe remaining elementary function is the bit delay(a bit time is one clock cycle). The delay is usefulfor aligning words as well as for producing worddelays required by some algorithms. The delay issimply a D flip-flop inserted into the data path foreach bit of delay desired. Word delays are con-structed from a string of bit delays equal in lengthto the number of bits in the word. A sample layoutof a word delay for a filter is shown in figure 3.

3

FD

out (to next delay and to YIN bus)

FDFDFD

FD FD FD FD FD

in

FDFDFD

FD

figure 3. Example Word delay

HIGHER FUNCTIONS

The repertoire of functions required to implement aFIR filter is rounded out with a multiplier and acolumn adder, both of which are constructed fromthe elements already discussed. Other DSP func-tions may require additional functions.

MultiplierMultipliers are essential to most signal processingalgorithms. The simple serial by parallel multi-plier is particularly well suited for FPGA imple-mentation because all of its routing is to nearestneighbors with the exception of the input. Thenumber of cells is proportional to the number ofbits in the parallel input. One input of this multi-plier is parallel while the other is bit serial with theleast significant bit presented first. The output isbit serial, also with the least significant bit first.The architecture of a general serial by parallel mul-tiplier is shown in figure 4a. This multiplier per-forms the familiar shift-add algorithm: the parallelinput is multiplied in turn by each bit of the serialinput as it is presented, and each of those partialproducts is added to the shifted accumulation of theprevious products. The bitwise products are simplythe logical ANDs of the input bit with each of theparallel input bits. The shifting accumulator iseasily constructed by chaining a series of carry-save adders together so that inputs to theaccumulator are bit parallel and the sum isdownshifted on each operation. The serial outputis then taken from the output of the leastsignificant bit adder. The output bit has the sameweight as the previous serial input bit, yielding alatency of one bit. The number of bits in the outputis equal to the sum of the number of bits in each ofthe inputs. Since the serial input has to be of thesame length as the output, it is extended with signbits.

Multiplication of negative (two’s complement)numbers using an unmodified shift-add algorithmwill yield an error in the upper half of the product.This error is the result of the inputs not being signextended to account for the growth of the product(number of bits in the product is equal to the sumof the bits in the multiplicands). The serial inputof the parallel by serial multiplier does not sufferfrom this error since it must be extended to accountfor the growth in the product. The parallel (X) in-put will suffer if not corrected. Fortunately, thecorrection can be made without adding bits to themultiplier hardware by recognizing that the signextension of X, if taken alone, multiplies the serialinput by either zero or negative one. This result isshifted into the lower bits of the multiplier duringthe course of the multiplication. The signextension of the parallel input can be accomplishedby replacing the most significant adder of themultiplier by a two's complement stage. The inputof this stage is the bit product (AND) of the serialinput and the sign bit of the parallel input.

The analysis of the correction for negative inputsreveals that the X and Y inputs do not have to havethe same number of bits. The X input is essentiallysign extended infinitely by the two’s complementblock, and the Y input is dependent only upon thelength of the serial input. The hardware for themultiplier is independent of the precision of theserial input. It is therefore possible to save somehardware in cases where the parallel input does notneed the precision of the serial input. This fact -could be advantageous in the FIR filter, since lim-ited precision in the coefficients limits the place-ment of the filter’s zeros but does not otherwisecontribute to noise in the output.

The serial multiplier must be cleared before a newword is input to prevent errors. This is especiallytrue if the X (parallel) input is negative since thetwo’s complement circuit cannot self clear. Noother controls are required. Secondly, the serialinput has to be sign extended by the number of bitsin the parallel input (ie to make the number of bitsin the serial input equal to the number of bits in theoutput). The output is always a full precisionoutput.

4

YXXX X

&

A S P2's C

&

CSADDA

B

S

&

CSADDA

B

S

&

CSADDA

B

S

n-1 n-2 1 0

SerialOutput

Parallel Input

Serial Input

a) Signed Serial by Parallel Multiplier Schematic

Z

Y

A S PA S2's C CSADD

A

B

SCSADDA

B

SCSADDB

SerialOutputSerial

Input

Logic Zero

b) Signed Serial by Parallel Multiplier with fixed parallel input (value determined by connections to Yin or logic zero)

P

CSADDYZ

2's C CSADD CSADD CSADD CSADD

c) Multiplier Layout for Pitch=2. Shaded cells correspond to PXOND cell in adders(note adders have been flipped vertically with respect to the adder shown earlier)

Figure 4 Serial by Parallel Multiplier Architectures and Layouts

The partial products presented to the shifting accu-mulator are generated by the logical AND of theinput serial bit with each bit of the parallel input.If the parallel input is fixed, the AND gates can beeliminated by connecting the inputs to each adderdirectly to the Y (serial) input or to logic zerodepending upon the value of the corresponding bitin the fixed parallel input. The FPGA is RAMbased, so the value of the parallel input can still bechanged by reprogramming if this is done. Sincethe coefficients of an FIR filter are normallychanged infrequently with respect to the data rate, Ichose to take advantage of this simplification.This eliminates the AND gates used to generate theproduct, and more significantly, the logic requiredfor setting and holding the parallel input values.As a note, a further simplification of the fixedinput multiplier is possible by noting that theadders associated with zeros in the parallel inputreduce to a single delay flip-flop. I chose not toimplement that reduction to minimize the changesrequired in reprogramming coefficients to the FIRfilter. The simplified multiplier is shown in figure4b.

The n bit multiplier is constructed in the FPGA bystringing n-1 of the carry-save adders (CSADD)

and a two's complement together. One input ofeach CSADD is supplied by the previous stageoutput. The other inputs are supplied by one oftwo local busses (the Y input to the multiplier forones, or logic zero for zeros): the parallel input tothe multiplier is programmed by the local busconnections only. The layout used is shown infigure 4c. Alternatively, a multiplier with a 3 cellpitch may be created using a different layout forthe CSADD adders to allow more parallel inputbits to fit in the width of the FPGA.

Column AdderAn adder structure capable of simultaneouslyadding more than two inputs is a desirable func-tion. This is easily accomplished by a tree of serialadders. Each serial adder (carry save adder)combines two input streams into one output, henceeach level of the tree structure reduces the numberof serial streams by half, adding one bit time oflatency in the process. A column adderconstructed in this manner allows an arbitrarilylarge number of inputs to be summed togetherwithout a sacrifice in the bit rate. If an oddnumber of inputs exist in a level, the odd input canbe passed on to the next level via a register to keepthe alignment of the bits. If overflow is to beavoided, one bit of growth must be allowed for

5

each level in the adder. Since the input and outputmust have a similar number of bits, the input mustinclude extra sign (guard) bits to prevent overflow.The number of levels and hence the latency andnumber of guard bits for an n input column adderis equal to Log2(n) rounded up to an integer. Aswith single serial adders, the inputs and output arepresented least significant bit first. The columnadder architecture is illustrated in figure 5. AnFPGA implementation designed to match to a stackof 2 cell tall multipliers is shown in figure 6. Notethat the CSADD layouts were optimized for theapplication.

CSADDA

B

S

CSADDA

B

S

CSADDA

B

S

CSADDA

B

S

CSADDA

B

S

CSADDA

B

S

CSADDA

B

S

CSADDA

B

S

CSADDA

B

S

CSADDA

B

SD Q D Q

A0

A2

A4

A6

A8

A1

A3

A5

A7

A9

A10 S

Figure 5. Example Serial Column Adder Architecture

PUTTING IT ALL TOGETHER: AN FIR FILTER

The FIR filter is essentially a discrete convolutionof the input signal with a set of coefficients. Math-ematically, the filter can be defined as:

Y[k] = SnCi X[k-i]

The signal flow diagram shown in figure 7 illus-trates the algorithm and suggests an architectureusing the elements created above. The word delaysare inserted for the ‘Z’ delay blocks. Themultiplications shown in the flow graph corre-spond to the serial by parallel multipliers with theirparallel inputs programmed with the value of theassociated coefficient. A separate word delay andmultiplier are used for each tap in the filter. All ofthe summation blocks shown are combined and re-placed by a column adder with as many inputs asthere are taps. Figure 8 shows an example layout

and interconnect for a seven tap filter using thefunctions developed earlier.

A0

1^&

B

A

PFDHA PINV

B

A

Z

^&

PXOND

^&

PFDHA

Z

A

B

B

A

A

A

A

A

B

1^&

B

A

PFDHA PINV

B

A

Z

^&

PXOND

^&

PFDHA

Z

B

B

A

A1

A2

A3

S

CSADD for higher level Connected by busses

1^& B

A

PFDHA PINV

B

A

Z

^&

PXOND

^&

PFDHA

Z

A

B

A

S

Figure 6. Portion of Serial Column Adder layout de-signed to match the outputs of a stack of 2 cell highmultipliers . Each shaded cell group is a carry saveadder. This pattern is repeated to create the first twolevels of the column adder. The remaining levels areinserted in the 2 x 3 voids left in the resulting layout .

As each word is shifted into the filter (least signifi-cant bit first of course) it is fed to the first multi-plier where it is multiplied by C0. At the sametime, The input is fed to the delay chain where it isdelayed exactly one word time interval so it arrivesat the second multiplier on the same clock that thesecond word is presented to the first multiplier.The succeeding words eventually fill the delay sothat the each of the last n (n=number of taps)words received are simultaneously multiplied bythe appropriate coefficients. The outputs of themultipliers are fed to the column adder to performthe summation. The latency for the entire filter isLog2(# taps) +1 rounded up to the next integer.This reflects the latency of one for the multiplier

6

added to the column adder latency. The delays donot contribute to the latency figure, as they areused to provide the past inputs to multipliers.

inputZ

C0

Z

C1

Z

Cn-2 Cn-1

output

× ×××

Figure 7. FIR filter signal flow diagram.

Word Delay

Word Delay

Word Delay

Word Delay

Word Delay

Word Delay

sourcelogic zero

0COLUMN

ADDER

Serial Multiplier

SerialOutput

Serial Input

Serial Multiplier

Serial Multiplier

Serial Multiplier

Serial Multiplier

Serial Multiplier

Serial Multiplier

Figure 8. Serial FIR Filter as implemented in FPGA(example has 7 taps). The Coefficients are fixed inputs

to the serial by parallel multipliers.

Input, Output and ControlBoth the input and output of the filter are serialdata streams with each word presented leastsignificant bit first. The input words are of thesame length as the output. The word length isequal to the sum of the word sizes of the input(number of bits excluding sign extension requiredfor processor), the coefficients (length ofmultipliers) and the number of levels in the columnadder. The word size of the input need not be thesame as that of the coefficients. The input is signextended to bring it to the same length as theoutput. This requirement is due to the nature ofthe serial processing; the inputs need to be as longas the outputs. The extra few bits due to thecolumn adder allow the column sum to grow with-out overflow. The multiplier array must be resetbefore each new word begins to shift in. The de-lays, however, cannot be reset since they hold theold words. In the FPGA implementation, a localreset was wired to the array columns correspondingto to the multipliers and column adder instead ofthe global reset. That local reset was brought outas a control line in addition to the global resetwhich clears the filter.

More taps may be obtained by cascading twofilters. This is done by adding an extra word delayto the end of the delay chain to feed the serial inputof the second filter. The serial outputs of the twofilters are summed using a serial adder to obtainthe final output. This expansion scheme can beextended to create any number of taps by chainingchips together via the delay chain and summingtheir outputs with a column adder. That columnadder could easily be included one of the devices.

The filter coefficients are determined when the de-vice is programmed. The value of each bit of thecoefficient is determined by connections to a singlecell (except for the coefficient sign bits which havetwo connections), so to change a bit, only one cellneeds to be updated in the device program(assuming the device was routed in a predictablemanner). The bit values of the filter coefficientsare defined by the connections of each stage of themultiplier to the corresponding Y bus (for a 1) orlogic zero bus (for a 0). The coefficient bits areordered on the multipliers so that the mostsignificant bit corresponds to the input end of themultiplier. The coefficients are ordered in the FIRfilter so that C0 is at the input end of the filter.The CLI FPGA has a feature which allows partialreprogramming of the array while the remainingportion remains functional. This feature should al-low changing coefficients on the fly. Even if thisfeature is not usable, FIR coefficients are usuallychanged as a result of a change of context, wherethe outputs are meaningless during the changeanyway. According to the CLI data sheets,complete device reprogramming typically takesplace in 8ms.

DESIGN ASSESSMENT

The real-estate occupied by the filter is determinedmainly by the layout of the carry-save adders, thenumber of bits in the coefficients and the numberof taps in the filter. If the multipliers areconstructed of adders 3 cells wide and 2 cells tall(adders are chained horizontally), one device cancontain a 27 tap filter with 12 bit coefficients and17 bit inputs. Alternatively, if the multipliers aremade of 3 cell tall by 2 cell wide adders, onedevice will support up to 18 taps with 18 bit coef-ficients and 25 bit inputs. These are respectableresults for a single FPGA. Additional taps areeasily had by expanding the filter to multipledevices as discussed above.

7

Using the maximum timing parameters from theCLI device specification, I found the maximum in-ternal clock to clock delay in the FIR filter to beabout 30 ns (attributed to the bus transit time onthe Y inputs to the multipliers). This translates toa bit rate of about 33 Mhz. For the example 27 tapfilter with 12 bit coefficients and 16 bit input data(33 bit output), this means a data rate ap-proaching 1 million words per second.

Placement and RoutingThe regular structure and compactness of the ele-ments yields a remarkably high utilization ofaround 70%. This figure is arrived at by countingthe number of cells used as logic (not wire cells)and dividing by the total number of cells in therectangular area covered by the filter. As acomparison, a 25% utilization is considered good.The placement and routing of the cells is critical toobtaining the predicted data rate and logic density.Unfortunately, the automatic placement tool isincapable of producing satisfactory results, so handplacement is necessary. The automatic router doesa decent job provided it has a good placement tostart from. The routed solution, however, is notoptimum. Additionally, the auto route does notconnect the multiplier Y and Z busses in a pre-dictable manner, so partial reprogramming wouldnot be possible. These limitations could be over-come by writing a generator program which takesadvantage of the regular structure of the filter tocreate a placed and routed data base. Thegenerator would have the added benefit of drasti-cally reducing design time and probability oferrors.

CONCLUSIONS

In this paper, I have shown that it is possible topack a relatively complex digital signal processingfunction into an FPGA by using bit serial struc-tures. The cost of bit serial architectures in termsof more clock cycles can be offset to some degreeby the shorter delay paths between pipeline regis-ters. The resulting design is fast enough for manyapplications where a bit serial process may nothave been considered. The bit serial designphilosophy is extendable to other FPGAs, VLSIdesigns and other places, and is as applicable todayas it was for the early processors.

REFERENCES

[1] P. Denyer and D. Renshaw, VLSI SignalProcessing: A Bit Serial Approach. Selected Read-ings, Addison-Wesley, 1985.

[2] C.R. Rupp, Digital Functions and Proces-sors, work in progress, University of Massachusettsat Lowell, 1992.

[3] P.J. Graumann and L.E. Turner, Imple-menting Digital Signal Processing Algorithms us-ing Pipelined Bit-Serial Arithmetic and Field Pro-grammable Gate Arrays, FPGA ‘92, February1992.

[4] Concurrent Logic Inc., "CLI 6000 SeriesField Programmable Gate Arrays", Data Sheet,Concurrent Logic Inc., December 1991.

8

AUTHOR'S BIOGRAPHY

Raymond J. Andraka is a senior engineer for RaytheonCompany's Missile Systems Division, where he has beendesigning digital signal processors for radar systems andmanaging projects for the last 5 years. Ray graduatedfrom Lehigh University in December 1983 with hisBSEE. He then spent 41/2 years in the Air Forceresearching techniques for recovering data masked bynoise and managing development contracts beforejoining Raytheon. He recently earned his MSEE fromthe University of Massachusetts at Lowell. Ray enjoysflying, tinkering and spending time with his wife andtwo boys.

Documents

FIR Filter Fits in an FPGA using a Bit Serial Approach structure implemented in FPGAs. To illustrate the advantages of bit serial designs for FPGAs , I examine the implementation of