16 Tap FIR Filter

11

22

Design ObjectivesDesign ObjectivesDesign ObjectivesDesign Objectives To have a register based storage of To have a register based storage of

16 latest input values and the 16 16 latest input values and the 16 impulse response coefficients on-impulse response coefficients on-chip.chip.

To utilize a clocked architecture to To utilize a clocked architecture to synchronize input and output values.synchronize input and output values.

Reduce the Number of Multiplier and Reduce the Number of Multiplier and Adder needed that is Optimize area Adder needed that is Optimize area and Power and cost. and Power and cost.

By Achieving the above the speed will By Achieving the above the speed will not be compromisednot be compromised

33

Design ObjectivesDesign ObjectivesDesign ObjectivesDesign Objectives Future scalability for input data as well Future scalability for input data as well

as coefficient bits. as coefficient bits.

Signed or unsigned input data as well Signed or unsigned input data as well as coefficients. as coefficients.

Fast MAC operation on signed or Fast MAC operation on signed or unsigned data with future scalability. unsigned data with future scalability.

Synchronization of Input/Output data Synchronization of Input/Output data

Configurable Output Precision Configurable Output Precision

44

Design ObjectivesDesign ObjectivesDesign ObjectivesDesign Objectives 16 taps of delay line. 16 taps of delay line.

8 bits of Input/Output bit resolution 8 bits of Input/Output bit resolution

Burst mode of data transfer at Input supporting 32 Burst mode of data transfer at Input supporting 32 elements of the desired resolution in one burst elements of the desired resolution in one burst

Main Issue of concern when designing FIR FilterMain Issue of concern when designing FIR Filter

Sharp ResponseSharp Response

Number of TapsNumber of Taps

Numerical PrecisionNumerical Precision

Fully ParallelFully Parallel

55

Advantages and DisadvantagesAdvantages and DisadvantagesAdvantages and DisadvantagesAdvantages and Disadvantages• Advantages:

– Always stable (assume non-recursive

implementation).

– Quantization noise is not much of a problem.

– Transients have a finite duration.

• Disadvantages:– A high-order filter is generally needed to satisfy

the stated specification – so more coefficients

are needed with more storage and computation.

66

Review of discrete-time Review of discrete-time systemssystems

Review of discrete-time Review of discrete-time systemssystemsLinear time-invariant (LTI) systemsLinear time-invariant (LTI) systems

Causal systems: Causal systems:

for all input x[k]=0, k<0 -> output y[k]=0, k<0for all input x[k]=0, k<0 -> output y[k]=0, k<0

Impulse response : Impulse response :

input 1,0,0,0,... -> output h[0],h[1],h[2],h[3],...input 1,0,0,0,... -> output h[0],h[1],h[2],h[3],...

input x[0],x[1],x[2],x[3] -> output y[0],y[1],y[2],y[3],...input x[0],x[1],x[2],x[3] -> output y[0],y[1],y[2],y[3],...

x[k] y[k]

][*][][].[][ khkuikhiukyi

77

OverviewOverviewOverviewOverviewFIR filter equationFIR filter equation

y[n] = x[n] * h [n]y[n] = x[n] * h [n]

where n is the number of where n is the number of “taps” or coefficients in the “taps” or coefficients in the FIR filter.FIR filter.

For a 16-tap FIR filterFor a 16-tap FIR filter

y[n] = ay[n] = a00x[n] + ax[n] + a11x[n-1] + ax[n-1] + a22x[n-2] x[n-2] + a+ a33x[n-3]+…+ ax[n-3]+…+ a1515x[n-15] x[n-15]

88

Different Filter Different Filter RepresentationsRepresentationsDifferent Filter Different Filter

RepresentationsRepresentations Difference equationDifference equation

Recursive Recursive computation needs computation needs yy[-1] and [-1] and yy[-2][-2]For the filter to be LTI, For the filter to be LTI, yy[-1] = 0 and [-1] = 0 and yy[-2] = 0[-2] = 0

Transfer functionTransfer functionAssumes LTI systemAssumes LTI system

Block Diagram Block Diagram RepresentationRepresentation][]2[

8

1]1[

2

1][ kxkykyky

x[k] y[k]

UnitDelay

UnitDelay

1/2

1/8

y[k-1]

y[k-2]

21

21

81

21

1

1

)(

)()(

)()(8

1)(

2

1)(

zzzX

zYzH

zXzYzzYzzY

99

Discrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsZ-Transform: Z-Transform:

i

izihzH ].[)(

]3[

]2[

]1[

]0[

.

]2[000

]1[]2[00

]0[]1[]2[0

0]0[]1[]2[

00]0[]1[

000]0[

....1

]5[

]4[

]3[

]2[

]1[

]0[

....1

3211).()(

521521

u

u

u

u

h

hh

hhh

hhh

hh

h

zzz

y

y

y

y

y

y

zzz

zzzzHzY

i

iziyzY ].[)( i

iziuzU ].[)(

)().()( zUzHzY

1010

Discrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsDiscrete-Time Systems`Popular’ frequency responses for filter design :`Popular’ frequency responses for filter design :

low-pass (LP) high-pass (HP) band-pass (BP)low-pass (LP) high-pass (HP) band-pass (BP)

band-stop multi-bandband-stop multi-band … …

1111

Digital Filter SpecificationsDigital Filter SpecificationsDigital Filter SpecificationsDigital Filter Specifications For example the magnitude response For example the magnitude response

of a digital lowpass filter may be given as of a digital lowpass filter may be given as indicated belowindicated below )( jeG

1212

Hierarchical Structures:Hierarchical Structures:

–PipelinePipeline

–SplitJoinSplitJoin

–Feedback LoopFeedback Loop

Structured StreamsStructured StreamsStructured StreamsStructured Streams

1313

Different StrategiesDifferent StrategiesDifferent StrategiesDifferent Strategies Map filter per tile and run Map filter per tile and run

foreverforever

Pros:Pros:– No filter swapping overheadNo filter swapping overhead– Reduced memory trafficReduced memory traffic– Localized communicationLocalized communication– Tighter latenciesTighter latencies– Smaller live data setSmaller live data set

Cons:Cons:– Load balancing is criticalLoad balancing is critical– Not good for dynamic behaviorNot good for dynamic behavior– Requires # filters Requires # filters ≤≤ # processing # processing

elements elements

1414

Discrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsDiscrete-Time Systems`FIR filters’ (finite impulse response):`FIR filters’ (finite impulse response):

Moving average filters (MA)Moving average filters (MA)

N poles at the origin z=0 (hence guaranteed stability) N poles at the origin z=0 (hence guaranteed stability)

N zeros (zeros of B(z)), `all zero’ filtersN zeros (zeros of B(z)), `all zero’ filters

corresponds to difference equationcorresponds to difference equation

Impulse responseImpulse response

NNN

zbzbbz

zBzH ...

)()( 1

10

][....]1[.][.][ 10 Nkubkubkubky N

,...0]1[,][,...,]1[,]0[ 10 NhbNhbhbh N

1515

Speeding Up FIR FilterSpeeding Up FIR FilterSpeeding Up FIR FilterSpeeding Up FIR Filter FIR speed-upFIR speed-up

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);

y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);

. . .. . .

y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Run MAC at double frequency, read two 32-bit numbersRun MAC at double frequency, read two 32-bit numbers

FIR filtering: two outputs in parallelFIR filtering: two outputs in parallel

Two outputs = 4N reads, 2N MAC’s, 2 writesTwo outputs = 4N reads, 2N MAC’s, 2 writes

1616

Direct Form Direct Form RealizationRealization

Direct Form Direct Form RealizationRealization

u[k]

u[k-4]u[k-3]u[k-2]u[k-1]

x

bo

+

x

b4

x

b3

+

x

b2

+

x

b1

+

y[k]

0 1[ ] . [ ] . [ 1] ... . [ ]

( 1)

, number of Taps

N

Critical M A

Clock Critical

y k b u k b u k b u k N

T T T N

T T N

1717

Retiming FIR Filter Retiming FIR Filter RealizationsRealizations

Retiming FIR Filter Retiming FIR Filter RealizationsRealizations Select subgraph (shaded) Select subgraph (shaded)

Remove delay element on all inbound arrowsRemove delay element on all inbound arrowsAdd delay element on all outbound arrowsAdd delay element on all outbound arrows

u[k]

u[k-4]u[k-3]u[k-2]u[k-1]

xbo

+

xb4

xb3

+

xb2

+

xb1

+y[k]

1818

RetimingRetimingRetimingRetimingu[k]

u[k-1]

x

bo

+

x

b1

+

y[k]

u[k-3]u[k-2]

x

b4

x

b3

+

x

b2

+

1919

Four Tap Direct Form RealizationFour Tap Direct Form RealizationFour Tap Direct Form RealizationFour Tap Direct Form Realization

u[k]

u[k-3]u[k-2]u[k-1]

xbo

+

xb3

xb2

+

xb1

y[k] +

0 1 2 3[ ] . [ ] . [ 1] . [ 2] . [ 3]

log( )

, number of TapsCritical M A

Clock Critical

y k b u k b u k b u k b u k

T T T N

T T N

2020

Transposed Direct-Form Transposed Direct-Form RealizationRealization

Transposed Direct-Form Transposed Direct-Form RealizationRealization

u[k]

xbo

+y[k]

xb1

+

xb2

+

xb3

+

xb4

0 1[ ] . [ ] . [ 1] ... . [ ]

, number of Taps

N

Critical M A

Clock Critical

y k b u k b u k b u k N

T T T

T T N

2121

Lattice Form Lattice Form RealizationsRealizationsLattice Form Lattice Form RealizationsRealizationsu[k] u[k-1]

u[k-2]

xb1

+

xb2

+

x

+

x

+

b3

u[k-3]

xb3

+

b2x

+

xbo

+

y[k]

b4x

+

u[k-4]

xb4

b1x

bo

y[k]~

2222

FIR Filter Realizations FIR Filter Realizations FIR Filter Realizations FIR Filter Realizations Lattice FormLattice Form

u[k]

y[k]

+

+

x

xko

+

+

x

xk1

+

+

x

xk2

+

+

x

xk3

xbo

y[k]~

][....]1[.][.][ 10 Nkubkubkubky N

i.e. different software/hardware, same i/o-behavior

2323

Efficient Direct Form Efficient Direct Form RealizationRealization

Efficient Direct Form Efficient Direct Form RealizationRealizationEfficient Direct-Form realization. Efficient Direct-Form realization.

bo

y[k]

u[k]

+

+ ++ +

++

x xb4

xb3

xb2

xb1

++

2424

Pin DiagramPin DiagramPin DiagramPin Diagram

Drivey[0]

y[2]y[3]y[4]y[5]y[6]….y[31]

y[1]

x[0]x[1]……....x[15]

Reset

Coeffin Din Clk

Vdd Gnd

16-bit16-tapFIR

Filter

a[0]a[1]……....

a[15]

Synthesis using Synopsys Design CompilerSynthesis using Synopsys Design CompilerInitial Target Frequency: 100 MHz (typical)Initial Target Frequency: 100 MHz (typical)

2525

SpecificationsSpecificationsSpecificationsSpecificationsInput SpecificationsInput Specifications

16-bit unsigned integers for 16-bit unsigned integers for data inputs.data inputs.

16-bit unsigned integers for 16-bit unsigned integers for coefficients.coefficients.

Output SpecificationsOutput Specifications

32-bit unsigned integer 32-bit unsigned integer output.output.

2626

System ComponentsSystem ComponentsSystem ComponentsSystem Components MemoryMemory - Input and Coefficient - Input and Coefficient

ControlControl - Mod-4 and Mod-8 counters - Mod-4 and Mod-8 counters

- 3-8 Decoder- 3-8 Decoder

- Combinational logic- Combinational logic

MultiplierMultiplier - Radius-8 Booth multiplier- Radius-8 Booth multiplier

- Multiplier register- Multiplier register

AdderAdder - 9-bit Carry Save adder- 9-bit Carry Save adder

- Adder register- Adder register

Output RegisterOutput Register

2727

SpecificationsSpecificationsSpecificationsSpecificationsDrive Signal(Output Signal)Drive Signal(Output Signal)

A new output is available.A new output is available.

Inputs or coefficients to be applied Inputs or coefficients to be applied only when Drive is asserted.only when Drive is asserted.

CoefficientsCoefficients

Any coefficient changed implies a Any coefficient changed implies a new filter definition.new filter definition.

Input Memory cleared – new data to Input Memory cleared – new data to be entered.be entered.

2828

SpecificationsSpecificationsSpecificationsSpecificationsSystem ClockSystem Clock

One clock-cycle for the filter = 32 One clock-cycle for the filter = 32 input clock pulses.input clock pulses.

One Tap-cycle = 8 input clock pulses One Tap-cycle = 8 input clock pulses described as 8 phases.described as 8 phases.

4 such Taps for each output.4 such Taps for each output.

System ResetSystem Reset

Active HighActive High

2929

System TimingSystem TimingSystem TimingSystem Timing mod8 counter statesmod8 counter states

Input or Coefficient memory enableInput or Coefficient memory enable

Multiplier propagation delayMultiplier propagation delay

Multiplier propagation delayMultiplier propagation delay

Multiplier Register enableMultiplier Register enable

Add Register EnableAdd Register Enable

Output Register EnableOutput Register Enable

3030

System Timing System Timing StrategyStrategy

System Timing System Timing StrategyStrategy Two phase clockingTwo phase clocking

Generation of internal lower Generation of internal lower frequency clocks using mod-4 and frequency clocks using mod-4 and mod-8 countersmod-8 counters

Each state of mod-4 counter used for Each state of mod-4 counter used for computation of one filter tapcomputation of one filter tap

Output available at the end of one Output available at the end of one cycle of mod-4 countercycle of mod-4 counter

3131

2-Parallel FIR Filtering 2-Parallel FIR Filtering StructureStructure

2-Parallel FIR Filtering 2-Parallel FIR Filtering StructureStructure

H0

H1

H0

H1

+

D

+

y(2k)

y(2k+1)

x(2k)

x(2k+1)

z-2

3232

Hardware-Efficient 2-Parallel FIR Hardware-Efficient 2-Parallel FIR FilterFilter

Hardware-Efficient 2-Parallel FIR Hardware-Efficient 2-Parallel FIR FilterFilter

YY00 = X = X00 H H00 + z + z-2-2XX11HH11

YY11 = X = X00 H H11 + X + X11 H H00

= (H= (H00 + H + H11) (X) (X00 + X + X11) – H) – H00XX00 – H – H11XX11

z-2

H0

H0+H1

H1

+

D

+

y(2k)

y(2k+1)

x(2k)

x(2k+1)

+ +

3333

Savings in the New Savings in the New StructureStructure

Savings in the New Savings in the New StructureStructureOriginally,Originally,

–2N multiplications + 2(N-1) 2N multiplications + 2(N-1) additions for two inputsadditions for two inputs

In the new structureIn the new structure–3*(N/2) = 1.5N multiplication3*(N/2) = 1.5N multiplication

–3(N/2 –1) + 4 = 1.5N + 1 additions3(N/2 –1) + 4 = 1.5N + 1 additions

3434

Design Flow FIR 16 Tap DelayDesign Flow FIR 16 Tap DelayDesign Flow FIR 16 Tap DelayDesign Flow FIR 16 Tap Delay

VHDL Deign Entry

Synthesis

Floor planning

Place & Route

FunctionalVerification

Timing Verification

PhysicalVerification

EDIF

PDEFSDF

PDEFParasitic

http://www.synopsys.com/

3535

The FIR FilterThe FIR FilterThe FIR FilterThe FIR FilterImplementation of 16 Tap Implementation of 16 Tap FIR Filter, the coefficients FIR Filter, the coefficients are represented as fixed are represented as fixed point 16-bits 2’s point 16-bits 2’s complement numbers. It complement numbers. It is assumed that either or is assumed that either or both of the coefficients both of the coefficients and data are fractional and data are fractional numbers. numbers.

3636

FIR Filter(Critical Path)FIR Filter(Critical Path)FIR Filter(Critical Path)FIR Filter(Critical Path) In order to save area and improve the In order to save area and improve the

critical path performance, we decided to add critical path performance, we decided to add the 12-bit sum and carry results of the the 12-bit sum and carry results of the multiplier during the accumulation multiplier during the accumulation operation. Therefore, the adder has to add operation. Therefore, the adder has to add three 12-bit numbers. To do that, the first three 12-bit numbers. To do that, the first stage of the adder is a 3-to-2 combiner, stage of the adder is a 3-to-2 combiner, which is just a CSA. The next stage is a CPA which is just a CSA. The next stage is a CPA (Carry Propagate Adder) arranged in a static (Carry Propagate Adder) arranged in a static Manchester carry chain form. The chain is Manchester carry chain form. The chain is divided into four sections, each one has divided into four sections, each one has three carry stages. Buffers are used three carry stages. Buffers are used between sections to reduce the overall between sections to reduce the overall delay. delay.

3737

Survey of MultiplierSurvey of MultiplierSurvey of MultiplierSurvey of MultiplierCombinational Multiplier: uses n Combinational Multiplier: uses n

adders, eliminates registers:adders, eliminates registers:

3838

44 multiplication

X3 X2 X1 X0 multiplicand

Y3 Y2 Y1 Y0 multiplier

X3Y0 X2Y0 X1Y0 X0Y0

X3Y1 X2Y1 X1Y1 X0Y1

X3Y2 X2Y2 X1Y2 X0Y2

X3Y3 X2Y3 X1Y3 X0Y3

Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0 Result

P.P.

Multiplier DesignMultiplier DesignMultiplier DesignMultiplier Design

3939

Radix-2 Unsigned Radix-2 Unsigned MultiplicationMultiplication

Radix-2 Unsigned Radix-2 Unsigned MultiplicationMultiplicationUse a single n-bit adder, three registers (P, A, B), Use a single n-bit adder, three registers (P, A, B),

and a testing circuit for Aand a testing circuit for A00

Initialization: Place the unsigned numbers in Initialization: Place the unsigned numbers in registers A and B. Set P to zero.registers A and B. Set P to zero.

1: If A1: If A00 is 1, is 1,

then register B, containing bthen register B, containing bn-1n-1bbn-2n-2...b...b00 is added to is added to

P; P; otherwise 00...00 (nothing) is added to P. The sum otherwise 00...00 (nothing) is added to P. The sum is placed back into P.is placed back into P.

2. Shift register pair (P, A) one bit right.2. Shift register pair (P, A) one bit right.The last bit of A is shifted out (not used).The last bit of A is shifted out (not used).

4040

Array MultiplierArray MultiplierArray MultiplierArray MultiplierArray multiplier is an efficient Array multiplier is an efficient

layout of a combinational layout of a combinational multiplier.multiplier.

Array multipliers may be Array multipliers may be pipelined to decrease clock pipelined to decrease clock period at the expense of period at the expense of latency.latency.

4141

Array Multiplier Array Multiplier OrganizationOrganization

Array Multiplier Array Multiplier OrganizationOrganization0 1 1 00 1 1 0

x 1 0 0 1x 1 0 0 1

0 1 1 00 1 1 0

+ + 0 0 0 00 0 0 0

0 0 1 1 00 0 1 1 0

+ + 0 0 0 00 0 0 0

0 0 0 1 1 00 0 0 1 1 0

+ + 0 1 1 00 1 1 0

0 1 1 0 1 1 00 1 1 0 1 1 0

Product

skew arrayfor rectangularlayout

Multiplicand

Multiplier

4242

Unsigned Array Unsigned Array MultiplierMultiplier

Unsigned Array Unsigned Array MultiplierMultiplier

+

x0y0x1y0x2y0

xny0

0

x0y1+ x1y1

0

+ x0y2+ x1y2

+ 0+

P(2n-1) P(2n-2) P0

4343

tmult(M-1) tcarry +(N-1) tsum + tand

For small tmult, tcarry

tsum

Beneficial to make tcarry = tsum

Differential Logic (DCVS)

Array Multiplier cell

Xi

Yi

Pin

Cout

Pout

FA

Pout

Cout

Pin

Cin

Cin

Xi Yi

Critical Path

N-1 P.P

M-1

Array Multiplier OrganizationArray Multiplier OrganizationArray Multiplier OrganizationArray Multiplier Organization

4444

HA

HA×

×

×

×

HA

HA

X3 X2 X1 X0

Y0

Y1

Y2

Y3 Z7 Z6 Z5 Z4 Z3

Z0

Z1

Z2

Architecture of Array MultiplierArchitecture of Array MultiplierArchitecture of Array MultiplierArchitecture of Array Multiplier

Array multipliersArray multipliers

– Partial product generation and Partial product generation and accumulation are mergedaccumulation are merged

– Identical cellsIdentical cells

– High-rate pipeliningHigh-rate pipelining

a4x2

a3x3

a2x4

p6

a4x1

a3x2

a2x3

a1x4

p5

a4

x4

a4x0

a3x1

a2x2

a1x3

a0x4

p4

a3

x3

a3x0

a2x1

a1x2

a0x3

p3

a2

x2

a2x0

a1x1

a0x2

p2

a1

x1

a1x0

a0x1

p1

a0

x0

a0x0

p0

a4x3

a3x4

p7

a4x4

p8p9

Advantages of Array MultiplierAdvantages of Array MultiplierAdvantages of Array MultiplierAdvantages of Array Multiplier

– Array multiplier for Array multiplier for

Unsigned numbersUnsigned numbers

a3x1

a4x00

a2x1

a3x00

a1x1

a2x00

a0x1

a1x00

a3x2

a4x1

a2x2 a1x2 a0x2

a3x3

a4x2

a2x3 a1x3 a0x3

a3x4

a4x3

a2x4 a1x4 a0x4a4x4

0

a0x0

p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

Array MultiplierArray MultiplierArray MultiplierArray Multiplier

• type I cell type I cell

–ordinary full adderordinary full adder

• type II cell type II cell –x + y - z = 2c - sx + y - z = 2c - s

s = (x + y - z) mod 2s = (x + y - z) mod 2

c = [(x + y - z) + s] / 2c = [(x + y - z) + s] / 2

–type I cell withtype I cell with

inverted z and sinverted z and s

z=1-z’, s=1-s’z=1-z’, s=1-s’

weight = -1z

II x

y

c s

x + y - z 2c - s

0 0 0 0 0 0 0 1 0 10 1 0 1 10 1 1 0 01 0 0 1 11 0 1 0 01 1 0 1 01 1 1 1 1

Array Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s Complement

• type II’ cell :type II’ cell :

–- x - y + z = - 2c + s - x - y + z = - 2c + s

x + y - z = 2c - sx + y - z = 2c - s

identical to the type II identical to the type II cellcell z

y

II’ x

c s

weight = -2

weight = -1

Array Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s Complement

4949

Carry-Save Multiplier

carry propagation : diagonally downwards instead of to left Requires additional adder (vector-merging adder) You can make this final adder very fast using CLA or CSA scheme

44 multiplier

ripple-carry based multiplier

Architecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save Multiplier

5050

Critical path

Vector-merging addercarry-save multiplier

tmult=(N-1) tcarry + tand + tvma

Carry-Save Multiplier (44)

Architecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save Multiplier

5151

Baugh-Wooley MultiplierBaugh-Wooley MultiplierBaugh-Wooley MultiplierBaugh-Wooley MultiplierAlgorithm for two’s-complement Algorithm for two’s-complement

multiplication.multiplication.

Adjusts partial products to maximize Adjusts partial products to maximize regularity of multiplication array.regularity of multiplication array.

Moves partial products with negative Moves partial products with negative signs to the last steps; also adds signs to the last steps; also adds negation of partial products rather than negation of partial products rather than subtracts.subtracts.

5252

Serial-Parallel Serial-Parallel MultiplierMultiplier

Serial-Parallel Serial-Parallel MultiplierMultiplierUsed in serial-arithmetic Used in serial-arithmetic

operations.operations.

Multiplicand can be held in Multiplicand can be held in place by register.place by register.

Multiplier is shifted into Multiplier is shifted into array.array.

5353

reset

Serial to parallelregister

G1

G2

Full adder

CoCi

Delay element ; F/F

S

N-1 stages

X

Y

M+N bits M*N cycles

Serial MultiplierSerial Multiplier



5454

Y0 Y1 Y2 Yn-1

X



5555

X3Y0 X2Y0 X1Y0 X0Y0

X0Y1X1Y1X2Y1X3Y1

X0Y2X1Y2X2Y2X3Y2

X0Y3X1Y3X2Y3X3Y3

P7 P6 P5 P4 P3 P2 P1 P0

Y0

Y1

Y2

Y3

X3 X2 X1 X0



5656

1

0

1

0

2

2

n

j

jj

m

i

ii

YY

XX

1

0

1

0

1

0

1

0

1

0

2

2)(

22

nm

k

kk

m

i

n

j

jiji

m

i

n

j

jj

iir

P

YX

YXYXP

+

Pi+1

Yi

Xi

CiCi+1



5757

The Architecture of the Booth The Architecture of the Booth AlgorithmAlgorithm

The Architecture of the Booth The Architecture of the Booth AlgorithmAlgorithm

The Booth MultiplierThe Booth Multiplier–High performance, low High performance, low power multiplier units are power multiplier units are necessary in many necessary in many situations, such as DSP situations, such as DSP systems.systems.

5858

FAFA

FA

FAFAFA

CLA adder

……..……..……..

X7 X6 X5 X4 X3 X2 X1 X0

Y0

Y1

Y2

Y7

. . . . . . . . .

Carry Save AdditionCarry Save AdditionCarry Save AdditionCarry Save Addition

5959

Booth’s AlgorithmBooth’s AlgorithmBooth’s AlgorithmBooth’s Algorithm

6060

)0(

2)248(

2)24(

2)2(

2)(

0

44142434

14/

044

313/

03132333

12/

0

221222

1

01

y

xyyyyyXY

xyyyyXY

xyyyXY

xyyXY

iiiii

n

ii

in

iiiii

n

i

iiii

n

i

iii1st order(radix-2)

2nd order(radix-4)

3rd order(radix-8)

4th order(radix-16)

Booth AlgorithmBooth AlgorithmBooth AlgorithmBooth Algorithm

6161

Booth EncodingBooth EncodingBooth EncodingBooth Encoding Encode a number by taking groups of 3 bitsEncode a number by taking groups of 3 bits

where each 3-bit group overlaps by 1 bitwhere each 3-bit group overlaps by 1 bit

Consider multiplier B with (n + 1) bitConsider multiplier B with (n + 1) bit– Pad B with 0 to match the first term Pad B with 0 to match the first term – if B has an odd number of bits, if B has an odd number of bits,

then extend the sign Bthen extend the sign BnnBBnnBBn-1n-1...B...B0000

i1i2i1j

2i1iij

BBB2E

BBB2E

6262

Booth MultiplierBooth MultiplierBooth MultiplierBooth MultiplierEncoding scheme to reduce number of Encoding scheme to reduce number of

stages in multiplication.stages in multiplication.

Performs two bits of multiplication at Performs two bits of multiplication at once—requires half the stages.once—requires half the stages.

Each stage is slightly more complex Each stage is slightly more complex than simple multiplier, but than simple multiplier, but adder/subtracter is almost as small/fast adder/subtracter is almost as small/fast as adder.as adder.

6363

Booth EncodingBooth EncodingBooth EncodingBooth Encoding

Two’s-complement form of multiplier:Two’s-complement form of multiplier:– y = -2y = -2nnyynn + 2 + 2n-1n-1yyn-2n-2 + 2 + 2n-2n-2yyn-2n-2 + ... + ...

Rewrite using 2Rewrite using 2aa = 2 = 2a+1a+1 - 2 - 2aa::– y = -2y = -2nn(y(yn-1n-1-y-ynn) + 2) + 2n-1n-1(y(yn-2n-2 -y -yn-1n-1) + 2) + 2n-2n-2(y(yn-3n-3 -y -yn-2n-2) )

+ ...+ ...

Consider first two terms: by looking at Consider first two terms: by looking at three bits of y, we can determine three bits of y, we can determine whether to add whether to add xx, , 2x2x to partial product. to partial product.

6464

Booth ActionsBooth ActionsBooth ActionsBooth Actionsyyii y yi-1i-1 y yi-2i-2 incrementincrement

0 0 00 0 0 00

0 0 10 0 1 xx

0 1 00 1 0 xx

0 1 10 1 1 2x2x

1 0 01 0 0 -2x-2x

1 0 11 0 1 -x-x

1 1 01 1 0 -x-x

1 1 11 1 1 00

6565

x8

Inverter/shift

Boothdecoder

Wallace Tree

CLA CLA CLA

x 2xx2x

selector

4

x0

y0

y1

y2

y3

y4

y5

y6

y7y8

………….

Booth MultiplierBooth MultiplierBooth MultiplierBooth Multiplier

Array Multiplier Cell for Booth’s Array Multiplier Cell for Booth’s AlgorithmAlgorithm

Array Multiplier Cell for Booth’s Array Multiplier Cell for Booth’s AlgorithmAlgorithm

0 (-2A)i (2A)i(A)i(-A)i

MUX

Full Adder

cout sout

select

cin

sin

6767

S0 S0 S0 S0 S0 S0 S0 S0 - - - - - - - -

S1 S1 S1 S1 S1 S1 - - - - - - - -

S2 S2 S2 S2 - - - - - - - -

S3 S3 - - - - - - - -

Signextension

)2(0)2(1)2(2)2(3

)222(0)222(1)222(2)222(3

)22222222(0

)222222(1)2222(2)22(3

0246

077277477677

01234567

234567456767

SSSS

SSSS

S

SSS

1 S3 1 S2 1 S1 1 S0+1

Sign Extension ReductionSign Extension ReductionSign Extension ReductionSign Extension Reduction

6868

Wallace TreeWallace TreeWallace TreeWallace Tree Reduces depth of adder chain.Reduces depth of adder chain.

Built from carry-save adders:Built from carry-save adders:– three inputs a, b, c three inputs a, b, c – produces two outputs y, z such that y + z = a + b produces two outputs y, z such that y + z = a + b

+ c+ c

Carry-save equations:Carry-save equations:– yyii = parity(a = parity(aii,b,bii,c,cii))

– zzii = majority(a = majority(aii,b,bii,c,cii))

6969

Wallace Tree StructureWallace Tree StructureWallace Tree StructureWallace Tree Structure

7070

7-bit Wallace Tree Addition7-bit Wallace Tree Addition7-bit Wallace Tree Addition7-bit Wallace Tree Addition

7171

Wallace Tree Wallace Tree OperationOperation

Wallace Tree Wallace Tree OperationOperation At each stage, i numbers are combined to At each stage, i numbers are combined to

form ceil(2i/3) sums.form ceil(2i/3) sums.

Final adder completes the summation.Final adder completes the summation.

Wiring is more complex.Wiring is more complex.

Can build a Booth-encoded Wallace tree Can build a Booth-encoded Wallace tree multiplier.multiplier.

7272

C S

FA

FA

FA

FA

1 2 3

4

5

6

FA FA

FA

FA

C S

CSA vs. Wallace TreeCSA vs. Wallace TreeCSA vs. Wallace TreeCSA vs. Wallace Tree

A 0 1 0 1 1 0 22A 0 1 0 1 1 0 22X X 0 0 1 0 1 1 11X X 0 0 1 0 1 1 11Y(recoded multiplier) 0 1 0 1 0 1Y(recoded multiplier) 0 1 0 1 0 1

1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 0 0 1 0

Radix-4 Modified Booth’s AlgorithmRadix-4 Modified Booth’s AlgorithmRadix-4 Modified Booth’s AlgorithmRadix-4 Modified Booth’s Algorithm

7474

Wallace-TreeWallace-TreeWallace-TreeWallace-Tree

FA

FA

FA

FA

y0 y1 y2

y3

y4

y5

S

Ci-1

Ci-1

Ci-1

Ci

Ci

Ci

FA

y0 y1 y2

FA

y3 y4 y5

FA

FA

CC S

Ci-1

Ci-1

Ci-1

Ci

Ci

Ci

Collapse the chain of FAs yCollapse the chain of FAs y00-y-y55 (5 adders delays) to the Wallace tree consisting (5 adders delays) to the Wallace tree consisting

of (4 adders delays)of (4 adders delays)

7575

Floor Plan of MultiplierFloor Plan of MultiplierFloor Plan of MultiplierFloor Plan of Multiplier

Y

X

Z0

|Z3

Z7 — Z4Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

X3 X2 X1 X0

Y0

Y1

Y2

Y3

1) Square Floor Plan

7676

In The Actual DatapathIn The Actual DatapathIn The Actual DatapathIn The Actual Datapathx

Y

LSB

LSB

MSB

M1

M2

orM3

Floor Plan of MultiplierFloor Plan of MultiplierFloor Plan of MultiplierFloor Plan of Multiplier

7777

Floor PlanFloor PlanFloor PlanFloor Plan

AdderAdder

Add RegAdd Reg

Out RegOut Reg

MultiplierMultiplier

Multiplier RegMultiplier Reg

Control BlockControl Block

Coefficient Coefficient MemoryMemory

InputInputMemoryMemory

RoutingRouting

7878

Floor PlanningFloor PlanningFloor PlanningFloor Planning

7979

ResultsResultsResultsResultsCellCell Number of Number of

PortsPortsNumber of PortsNumber of Ports 3434

Number of NetsNumber of Nets 157157

Number of CellsNumber of Cells 3232

Combinational AreaCombinational Area 24286.050781 24286.050781

Non-Combinational AreaNon-Combinational Area 14935.535156 14935.535156

Total AreaTotal Area 39221.58593839221.585938

8080

Power Consumption Power Consumption & Area& Area

Power Consumption Power Consumption & Area& AreaCell Internal Power = 419.5078 uW (57%)Cell Internal Power = 419.5078 uW (57%)

Net Switching Power = 315.0848 uW (43%)Net Switching Power = 315.0848 uW (43%)

Total Dynamic Power = 734.5925 uW (100%)Total Dynamic Power = 734.5925 uW (100%)

Cell Leakage Power = 248.1773 nWCell Leakage Power = 248.1773 nW

Cell Internal Power = 419.5078 uW (57%)Cell Internal Power = 419.5078 uW (57%)

Net Switching Power = 315.0848 uW (43%)Net Switching Power = 315.0848 uW (43%)

Total Dynamic Power = 734.5925 uW (100%)Total Dynamic Power = 734.5925 uW (100%)

Cell Leakage Power = 248.1773 nWCell Leakage Power = 248.1773 nW

8181

Main ModuleMain ModuleMain ModuleMain Module

8282

Booth MultiplierBooth MultiplierBooth MultiplierBooth Multiplier

8383

Core ModuleCore ModuleCore ModuleCore Module

8484

Controller ModuleController ModuleController ModuleController Module

8585

ConclusionConclusionConclusionConclusion Good Design Experience.Good Design Experience.

Using Parallel FIR Filter Realization Using Parallel FIR Filter Realization Reduced the number of Multiplier and Reduced the number of Multiplier and Adder needed therefore Area was shrunk Adder needed therefore Area was shrunk and power consumption was loweredand power consumption was lowered

Timing Strategies Using non-blocking in Timing Strategies Using non-blocking in Verilog reduced number of states needed Verilog reduced number of states needed for implementation.for implementation.

Partitioning the design into submodules Partitioning the design into submodules made design more manageable and made design more manageable and optimized.optimized.

Performance Optimization was reached Performance Optimization was reached with slack time equal to +9.54.with slack time equal to +9.54.

Documents

16 Tap FIR Filter