Upload
mingan
View
125
Download
6
Embed Size (px)
DESCRIPTION
16 Tap FIR Filter. Omar F. Mousa/Chintan Daisa Professor: Scott Wakefield. Design Objectives. To have a register based storage of 16 latest input values and the 16 impulse response coefficients on-chip. To utilize a clocked architecture to synchronize input and output values. - PowerPoint PPT Presentation
Citation preview
11
22
Design ObjectivesDesign ObjectivesDesign ObjectivesDesign Objectives To have a register based storage of To have a register based storage of
16 latest input values and the 16 16 latest input values and the 16 impulse response coefficients on-impulse response coefficients on-chip.chip.
To utilize a clocked architecture to To utilize a clocked architecture to synchronize input and output values.synchronize input and output values.
Reduce the Number of Multiplier and Reduce the Number of Multiplier and Adder needed that is Optimize area Adder needed that is Optimize area and Power and cost. and Power and cost.
By Achieving the above the speed will By Achieving the above the speed will not be compromisednot be compromised
33
Design ObjectivesDesign ObjectivesDesign ObjectivesDesign Objectives Future scalability for input data as well Future scalability for input data as well
as coefficient bits. as coefficient bits.
Signed or unsigned input data as well Signed or unsigned input data as well as coefficients. as coefficients.
Fast MAC operation on signed or Fast MAC operation on signed or unsigned data with future scalability. unsigned data with future scalability.
Synchronization of Input/Output data Synchronization of Input/Output data
Configurable Output Precision Configurable Output Precision
44
Design ObjectivesDesign ObjectivesDesign ObjectivesDesign Objectives 16 taps of delay line. 16 taps of delay line.
8 bits of Input/Output bit resolution 8 bits of Input/Output bit resolution
Burst mode of data transfer at Input supporting 32 Burst mode of data transfer at Input supporting 32 elements of the desired resolution in one burst elements of the desired resolution in one burst
Main Issue of concern when designing FIR FilterMain Issue of concern when designing FIR Filter
Sharp ResponseSharp Response
Number of TapsNumber of Taps
Numerical PrecisionNumerical Precision
Fully ParallelFully Parallel
55
Advantages and DisadvantagesAdvantages and DisadvantagesAdvantages and DisadvantagesAdvantages and Disadvantages• Advantages:
– Always stable (assume non-recursive
implementation).
– Quantization noise is not much of a problem.
– Transients have a finite duration.
• Disadvantages:– A high-order filter is generally needed to satisfy
the stated specification – so more coefficients
are needed with more storage and computation.
66
Review of discrete-time Review of discrete-time systemssystems
Review of discrete-time Review of discrete-time systemssystemsLinear time-invariant (LTI) systemsLinear time-invariant (LTI) systems
Causal systems: Causal systems:
for all input x[k]=0, k<0 -> output y[k]=0, k<0for all input x[k]=0, k<0 -> output y[k]=0, k<0
Impulse response : Impulse response :
input 1,0,0,0,... -> output h[0],h[1],h[2],h[3],...input 1,0,0,0,... -> output h[0],h[1],h[2],h[3],...
input x[0],x[1],x[2],x[3] -> output y[0],y[1],y[2],y[3],...input x[0],x[1],x[2],x[3] -> output y[0],y[1],y[2],y[3],...
x[k] y[k]
][*][][].[][ khkuikhiukyi
77
OverviewOverviewOverviewOverviewFIR filter equationFIR filter equation
y[n] = x[n] * h [n]y[n] = x[n] * h [n]
where n is the number of where n is the number of “taps” or coefficients in the “taps” or coefficients in the FIR filter.FIR filter.
For a 16-tap FIR filterFor a 16-tap FIR filter
y[n] = ay[n] = a00x[n] + ax[n] + a11x[n-1] + ax[n-1] + a22x[n-2] x[n-2] + a+ a33x[n-3]+…+ ax[n-3]+…+ a1515x[n-15] x[n-15]
88
Different Filter Different Filter RepresentationsRepresentationsDifferent Filter Different Filter
RepresentationsRepresentations Difference equationDifference equation
Recursive Recursive computation needs computation needs yy[-1] and [-1] and yy[-2][-2]For the filter to be LTI, For the filter to be LTI, yy[-1] = 0 and [-1] = 0 and yy[-2] = 0[-2] = 0
Transfer functionTransfer functionAssumes LTI systemAssumes LTI system
Block Diagram Block Diagram RepresentationRepresentation][]2[
8
1]1[
2
1][ kxkykyky
x[k] y[k]
UnitDelay
UnitDelay
1/2
1/8
y[k-1]
y[k-2]
21
21
81
21
1
1
)(
)()(
)()(8
1)(
2
1)(
zzzX
zYzH
zXzYzzYzzY
99
Discrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsZ-Transform: Z-Transform:
i
izihzH ].[)(
]3[
]2[
]1[
]0[
.
]2[000
]1[]2[00
]0[]1[]2[0
0]0[]1[]2[
00]0[]1[
000]0[
....1
]5[
]4[
]3[
]2[
]1[
]0[
....1
3211).()(
521521
u
u
u
u
h
hh
hhh
hhh
hh
h
zzz
y
y
y
y
y
y
zzz
zzzzHzY
i
iziyzY ].[)( i
iziuzU ].[)(
)().()( zUzHzY
1010
Discrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsDiscrete-Time Systems`Popular’ frequency responses for filter design :`Popular’ frequency responses for filter design :
low-pass (LP) high-pass (HP) band-pass (BP)low-pass (LP) high-pass (HP) band-pass (BP)
band-stop multi-bandband-stop multi-band … …
1111
Digital Filter SpecificationsDigital Filter SpecificationsDigital Filter SpecificationsDigital Filter Specifications For example the magnitude response For example the magnitude response
of a digital lowpass filter may be given as of a digital lowpass filter may be given as indicated belowindicated below )( jeG
1212
Hierarchical Structures:Hierarchical Structures:
–PipelinePipeline
–SplitJoinSplitJoin
–Feedback LoopFeedback Loop
Structured StreamsStructured StreamsStructured StreamsStructured Streams
1313
Different StrategiesDifferent StrategiesDifferent StrategiesDifferent Strategies Map filter per tile and run Map filter per tile and run
foreverforever
Pros:Pros:– No filter swapping overheadNo filter swapping overhead– Reduced memory trafficReduced memory traffic– Localized communicationLocalized communication– Tighter latenciesTighter latencies– Smaller live data setSmaller live data set
Cons:Cons:– Load balancing is criticalLoad balancing is critical– Not good for dynamic behaviorNot good for dynamic behavior– Requires # filters Requires # filters ≤≤ # processing # processing
elements elements
1414
Discrete-Time SystemsDiscrete-Time SystemsDiscrete-Time SystemsDiscrete-Time Systems`FIR filters’ (finite impulse response):`FIR filters’ (finite impulse response):
Moving average filters (MA)Moving average filters (MA)
N poles at the origin z=0 (hence guaranteed stability) N poles at the origin z=0 (hence guaranteed stability)
N zeros (zeros of B(z)), `all zero’ filtersN zeros (zeros of B(z)), `all zero’ filters
corresponds to difference equationcorresponds to difference equation
Impulse responseImpulse response
NNN
zbzbbz
zBzH ...
)()( 1
10
][....]1[.][.][ 10 Nkubkubkubky N
,...0]1[,][,...,]1[,]0[ 10 NhbNhbhbh N
1515
Speeding Up FIR FilterSpeeding Up FIR FilterSpeeding Up FIR FilterSpeeding Up FIR Filter FIR speed-upFIR speed-up
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
Run MAC at double frequency, read two 32-bit numbersRun MAC at double frequency, read two 32-bit numbers
FIR filtering: two outputs in parallelFIR filtering: two outputs in parallel
Two outputs = 4N reads, 2N MAC’s, 2 writesTwo outputs = 4N reads, 2N MAC’s, 2 writes
1616
Direct Form Direct Form RealizationRealization
Direct Form Direct Form RealizationRealization
u[k]
u[k-4]u[k-3]u[k-2]u[k-1]
x
bo
+
x
b4
x
b3
+
x
b2
+
x
b1
+
y[k]
0 1[ ] . [ ] . [ 1] ... . [ ]
( 1)
, number of Taps
N
Critical M A
Clock Critical
y k b u k b u k b u k N
T T T N
T T N
1717
Retiming FIR Filter Retiming FIR Filter RealizationsRealizations
Retiming FIR Filter Retiming FIR Filter RealizationsRealizations Select subgraph (shaded) Select subgraph (shaded)
Remove delay element on all inbound arrowsRemove delay element on all inbound arrowsAdd delay element on all outbound arrowsAdd delay element on all outbound arrows
u[k]
u[k-4]u[k-3]u[k-2]u[k-1]
xbo
+
xb4
xb3
+
xb2
+
xb1
+y[k]
1818
RetimingRetimingRetimingRetimingu[k]
u[k-1]
x
bo
+
x
b1
+
y[k]
u[k-3]u[k-2]
x
b4
x
b3
+
x
b2
+
1919
Four Tap Direct Form RealizationFour Tap Direct Form RealizationFour Tap Direct Form RealizationFour Tap Direct Form Realization
u[k]
u[k-3]u[k-2]u[k-1]
xbo
+
xb3
xb2
+
xb1
y[k] +
0 1 2 3[ ] . [ ] . [ 1] . [ 2] . [ 3]
log( )
, number of TapsCritical M A
Clock Critical
y k b u k b u k b u k b u k
T T T N
T T N
2020
Transposed Direct-Form Transposed Direct-Form RealizationRealization
Transposed Direct-Form Transposed Direct-Form RealizationRealization
u[k]
xbo
+y[k]
xb1
+
xb2
+
xb3
+
xb4
0 1[ ] . [ ] . [ 1] ... . [ ]
, number of Taps
N
Critical M A
Clock Critical
y k b u k b u k b u k N
T T T
T T N
2121
Lattice Form Lattice Form RealizationsRealizationsLattice Form Lattice Form RealizationsRealizationsu[k] u[k-1]
u[k-2]
xb1
+
xb2
+
x
+
x
+
b3
u[k-3]
xb3
+
b2x
+
xbo
+
y[k]
b4x
+
u[k-4]
xb4
b1x
bo
y[k]~
2222
FIR Filter Realizations FIR Filter Realizations FIR Filter Realizations FIR Filter Realizations Lattice FormLattice Form
u[k]
y[k]
+
+
x
xko
+
+
x
xk1
+
+
x
xk2
+
+
x
xk3
xbo
y[k]~
][....]1[.][.][ 10 Nkubkubkubky N
i.e. different software/hardware, same i/o-behavior
2323
Efficient Direct Form Efficient Direct Form RealizationRealization
Efficient Direct Form Efficient Direct Form RealizationRealizationEfficient Direct-Form realization. Efficient Direct-Form realization.
bo
y[k]
u[k]
+
+ ++ +
++
x xb4
xb3
xb2
xb1
++
2424
Pin DiagramPin DiagramPin DiagramPin Diagram
Drivey[0]
y[2]y[3]y[4]y[5]y[6]….y[31]
y[1]
x[0]x[1]……....x[15]
Reset
Coeffin Din Clk
Vdd Gnd
16-bit16-tapFIR
Filter
a[0]a[1]……....
a[15]
Synthesis using Synopsys Design CompilerSynthesis using Synopsys Design CompilerInitial Target Frequency: 100 MHz (typical)Initial Target Frequency: 100 MHz (typical)
2525
SpecificationsSpecificationsSpecificationsSpecificationsInput SpecificationsInput Specifications
16-bit unsigned integers for 16-bit unsigned integers for data inputs.data inputs.
16-bit unsigned integers for 16-bit unsigned integers for coefficients.coefficients.
Output SpecificationsOutput Specifications
32-bit unsigned integer 32-bit unsigned integer output.output.
2626
System ComponentsSystem ComponentsSystem ComponentsSystem Components MemoryMemory - Input and Coefficient - Input and Coefficient
ControlControl - Mod-4 and Mod-8 counters - Mod-4 and Mod-8 counters
- 3-8 Decoder- 3-8 Decoder
- Combinational logic- Combinational logic
MultiplierMultiplier - Radius-8 Booth multiplier- Radius-8 Booth multiplier
- Multiplier register- Multiplier register
AdderAdder - 9-bit Carry Save adder- 9-bit Carry Save adder
- Adder register- Adder register
Output RegisterOutput Register
2727
SpecificationsSpecificationsSpecificationsSpecificationsDrive Signal(Output Signal)Drive Signal(Output Signal)
A new output is available.A new output is available.
Inputs or coefficients to be applied Inputs or coefficients to be applied only when Drive is asserted.only when Drive is asserted.
CoefficientsCoefficients
Any coefficient changed implies a Any coefficient changed implies a new filter definition.new filter definition.
Input Memory cleared – new data to Input Memory cleared – new data to be entered.be entered.
2828
SpecificationsSpecificationsSpecificationsSpecificationsSystem ClockSystem Clock
One clock-cycle for the filter = 32 One clock-cycle for the filter = 32 input clock pulses.input clock pulses.
One Tap-cycle = 8 input clock pulses One Tap-cycle = 8 input clock pulses described as 8 phases.described as 8 phases.
4 such Taps for each output.4 such Taps for each output.
System ResetSystem Reset
Active HighActive High
2929
System TimingSystem TimingSystem TimingSystem Timing mod8 counter statesmod8 counter states
Input or Coefficient memory enableInput or Coefficient memory enable
Multiplier propagation delayMultiplier propagation delay
Multiplier propagation delayMultiplier propagation delay
Multiplier Register enableMultiplier Register enable
Add Register EnableAdd Register Enable
Output Register EnableOutput Register Enable
3030
System Timing System Timing StrategyStrategy
System Timing System Timing StrategyStrategy Two phase clockingTwo phase clocking
Generation of internal lower Generation of internal lower frequency clocks using mod-4 and frequency clocks using mod-4 and mod-8 countersmod-8 counters
Each state of mod-4 counter used for Each state of mod-4 counter used for computation of one filter tapcomputation of one filter tap
Output available at the end of one Output available at the end of one cycle of mod-4 countercycle of mod-4 counter
3131
2-Parallel FIR Filtering 2-Parallel FIR Filtering StructureStructure
2-Parallel FIR Filtering 2-Parallel FIR Filtering StructureStructure
H0
H1
H0
H1
+
D
+
y(2k)
y(2k+1)
x(2k)
x(2k+1)
z-2
3232
Hardware-Efficient 2-Parallel FIR Hardware-Efficient 2-Parallel FIR FilterFilter
Hardware-Efficient 2-Parallel FIR Hardware-Efficient 2-Parallel FIR FilterFilter
YY00 = X = X00 H H00 + z + z-2-2XX11HH11
YY11 = X = X00 H H11 + X + X11 H H00
= (H= (H00 + H + H11) (X) (X00 + X + X11) – H) – H00XX00 – H – H11XX11
z-2
H0
H0+H1
H1
+
D
+
y(2k)
y(2k+1)
x(2k)
x(2k+1)
+ +
3333
Savings in the New Savings in the New StructureStructure
Savings in the New Savings in the New StructureStructureOriginally,Originally,
–2N multiplications + 2(N-1) 2N multiplications + 2(N-1) additions for two inputsadditions for two inputs
In the new structureIn the new structure–3*(N/2) = 1.5N multiplication3*(N/2) = 1.5N multiplication
–3(N/2 –1) + 4 = 1.5N + 1 additions3(N/2 –1) + 4 = 1.5N + 1 additions
3434
Design Flow FIR 16 Tap DelayDesign Flow FIR 16 Tap DelayDesign Flow FIR 16 Tap DelayDesign Flow FIR 16 Tap Delay
VHDL Deign Entry
Synthesis
Floor planning
Place & Route
FunctionalVerification
Timing Verification
PhysicalVerification
EDIF
PDEFSDF
PDEFParasitic
3535
The FIR FilterThe FIR FilterThe FIR FilterThe FIR FilterImplementation of 16 Tap Implementation of 16 Tap FIR Filter, the coefficients FIR Filter, the coefficients are represented as fixed are represented as fixed point 16-bits 2’s point 16-bits 2’s complement numbers. It complement numbers. It is assumed that either or is assumed that either or both of the coefficients both of the coefficients and data are fractional and data are fractional numbers. numbers.
3636
FIR Filter(Critical Path)FIR Filter(Critical Path)FIR Filter(Critical Path)FIR Filter(Critical Path) In order to save area and improve the In order to save area and improve the
critical path performance, we decided to add critical path performance, we decided to add the 12-bit sum and carry results of the the 12-bit sum and carry results of the multiplier during the accumulation multiplier during the accumulation operation. Therefore, the adder has to add operation. Therefore, the adder has to add three 12-bit numbers. To do that, the first three 12-bit numbers. To do that, the first stage of the adder is a 3-to-2 combiner, stage of the adder is a 3-to-2 combiner, which is just a CSA. The next stage is a CPA which is just a CSA. The next stage is a CPA (Carry Propagate Adder) arranged in a static (Carry Propagate Adder) arranged in a static Manchester carry chain form. The chain is Manchester carry chain form. The chain is divided into four sections, each one has divided into four sections, each one has three carry stages. Buffers are used three carry stages. Buffers are used between sections to reduce the overall between sections to reduce the overall delay. delay.
3737
Survey of MultiplierSurvey of MultiplierSurvey of MultiplierSurvey of MultiplierCombinational Multiplier: uses n Combinational Multiplier: uses n
adders, eliminates registers:adders, eliminates registers:
3838
44 multiplication
X3 X2 X1 X0 multiplicand
Y3 Y2 Y1 Y0 multiplier
X3Y0 X2Y0 X1Y0 X0Y0
X3Y1 X2Y1 X1Y1 X0Y1
X3Y2 X2Y2 X1Y2 X0Y2
X3Y3 X2Y3 X1Y3 X0Y3
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0 Result
P.P.
Multiplier DesignMultiplier DesignMultiplier DesignMultiplier Design
3939
Radix-2 Unsigned Radix-2 Unsigned MultiplicationMultiplication
Radix-2 Unsigned Radix-2 Unsigned MultiplicationMultiplicationUse a single n-bit adder, three registers (P, A, B), Use a single n-bit adder, three registers (P, A, B),
and a testing circuit for Aand a testing circuit for A00
Initialization: Place the unsigned numbers in Initialization: Place the unsigned numbers in registers A and B. Set P to zero.registers A and B. Set P to zero.
1: If A1: If A00 is 1, is 1,
then register B, containing bthen register B, containing bn-1n-1bbn-2n-2...b...b00 is added to is added to
P; P; otherwise 00...00 (nothing) is added to P. The sum otherwise 00...00 (nothing) is added to P. The sum is placed back into P.is placed back into P.
2. Shift register pair (P, A) one bit right.2. Shift register pair (P, A) one bit right.The last bit of A is shifted out (not used).The last bit of A is shifted out (not used).
4040
Array MultiplierArray MultiplierArray MultiplierArray MultiplierArray multiplier is an efficient Array multiplier is an efficient
layout of a combinational layout of a combinational multiplier.multiplier.
Array multipliers may be Array multipliers may be pipelined to decrease clock pipelined to decrease clock period at the expense of period at the expense of latency.latency.
4141
Array Multiplier Array Multiplier OrganizationOrganization
Array Multiplier Array Multiplier OrganizationOrganization0 1 1 00 1 1 0
x 1 0 0 1x 1 0 0 1
0 1 1 00 1 1 0
+ + 0 0 0 00 0 0 0
0 0 1 1 00 0 1 1 0
+ + 0 0 0 00 0 0 0
0 0 0 1 1 00 0 0 1 1 0
+ + 0 1 1 00 1 1 0
0 1 1 0 1 1 00 1 1 0 1 1 0
Product
skew arrayfor rectangularlayout
Multiplicand
Multiplier
4242
Unsigned Array Unsigned Array MultiplierMultiplier
Unsigned Array Unsigned Array MultiplierMultiplier
+
x0y0x1y0x2y0
xny0
0
x0y1+ x1y1
0
+ x0y2+ x1y2
+ 0+
P(2n-1) P(2n-2) P0
4343
tmult(M-1) tcarry +(N-1) tsum + tand
For small tmult, tcarry
tsum
Beneficial to make tcarry = tsum
Differential Logic (DCVS)
Array Multiplier cell
Xi
Yi
Pin
Cout
Pout
FA
Pout
Cout
Pin
Cin
Cin
Xi Yi
Critical Path
N-1 P.P
M-1
Array Multiplier OrganizationArray Multiplier OrganizationArray Multiplier OrganizationArray Multiplier Organization
4444
HA
HA×
×
×
×
HA
HA
X3 X2 X1 X0
Y0
Y1
Y2
Y3 Z7 Z6 Z5 Z4 Z3
Z0
Z1
Z2
Architecture of Array MultiplierArchitecture of Array MultiplierArchitecture of Array MultiplierArchitecture of Array Multiplier
Array multipliersArray multipliers
– Partial product generation and Partial product generation and accumulation are mergedaccumulation are merged
– Identical cellsIdentical cells
– High-rate pipeliningHigh-rate pipelining
a4x2
a3x3
a2x4
p6
a4x1
a3x2
a2x3
a1x4
p5
a4
x4
a4x0
a3x1
a2x2
a1x3
a0x4
p4
a3
x3
a3x0
a2x1
a1x2
a0x3
p3
a2
x2
a2x0
a1x1
a0x2
p2
a1
x1
a1x0
a0x1
p1
a0
x0
a0x0
p0
a4x3
a3x4
p7
a4x4
p8p9
Advantages of Array MultiplierAdvantages of Array MultiplierAdvantages of Array MultiplierAdvantages of Array Multiplier
– Array multiplier for Array multiplier for
Unsigned numbersUnsigned numbers
a3x1
a4x00
a2x1
a3x00
a1x1
a2x00
a0x1
a1x00
a3x2
a4x1
a2x2 a1x2 a0x2
a3x3
a4x2
a2x3 a1x3 a0x3
a3x4
a4x3
a2x4 a1x4 a0x4a4x4
0
a0x0
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
Array MultiplierArray MultiplierArray MultiplierArray Multiplier
• type I cell type I cell
–ordinary full adderordinary full adder
• type II cell type II cell –x + y - z = 2c - sx + y - z = 2c - s
s = (x + y - z) mod 2s = (x + y - z) mod 2
c = [(x + y - z) + s] / 2c = [(x + y - z) + s] / 2
–type I cell withtype I cell with
inverted z and sinverted z and s
z=1-z’, s=1-s’z=1-z’, s=1-s’
weight = -1z
II x
y
c s
x + y - z 2c - s
0 0 0 0 0 0 0 1 0 10 1 0 1 10 1 1 0 01 0 0 1 11 0 1 0 01 1 0 1 01 1 1 1 1
Array Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s Complement
• type II’ cell :type II’ cell :
–- x - y + z = - 2c + s - x - y + z = - 2c + s
x + y - z = 2c - sx + y - z = 2c - s
identical to the type II identical to the type II cellcell z
y
II’ x
c s
weight = -2
weight = -1
Array Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s ComplementArray Multiplier for Two’s Complement
4949
Carry-Save Multiplier
carry propagation : diagonally downwards instead of to left Requires additional adder (vector-merging adder) You can make this final adder very fast using CLA or CSA scheme
44 multiplier
ripple-carry based multiplier
Architecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save Multiplier
5050
Critical path
Vector-merging addercarry-save multiplier
tmult=(N-1) tcarry + tand + tvma
Carry-Save Multiplier (44)
Architecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save MultiplierArchitecture of Carry-Save Multiplier
5151
Baugh-Wooley MultiplierBaugh-Wooley MultiplierBaugh-Wooley MultiplierBaugh-Wooley MultiplierAlgorithm for two’s-complement Algorithm for two’s-complement
multiplication.multiplication.
Adjusts partial products to maximize Adjusts partial products to maximize regularity of multiplication array.regularity of multiplication array.
Moves partial products with negative Moves partial products with negative signs to the last steps; also adds signs to the last steps; also adds negation of partial products rather than negation of partial products rather than subtracts.subtracts.
5252
Serial-Parallel Serial-Parallel MultiplierMultiplier
Serial-Parallel Serial-Parallel MultiplierMultiplierUsed in serial-arithmetic Used in serial-arithmetic
operations.operations.
Multiplicand can be held in Multiplicand can be held in place by register.place by register.
Multiplier is shifted into Multiplier is shifted into array.array.
5353
reset
Serial to parallelregister
G1
G2
Full adder
CoCi
Delay element ; F/F
S
N-1 stages
X
Y
M+N bits M*N cycles
Serial MultiplierSerial Multiplier
Serial-Parallel Serial-Parallel MultiplierMultiplier
Serial-Parallel Serial-Parallel MultiplierMultiplier
5454
Y0 Y1 Y2 Yn-1
X
Serial-Parallel Serial-Parallel MultiplierMultiplier
Serial-Parallel Serial-Parallel MultiplierMultiplier
5555
X3Y0 X2Y0 X1Y0 X0Y0
X0Y1X1Y1X2Y1X3Y1
X0Y2X1Y2X2Y2X3Y2
X0Y3X1Y3X2Y3X3Y3
P7 P6 P5 P4 P3 P2 P1 P0
Y0
Y1
Y2
Y3
X3 X2 X1 X0
Serial-Parallel Serial-Parallel MultiplierMultiplier
Serial-Parallel Serial-Parallel MultiplierMultiplier
5656
1
0
1
0
2
2
n
j
jj
m
i
ii
YY
XX
1
0
1
0
1
0
1
0
1
0
2
2)(
22
nm
k
kk
m
i
n
j
jiji
m
i
n
j
jj
iir
P
YX
YXYXP
+
Pi+1
Yi
Xi
CiCi+1
Serial-Parallel Serial-Parallel MultiplierMultiplier
Serial-Parallel Serial-Parallel MultiplierMultiplier
5757
The Architecture of the Booth The Architecture of the Booth AlgorithmAlgorithm
The Architecture of the Booth The Architecture of the Booth AlgorithmAlgorithm
The Booth MultiplierThe Booth Multiplier–High performance, low High performance, low power multiplier units are power multiplier units are necessary in many necessary in many situations, such as DSP situations, such as DSP systems.systems.
5858
FAFA
FA
FAFAFA
CLA adder
……..……..……..
X7 X6 X5 X4 X3 X2 X1 X0
Y0
Y1
Y2
Y7
. . . . . . . . .
Carry Save AdditionCarry Save AdditionCarry Save AdditionCarry Save Addition
5959
Booth’s AlgorithmBooth’s AlgorithmBooth’s AlgorithmBooth’s Algorithm
6060
)0(
2)248(
2)24(
2)2(
2)(
0
44142434
14/
044
313/
03132333
12/
0
221222
1
01
y
xyyyyyXY
xyyyyXY
xyyyXY
xyyXY
iiiii
n
ii
in
iiiii
n
i
iiii
n
i
iii1st order(radix-2)
2nd order(radix-4)
3rd order(radix-8)
4th order(radix-16)
Booth AlgorithmBooth AlgorithmBooth AlgorithmBooth Algorithm
6161
Booth EncodingBooth EncodingBooth EncodingBooth Encoding Encode a number by taking groups of 3 bitsEncode a number by taking groups of 3 bits
where each 3-bit group overlaps by 1 bitwhere each 3-bit group overlaps by 1 bit
Consider multiplier B with (n + 1) bitConsider multiplier B with (n + 1) bit– Pad B with 0 to match the first term Pad B with 0 to match the first term – if B has an odd number of bits, if B has an odd number of bits,
then extend the sign Bthen extend the sign BnnBBnnBBn-1n-1...B...B0000
i1i2i1j
2i1iij
BBB2E
BBB2E
6262
Booth MultiplierBooth MultiplierBooth MultiplierBooth MultiplierEncoding scheme to reduce number of Encoding scheme to reduce number of
stages in multiplication.stages in multiplication.
Performs two bits of multiplication at Performs two bits of multiplication at once—requires half the stages.once—requires half the stages.
Each stage is slightly more complex Each stage is slightly more complex than simple multiplier, but than simple multiplier, but adder/subtracter is almost as small/fast adder/subtracter is almost as small/fast as adder.as adder.
6363
Booth EncodingBooth EncodingBooth EncodingBooth Encoding
Two’s-complement form of multiplier:Two’s-complement form of multiplier:– y = -2y = -2nnyynn + 2 + 2n-1n-1yyn-2n-2 + 2 + 2n-2n-2yyn-2n-2 + ... + ...
Rewrite using 2Rewrite using 2aa = 2 = 2a+1a+1 - 2 - 2aa::– y = -2y = -2nn(y(yn-1n-1-y-ynn) + 2) + 2n-1n-1(y(yn-2n-2 -y -yn-1n-1) + 2) + 2n-2n-2(y(yn-3n-3 -y -yn-2n-2) )
+ ...+ ...
Consider first two terms: by looking at Consider first two terms: by looking at three bits of y, we can determine three bits of y, we can determine whether to add whether to add xx, , 2x2x to partial product. to partial product.
6464
Booth ActionsBooth ActionsBooth ActionsBooth Actionsyyii y yi-1i-1 y yi-2i-2 incrementincrement
0 0 00 0 0 00
0 0 10 0 1 xx
0 1 00 1 0 xx
0 1 10 1 1 2x2x
1 0 01 0 0 -2x-2x
1 0 11 0 1 -x-x
1 1 01 1 0 -x-x
1 1 11 1 1 00
6565
x8
Inverter/shift
Boothdecoder
Wallace Tree
CLA CLA CLA
x 2xx2x
selector
4
x0
y0
y1
y2
y3
y4
y5
y6
y7y8
………….
Booth MultiplierBooth MultiplierBooth MultiplierBooth Multiplier
Array Multiplier Cell for Booth’s Array Multiplier Cell for Booth’s AlgorithmAlgorithm
Array Multiplier Cell for Booth’s Array Multiplier Cell for Booth’s AlgorithmAlgorithm
0 (-2A)i (2A)i(A)i(-A)i
MUX
Full Adder
cout sout
select
cin
sin
6767
S0 S0 S0 S0 S0 S0 S0 S0 - - - - - - - -
S1 S1 S1 S1 S1 S1 - - - - - - - -
S2 S2 S2 S2 - - - - - - - -
S3 S3 - - - - - - - -
Signextension
)2(0)2(1)2(2)2(3
)222(0)222(1)222(2)222(3
)22222222(0
)222222(1)2222(2)22(3
0246
077277477677
01234567
234567456767
SSSS
SSSS
S
SSS
1 S3 1 S2 1 S1 1 S0+1
Sign Extension ReductionSign Extension ReductionSign Extension ReductionSign Extension Reduction
6868
Wallace TreeWallace TreeWallace TreeWallace Tree Reduces depth of adder chain.Reduces depth of adder chain.
Built from carry-save adders:Built from carry-save adders:– three inputs a, b, c three inputs a, b, c – produces two outputs y, z such that y + z = a + b produces two outputs y, z such that y + z = a + b
+ c+ c
Carry-save equations:Carry-save equations:– yyii = parity(a = parity(aii,b,bii,c,cii))
– zzii = majority(a = majority(aii,b,bii,c,cii))
6969
Wallace Tree StructureWallace Tree StructureWallace Tree StructureWallace Tree Structure
7070
7-bit Wallace Tree Addition7-bit Wallace Tree Addition7-bit Wallace Tree Addition7-bit Wallace Tree Addition
7171
Wallace Tree Wallace Tree OperationOperation
Wallace Tree Wallace Tree OperationOperation At each stage, i numbers are combined to At each stage, i numbers are combined to
form ceil(2i/3) sums.form ceil(2i/3) sums.
Final adder completes the summation.Final adder completes the summation.
Wiring is more complex.Wiring is more complex.
Can build a Booth-encoded Wallace tree Can build a Booth-encoded Wallace tree multiplier.multiplier.
7272
C S
FA
FA
FA
FA
1 2 3
4
5
6
FA FA
FA
FA
C S
CSA vs. Wallace TreeCSA vs. Wallace TreeCSA vs. Wallace TreeCSA vs. Wallace Tree
A 0 1 0 1 1 0 22A 0 1 0 1 1 0 22X X 0 0 1 0 1 1 11X X 0 0 1 0 1 1 11Y(recoded multiplier) 0 1 0 1 0 1Y(recoded multiplier) 0 1 0 1 0 1
1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 0 0 1 0
Radix-4 Modified Booth’s AlgorithmRadix-4 Modified Booth’s AlgorithmRadix-4 Modified Booth’s AlgorithmRadix-4 Modified Booth’s Algorithm
7474
Wallace-TreeWallace-TreeWallace-TreeWallace-Tree
FA
FA
FA
FA
y0 y1 y2
y3
y4
y5
S
Ci-1
Ci-1
Ci-1
Ci
Ci
Ci
FA
y0 y1 y2
FA
y3 y4 y5
FA
FA
CC S
Ci-1
Ci-1
Ci-1
Ci
Ci
Ci
Collapse the chain of FAs yCollapse the chain of FAs y00-y-y55 (5 adders delays) to the Wallace tree consisting (5 adders delays) to the Wallace tree consisting
of (4 adders delays)of (4 adders delays)
7575
Floor Plan of MultiplierFloor Plan of MultiplierFloor Plan of MultiplierFloor Plan of Multiplier
Y
X
Z0
|Z3
Z7 — Z4Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
X3 X2 X1 X0
Y0
Y1
Y2
Y3
1) Square Floor Plan
7676
In The Actual DatapathIn The Actual DatapathIn The Actual DatapathIn The Actual Datapathx
Y
LSB
LSB
MSB
M1
M2
orM3
Floor Plan of MultiplierFloor Plan of MultiplierFloor Plan of MultiplierFloor Plan of Multiplier
7777
Floor PlanFloor PlanFloor PlanFloor Plan
AdderAdder
Add RegAdd Reg
Out RegOut Reg
MultiplierMultiplier
Multiplier RegMultiplier Reg
Control BlockControl Block
Coefficient Coefficient MemoryMemory
InputInputMemoryMemory
RoutingRouting
7878
Floor PlanningFloor PlanningFloor PlanningFloor Planning
7979
ResultsResultsResultsResultsCellCell Number of Number of
PortsPortsNumber of PortsNumber of Ports 3434
Number of NetsNumber of Nets 157157
Number of CellsNumber of Cells 3232
Combinational AreaCombinational Area 24286.050781 24286.050781
Non-Combinational AreaNon-Combinational Area 14935.535156 14935.535156
Total AreaTotal Area 39221.58593839221.585938
8080
Power Consumption Power Consumption & Area& Area
Power Consumption Power Consumption & Area& AreaCell Internal Power = 419.5078 uW (57%)Cell Internal Power = 419.5078 uW (57%)
Net Switching Power = 315.0848 uW (43%)Net Switching Power = 315.0848 uW (43%)
Total Dynamic Power = 734.5925 uW (100%)Total Dynamic Power = 734.5925 uW (100%)
Cell Leakage Power = 248.1773 nWCell Leakage Power = 248.1773 nW
Cell Internal Power = 419.5078 uW (57%)Cell Internal Power = 419.5078 uW (57%)
Net Switching Power = 315.0848 uW (43%)Net Switching Power = 315.0848 uW (43%)
Total Dynamic Power = 734.5925 uW (100%)Total Dynamic Power = 734.5925 uW (100%)
Cell Leakage Power = 248.1773 nWCell Leakage Power = 248.1773 nW
8181
Main ModuleMain ModuleMain ModuleMain Module
8282
Booth MultiplierBooth MultiplierBooth MultiplierBooth Multiplier
8383
Core ModuleCore ModuleCore ModuleCore Module
8484
Controller ModuleController ModuleController ModuleController Module
8585
ConclusionConclusionConclusionConclusion Good Design Experience.Good Design Experience.
Using Parallel FIR Filter Realization Using Parallel FIR Filter Realization Reduced the number of Multiplier and Reduced the number of Multiplier and Adder needed therefore Area was shrunk Adder needed therefore Area was shrunk and power consumption was loweredand power consumption was lowered
Timing Strategies Using non-blocking in Timing Strategies Using non-blocking in Verilog reduced number of states needed Verilog reduced number of states needed for implementation.for implementation.
Partitioning the design into submodules Partitioning the design into submodules made design more manageable and made design more manageable and optimized.optimized.
Performance Optimization was reached Performance Optimization was reached with slack time equal to +9.54.with slack time equal to +9.54.