Upload
dominic-ferguson
View
244
Download
6
Embed Size (px)
Citation preview
ELEC692 VLSI Signal Processing Architecture
Lecture 2Pipelining and Parallel Processing
Technique for improving performance
• Exploiting parallelism in improving performance• Two ways
– Pipelining• Using pipeline latches to reduce the critical path delay• Can exploit to increase the clock speed of sample speed or
reduce power consumption at the same speed.
– Parallel processing• Multiple output are computed in parallel in a clock period with
parallel hardware• Effective sampling speed is increased with the level of
parallelism• Can be used for the reduction of power consumption
Example of a 3-tap FIR filter
)2()1()()( nCxnBxnAxny
AMsample TTf
2
1
D DX(n)
y(n)
X(n-1)X(n-2)
A BC
Let TM be the delay for multiplier and TA be the delay of adder, the sampling period
AMsample TTT 2
How can we improve the performance??
Pipelining of FIR digital filter• By adding latches
D DX(n)
y(n)
X(n-1) X(n-2)
A BC
D DX(n)
y(n)
A BC
D
D
1 2
3
Critical path = TM+2TACritical path = TM+TA
Schedule of events in the pipeline
Clock Input Node 1 Node 2 Node 3 Output
0 X(0) Ax(0)+bx(-1) - - -
1 X(1) Ax(1)+bx(0) Ax(0)+bx(-1) Cx(-2) Y(0)
2 X(2) Ax(2)+Bx(1) Ax(1)+bx(0) Cx(-1) Y(1)
3 X(3) Ax(3)+Bx(2) Ax(2)+Bx(1) Cx(0) Y(2)
Pipelining properties
• M-level pipelining needs M-1 more delay elements in any path from input to output
• Increase in speed with the following penalty– Increase in system latency– Increase in the number of latches
• Pipelining latches can only be placed across any feed-forward cutset of the graph (signal flow graph/DFG)
Cutset pipelining• Cutset – a set of edges of a graph such
that if these edges are removed from the graph, the graph becomes disjoint.
• Feed-forward cutset – a cutset that the data move in the forward direction on all the edges of the cutset, e.g. dotted line in the previous slide
• We can place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm. The data movement between the two disjoint sub-graphs only occurs on the feed-forward cutset, delaying or advancing the data movement along all edges on the cutset by the same amount of time do not change the behavior.
SG1SG2
cutset
D1
D1
Feed-forward cutsetA2
A1
A3
A4
A5
A6
D A2
A1
A3
A4
A5
A6
D
DD
D
Not a valid pipelining
A2
A1
A3
A4
A5
A6
D
DD D
D
Must place delays on all edgesIn the cutset
Critical path reduced to 2
Data-broadcast structures
• The critical path of the original 3-tap FIR filter can be reduced without introducing pipelining latches by transposing the structure
• Transposition – reversing the direction of all the edges in a given SFG and interchanging the input and output ports preserves the functionality of the system.
X(n)
y(n)
Z-1 Z-1
a b c
y(n)
x(n)
Z-1 Z-1
a b c
SFG of the FIR Transposed SFG of the FIR
Data-broadcast structures
• Data-broadcasting structure based on transposed form where data are not stored but are broadcast to all the multipliers simultaneously.
D D
X(n)
y(n)
C BA
Critical path delay= TM+TA
Fine-grain pipelining
• Further breakdown the functional units by pipelining to increase performance.
• E,g. breakdown each multiplier into 2 small units
D D
X(n)
y(n)
m1 m1 m1
D D D
m2 m2 m2
(6) (6) (6)
(4) (4) (4)
(2) (2)
Critical path delay= TM2+TA
= 4+2 = 6
Parallel Processing
• Parallel processing and pipelining techniques are duals of each other– Both exploit concurrency available in the
computation
• Parallel processing – computed using duplicate hardware
A Parallel FIR System• E.g. 3-tap FIR filter, Single-input-single-output (SISO)
system• Y(n) = Ax(n)+bx(n-1)+cx(n-2)• A parallel system with 3 inputs per clock cycle , level
of parallel processing L=3.– Y(3k) = Ax(3k)+bx(3k-1)+cx(3k-2)– Y(3k+1) = Ax(3k+1)+bx(3k)+cx(3k-1)– Y(3k+2) = Ax(3k+2)+bx(3k+1)+cx(3k)
SISOX(n) y(n)
Sequential System
MIMO
X(3k)
X(3k+1)
X(3k+2)
y(3k)
y(3k+1)
y(3k+2)
3-Parallel System
A Parallel FIR System
AMclk TTT 2
)2(3
11AMclksampleiter TTT
LTT
sampleclk TT
a b c
Y(3k+2)
c a b
Y(3k+1)
b c a
Y(3k)
D
D
x(3k+2)x(3k+1) x(3k)
Parallel system
Pipelined system
sampleclk TT
Complete parallel processing system
Serial-to-ParallelCOnverter
MIMO System
Parallel-to-SerialCOnverter
X(n)
SamplingPeriod=T/4
Clock Period=T
Clock Period=T/4
X(4k+3)
X(4k+2)X(4k+1)
X(4k)
y(4k+3)
y(4k+2)
y(4k+1)
y(4k)
Y(n)
When should we use parallel over pipeline processing
• There is fundamental limit to pipelining imposed by the input/output (I/O) bottlenecks.
Chip1 Chip2
o/ppad
i/ppad
Tcomm.
Tcomputation
• Communication bounded– Communication time (input/output pad delay + wire-delay) is larger than that of computation delay.– Pipelining can only be used to reduce the critical path computation delay.– For communication-bound system, this cannot help.– So only parallel processing can be used to improve the performance.– Further improvement can be achieved by combining pipelining and parallel processing
Low Power Signal Processing
• Higher speed• Low Power• Dynamic Power consumption• Propagation delay
– Ccharge: the capacitance to be charged/discharged
– Vo: supply voltage; Vt: threshold voltage
– K: technology parameter
fVCP ototaldyn2
2
arg
)( to
oechpd VVk
VCT
Pipelining for Low Power• Pseq=CtotalV0
2 f
• After pipelining, the critical path is reduced, hence we can use a lower voltage V’=V0, the new power is Ppip=Ctotal2V0
2 f=2Pseq
• The power consumption reduction factor, , can be found the following:Tseq
Sequential (critical path)
(Vo)
(Vo)Tpipe Tpipe Tpipe
Pipelined: (critical path when M=3)
22
2
arg
2
arg
)()(
)(
)(
toto
seqpipe
to
oech
pipe
to
oechseq
VVVVM
TT
VVk
VM
C
T
VVk
VCT
Example
• Assume
– Cap. Of multiplier CM is 5 times of that of an adder CA
– Fine grain pipelining is used, and Cm1=3 CA and Cm2 = 2 CA
– Vdd = 5V and Vt = 0.6V
D D
X(n)
y(n)
m1 m1 m1
D D D
m2 m2 m2
(6) (6) (6)
(4) (4) (4)
(2) (2)
D D
X(n)
y(n)
C BA
Solution
• For original filter, Ccharge=CM+CA=6CA
• For pipelined filter, Ccharge=CM1=CM2+CA = 3CA
• Now M = 2,we have 2(.5-0.6)2=.(5-0.6)2, solving this equation, we have =0.6033
• The voltage of the pipelined filter Vpipe=.Vo=~3V
• Power consumption ratio is 2 = 36.4%
Parallel Processing for low power• In an L-parallel architecture, we can assume the
charge capacitance remain the same, but the total capacitance (i.e. Ctotal) is increased L times.
• The clock speed of the L-parallel architecture is reduced to 1/L (i.e. f = 1/L. Tpd) to maintain the same sampling rate
• Supply voltage can be reduced to .Vo since more time is allowed to charge or discsharge the same capacitance.
Parallel Processing for low power
Sequential (critical path)
(Vo)
Tseq
(Vo)
3Tseq
3Tseq
3Tseq
Parallel: critical path when L=3
20
20
20
arg
20
arg
)()(
)(
)(
tt
t
oechseqparallel
t
oechseq
VVVVL
VVk
VCTLT
VVk
VCT
Example: Reduce Power by parallel• Consider the following FIR filters
D D D
X(n)
y(n)D D D
X(2k)
y(2k+1)
D D D
X2k+1)
y(2k)Assumption: - CM = 8CA
- TM = 8TA
- both architectures operate at the sampling period of 9 TA
- Supply voltage = 3.3V and Vt = 0.45V
Solution
• Ccharge: Sequential: Ccharge = CM + CA = 9 CA
• Parallel: Ccharge = CM + 2CA = 10 CA
• Power ratio 2 = 0.434659.0
)(9)(5
22,
)(
10
)(
9
22
2
2
toto
seqsampleparaseqsample
to
oApara
to
oAseq
VVVV
TTTTT
VVk
VCT
VVk
VCT