ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

ELEC692 VLSI Signal Processing Architecture

Lecture 2Pipelining and Parallel Processing

Technique for improving performance

• Exploiting parallelism in improving performance• Two ways

– Pipelining• Using pipeline latches to reduce the critical path delay• Can exploit to increase the clock speed of sample speed or

reduce power consumption at the same speed.

– Parallel processing• Multiple output are computed in parallel in a clock period with

parallel hardware• Effective sampling speed is increased with the level of

parallelism• Can be used for the reduction of power consumption

Example of a 3-tap FIR filter

)2()1()()( nCxnBxnAxny

AMsample TTf

2

1

D DX(n)

y(n)

X(n-1)X(n-2)

A BC

Let TM be the delay for multiplier and TA be the delay of adder, the sampling period

AMsample TTT 2

How can we improve the performance??

Pipelining of FIR digital filter• By adding latches

D DX(n)

y(n)

X(n-1) X(n-2)

A BC

D DX(n)

y(n)

A BC

D

D

1 2

3

Critical path = TM+2TACritical path = TM+TA

Schedule of events in the pipeline

Clock Input Node 1 Node 2 Node 3 Output

0 X(0) Ax(0)+bx(-1) - - -

1 X(1) Ax(1)+bx(0) Ax(0)+bx(-1) Cx(-2) Y(0)

2 X(2) Ax(2)+Bx(1) Ax(1)+bx(0) Cx(-1) Y(1)

3 X(3) Ax(3)+Bx(2) Ax(2)+Bx(1) Cx(0) Y(2)

Pipelining properties

• M-level pipelining needs M-1 more delay elements in any path from input to output

• Increase in speed with the following penalty– Increase in system latency– Increase in the number of latches

• Pipelining latches can only be placed across any feed-forward cutset of the graph (signal flow graph/DFG)

Cutset pipelining• Cutset – a set of edges of a graph such

that if these edges are removed from the graph, the graph becomes disjoint.

• Feed-forward cutset – a cutset that the data move in the forward direction on all the edges of the cutset, e.g. dotted line in the previous slide

• We can place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm. The data movement between the two disjoint sub-graphs only occurs on the feed-forward cutset, delaying or advancing the data movement along all edges on the cutset by the same amount of time do not change the behavior.

SG1SG2

cutset

D1

D1

Feed-forward cutsetA2

A1

A3

A4

A5

A6

D A2

A1

A3

A4

A5

A6

D

DD

D

Not a valid pipelining

A2

A1

A3

A4

A5

A6

D

DD D

D

Must place delays on all edgesIn the cutset

Critical path reduced to 2

Data-broadcast structures

• The critical path of the original 3-tap FIR filter can be reduced without introducing pipelining latches by transposing the structure

• Transposition – reversing the direction of all the edges in a given SFG and interchanging the input and output ports preserves the functionality of the system.

X(n)

y(n)

Z-1 Z-1

a b c

y(n)

x(n)

Z-1 Z-1

a b c

SFG of the FIR Transposed SFG of the FIR

Data-broadcast structures

• Data-broadcasting structure based on transposed form where data are not stored but are broadcast to all the multipliers simultaneously.

D D

X(n)

y(n)

C BA

Critical path delay= TM+TA

Fine-grain pipelining

• Further breakdown the functional units by pipelining to increase performance.

• E,g. breakdown each multiplier into 2 small units

D D

X(n)

y(n)

m1 m1 m1

D D D

m2 m2 m2

(6) (6) (6)

(4) (4) (4)

(2) (2)

Critical path delay= TM2+TA

= 4+2 = 6

Parallel Processing

• Parallel processing and pipelining techniques are duals of each other– Both exploit concurrency available in the

computation

• Parallel processing – computed using duplicate hardware

A Parallel FIR System• E.g. 3-tap FIR filter, Single-input-single-output (SISO)

system• Y(n) = Ax(n)+bx(n-1)+cx(n-2)• A parallel system with 3 inputs per clock cycle , level

of parallel processing L=3.– Y(3k) = Ax(3k)+bx(3k-1)+cx(3k-2)– Y(3k+1) = Ax(3k+1)+bx(3k)+cx(3k-1)– Y(3k+2) = Ax(3k+2)+bx(3k+1)+cx(3k)

SISOX(n) y(n)

Sequential System

MIMO

X(3k)

X(3k+1)

X(3k+2)

y(3k)

y(3k+1)

y(3k+2)

3-Parallel System

A Parallel FIR System

AMclk TTT 2

)2(3

11AMclksampleiter TTT

LTT

sampleclk TT

a b c

Y(3k+2)

c a b

Y(3k+1)

b c a

Y(3k)

D

D

x(3k+2)x(3k+1) x(3k)

Parallel system

Pipelined system

sampleclk TT

Complete parallel processing system

Serial-to-ParallelCOnverter

MIMO System

Parallel-to-SerialCOnverter

X(n)

SamplingPeriod=T/4

Clock Period=T

Clock Period=T/4

X(4k+3)

X(4k+2)X(4k+1)

X(4k)

y(4k+3)

y(4k+2)

y(4k+1)

y(4k)

Y(n)

When should we use parallel over pipeline processing

• There is fundamental limit to pipelining imposed by the input/output (I/O) bottlenecks.

Chip1 Chip2

o/ppad

i/ppad

Tcomm.

Tcomputation

• Communication bounded– Communication time (input/output pad delay + wire-delay) is larger than that of computation delay.– Pipelining can only be used to reduce the critical path computation delay.– For communication-bound system, this cannot help.– So only parallel processing can be used to improve the performance.– Further improvement can be achieved by combining pipelining and parallel processing

Low Power Signal Processing

• Higher speed• Low Power• Dynamic Power consumption• Propagation delay

– Ccharge: the capacitance to be charged/discharged

– Vo: supply voltage; Vt: threshold voltage

– K: technology parameter

fVCP ototaldyn2

2

arg

)( to

oechpd VVk

VCT

Pipelining for Low Power• Pseq=CtotalV0

2 f

• After pipelining, the critical path is reduced, hence we can use a lower voltage V’=V0, the new power is Ppip=Ctotal2V0

2 f=2Pseq

• The power consumption reduction factor, , can be found the following:Tseq

Sequential (critical path)

(Vo)

(Vo)Tpipe Tpipe Tpipe

Pipelined: (critical path when M=3)

22

2

arg

2

arg

)()(

)(

)(

toto

seqpipe

to

oech

pipe

to

oechseq

VVVVM

TT

VVk

VM

C

T

VVk

VCT

Example

• Assume

– Cap. Of multiplier CM is 5 times of that of an adder CA

– Fine grain pipelining is used, and Cm1=3 CA and Cm2 = 2 CA

– Vdd = 5V and Vt = 0.6V

D D

X(n)

y(n)

m1 m1 m1

D D D

m2 m2 m2

(6) (6) (6)

(4) (4) (4)

(2) (2)

D D

X(n)

y(n)

C BA

Solution

• For original filter, Ccharge=CM+CA=6CA

• For pipelined filter, Ccharge=CM1=CM2+CA = 3CA

• Now M = 2,we have 2(.5-0.6)2=.(5-0.6)2, solving this equation, we have =0.6033

• The voltage of the pipelined filter Vpipe=.Vo=~3V

• Power consumption ratio is 2 = 36.4%

Parallel Processing for low power• In an L-parallel architecture, we can assume the

charge capacitance remain the same, but the total capacitance (i.e. Ctotal) is increased L times.

• The clock speed of the L-parallel architecture is reduced to 1/L (i.e. f = 1/L. Tpd) to maintain the same sampling rate

• Supply voltage can be reduced to .Vo since more time is allowed to charge or discsharge the same capacitance.

Parallel Processing for low power

Sequential (critical path)

(Vo)

Tseq

(Vo)

3Tseq

3Tseq

3Tseq

Parallel: critical path when L=3

20

20

20

arg

20

arg

)()(

)(

)(

tt

t

oechseqparallel

t

oechseq

VVVVL

VVk

VCTLT

VVk

VCT

Example: Reduce Power by parallel• Consider the following FIR filters

D D D

X(n)

y(n)D D D

X(2k)

y(2k+1)

D D D

X2k+1)

y(2k)Assumption: - CM = 8CA

- TM = 8TA

- both architectures operate at the sampling period of 9 TA

- Supply voltage = 3.3V and Vt = 0.45V

Solution

• Ccharge: Sequential: Ccharge = CM + CA = 9 CA

• Parallel: Ccharge = CM + 2CA = 10 CA

• Power ratio 2 = 0.434659.0

)(9)(5

22,

)(

10

)(

9

22

2

2

toto

seqsampleparaseqsample

to

oApara

to

oAseq

VVVV

TTTTT

VVk

VCT

VVk

VCT

Documents

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing