23
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Embed Size (px)

Citation preview

Page 1: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

ELEC692 VLSI Signal Processing Architecture

Lecture 2Pipelining and Parallel Processing

Page 2: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Technique for improving performance

• Exploiting parallelism in improving performance• Two ways

– Pipelining• Using pipeline latches to reduce the critical path delay• Can exploit to increase the clock speed of sample speed or

reduce power consumption at the same speed.

– Parallel processing• Multiple output are computed in parallel in a clock period with

parallel hardware• Effective sampling speed is increased with the level of

parallelism• Can be used for the reduction of power consumption

Page 3: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Example of a 3-tap FIR filter

)2()1()()( nCxnBxnAxny

AMsample TTf

2

1

D DX(n)

y(n)

X(n-1)X(n-2)

A BC

Let TM be the delay for multiplier and TA be the delay of adder, the sampling period

AMsample TTT 2

How can we improve the performance??

Page 4: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Pipelining of FIR digital filter• By adding latches

D DX(n)

y(n)

X(n-1) X(n-2)

A BC

D DX(n)

y(n)

A BC

D

D

1 2

3

Critical path = TM+2TACritical path = TM+TA

Schedule of events in the pipeline

Clock Input Node 1 Node 2 Node 3 Output

0 X(0) Ax(0)+bx(-1) - - -

1 X(1) Ax(1)+bx(0) Ax(0)+bx(-1) Cx(-2) Y(0)

2 X(2) Ax(2)+Bx(1) Ax(1)+bx(0) Cx(-1) Y(1)

3 X(3) Ax(3)+Bx(2) Ax(2)+Bx(1) Cx(0) Y(2)

Page 5: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Pipelining properties

• M-level pipelining needs M-1 more delay elements in any path from input to output

• Increase in speed with the following penalty– Increase in system latency– Increase in the number of latches

• Pipelining latches can only be placed across any feed-forward cutset of the graph (signal flow graph/DFG)

Page 6: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Cutset pipelining• Cutset – a set of edges of a graph such

that if these edges are removed from the graph, the graph becomes disjoint.

• Feed-forward cutset – a cutset that the data move in the forward direction on all the edges of the cutset, e.g. dotted line in the previous slide

• We can place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm. The data movement between the two disjoint sub-graphs only occurs on the feed-forward cutset, delaying or advancing the data movement along all edges on the cutset by the same amount of time do not change the behavior.

SG1SG2

cutset

D1

D1

Page 7: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Feed-forward cutsetA2

A1

A3

A4

A5

A6

D A2

A1

A3

A4

A5

A6

D

DD

D

Not a valid pipelining

A2

A1

A3

A4

A5

A6

D

DD D

D

Must place delays on all edgesIn the cutset

Critical path reduced to 2

Page 8: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Data-broadcast structures

• The critical path of the original 3-tap FIR filter can be reduced without introducing pipelining latches by transposing the structure

• Transposition – reversing the direction of all the edges in a given SFG and interchanging the input and output ports preserves the functionality of the system.

X(n)

y(n)

Z-1 Z-1

a b c

y(n)

x(n)

Z-1 Z-1

a b c

SFG of the FIR Transposed SFG of the FIR

Page 9: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Data-broadcast structures

• Data-broadcasting structure based on transposed form where data are not stored but are broadcast to all the multipliers simultaneously.

D D

X(n)

y(n)

C BA

Critical path delay= TM+TA

Page 10: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Fine-grain pipelining

• Further breakdown the functional units by pipelining to increase performance.

• E,g. breakdown each multiplier into 2 small units

D D

X(n)

y(n)

m1 m1 m1

D D D

m2 m2 m2

(6) (6) (6)

(4) (4) (4)

(2) (2)

Critical path delay= TM2+TA

= 4+2 = 6

Page 11: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Parallel Processing

• Parallel processing and pipelining techniques are duals of each other– Both exploit concurrency available in the

computation

• Parallel processing – computed using duplicate hardware

Page 12: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

A Parallel FIR System• E.g. 3-tap FIR filter, Single-input-single-output (SISO)

system• Y(n) = Ax(n)+bx(n-1)+cx(n-2)• A parallel system with 3 inputs per clock cycle , level

of parallel processing L=3.– Y(3k) = Ax(3k)+bx(3k-1)+cx(3k-2)– Y(3k+1) = Ax(3k+1)+bx(3k)+cx(3k-1)– Y(3k+2) = Ax(3k+2)+bx(3k+1)+cx(3k)

SISOX(n) y(n)

Sequential System

MIMO

X(3k)

X(3k+1)

X(3k+2)

y(3k)

y(3k+1)

y(3k+2)

3-Parallel System

Page 13: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

A Parallel FIR System

AMclk TTT 2

)2(3

11AMclksampleiter TTT

LTT

sampleclk TT

a b c

Y(3k+2)

c a b

Y(3k+1)

b c a

Y(3k)

D

D

x(3k+2)x(3k+1) x(3k)

Parallel system

Pipelined system

sampleclk TT

Page 14: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Complete parallel processing system

Serial-to-ParallelCOnverter

MIMO System

Parallel-to-SerialCOnverter

X(n)

SamplingPeriod=T/4

Clock Period=T

Clock Period=T/4

X(4k+3)

X(4k+2)X(4k+1)

X(4k)

y(4k+3)

y(4k+2)

y(4k+1)

y(4k)

Y(n)

Page 15: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

When should we use parallel over pipeline processing

• There is fundamental limit to pipelining imposed by the input/output (I/O) bottlenecks.

Chip1 Chip2

o/ppad

i/ppad

Tcomm.

Tcomputation

• Communication bounded– Communication time (input/output pad delay + wire-delay) is larger than that of computation delay.– Pipelining can only be used to reduce the critical path computation delay.– For communication-bound system, this cannot help.– So only parallel processing can be used to improve the performance.– Further improvement can be achieved by combining pipelining and parallel processing

Page 16: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Low Power Signal Processing

• Higher speed• Low Power• Dynamic Power consumption• Propagation delay

– Ccharge: the capacitance to be charged/discharged

– Vo: supply voltage; Vt: threshold voltage

– K: technology parameter

fVCP ototaldyn2

2

arg

)( to

oechpd VVk

VCT

Page 17: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Pipelining for Low Power• Pseq=CtotalV0

2 f

• After pipelining, the critical path is reduced, hence we can use a lower voltage V’=V0, the new power is Ppip=Ctotal2V0

2 f=2Pseq

• The power consumption reduction factor, , can be found the following:Tseq

Sequential (critical path)

(Vo)

(Vo)Tpipe Tpipe Tpipe

Pipelined: (critical path when M=3)

22

2

arg

2

arg

)()(

)(

)(

toto

seqpipe

to

oech

pipe

to

oechseq

VVVVM

TT

VVk

VM

C

T

VVk

VCT

Page 18: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Example

• Assume

– Cap. Of multiplier CM is 5 times of that of an adder CA

– Fine grain pipelining is used, and Cm1=3 CA and Cm2 = 2 CA

– Vdd = 5V and Vt = 0.6V

D D

X(n)

y(n)

m1 m1 m1

D D D

m2 m2 m2

(6) (6) (6)

(4) (4) (4)

(2) (2)

D D

X(n)

y(n)

C BA

Page 19: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Solution

• For original filter, Ccharge=CM+CA=6CA

• For pipelined filter, Ccharge=CM1=CM2+CA = 3CA

• Now M = 2,we have 2(.5-0.6)2=.(5-0.6)2, solving this equation, we have =0.6033

• The voltage of the pipelined filter Vpipe=.Vo=~3V

• Power consumption ratio is 2 = 36.4%

Page 20: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Parallel Processing for low power• In an L-parallel architecture, we can assume the

charge capacitance remain the same, but the total capacitance (i.e. Ctotal) is increased L times.

• The clock speed of the L-parallel architecture is reduced to 1/L (i.e. f = 1/L. Tpd) to maintain the same sampling rate

• Supply voltage can be reduced to .Vo since more time is allowed to charge or discsharge the same capacitance.

Page 21: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Parallel Processing for low power

Sequential (critical path)

(Vo)

Tseq

(Vo)

3Tseq

3Tseq

3Tseq

Parallel: critical path when L=3

20

20

20

arg

20

arg

)()(

)(

)(

tt

t

oechseqparallel

t

oechseq

VVVVL

VVk

VCTLT

VVk

VCT

Page 22: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Example: Reduce Power by parallel• Consider the following FIR filters

D D D

X(n)

y(n)D D D

X(2k)

y(2k+1)

D D D

X2k+1)

y(2k)Assumption: - CM = 8CA

- TM = 8TA

- both architectures operate at the sampling period of 9 TA

- Supply voltage = 3.3V and Vt = 0.45V

Page 23: ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing

Solution

• Ccharge: Sequential: Ccharge = CM + CA = 9 CA

• Parallel: Ccharge = CM + 2CA = 10 CA

• Power ratio 2 = 0.434659.0

)(9)(5

22,

)(

10

)(

9

22

2

2

toto

seqsampleparaseqsample

to

oApara

to

oAseq

VVVV

TTTTT

VVk

VCT

VVk

VCT