Pipe Parallel

Embed Size (px)

Citation preview

  • 8/13/2019 Pipe Parallel

    1/24

    Pipelining andParallel Processing

  • 8/13/2019 Pipe Parallel

    2/24

    Introduction (1)

    Pipelining transformationleads

    to a reduction in the critical path,

    which can be exploited to increase

    the clock speed (sample speed), or

    to reduce power consumption atsame speed.

    In the parallel processing,

    multiple outputs are computed in

    parallel in a clock period.Therefore, the effective sampling

    speed is increased by the level of

    parallelism.

  • 8/13/2019 Pipe Parallel

    3/24

    Introduction (2)

    3-tap FIR digital filter

    y(n) = ax(n)+bx(n-1)+cx(n-2)

    Sample Period

    Sampling frequency

    2sample M AT T T

    1

    2sample

    M A

    fT T

  • 8/13/2019 Pipe Parallel

    4/24

    Pipelining of FIR digital filter

    Pipelined implementation of the 3-tap FIR filter is obtained by placing

    2 additional latches.

    The critical path is reduced from TM+2TA to TM+TA.

    The two main drawbacks of the pipelining are increase in the number

    of latches and in system latency.

  • 8/13/2019 Pipe Parallel

    5/24

    Pipelining of FIR digital filter (2)

    The critical path (longest path) can be reduced by suitably placing the

    pipelining latches in the architecture.

    The pipelining latches can only be placed across any feed-forward

    cutsetof the graph Introduce 2 definitions of graph for pipelining.

    Cutset A cutset is a set of edges of a graph such that if these edges are

    removed from the graph, the graph becomes disjoint.

    Feed-forward Cutset A cutset is called a feed-forward cutset if the data

    move in the forward direction on all the edges of the cutset.

    To obtain an appropriate pipelining circuit, pipelining latches should

    be inserted on all the edges in thefeed-forward cutset!!

  • 8/13/2019 Pipe Parallel

    6/24

    Pipelining of FIR digital filter (3)

    Signal-flow graph example

  • 8/13/2019 Pipe Parallel

    7/24

    Pipelining of FIR digital filter (4)

    Data-Broadcast Structures

    The critical path of the original 3-tap FIR filter can be reduced

    without introducing any pipelining latches by transposing the

    structure.

    Transposition theorem

    Reversing the direction of all the edgesin a given SFG (signal-

    flow graph) and interchanging the input and output ports

    preserves the functionality of the system.

  • 8/13/2019 Pipe Parallel

    8/24

    Pipelining of FIR digital filter (5)

    < SFG representation

    of the FIR filter>

    < Transposed SFG representation

    of the FIR filter>

  • 8/13/2019 Pipe Parallel

    9/24

    Pipelining of FIR digital filter (6)

    Transposed SFG representation leads to the data-broadcast

    structure where data are not stored but are broadcast to all

    the multipliers simultaneously.

  • 8/13/2019 Pipe Parallel

    10/24

    Pipelining of FIR digital filter (7)

    Fine-Grain Pipelining Let TM=10 units and TAunits, and the desired clock period be

    (TM+TA)/2=6 units.

    In this case the multiplier is broken into 2 smaller units with

    processing times of 6 units and 4 units, respectively.

    By placing the latches on the horizontal cutset across the

    multiplier , the desired clock speed can be achieved.

  • 8/13/2019 Pipe Parallel

    11/24

    (C) 1997-2006 by Yu Hen Hu

    Usage of Pipelined Processing

    By inserting latches or registers

    between combinational logic cir

    cuits, the critical path can be sh

    ortened. Consequence:

    reduce clock cycle time,

    increase clock frequency.

    Suitable for DSP applications th

    at have (infinity) long data stream.

    Method to incorporate pipelining: Cut-set retiming

    Cut set:

    A cut set is a set of edges of agraph. If these edges are removed from the original graph, theremaining graph will become two separate graphs.

    Retiming:

    The timing of an algorithm is re-adjusted while keeping the partial ordering of execution unchanged so that the results correct

  • 8/13/2019 Pipe Parallel

    12/24

    Parallel Processing (1)

    Designing a Parallel FIR System

    To obtain a parallel processing structure, the SISO(single-input

    single-output) system must be converted into a MIMO(multiple-

    input multiple-output) system.

    y(3k) = ax(3k)+bx(3k-1)+cx(3k-2)

    y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1)y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k)

    Parallel Processing systems are also referred to as block

    processing systems.

  • 8/13/2019 Pipe Parallel

    13/24

    Parallel Processing (2)

    Parallel processing architecture for a 3-tap FIR filter

    (with block size 3)

  • 8/13/2019 Pipe Parallel

    14/24

    Parallel Processing (3)

    The critical path of the parallel processing system has remained

    unchanged and the clock period (Tclk) must satisfy :

    But since 3 samples are processed in 1 clock cycle instead of 3, the

    iteration period is given by

    In a Pipelined system : Tclk= Tsample

    2clk M AT T T

    1 1( 2 )

    3iter sample clk M AT T T T T

    L

  • 8/13/2019 Pipe Parallel

    15/24

    Parallel Processing (4)

    Complete parallel processing system with block size 4

  • 8/13/2019 Pipe Parallel

    16/24

    Parallel Processing (5)

    Why do we use parallel processing when we can use pipelining ?

    Due to a fundamental limit to pipelining imposed by the I/O bottlenecks.

    Pipelining can be combined with parallel processing to further increase

    the speedof the architecture. By combining parallel processing and pipelining, the sample period has

    been reduced to

    Parallel processing is also used for reduction of power consumption while

    using slow clocks.

    1 1( 2 )

    6iter sample clk M AT T T T T

    LM

  • 8/13/2019 Pipe Parallel

    17/24

    Parallel Processing (6)

    < A chip set>

  • 8/13/2019 Pipe Parallel

    18/24

    Parallel Processing (7)

  • 8/13/2019 Pipe Parallel

    19/24

    Pipelining and Parallel processing

    for Low power

    There are two main advantages of using pipelining and parallel

    processing :

    Higher speed

    Lower power For CMOS circuit, the propagation delay can be written as :

    Power consumption of a CMOS circuit can be estimated as :

    arg 0

    20( )

    ch epd

    t

    C VT

    k V V

    20totalP C V f

  • 8/13/2019 Pipe Parallel

    20/24

    Pipelining for Low power (1)

    represent the power consumed in the original filter.

    (where Tseq is the clock period of the original sequential filter)

    In the M-level pipelined system, the critical path is reduced to 1/Mof

    its original length and the capacitance to be charged/discharged in a

    single clock cycle is reduced to Ccharge/ M.

    supply voltage can be reduced to

    20seq totalP C V f

    0V

  • 8/13/2019 Pipe Parallel

    21/24

    Pipelining for Low power (2)

    The power consumption factor, , can be determined by examining

    the relationship between the propagation delay of the original filter and

    the pipelined filter.

    arg 0

    20( )

    ch eseq

    t

    C VT

    k V V

    arg 0

    20( )

    ch epipe

    t

    C VT

    k V V

    2 2 20pip total seqP C V f P

  • 8/13/2019 Pipe Parallel

    22/24

    Parallel processing for Low power (1)

    Parallel processing, like pipelining, can reduce the power consumption

    of a system by allowing the supply voltage to be reduced.

    In an L-parallel system, the charging capacitance does not changewhile the total capacitance is increased by L times.

    In order to maintain the same sample rate, the clock period of the L-

    parallel circuit must be increased toLTseq, where Tseqis the propagation

    delay of the sequential circuit.

    There is more time to charge the same capacitance => supply voltage

    can be reduced to 0V

    supply voltage can be reduced to

  • 8/13/2019 Pipe Parallel

    23/24

    Parallel processing for Low power (2)

    arg 0

    20( )

    ch e

    seqt

    C V

    LT k V V

    The propagation delay of the L-parallel system is given by :

    2 2 20par total seqP C V f P

    arg 0

    20( )

    ch eseq

    t

    C VT

    k V V

  • 8/13/2019 Pipe Parallel

    24/24

    Conclusions

    The pipelining

    Pipelining latches are placed across the feed-forward cutsets in the

    SFG and computation time of the critical path is reduced

    The clock frequency can be increased and hence the sampling rateis increased.

    Parallel processing

    The hardware for the original serial system is duplicated and the

    resulting system is MIMO parallel system.

    The clock freq. Stays the same, and the sampling freq. is increased.

    Two scheme is used for higher speed and lower power

    design (using lower supply voltage).