Parallel Chapter3

7/27/2019 Parallel Chapter3

1/29

Chapter 3 Pipelining


2/29

3.1 Pipeline Model

Terminology

task

subtask stage

staging register

Total processing time for each task.

Tpl = , where tiis the processing time,

diis the delay by the staging register, and k is the

number of stages

k

i

i idt

1

)(


3/29

3.1 Pipeline Model

(continued) Total processing time for each task.

Tseq =

pipeline cycle time, tmax= Max(ti+di), 1 Ik clock frequency = 1/ tmax

pipeline cycle time tcyccan be denoted by

Tseq

/k + d

speedup, S = ,where N is the number

of tasks.

k

i

it

1

)(

cyc

seq

tNk

TN

)1(


4/29

3.1 Pipeline Model

(continued) If staging register delay is ignored and the

processing times of the stages are same,

tcyc= Tseq / k.Therefore, Sideal becomes

If

1

NkkN

kSN ideal ,


5/29

3.1 Pipeline Model

(continued) The total cost of the pipeline is given by

C= L.k + Cp where Cp = and L is

the cost of each staging register. To minimize the composite cost per the

computation rate, k = dLTseqCp

k

i

ic

1


6/29

3.1 Pipeline Model

(continued) In practice, making the delays of pipeline stages

equal is a complicated and time-consuming process

It is essential to maximum performance that the stages be

close to balanced. It is done for commercial processors, although it is not easy

and cheap to do

Another problem with pipelines is the overhead in

term of handling exception or interrupts. A deep pipeline increases the interrupt handling overhead.


7/29

Pipeline Types Pipeline Types(Handlers classification)

Instruction pipelines

FI, DI, CA, FO, EX, ST

arithmetic pipelines

processor pipelines: a cascade of

processors each executing a specific

module in the application program.


8/29

Instruction pipeline

reservation table

Row : stages

Column : pipeline cycles

The cycle time of instruction pipelines is

often determined by the stages

requiring memory access.


9/29

Control Hazard

Conditional branch instructions

The target address of branch will be known onlyafter the evaluation of the condition.

The ways to solve control hazards The pipeline is frozen

The pipeline predicts that the branch will not betaken.

It would be to start fetching the target instructionsequence into a buffer while the nonbranchsequence is being fed into the pipeline.


10/29

Arithmetic pipelines

Floating point addition Consider S = A + B, where A=(Ea,Ma), B=(Eb,

Mb), and S=(Es,Ms)

Addition steps (Figure 3.5) Equalize the exponents

Add mantissas

Normalize Ms and adjust Es for the sum normalization

Round Ms Renormalize Ms and adjust Es

Modified floating point add pipeline (Figure 3.6 &3.7)


11/29

Arithmetic pipelines(cont.)

floating point multiplication

Consider P= A x B, where A=(Ea,Ma), B=(Eb,

Mb), and P=(Ep,Mp) Multiplication steps (Figure 3.8)

Add exponents

Multiply mantissas

Normalize Mp and adjust Ep Round Mp

Renormalize Mp and adjust Ep

Modified floating point add pipeline (Figure 3.9)


12/29

Arithmetic pipelines(cont.)

Multifunction pipeline

To perform more than one operation

A control input is needed for proper

operation of the multifunction pipeline.

Figure 3.10 : floating point add/multiplier


13/29

Classification scheme by

Ramamoorthy and Li Functionality

unifunctional

multifunctional

Configuration static

dynamic

Mode of operation: scalar

vector


14/29

3.2 Pipeline control and

Performance To provide the max. possible throughput, it

must be kept full and flowing smoothly.

Two conditions of smooth flow of a pipeline: the rate of input of data

data interlocks between the stages

Example 3.1 : the pipeline completes one

operation per cycle(once it is full)

Example 3.2 : non-linear pipeline


15/29

Structural hazard

Due to the non-availability of

appropriate hardware

One obvious way of avoiding structuralhazard is to insert additional hardware

into the pipeline.


16/29

Example 3.3

Figure 3.12 depicts the operation of thepipeline In cycle 3, 4, 5, and 6, simultaneous accesses are

needed.

If we assume that the machine has separate dataand instruction caches, in cycles 5 and 6 theproblems are solved.

One way to solve the problem in cycle 4 is to stallthe ADD instruction (Figure 3.13) The stalling process results in a degradation of pipeline

performance.


17/29

Collision vectors

Initiation : launching of an operation into thepipeline

Latency: the number of cycles that elapsebetween two initiation.

Latency sequence: the latencies betweensuccessive initiations

Collision: it occurs if a stage in the pipeline isrequired to perform more than one task atany time.


18/29

Collision vectors(cont.)

Forbidden set: the set of all possible column

distances between two entries on some row

of RT. Collision vector can be derived from

forbidden set F and can be utilized to

control the initiation of operations in the

pipelines. CV = (vn-1,vn-2,,v2,v1)

Vi =1 if i is in the forbidden set


19/29

Examples

Example 3.4

(a) Overlapped RT

(b) Collision Vector(CV)

Example 3.5 & 3.6

Collision case and no collision case


20/29

Control

How to control the initiation of pipeline usingCV. Place the CV in a shift reg.

If the LSB of the shift reg. Is 1, do not initiate anoperation at that cycle; shift the CV right once,inserting 0 at the vacant MSB position

If the LSB of the shift reg. Is 0, initiate a new

operation at that cycle; shift the CV right once,inserting 0 at the vacant MSB position. In order toreflect the superposing status due to the newinitiation over the original one, perform a bit-by-bitOR of the original CV with the content of the shift

reg.


21/29

3.2.3 Performance

Figure 3.15(a)

The CV of Figure 3.11 : (00111)

Figure 3.15(a) shows the state transitions.


22/29

3.2.3 Performance

Average latency

simple cycle

greedy cycle

MAL(Minimum average Latency)


23/29

3.2.4 Multifunction Pipelines

Figure 3.17

Vxx, Vxy, Vyx, Vyy


24/29

3.3 Other Pipeline Problems

Data Interlock: due to the sharing of

resources. Data hazard

data forwarding

internal forwarding

write-read forwarding

read-read forwarding

write-write forwarding load/store architectures versus

memory/memory architectures


25/29

3.3 Other Pipeline Problems

(continued) Conditional Branches

branch prediction

delayed branch

branch-prediction buffer

branch history

multiple instruction buffers

Interrupts

precise interrupt scheme


26/29

3.4 Dynamic Pipelines Instruction deferral

scoreboard

Tomosulos algorithm

Performance evaluation

maximizing the total number of initiations

per unit time

minimizing the total time required to handle

a specific sequences of initiation table

types


27/29

3.5 Example systems

CDC Star-100

CDC 6600

MIPS R-4000


28/29

3.6 Summaries

Three approaches have been tried to

improve the performance beyond the

ideal CPI case: superpipeline

superscalar

VLIW(Very Long Instruction Word)


29/29

End of Chapter 3

Documents

Parallel Chapter3