Upload
aletharee
View
260
Download
0
Embed Size (px)
Citation preview
7/27/2019 Parallel Chapter3
1/29
Chapter 3 Pipelining
7/27/2019 Parallel Chapter3
2/29
3.1 Pipeline Model
Terminology
task
subtask stage
staging register
Total processing time for each task.
Tpl = , where tiis the processing time,
diis the delay by the staging register, and k is the
number of stages
k
i
i idt
1
)(
7/27/2019 Parallel Chapter3
3/29
3.1 Pipeline Model
(continued) Total processing time for each task.
Tseq =
pipeline cycle time, tmax= Max(ti+di), 1 Ik clock frequency = 1/ tmax
pipeline cycle time tcyccan be denoted by
Tseq
/k + d
speedup, S = ,where N is the number
of tasks.
k
i
it
1
)(
cyc
seq
tNk
TN
)1(
7/27/2019 Parallel Chapter3
4/29
3.1 Pipeline Model
(continued) If staging register delay is ignored and the
processing times of the stages are same,
tcyc= Tseq / k.Therefore, Sideal becomes
If
1
NkkN
kSN ideal ,
7/27/2019 Parallel Chapter3
5/29
3.1 Pipeline Model
(continued) The total cost of the pipeline is given by
C= L.k + Cp where Cp = and L is
the cost of each staging register. To minimize the composite cost per the
computation rate, k = dLTseqCp
k
i
ic
1
7/27/2019 Parallel Chapter3
6/29
3.1 Pipeline Model
(continued) In practice, making the delays of pipeline stages
equal is a complicated and time-consuming process
It is essential to maximum performance that the stages be
close to balanced. It is done for commercial processors, although it is not easy
and cheap to do
Another problem with pipelines is the overhead in
term of handling exception or interrupts. A deep pipeline increases the interrupt handling overhead.
7/27/2019 Parallel Chapter3
7/29
Pipeline Types Pipeline Types(Handlers classification)
Instruction pipelines
FI, DI, CA, FO, EX, ST
arithmetic pipelines
processor pipelines: a cascade of
processors each executing a specific
module in the application program.
7/27/2019 Parallel Chapter3
8/29
Instruction pipeline
reservation table
Row : stages
Column : pipeline cycles
The cycle time of instruction pipelines is
often determined by the stages
requiring memory access.
7/27/2019 Parallel Chapter3
9/29
Control Hazard
Conditional branch instructions
The target address of branch will be known onlyafter the evaluation of the condition.
The ways to solve control hazards The pipeline is frozen
The pipeline predicts that the branch will not betaken.
It would be to start fetching the target instructionsequence into a buffer while the nonbranchsequence is being fed into the pipeline.
7/27/2019 Parallel Chapter3
10/29
Arithmetic pipelines
Floating point addition Consider S = A + B, where A=(Ea,Ma), B=(Eb,
Mb), and S=(Es,Ms)
Addition steps (Figure 3.5) Equalize the exponents
Add mantissas
Normalize Ms and adjust Es for the sum normalization
Round Ms Renormalize Ms and adjust Es
Modified floating point add pipeline (Figure 3.6 &3.7)
7/27/2019 Parallel Chapter3
11/29
Arithmetic pipelines(cont.)
floating point multiplication
Consider P= A x B, where A=(Ea,Ma), B=(Eb,
Mb), and P=(Ep,Mp) Multiplication steps (Figure 3.8)
Add exponents
Multiply mantissas
Normalize Mp and adjust Ep Round Mp
Renormalize Mp and adjust Ep
Modified floating point add pipeline (Figure 3.9)
7/27/2019 Parallel Chapter3
12/29
Arithmetic pipelines(cont.)
Multifunction pipeline
To perform more than one operation
A control input is needed for proper
operation of the multifunction pipeline.
Figure 3.10 : floating point add/multiplier
7/27/2019 Parallel Chapter3
13/29
Classification scheme by
Ramamoorthy and Li Functionality
unifunctional
multifunctional
Configuration static
dynamic
Mode of operation: scalar
vector
7/27/2019 Parallel Chapter3
14/29
3.2 Pipeline control and
Performance To provide the max. possible throughput, it
must be kept full and flowing smoothly.
Two conditions of smooth flow of a pipeline: the rate of input of data
data interlocks between the stages
Example 3.1 : the pipeline completes one
operation per cycle(once it is full)
Example 3.2 : non-linear pipeline
7/27/2019 Parallel Chapter3
15/29
Structural hazard
Due to the non-availability of
appropriate hardware
One obvious way of avoiding structuralhazard is to insert additional hardware
into the pipeline.
7/27/2019 Parallel Chapter3
16/29
Example 3.3
Figure 3.12 depicts the operation of thepipeline In cycle 3, 4, 5, and 6, simultaneous accesses are
needed.
If we assume that the machine has separate dataand instruction caches, in cycles 5 and 6 theproblems are solved.
One way to solve the problem in cycle 4 is to stallthe ADD instruction (Figure 3.13) The stalling process results in a degradation of pipeline
performance.
7/27/2019 Parallel Chapter3
17/29
Collision vectors
Initiation : launching of an operation into thepipeline
Latency: the number of cycles that elapsebetween two initiation.
Latency sequence: the latencies betweensuccessive initiations
Collision: it occurs if a stage in the pipeline isrequired to perform more than one task atany time.
7/27/2019 Parallel Chapter3
18/29
Collision vectors(cont.)
Forbidden set: the set of all possible column
distances between two entries on some row
of RT. Collision vector can be derived from
forbidden set F and can be utilized to
control the initiation of operations in the
pipelines. CV = (vn-1,vn-2,,v2,v1)
Vi =1 if i is in the forbidden set
7/27/2019 Parallel Chapter3
19/29
Examples
Example 3.4
(a) Overlapped RT
(b) Collision Vector(CV)
Example 3.5 & 3.6
Collision case and no collision case
7/27/2019 Parallel Chapter3
20/29
Control
How to control the initiation of pipeline usingCV. Place the CV in a shift reg.
If the LSB of the shift reg. Is 1, do not initiate anoperation at that cycle; shift the CV right once,inserting 0 at the vacant MSB position
If the LSB of the shift reg. Is 0, initiate a new
operation at that cycle; shift the CV right once,inserting 0 at the vacant MSB position. In order toreflect the superposing status due to the newinitiation over the original one, perform a bit-by-bitOR of the original CV with the content of the shift
reg.
7/27/2019 Parallel Chapter3
21/29
3.2.3 Performance
Figure 3.15(a)
The CV of Figure 3.11 : (00111)
Figure 3.15(a) shows the state transitions.
7/27/2019 Parallel Chapter3
22/29
3.2.3 Performance
Average latency
simple cycle
greedy cycle
MAL(Minimum average Latency)
7/27/2019 Parallel Chapter3
23/29
3.2.4 Multifunction Pipelines
Figure 3.17
Vxx, Vxy, Vyx, Vyy
7/27/2019 Parallel Chapter3
24/29
3.3 Other Pipeline Problems
Data Interlock: due to the sharing of
resources. Data hazard
data forwarding
internal forwarding
write-read forwarding
read-read forwarding
write-write forwarding load/store architectures versus
memory/memory architectures
7/27/2019 Parallel Chapter3
25/29
3.3 Other Pipeline Problems
(continued) Conditional Branches
branch prediction
delayed branch
branch-prediction buffer
branch history
multiple instruction buffers
Interrupts
precise interrupt scheme
7/27/2019 Parallel Chapter3
26/29
3.4 Dynamic Pipelines Instruction deferral
scoreboard
Tomosulos algorithm
Performance evaluation
maximizing the total number of initiations
per unit time
minimizing the total time required to handle
a specific sequences of initiation table
types
7/27/2019 Parallel Chapter3
27/29
3.5 Example systems
CDC Star-100
CDC 6600
MIPS R-4000
7/27/2019 Parallel Chapter3
28/29
3.6 Summaries
Three approaches have been tried to
improve the performance beyond the
ideal CPI case: superpipeline
superscalar
VLIW(Very Long Instruction Word)
7/27/2019 Parallel Chapter3
29/29
End of Chapter 3