Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
1
COSC 6385
Computer Architecture
- Pipelining (II)
Edgar Gabriel
Spring 2018
Performance evaluation of pipelines (I)
enh
org
Time
TimeSpeedup
enhenhenh
orgorgorg
CPIeClockClyclIC
CPIClockCycleIC
For a fixed application lets assume that ICorg = ICenh
enhenh
orgorg
CPIeClockClycl
CPIClockCycleSpeedup
If we assume additionally that the CPU has the same frequency,
i.e. ClockCycleorg = ClockCycleenh
enh
org
CPI
CPISpeedup
General Speedup Formula:
2
Performance evaluation of pipelines (II)
enh
org
overallTime
TimeSpeedup
n
i
enhii
enh
n
i
orgii
org
CPIICeClockClycl
CPIICeClockClycl
1
1
with
If looking at individual classes of instructions
total
ii
IC
ICf
Assuming ICtotal is identical in both architectures
enh
org
overallTime
TimeSpeedup
n
i
enhii
enh
n
i
orgii
org
CPIfeClockClycl
CPIfeClockClycl
1
1
Comparing pipelined and non-pipelined
execution
• An ideal pipeline produces one result per clock cycle
→Ideal CPIpipelined = 1
• using the average instruction execution time
(AvIETime)
stagespipeline
pipelinednon
pipelinedno
TimeTime
_
_
stagespipeline
pipelined
pipelinednonno
Time
TimeSpeedup _
_
pipelined
pipelinednon
AvIETime
AvIETimeSpeedup
_
pipelined
pipelinednon
pipelined
pipelinednon
ClockCycle
ClockCycle
CPI
CPI __
3
Comparing pipelined and non-
pipelined execution (II)
pipelined
pipelinednon
AvIETime
AvIETimeSpeedup
_Thus:
If ClockCycle is constant:
erInstrallCyclesPPipelineSt
CPISpeedup
pipelinednon
1
_
pipelined
pipelinednonpipelinednon
ClockCycle
ClockCycle
erInstrallCyclesPPipelineSt
CPI __
1
Realistic CPIpipelined = Ideal CPIpipelined +
Pipeline stall cycles per instruction
Example I
• (A) Given an non-pipelined processor:
– 1 ns clock cycle time
– 4 cycles for ALU operations
– 4 cycles for branches
– 5 cycles for memory operations
• (B) Given also a pipelined processor
– 1.2 ns clock cycle time
• Both (A) and (B) have
– 40% ALU operations
– 40% branches
– 20% memory operations
• What is the speedup of (B) over (A) due to pipelining?
4
Example I
For machine (A):
n
i
ii
AA CPIfClockCycleAvIETime1
)(
nsns 4.4)52.044.044.0(1
For machine (B): assuming ideal CPI (= 1)
n
i
ii
BB CPIfClockCycleAvIETime1
)(
nsns 2.1)14.012.014.0(2.1
7.32.1
4.4
)(
)(
ns
ns
AvIETime
AvIETimeSpeedup
B
AThus
Exceptions
• Instruction execution order is interrupted
• E.g.
– I/O device request
– Invoking an OS service from an application
– Tracing execution
– Breakpoint
– Integer or FP arithmetic anomaly (e.g. overflow)
– Page fault
– Misaligned memory access
– Memory protection violation
– Hardware malfunction
5
Classification of Exceptions
• Problems with pipelining:
– Different stages of the pipeline can raise exceptions
leading to a different order of exceptions compared to
the unpipelined case
• Classes of exceptions
1. Synchronous vs. Asynchronous:
2. User requested vs. Coerced
3. User maskable vs. user non-maskable
4. Within vs. between instructions
5. Resume vs. terminate
Exceptions
• Most problematic: exceptions raised within
instructions, where the instruction must be resumed
– Another program must be invoked to save the state of the
program
• Pipelines capable of handling exceptions are called
restartable
Pipeline stage Possible exceptions
IF Page fault on Instruction fetch; misaligned memory access; memory
protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch; misaligned memory access; memory protection
violation
WB Non
6
Exceptions
• Since an exception can not be raised when it occurs
– Status vector associated with instruction shows exception
– Status vector carried along with instruction
– Writing of data values disabled if status vector is set
– In WB status vector checked and exception handled
=> Exception of instruction i handled before exception of
instruction i+1
=> Since no data values are written back, register file not
changed -> instruction can be repeated
Multi-cycle instructions
• Not all instructions will take the same amount of cycles to finish!
– Floating point instructions can take many cycles to complete
• Latency:
– number of intervening cycles between an instruction that
produces a result and instruction that uses the result
– Usually: depth of the EX stage -1
• Initiation interval:
– Number of cycles that must elapse between issuing two
operations of a given type
• Multi-cycle instructions/pipelines increase the probability for
occurring WAW and RAW hazards
7
Example for a multi-cycle pipeline
IF ID
EX
M1 M2 M3 M4 M5 M6 M7
FP/Integer multiply unit
A1 A2 A3 A4
FP/Integer add unit
DIV
FP/Integer division (non pipelined)
MEM WB
Functional unit Latency Initiation interval
Integer ALU 0 1
Data memory 1 1
FP add 3 1
FP multiply 6 1
FP divide 24 25
Instruction level parallelism
• Exploit parallelism between independent instructions
– Limited by data dependencies
– Limited by branches
• Example:
– Each iteration of the loop is independent
– Exploitation of that fact is not trivial because of register
reuse!
for (i=0; i<n; i++ ) {
c[i] = a[i] + b[i];
}
8
Instruction level parallelism
• Data dependencies:
– True dependencies: instruction i produces a result required by instruction i+k, k>0 (RAW)
• sharing a register or a memory location
– Name dependencies: usage of the same register or memory location without data flow
• Antidependence: instruction i+k writes a register/memory location read by instruction i (WAR)
– No problem if not reordering instructions
• Output dependence: instruction i and instruction i+k write the same register/memory location (WAW)
– No problem if not reordering instructions
– Control dependencies: determines ordering of an instruction i with respect to a branch
Dynamic scheduling
• Up-to-now
– Instructions are issued in program order
– If an instruction is stalled in the pipeline, no later instruction can proceed
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F12, F8, F14
• In order to allow out-of-order execution, the ID stage is split into two parts:
– Instruction issue: decode instruction and check for structural hazards
– Read operands: Read operands if no data hazard
9
Dynamic scheduling
• Out-of-order execution introduces the possibility of WAR and WAW hazards
DIV.D F0, F2, F4 DIV.D F0, F2, F4
ADD.D F10, F0, F8 SUB.D F8, F8, F14
SUB.D F8, F8, F14 ADD.D F10, F0, F8
• Out-of-order execution only improves performance if
– Multiple instructions can be executed at once
– Multiple functional units are available
• All instructions pass through the issue stage in order
• Instructions can be bypassed in the read-operand stage
• Algorithms allowing instructions to execute out-of-order
– Scoreboarding
– Tomasulo’s approach
Scoreboarding
• First implemented in the CDC6600
• Assumption for the following slides:
– 2 multipliers
– 1 adder
– 1 divider
– 1 integer unit
• Each instruction goes through the scoreboard
– Scoreboard determines when an instruction can execute
– Scoreboard monitors usage of execution units
– Scoreboard monitors when a result can be written to the
destination register
10
Scoreboarding (II)
4 steps of Scoreboarding (replaces ID, EX and WB)
1. Issue: if functional unit is free and no other active
instruction has the same destination register
2. Read operands: Scoreboard monitors the availability of
operands.
3. Execution
4. Write result: if Execution done, Scoreboard checks for
WAR hazards and stalls the instruction if necessary.
Scoreboarding (II)
Scoreboard data structures:
• Instruction status: which of the four steps the instruction is in
• Functional unit status: status of a functional unit.
– Busy: indicates whether unit is busy or not
– Op: operation to be performed
– Fi: Destination register number
– Fj, Fk: Source register number
– Qj, Qk: Functional units producing source registers Fj, Fk
– Rj, Rk: Flags indicating whether Fj, Fk are ready. Set to NO
after operands are read.
• Register result status: which functional unit will write which
register
11
Scoreboarding example
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Following slides are based on a lecture by Jelena Mirkovic,
University of Delaware
http://www.cis.udel.edu/~sunshine/courses/F04/CIS662/class10.pdf
Assumption:
ADD and SUB take 2 clock cycles
MULT takes 10 clock cycle
DIV takes 40 clock cycles
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1
Mult2
Add
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Integer
Time=1 Issue first load
12
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 No
Mult1
Mult2
Add
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Integer
Time=2 first load read operands; second load can not issue (structural hazard)
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 No
Mult1
Mult2
Add
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Integer
Time=3 first load completes exec; second load can not issue (SH)
13
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1
Mult2
Add
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU
Time=4 first load writes result; second load can not issue (SH)
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1
Mult2
Add
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Integer
Time=5 Second load is issued
14
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 No
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2
Add
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Integer
Time=6 Second load reads operands; Mult is issued
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 No
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2
Add Yes Sub F8 F6 F2 Integer Yes No
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Integer Add
Time=7 Second load completes exec; Mult is stalled waiting for F2; Sub is issued
15
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=8 Second load writes result; Mult and Sub stalled (F2); Div is issued
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Sub F8 F6 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=9 Mult and Sub read operands; Div stalled waiting for (F0); Add not issued (SH)
16
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Sub F8 F6 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=10 Mult executing (1 out of 10 cycles); Sub executing (1 out of 2 cycles); Div stalled (F0);
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Sub F8 F6 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=11 Mult executing (2/10); Sub completes execution; Div stalled (F0);
17
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Div
Time=12 Mult executing (3/10); Sub writes result; Div stalled (F0);
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=13 Mult executing (4/10); Div stalled (F0); Add issued
18
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=14 Mult executing (5/10); Div stalled (F0); Add reads operands
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=15 Mult executing (6/10); Div stalled (F0); Add executes (1 of 2 cycles)
19
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=16 Mult executing (7/10 cycles); Div stalled (F0); Add completes exec
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=17 Mult executing (8/10); Div stalled (F0); Add stalled (WAR hazard on F6)
20
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1 Yes Mult F0 F2 F4 No No
Mult2
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Mult1 Add Div
Time=19 Mult completes exec; Div stalled (F0); Add stalled (WAR hazard on F6)
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1
Mult2
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Add Div
Time=20 Mult writes result; Div stalled (F0); Add stalled (WAR hazard on F6)
21
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1
Mult2
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 No No
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Add Div
Time=21 Div reads operands; Add stalled (WAR hazard on F6)
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1
Mult2
Add
Divide Yes Div F10 F0 F6 No No
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Div
Time=22 Div executes (1/40); Add writes result
22
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1
Mult2
Add
Divide Yes Div F10 F0 F6 No No
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU Div
Time=61 Div completes execution
Instruction status
Instruction Issue Read operands Execution complete Write result
L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer
Mult1
Mult2
Add
Divide
Register result status
F0 F2 F4 F6 F8 F10 F12 … F30
FU
Time=62 Div writes result
23
Scoreboarding (IV)
• Performance of scoreboarding depends on
– The amount of parallelism available among instructions
– Number of scoreboard entries
– Number and type of functional units
– Presence of antidependeces and output dependences