Upload
lesley-howard
View
223
Download
4
Embed Size (px)
Citation preview
1
DLX computer
Electronic Computers M
2
• RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer
RISC architectures
• In CISC architectures the 10% of the instructions are used in 90% of cases• Waste of silicon• Bottleneck: the bus• Mid ‘80s a new architecture: RISC• Solution: reduction of instruction number and complexity (fewer simpler machine
instructions) • Fixed instruction format (simpler instruction decoders)• Simpler control logic network increasing the number of on-chip registers• Reduction of bus/memory accesses• Increase of machine instructions needed for a job which is (in many cases) more
than compensated (in term of time) by the reduction of bus accesses• CISC and RISC are each one the best solution in different application fields• Nowadays coexistence of both architectures in the same processor: analysis at the end
of the course• A simplified RISC architecture: DLX (implemented as real processor in the ‘80s as
R4000)
3
DLX (fixed) instruction format
R Op-code Ra Rb Rc Cod. op (11 bit) extension
6 bit 5 bit 5 bit 5 bit 11 bit
31 26 25 21 20 16 15 11 10 0
Arithmetic or logic instructions as Rd RS1 op RS2 or Set Conditions between registers
J Op-code 26 bit (PC relative) offset
Direct and unconditional control transfer(J e JAL)
I Op-code Ra Rb 16 bit immediate operand
Data transfer (Load, Store), conditional Branch , JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In LD and ALU instructions RS2=destination, in the ST RS2=source. -- RS1 used as base address or as ALU value for the immediate instructions
4
DLX non floating-point instructions(31x32bit registers R31…R1 - R0=0 fixed - Ra and Rb any of the 32 registers)
Data Transfer
LW Ra, offset(Rb)LB Ra, offset(Rb)LBU Ra, offset(Rb)LHU Ra, offset(Rb)LH Ra, offset(Rb)SW Ra, offset(Rb)SH Ra, offset(Rb)SB Ra, offset(Rb)LHI Ra, value
Arithmetic/Logic
ADD Ra,Rb,RcADDI Ra,Rb,valueADDU Ra,Rb,RcADDUI Ra,Rb, valueSUB Ra,Rb,RcSUBI Ra,Rb,valueSUBU Ra,Rb,RcSUBUI Ra,Rb, valueDIV Ra,Rb,RcDIVI Ra,Rb,valueMULU Ra,Rb,RcMULI Ra,Rb, valueSLL Ra ,Rb,RcSLLI Ra,Rb;valueSHR Ra,Rb.RcSHRI Ra,Rb,valueSLA Ra,Rb,RcSLAI Ra,Rb,valueOR Ra,Rb,RcORI Ra,Rb,valueXOR Ra,Rb,RcXORI Ra,Rb,valueAND Ra,Rb,RcANDI Ra,Rb,value
Control
SETx Ra,Rb,RcSETIx Ra,Rb,valueBEQZ Ra, offsetBNEQZ Ra, offsetJ offsetJR RaJL offset JLR Ra
N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NEJL (via or non via register) -> Jump and link saving PC in R31Offset is a value within the instructionPostfix I means «immediate» (value within the instruction)PostfixA means «arithmetic» (sign extension)Postfix U means «unsigned»Value is the immediate within the instruction
No STACK registers
5
DLX ALU operationsTwo inputs data One output data plus flags
S1 , S2 : ALU inputs (32 bit)
S1 + S2S1 – S2S1 and S2S1 or S2S1 exor S2Left Shift S1 of S2 positionsRight Shift S1 of S2 positionsArithmetic Right Shift S1 of S2 positionsS1S201
Output Flags
ZeroNegative sign
ALU is a combinatorial circuit !!!
32
3232
S1
S2OUTALU
Flags
6
PC is the Program CounterA and B are two scratchpad internal registers unknown to the programmer
Ready ? INSTRUCTION FETCHAbstractinstructionexecution INSTRUCTION
DECODE
[PC] <= [PC] +4 [A ]<= [Ra] [B] <= [Rb]
[REGINSTR] ]<= M [PC]
Data transfer
ALU
Set
Jump
Branch
INSTRUCTION EXECUTION
Next Instruction 7
INSTR <= M [PC]Example: LB (LOAD BYTE format I)
Sign extension !!
ExampleM[Addr]7..0=A7H => (10100111)b
Sign extended address <= FFFFFFA7H
Instr15.0. is the instruction offset Address is always 32 bit
31 MBbit 0 LSbit
LB Ra, offset(Rb)
Op-code RS1 RS2 16 bit immediate operand
[Ra] < =(M[Addr.]7)24 ## M[Addr.]7..0 (Dest. Reg. = RS2)
Byte in register
[PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb]
LOAD
Byte Addr. < =[B] + (Instr15)16 ## Instr15..0 [A] = [RS1]
31 26 25 21 20 16 15 0
## => JOIN operatorSign extensionByte address compute
Instruction bit 15 (sign) is left extended 16 times
8
Sign extension (IR15)16 ## IR15..0
0
15
31
IR
31 30…………17 16
From the Control Unit
15-0Tri-state devices
Data transferInstructions(R format)
Examples
LW Ra, offset(Rb)LB Ra, offset(Rb)LBU Ra, offset(Rb) unsignedLHU Ra, offset(Rb) unsignedSW Ra, offset(Rb)
LBLB (byte)
[Ra] <= (M[Addr]7)24 ## M[Addr]7..0
LBULBU (byte)
[Ra] < = (0)24 ## M[Addr]7..0
M[Addr]<=[A]SW
Addr. <= [B] + (Instr15)16 ## Instr15..0
A unsigned
LHLH (half word)
[Ra ]< = (M[Addr]15)16 ## M[Addr]15..0
Signed
.
LHU
LW
LHU (half word)
[Ra] <= (0)16 ## M[Addr]15..0
9
10
[Ra ]<= [B ]+ [T]
[Ra] <= [B] xor [T][Ra]<= [B] - [Rc]
[Ra] <= [B] and [T]
[Ra] <=[B] or [T]
ADD AND
SUB XOR OR
ALUinstructions examples(I format)
(T is a temporary hidden register unknown to the
programmer)
The same scheme for the shift etc.A and B generic registers (RS1, RS2)
Register (format R) Immediate (format I)
[T]<= [Rc] [T]<= (Instr15)16 ## Instr15..0]
Register content signed if arithmetic operations
ADD Ra,Rb,RcADDI Ra,Rb,valueADDU Ra,Rb,RcADDUI Ra,Rb, value………………………
11
SET instructions(see branch)
ex. SLT Ra,Rb,RcSet Ra=1 if Rb is less than Rc
otherwise Ra=0
Register (format R) Immediate (format I)
[T]<= [Rc] [T]<= (Instr15)16 ## Instr15..0
[Ra] = 1 if [Rb] = [T]
SEQ SLT SGE
SNE SGT SLE
[Ra] = 1 if [Rb] < [T] [Ra] = 1 if [Rb] >= [T]
[Ra] =1 if [Rb] <= [T] [Ra] = 1 if [Rb] > [T] [Ra] = 1 if [Rb]! = [T]
Register content as signed
12
[T] <= [PC]
JALR
JUMPInstructions
JAL
[T] <= [PC]
JALR JAL
[R31 ]<= [T]
For saving [PC] in R31
JR
JALR
JMPJAL
[PC] <= [PC] + (Instr25)6 ## Instr25..0 [PC] <= [A]
format Iformat J
J offset (jump address)JR Ra (jump register)JL offset (jump and link address)JLR Ra (jump and link register)
INIT
13
[A] = 1
BRANCH
YESYES
NO NO
BEQZ BNEZBranchInstructions [A!] = 1
[PC] <= [PC] + (Instr15)16 ## Instr15..0
Ex. BNEQZ R5, 100Jump to PC+100 if R5 not equal 0
14
The Pipelining Principle Pipelining is the main basic technique used for “speeding-up” a CPU.
The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …)
A system S must operate N times on a task Ai producing result Ri :
A1 , A2 , A3 …AN S R1 , R2 , R3 …RN
Latency : time occurring between the beginning and the end oftask A (TA ).
Throughput : frequency of each task completion
15
The Pipelining Principle1) Sequential System
A2 A3 tANA1
TA
Latency (execution time of a single instruction) = TA
2) Pipelined System (instruction are subdivided in stages – each stage during one nth – 4 in this example - of the entire instruction) – Instructions overlap
S
A
P1 P2 P3 P4 t
S1 S2 S3 S4
Si: pipeline stage
16
The Pipelining Principle
P1
TP
P2 P3A1 P4
S
S1 S2 S3 S4
P1A2 P2 P3 P4
P1A3 P2 P3 P4
P1A4 P2 P3 P4
tAn
TP : pipeline cycleEach cycle one instruction terminates
Instruction stages
17
EXID MEM WBIF
Instruction fetch(from memory)
Instruction decode
Instruction execution(ALU)
Data memory access (if needed)
Write-back(if needed)
18
Pipelining of a CPU (DLX)Instruction sequence: I1 , I2 , I3 …IN
Instruction j
EXIDt
MEM WBIF
ClockPerInstruction=1 (ideally !)
IF/ID ID/EX EX/MEM MEM/WB
CPU (datapath)
IF ID EX MEM WB
Pipeline Cycle Clock Cycle Delay of the slowest stage
Registers(PipelineRegisters
D FF)
Combinatorialcircuits
19
DLX Pipeline
Instr i
Instr i+1
Instr i+2
Instr i+3
Instr i+4
IF ID EX MEM WB
Tclk = Td + TP + Tsu
Clock Cycle
CPI (ideally) = 1
Overhead introduced by the Pipeline Registers:
Switch delay of theinput stage register
Set-up time of theoutput stage register
Delay of the slowestcombinatorial stage
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
D
Tp
Switch delay of theinput stage register
D
Set-up time of theoutput stage register
CombinatorialCircuit
Delay of the slowestcombinatorial stage
20
21
Pipeline implementation requirements· Each stage is active at each clock cycle.
· The PC is incremented in the IF stage.
31 2 1 0
PCAlways 0
· An ADDER should be introduced (PC <=PC+4 – one instruction is 4 bytes) in the IF stage. But instructions are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit only register (a programmable counter for jumps) is used, incremented by 1 each clock cycle
· Two Memory Data Registers are required (referred to as LMDR e SMDR). In fact when a LOAD is immediately followed by a STORE there is a WB/MEM stages overlap – two data waiting therefore to be written (one onto the memory, the other onto a register of the RF).
· Each clock cycle 2 memory accesses must be possibly executed (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture
· The CPU clock is determined by the slowest stage
· Pipeline Registers store both data and control information ( “distributed” control unit)
IF ID EX MEM WB
DLX Pipelined Datapath
ADD
4 MUX
DATAMEM
ALUM
UX
MUX
=0?
INSTRMEM
RF
SE
PC
DEC
MUX
IF/ID ID/EX EX/MEM MEM/WB
Sign extension
Number of dest. registersin case of LOAD and ALU instr.
For computing new PC valuewhen branch
For operations with immediates
RD
D
Ra
Rb
destination register number (1-31)
Data (from reg. or mem or PC per link)
PC
Actually a programmable
counter
if jump
For Set Condition(also <0 and >0)
[it acts on the output]
=0?
for Branch
JL and JLR(PC in R31)
22
RS1 RS2 scratchpad)
23
ID stage (N.B. stage layout different from previous slide!)
IR
SE
DR
D
Ra
Rb
IF/ID ID/EX
IR25-21
IR20-16
Number of the dest. register (from WB stage)
Data (from WB stage)
(31-16) Immed./Branch
(31-26) Jump
IR15
IR25
LB
SW
IR15-0 (Offset/Immediate– 11-15 as dest. reg. in R instr. )
IR25-16 (Jump; Jump and Link)
PC31-0 (JL and JLR)
PC
A
B
26 (J and JL)
6
16
32
32
32
32
32
Info travelling withthe instruction
IR10-00 (R Istr.) DEC
Sign extension
IR31-26 (Opcode)
Sing extension
RF
DLX Pipelined Datapath
ADD
4 MUX
DM
ALUM
UX
MUX
IM
RF
SE
PC
DEC
MUX
IF/ID ID/EX EX/MEM MEM/WB
IR1
A
B
IR2
PC2
COND
X
X: Computed data or Memory Address or Branch Address
SMDR Y
LMDR
Y: Computed data from the previous stage
IF ID EX MEM WB
PC1
PC3
PC4
Address
Data
IR3
IR4
destination register number
for Set Condition(also <0 e >0)
[it acts on output]
=0?
=0?
for Branch
JL JLR
(PC saved in R31)
SMDR => Store Memory Data RegisterLMDR => Load memory data RegisterIRi => Instruction Register i
24
Ra
Rb
DRD
25
Pipelined execution of an “ALU” instruction
X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1”
IFID
EX
MEM Y <= X (temp. Storage for WB)
WB RD <= Y
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;
X <= A op Bor
X <= A op [(IR215)16 ## IR215..0]
[PC4 <= PC3]
[PC3 <= PC2]
Decoded opcode travels
through all stages
[IR3 <= IR2]
[IR4 <.= IR3]
NOTE:IRi bits which
are dropped stage by stage when no more needed for all instructions.
Why ?
26
Pipelined execution of a “MEM” instruction
IFID
EX
MEMLMDR <= M[MAR] (if LOAD)
orM[MAR] <= SMDR (if STORE)
WB RD <= MDR (if LOAD) [Sign ext.]
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;
MAR <= A op (IR215)16 ## IR215..0 SMDR <= B
[PC4 <= PC3]
[PC3 <= PC2]Decoded opcode travels
through all stages
[IR3 <= IR2
[IR4 <= IR3]
27
Pipelined execution of a “BRANCH” instruction(normally after a SCn instruction – see later)
X : “BTA (BRANCH TARGET ADDRESS)”
IFID
EX
MEM if (Cond) PC <= X
WB (NOP)
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;
X <= PC2 op (IR15)16 ## IR15..0 Cond <= A op 0
[PC4 <= PC3]
[PC3 <= PC2]
Decoded opcode travels
through all stages
[IR3 <= IR2]
[IR4 <= IR3
Branch on Reg A value (0/1)New value in PC in this interval .
When Branch is taken 3 new unwanted instructions
are already started
Computed new PC address
28
Pipelined execution of a “JR” instruction
ID
MEM
WB
IFID
EX
MEM PC <= X
WB (NOP)
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;
X <= A
[PC4 <= PC3]
[PC3 <= PC2]Decoded opcode travels
through all stages
[IR3 <= IR2]
[IR4 <= IR3]
Which would be the stage sequence for a J instruction?New value in PC in this interval .
When Jump executed 3 new unwanted instructions
are already started
new PC address
29
Pipelined execution of a “JL or JLR” instruction
IDIF
ID
EX
MEM PC <= X ; PC4<= PC3
WB R31 <= PC4
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1ID/EX <= Instruction decode;
PC3 <= PC2X <= A (If JLR) X <= PC2 + (IR25)6 ## IR25..0 (If JL)
NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write
Decoded opcode
through all stages
[IR4 <= IR3]
[IR3 <= IR2]
In this case PCi values are used
New value in PC in this interval . When Jump executed 3 new
unwanted instructions are already started
30
Which would be the sequence in case of SCn (ex SLT R1,R2,R3) ?
IDIF
ID
EX
MEM
WB
IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4
A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1ID/EX <= Instruction decode;
?
?
?
31
Pipeline Hazards
A “Hazard” occurs when during a clock cycle an instruction currently in a pipeline stage can’t be executed in the same clock cycle.
• Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can’t be executed simultaneously.
• Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a RF register not yet written by a previous instruction (Read After Write).
• Control Hazards – Instructions following a branch depend from the branch result (taken/not taken).
The instruction that cannot be executed must be stalled (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard).
Clk 6 Clk 7 Clk 8
Hazards and stalls
IF ID EX MEM WBIi-3
Ii-2
Ii-1
ID EX MEM
ID EX
IF
IF
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
WB
Clk 9 Clk 10 Clk 11 Clk 12
T5 = 8 * CLK = (5 + 3) * CLK
T5 = 5 * (1 + 3/5 ) * CLK
Instruction stalls
IDIi IDIF
IFIi+1 WB
WBS SS
S S IFS
MEM WB
Stall: the clock signal for Ii, Ii+1 …etc. is blocked for three periods
The consequence of a data hazard: if instruction I i needs the result of instruction Ii-1 (data are read in ID stage), must wait until after WB of Ii-1
32
Normally the three stalledinstructions are transformed in NOPs
to avoid clock blocking
33
Forwarding
Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline.
(NOTE: in DLX, registers are modified only in WB stage)
Clk 6 Clk 7 Clk 8
ADD R3, R1, R4 IF ID EX MEM WB
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
SUB R7, R3, R5 hazard ID EX MEMIF WB
Clk 9
OR R1, R3, R5 hazard ID MEM WBEXIF
Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!)
LW R6, 100 (R3) hazard IDIF EX MEM WB
AND R9, R5, R3 no hazard IF ID EX MEM WB
Data are read from registers in the ID stage
34
Forwarding implementation
FU
EX/MEM
MUX
MEM/WB
ALUM
UX
ID/EX
MUX
MUX
RS1/RS2OPCODE
RD2/OpCode
RD1 (destination register/OpCode)Combinatorial!!comparison between
RS1, RS2 and RD1, RD2 and the Opcodes
RFMUX
Often performed inside the RF
It allows “the anticipation” of the register on ID/EXMUX control: IF/ID opcode and comparison of RD with RS1 and RS2
Memory
ALU
IR3 IR4
Offset
B
A
BypassMUX
PC
INSTRUCTION DECODE*
MUX
PC
35
Data hazard due to LOAD instructions
NOTE: the data required by the ADD is available only at the end of MEM stage. This hazard cannot be eliminated by forwarding (unless there is an additional input in the MUXs between memory and ALU – delays!)
ADD R4,R1,R7
SUB R5,R1,R8
AND R6,R1,R7
LW R1,32(R6) MEM WB
IF ID EX MEM
IF ID EX
IF ID
IF ID EX
LW R1,32(R6) IF ID EX MEM WB
ADD R4,R1,R7 IF ID S EX MEM
SUB R5,R1,R8 IF ID EX
AND R6,R1,R7 IF ID
The pipeline needs to be stalled
Transformed in NOPPC-<PC-4
From the end of this stage onwards: standard forwarding
ADD R4,R1,R7 IF ID EX MEMNOP IF ID EX MEM WB
36
Delayed load
In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by HW by stalling the pipeline but by software through the compiler (delayed load):
LOAD Instruction
delay slot
Next instruction
The compiler tries tofill the delay-slot
with a “useful” instruction(worst case: NOP).
LW R1,32(R6)
LW R3,10 (R4)
ADD R5,R1,R3
LW R6, 20 (R7)
LW R8, 40(R9)
LW R1,32(R6)
LW R3,10 (R4)
ADD R5,R1,R3
LW R6, 20 (R7)
LW R8, 40(R9)
37
Control Hazards
BEQZ R4, 200
PC BEQZ R4, 200PC+4 SUB R7, R3, R5PC+8 OR R1, R3, R5
PC+12 LW R6, 100 (R8)
PC+4+200 AND R9, R5, R3 (BTA)
Next InstructionAddress
R4 = 0 : Branch Target Address
(taken)R4 0 : PC+4(not taken)
Clk 6 Clk 7 Clk 8
IF ID EX MEM WB
ID
ID
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
MEM WB
EX MEM
EX
IF
IF
WBID
ID
IDIF EX WBID MEM
Fetch with the new PC
New computed PC value (Aluout)
SUB R7, R3, R5
OR R1, R3, R5
LW R6, 100 (R8)
New value in PC (one clock after:new value must be clocked onto the PC))
IDIF EX WBID MEM
ADD
4
IM RF
SE
PC
DEC
Instruction Fetch Instruction Decode
Execute
MemoryWriteBack
IF/ID ID/EX
ALUM
UX
EX/MEM
MUX
MUX
DLX Pipelined Datapath (Branch or JMP)
BEQZ R4, 200
MUX
DM
MEM/WB
When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included)
NOTE if the feedback signal of the new PC were output directly from the ALU instead than from ALUOUT the required stalls would be only two – slower clock!
=0?
=0?
38
39
Handling the Control Hazards
BEQZ R4,200
Clk 6 Clk 7 Clk 8
IF ID EX MEM WB
Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
S S IFS
Fetch at new PC• Always Stall (three-clock block being propagated)
• Predict Not Taken
IF ID EX MEM WB
ID
ID
ID
BEQZ R4, 200
SUB R7, R3, R5
OR R1, R3, R5
LW R6, 100 (R8)
Clk 6 Clk 7 Clk 8Clk 1 Clk 2 Clk 3 Clk 4 Clk 5
MEM WB
EX MEM
EX
IF
IF
IF
WB
EX WBID
ID
ID
MEM
Branch Completion
IF here: the previous instruction has not been yet decoded
S IFIF IDSReal situationRepeated IFPC <= PC - 4
Here the new value is sampled by the PC
No problem because no instruction in WB stage
NOP NOP NOP
If branch taken: flush. They
becomeNOP. No data
yet written
Here the new value of PC is computed
IF ID EX MEM WB
Stalls with jumps (1/3)
ADD
4 MUX
DATAMEM
ALUM
UX
MUX
=0?
INSTRMEM
RF
SE
PC
DEC
MUX
IF/ID ID/EX EX/MEM MEM/WB
DR
D
RS1
RS2
Data
PC
if jump
=0?
NOP
NOP
NOP
Jump forced NOP
Three NOPs MUST replace the 3 unwanted instructions
already started
When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/MEM
40
IF ID EX MEM WB
Stalls with jump (2/3)
ADD
4 MUX
DATAMEM
ALUM
UX
MUX
=0?
INSTRMEM
RF
SE
PC
DEC
MUX
IF/ID ID/EX EX/MEM MEM/WB
DR
D
RS1
RS2
Data
PC
if jump
=0?
NOP
NOP
forced NOP when jump
NOTE in this case the jump condition detection and the new PC value are input to the MUX in the same clok interval
Two NOPs MUST replace the 2 unwanted
instructions already started
41
IF ID EX MEM WB
Stalls with jump (3/3)
ADD
4
DATAMEM
ALUM
UX
MUX
=0?
INSTRMEM
RF
SE
DEC
MUX
IF/ID ID/EX EX/MEM MEM/WB
DR
D
RS1
RS2
Data
PC
if jump
=0?
NOP
NOP for jump
NOTE In this case the jump condition and the new PC act on the MUX in the same period when the condition is detected
PC
MUX
A NOP MUST replace the unwanted instruction
already started
Very slow solution !
42
43
Delayed branch
Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch):
BRANCH instruction
delay slot
Next instruction
The compiler tries tofill the delay-slots
with “useful” instructions(worst case: NOP).
delay slot
delay slot
44
Delayed branch/jump
Add R5, R4, R3Sub R6, R5, R2Or R14, R6, R21Sne R1, R8, R9 ; branch condition
Br R1, +100
Sne R1, R8, R9 ; branch condition
Br R1, +100Add R5, R4, R3Sub R6, R5, R2Or R14, R6, R21
CompiledOriginal
Executed in both casesObviously in this
instructions group there must be no
jumps!!!
Instead of one or more “postponed” instructions, the compiler inserts NOPs when no suitable instructions are available
45
Independent Adder for BRANCH/JMP
To reduce the number of stalls
BTA <=PC1+ (IR15)16 ## IR15-0 /(IR25)6 ## IR25..0 if Branch: if (RS1 op 0) PC <= BTA
if JMP always PC <= BTA
IF
ID
EX -------------------------
MEM
WB
-------------------------
-------------------------
(New fetchonly one stall)
ALU (additional full adder)
A <- Ra; B <- Rb; PC2 <- PC1ID/EX <- Decode; ID/EX <- Opc ext.
IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4
NOTE: in this case there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!!!!!
BRANCH/JMP – 1 stall
ADDER
4
IM RF
PC
DEC
IF/ID ID/EX
IR1
IF ID
PC1
MUX
MUX
SE
##
A
B
PC2
NOTE: for “Unconditional Jump” instructions there a similar situation : we need only to provide further inputs to the MUXs of the PC by considering either the RS1 register (JR and JRL) or the 26 less-significant bits of the IR with SE (J and JL) to be added to the instruction PC (not the current PC)
The source of the next PC is selected according to the opcode and the value of the branch test register
= 0 ?
For Branches
Standard increment
Branch
Offset and sign
extension
Displacement of the Branch instructionPC of the Branch instruction
46
47
Handling the Control Hazards
Dynamic Prediction: Branch Target Buffer => no stall (almost..)
T/NT
TAGS
Predicted PCPC
= HIT : Fetch with predicted PC
MISS : Fetch with PC + 4
Correct prediction : no stalls Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before)
N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID
48
Prediction Buffer: the simplest implementation uses a single bit that indicates what happened
when last branch occurred.
In case of predominance of one prediction, when the opposite situation occurs we have two
consecutive errors.
Loop1Loop2
When the program ends loop2, the prediction fails (branch
predicted as taken but actually it is untaken), then it fails again when it predicts as untaken
whilst entering once again loop2
49
Usually two bits.
TAKEN
TAKEN
UNTAKEN
UNTAKEN
TAKEN
UNTAKEN
TAKEN
UNTAKEN
TAKEN
TAKEN
UNTAKEN UNTAKEN