49
1 DLX computer Electronic Computers M

1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

Embed Size (px)

Citation preview

Page 1: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

1

DLX computer

Electronic Computers M

Page 2: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

2

• RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer

RISC architectures

• In CISC architectures the 10% of the instructions are used in 90% of cases• Waste of silicon• Bottleneck: the bus• Mid ‘80s a new architecture: RISC• Solution: reduction of instruction number and complexity (fewer simpler machine

instructions) • Fixed instruction format (simpler instruction decoders)• Simpler control logic network increasing the number of on-chip registers• Reduction of bus/memory accesses• Increase of machine instructions needed for a job which is (in many cases) more

than compensated (in term of time) by the reduction of bus accesses• CISC and RISC are each one the best solution in different application fields• Nowadays coexistence of both architectures in the same processor: analysis at the end

of the course• A simplified RISC architecture: DLX (implemented as real processor in the ‘80s as

R4000)

Page 3: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

3

DLX (fixed) instruction format

R Op-code Ra Rb Rc Cod. op (11 bit) extension

6 bit 5 bit 5 bit 5 bit 11 bit

31 26 25 21 20 16 15 11 10 0

Arithmetic or logic instructions as Rd RS1 op RS2 or Set Conditions between registers

J Op-code 26 bit (PC relative) offset

Direct and unconditional control transfer(J e JAL)

I Op-code Ra Rb 16 bit immediate operand

Data transfer (Load, Store), conditional Branch , JR and JALR (Control transfer via register), Set Condition e ALU with immediate operator. In LD and ALU instructions RS2=destination, in the ST RS2=source. -- RS1 used as base address or as ALU value for the immediate instructions

Page 4: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

4

DLX non floating-point instructions(31x32bit registers R31…R1 - R0=0 fixed - Ra and Rb any of the 32 registers)

Data Transfer

LW Ra, offset(Rb)LB Ra, offset(Rb)LBU Ra, offset(Rb)LHU Ra, offset(Rb)LH Ra, offset(Rb)SW Ra, offset(Rb)SH Ra, offset(Rb)SB Ra, offset(Rb)LHI Ra, value

Arithmetic/Logic

ADD Ra,Rb,RcADDI Ra,Rb,valueADDU Ra,Rb,RcADDUI Ra,Rb, valueSUB Ra,Rb,RcSUBI Ra,Rb,valueSUBU Ra,Rb,RcSUBUI Ra,Rb, valueDIV Ra,Rb,RcDIVI Ra,Rb,valueMULU Ra,Rb,RcMULI Ra,Rb, valueSLL Ra ,Rb,RcSLLI Ra,Rb;valueSHR Ra,Rb.RcSHRI Ra,Rb,valueSLA Ra,Rb,RcSLAI Ra,Rb,valueOR Ra,Rb,RcORI Ra,Rb,valueXOR Ra,Rb,RcXORI Ra,Rb,valueAND Ra,Rb,RcANDI Ra,Rb,value

Control

SETx Ra,Rb,RcSETIx Ra,Rb,valueBEQZ Ra, offsetBNEQZ Ra, offsetJ offsetJR RaJL offset JLR Ra

N.B. Postfix x (set condition) can be LT, GT, LE, GE, EQ, NEJL (via or non via register) -> Jump and link saving PC in R31Offset is a value within the instructionPostfix I means «immediate» (value within the instruction)PostfixA means «arithmetic» (sign extension)Postfix U means «unsigned»Value is the immediate within the instruction

No STACK registers

Page 5: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

5

DLX ALU operationsTwo inputs data One output data plus flags

S1 , S2 : ALU inputs (32 bit)

S1 + S2S1 – S2S1 and S2S1 or S2S1 exor S2Left Shift S1 of S2 positionsRight Shift S1 of S2 positionsArithmetic Right Shift S1 of S2 positionsS1S201

Output Flags

ZeroNegative sign

ALU is a combinatorial circuit !!!

32

3232

S1

S2OUTALU

Flags

Page 6: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

6

PC is the Program CounterA and B are two scratchpad internal registers unknown to the programmer

Ready ? INSTRUCTION FETCHAbstractinstructionexecution INSTRUCTION

DECODE

[PC] <= [PC] +4 [A ]<= [Ra] [B] <= [Rb]

[REGINSTR] ]<= M [PC]

Data transfer

ALU

Set

Jump

Branch

INSTRUCTION EXECUTION

Page 7: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

Next Instruction 7

INSTR <= M [PC]Example: LB (LOAD BYTE format I)

Sign extension !!

ExampleM[Addr]7..0=A7H => (10100111)b

Sign extended address <= FFFFFFA7H

Instr15.0. is the instruction offset Address is always 32 bit

31 MBbit 0 LSbit

LB Ra, offset(Rb)

Op-code RS1 RS2 16 bit immediate operand

[Ra] < =(M[Addr.]7)24 ## M[Addr.]7..0 (Dest. Reg. = RS2)

Byte in register

[PC] <= [PC] +4 [A ]<= [Ra] [B ]<= [Rb]

LOAD

Byte Addr. < =[B] + (Instr15)16 ## Instr15..0 [A] = [RS1]

31 26 25 21 20 16 15 0

## => JOIN operatorSign extensionByte address compute

Instruction bit 15 (sign) is left extended 16 times

Page 8: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

8

Sign extension (IR15)16 ## IR15..0

0

15

31

IR

31 30…………17 16

From the Control Unit

15-0Tri-state devices

Page 9: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

Data transferInstructions(R format)

Examples

LW Ra, offset(Rb)LB Ra, offset(Rb)LBU Ra, offset(Rb) unsignedLHU Ra, offset(Rb) unsignedSW Ra, offset(Rb)

LBLB (byte)

[Ra] <= (M[Addr]7)24 ## M[Addr]7..0

LBULBU (byte)

[Ra] < = (0)24 ## M[Addr]7..0

M[Addr]<=[A]SW

Addr. <= [B] + (Instr15)16 ## Instr15..0

A unsigned

LHLH (half word)

[Ra ]< = (M[Addr]15)16 ## M[Addr]15..0

Signed

.

LHU

LW

LHU (half word)

[Ra] <= (0)16 ## M[Addr]15..0

9

Page 10: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

10

[Ra ]<= [B ]+ [T]

[Ra] <= [B] xor [T][Ra]<= [B] - [Rc]

[Ra] <= [B] and [T]

[Ra] <=[B] or [T]

ADD AND

SUB XOR OR

ALUinstructions examples(I format)

(T is a temporary hidden register unknown to the

programmer)

The same scheme for the shift etc.A and B generic registers (RS1, RS2)

Register (format R) Immediate (format I)

[T]<= [Rc] [T]<= (Instr15)16 ## Instr15..0]

Register content signed if arithmetic operations

ADD Ra,Rb,RcADDI Ra,Rb,valueADDU Ra,Rb,RcADDUI Ra,Rb, value………………………

Page 11: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

11

SET instructions(see branch)

ex. SLT Ra,Rb,RcSet Ra=1 if Rb is less than Rc

otherwise Ra=0

Register (format R) Immediate (format I)

[T]<= [Rc] [T]<= (Instr15)16 ## Instr15..0

[Ra] = 1 if [Rb] = [T]

SEQ SLT SGE

SNE SGT SLE

[Ra] = 1 if [Rb] < [T] [Ra] = 1 if [Rb] >= [T]

[Ra] =1 if [Rb] <= [T] [Ra] = 1 if [Rb] > [T] [Ra] = 1 if [Rb]! = [T]

Register content as signed

Page 12: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

12

[T] <= [PC]

JALR

JUMPInstructions

JAL

[T] <= [PC]

JALR JAL

[R31 ]<= [T]

For saving [PC] in R31

JR

JALR

JMPJAL

[PC] <= [PC] + (Instr25)6 ## Instr25..0 [PC] <= [A]

format Iformat J

J offset (jump address)JR Ra (jump register)JL offset (jump and link address)JLR Ra (jump and link register)

Page 13: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

INIT

13

[A] = 1

BRANCH

YESYES

NO NO

BEQZ BNEZBranchInstructions [A!] = 1

[PC] <= [PC] + (Instr15)16 ## Instr15..0

Ex. BNEQZ R5, 100Jump to PC+100 if R5 not equal 0

Page 14: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

14

The Pipelining Principle Pipelining is the main basic technique used for “speeding-up” a CPU.

The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …)

A system S must operate N times on a task Ai producing result Ri :

A1 , A2 , A3 …AN S R1 , R2 , R3 …RN

Latency : time occurring between the beginning and the end oftask A (TA ).

Throughput : frequency of each task completion

Page 15: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

15

The Pipelining Principle1) Sequential System

A2 A3 tANA1

TA

Latency (execution time of a single instruction) = TA

2) Pipelined System (instruction are subdivided in stages – each stage during one nth – 4 in this example - of the entire instruction) – Instructions overlap

S

A

P1 P2 P3 P4 t

S1 S2 S3 S4

Si: pipeline stage

Page 16: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

16

The Pipelining Principle

P1

TP

P2 P3A1 P4

S

S1 S2 S3 S4

P1A2 P2 P3 P4

P1A3 P2 P3 P4

P1A4 P2 P3 P4

tAn

TP : pipeline cycleEach cycle one instruction terminates

Page 17: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

Instruction stages

17

EXID MEM WBIF

Instruction fetch(from memory)

Instruction decode

Instruction execution(ALU)

Data memory access (if needed)

Write-back(if needed)

Page 18: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

18

Pipelining of a CPU (DLX)Instruction sequence: I1 , I2 , I3 …IN

Instruction j

EXIDt

MEM WBIF

ClockPerInstruction=1 (ideally !)

IF/ID ID/EX EX/MEM MEM/WB

CPU (datapath)

IF ID EX MEM WB

Pipeline Cycle Clock Cycle Delay of the slowest stage

Registers(PipelineRegisters

D FF)

Combinatorialcircuits

Page 19: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

19

DLX Pipeline

Instr i

Instr i+1

Instr i+2

Instr i+3

Instr i+4

IF ID EX MEM WB

Tclk = Td + TP + Tsu

Clock Cycle

CPI (ideally) = 1

Overhead introduced by the Pipeline Registers:

Switch delay of theinput stage register

Set-up time of theoutput stage register

Delay of the slowestcombinatorial stage

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Page 20: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

D

Tp

Switch delay of theinput stage register

D

Set-up time of theoutput stage register

CombinatorialCircuit

Delay of the slowestcombinatorial stage

20

Page 21: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

21

Pipeline implementation requirements· Each stage is active at each clock cycle.

· The PC is incremented in the IF stage.

31 2 1 0

PCAlways 0

· An ADDER should be introduced (PC <=PC+4 – one instruction is 4 bytes) in the IF stage. But instructions are aligned (each one ends to an address multiple of the instruction length in bytes) and therefore a 30 bit only register (a programmable counter for jumps) is used, incremented by 1 each clock cycle

· Two Memory Data Registers are required (referred to as LMDR e SMDR). In fact when a LOAD is immediately followed by a STORE there is a WB/MEM stages overlap – two data waiting therefore to be written (one onto the memory, the other onto a register of the RF).

· Each clock cycle 2 memory accesses must be possibly executed (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture

· The CPU clock is determined by the slowest stage

· Pipeline Registers store both data and control information ( “distributed” control unit)

Page 22: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

IF ID EX MEM WB

DLX Pipelined Datapath

ADD

4 MUX

DATAMEM

ALUM

UX

MUX

=0?

INSTRMEM

RF

SE

PC

DEC

MUX

IF/ID ID/EX EX/MEM MEM/WB

Sign extension

Number of dest. registersin case of LOAD and ALU instr.

For computing new PC valuewhen branch

For operations with immediates

RD

D

Ra

Rb

destination register number (1-31)

Data (from reg. or mem or PC per link)

PC

Actually a programmable

counter

if jump

For Set Condition(also <0 and >0)

[it acts on the output]

=0?

for Branch

JL and JLR(PC in R31)

22

RS1 RS2 scratchpad)

Page 23: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

23

ID stage (N.B. stage layout different from previous slide!)

IR

SE

DR

D

Ra

Rb

IF/ID ID/EX

IR25-21

IR20-16

Number of the dest. register (from WB stage)

Data (from WB stage)

(31-16) Immed./Branch

(31-26) Jump

IR15

IR25

LB

SW

IR15-0 (Offset/Immediate– 11-15 as dest. reg. in R instr. )

IR25-16 (Jump; Jump and Link)

PC31-0 (JL and JLR)

PC

A

B

26 (J and JL)

6

16

32

32

32

32

32

Info travelling withthe instruction

IR10-00 (R Istr.) DEC

Sign extension

IR31-26 (Opcode)

Sing extension

RF

Page 24: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

DLX Pipelined Datapath

ADD

4 MUX

DM

ALUM

UX

MUX

IM

RF

SE

PC

DEC

MUX

IF/ID ID/EX EX/MEM MEM/WB

IR1

A

B

IR2

PC2

COND

X

X: Computed data or Memory Address or Branch Address

SMDR Y

LMDR

Y: Computed data from the previous stage

IF ID EX MEM WB

PC1

PC3

PC4

Address

Data

IR3

IR4

destination register number

for Set Condition(also <0 e >0)

[it acts on output]

=0?

=0?

for Branch

JL JLR

(PC saved in R31)

SMDR => Store Memory Data RegisterLMDR => Load memory data RegisterIRi => Instruction Register i

24

Ra

Rb

DRD

Page 25: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

25

Pipelined execution of an “ALU” instruction

X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1”

IFID

EX

MEM Y <= X (temp. Storage for WB)

WB RD <= Y

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;

X <= A op Bor

X <= A op [(IR215)16 ## IR215..0]

[PC4 <= PC3]

[PC3 <= PC2]

Decoded opcode travels

through all stages

[IR3 <= IR2]

[IR4 <.= IR3]

NOTE:IRi bits which

are dropped stage by stage when no more needed for all instructions.

Why ?

Page 26: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

26

Pipelined execution of a “MEM” instruction

IFID

EX

MEMLMDR <= M[MAR] (if LOAD)

orM[MAR] <= SMDR (if STORE)

WB RD <= MDR (if LOAD) [Sign ext.]

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;

MAR <= A op (IR215)16 ## IR215..0 SMDR <= B

[PC4 <= PC3]

[PC3 <= PC2]Decoded opcode travels

through all stages

[IR3 <= IR2

[IR4 <= IR3]

Page 27: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

27

Pipelined execution of a “BRANCH” instruction(normally after a SCn instruction – see later)

X : “BTA (BRANCH TARGET ADDRESS)”

IFID

EX

MEM if (Cond) PC <= X

WB (NOP)

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;

X <= PC2 op (IR15)16 ## IR15..0 Cond <= A op 0

[PC4 <= PC3]

[PC3 <= PC2]

Decoded opcode travels

through all stages

[IR3 <= IR2]

[IR4 <= IR3

Branch on Reg A value (0/1)New value in PC in this interval .

When Branch is taken 3 new unwanted instructions

are already started

Computed new PC address

Page 28: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

28

Pipelined execution of a “JR” instruction

ID

MEM

WB

IFID

EX

MEM PC <= X

WB (NOP)

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb; PC2 <= PC1; IR2<=IR1ID/EX <= Instruction decode;

X <= A

[PC4 <= PC3]

[PC3 <= PC2]Decoded opcode travels

through all stages

[IR3 <= IR2]

[IR4 <= IR3]

Which would be the stage sequence for a J instruction?New value in PC in this interval .

When Jump executed 3 new unwanted instructions

are already started

new PC address

Page 29: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

29

Pipelined execution of a “JL or JLR” instruction

IDIF

ID

EX

MEM PC <= X ; PC4<= PC3

WB R31 <= PC4

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1ID/EX <= Instruction decode;

PC3 <= PC2X <= A (If JLR) X <= PC2 + (IR25)6 ## IR25..0 (If JL)

NOTE: Write on R31 CANNOT be performed on-the fly since it could overlap with another register write

Decoded opcode

through all stages

[IR4 <= IR3]

[IR3 <= IR2]

In this case PCi values are used

New value in PC in this interval . When Jump executed 3 new

unwanted instructions are already started

Page 30: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

30

Which would be the sequence in case of SCn (ex SLT R1,R2,R3) ?

IDIF

ID

EX

MEM

WB

IR <= M[PC] ; PC <= PC + 4 ; PC1 <= PC + 4

A <= Ra; B <= Rb; PC2 <= P1; IR2<=IR1ID/EX <= Instruction decode;

?

?

?

Page 31: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

31

Pipeline Hazards

A “Hazard” occurs when during a clock cycle an instruction currently in a pipeline stage can’t be executed in the same clock cycle.

• Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can’t be executed simultaneously.

• Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a RF register not yet written by a previous instruction (Read After Write).

• Control Hazards – Instructions following a branch depend from the branch result (taken/not taken).

The instruction that cannot be executed must be stalled (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions must proceed normally (so as to eliminate the hazard).

Page 32: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

Clk 6 Clk 7 Clk 8

Hazards and stalls

IF ID EX MEM WBIi-3

Ii-2

Ii-1

ID EX MEM

ID EX

IF

IF

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

WB

Clk 9 Clk 10 Clk 11 Clk 12

T5 = 8 * CLK = (5 + 3) * CLK

T5 = 5 * (1 + 3/5 ) * CLK

Instruction stalls

IDIi IDIF

IFIi+1 WB

WBS SS

S S IFS

MEM WB

Stall: the clock signal for Ii, Ii+1 …etc. is blocked for three periods

The consequence of a data hazard: if instruction I i needs the result of instruction Ii-1 (data are read in ID stage), must wait until after WB of Ii-1

32

Normally the three stalledinstructions are transformed in NOPs

to avoid clock blocking

Page 33: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

33

Forwarding

Forwarding allows eliminating almost all RAW hazards of the pipeline without stalling the pipeline.

(NOTE: in DLX, registers are modified only in WB stage)

Clk 6 Clk 7 Clk 8

ADD R3, R1, R4 IF ID EX MEM WB

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

SUB R7, R3, R5 hazard ID EX MEMIF WB

Clk 9

OR R1, R3, R5 hazard ID MEM WBEXIF

Here too the requested data is not yet in RF since it is written on the positive clock edge at the end of WB (register value is read in ID!)

LW R6, 100 (R3) hazard IDIF EX MEM WB

AND R9, R5, R3 no hazard IF ID EX MEM WB

Data are read from registers in the ID stage

Page 34: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

34

Forwarding implementation

FU

EX/MEM

MUX

MEM/WB

ALUM

UX

ID/EX

MUX

MUX

RS1/RS2OPCODE

RD2/OpCode

RD1 (destination register/OpCode)Combinatorial!!comparison between

RS1, RS2 and RD1, RD2 and the Opcodes

RFMUX

Often performed inside the RF

It allows “the anticipation” of the register on ID/EXMUX control: IF/ID opcode and comparison of RD with RS1 and RS2

Memory

ALU

IR3 IR4

Offset

B

A

BypassMUX

PC

INSTRUCTION DECODE*

MUX

PC

Page 35: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

35

Data hazard due to LOAD instructions

NOTE: the data required by the ADD is available only at the end of MEM stage. This hazard cannot be eliminated by forwarding (unless there is an additional input in the MUXs between memory and ALU – delays!)

ADD R4,R1,R7

SUB R5,R1,R8

AND R6,R1,R7

LW R1,32(R6) MEM WB

IF ID EX MEM

IF ID EX

IF ID

IF ID EX

LW R1,32(R6) IF ID EX MEM WB

ADD R4,R1,R7 IF ID S EX MEM

SUB R5,R1,R8 IF ID EX

AND R6,R1,R7 IF ID

The pipeline needs to be stalled

Transformed in NOPPC-<PC-4

From the end of this stage onwards: standard forwarding

ADD R4,R1,R7 IF ID EX MEMNOP IF ID EX MEM WB

Page 36: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

36

Delayed load

In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by HW by stalling the pipeline but by software through the compiler (delayed load):

LOAD Instruction

delay slot

Next instruction

The compiler tries tofill the delay-slot

with a “useful” instruction(worst case: NOP).

LW R1,32(R6)

LW R3,10 (R4)

ADD R5,R1,R3

LW R6, 20 (R7)

LW R8, 40(R9)

LW R1,32(R6)

LW R3,10 (R4)

ADD R5,R1,R3

LW R6, 20 (R7)

LW R8, 40(R9)

Page 37: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

37

Control Hazards

BEQZ R4, 200

PC BEQZ R4, 200PC+4 SUB R7, R3, R5PC+8 OR R1, R3, R5

PC+12 LW R6, 100 (R8)

PC+4+200 AND R9, R5, R3 (BTA)

Next InstructionAddress

R4 = 0 : Branch Target Address

(taken)R4 0 : PC+4(not taken)

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB

ID

ID

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

MEM WB

EX MEM

EX

IF

IF

WBID

ID

IDIF EX WBID MEM

Fetch with the new PC

New computed PC value (Aluout)

SUB R7, R3, R5

OR R1, R3, R5

LW R6, 100 (R8)

New value in PC (one clock after:new value must be clocked onto the PC))

IDIF EX WBID MEM

Page 38: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

ADD

4

IM RF

SE

PC

DEC

Instruction Fetch Instruction Decode

Execute

MemoryWriteBack

IF/ID ID/EX

ALUM

UX

EX/MEM

MUX

MUX

DLX Pipelined Datapath (Branch or JMP)

BEQZ R4, 200

MUX

DM

MEM/WB

When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included)

NOTE if the feedback signal of the new PC were output directly from the ALU instead than from ALUOUT the required stalls would be only two – slower clock!

=0?

=0?

38

Page 39: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

39

Handling the Control Hazards

BEQZ R4,200

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

S S IFS

Fetch at new PC• Always Stall (three-clock block being propagated)

• Predict Not Taken

IF ID EX MEM WB

ID

ID

ID

BEQZ R4, 200

SUB R7, R3, R5

OR R1, R3, R5

LW R6, 100 (R8)

Clk 6 Clk 7 Clk 8Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

MEM WB

EX MEM

EX

IF

IF

IF

WB

EX WBID

ID

ID

MEM

Branch Completion

IF here: the previous instruction has not been yet decoded

S IFIF IDSReal situationRepeated IFPC <= PC - 4

Here the new value is sampled by the PC

No problem because no instruction in WB stage

NOP NOP NOP

If branch taken: flush. They

becomeNOP. No data

yet written

Here the new value of PC is computed

Page 40: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

IF ID EX MEM WB

Stalls with jumps (1/3)

ADD

4 MUX

DATAMEM

ALUM

UX

MUX

=0?

INSTRMEM

RF

SE

PC

DEC

MUX

IF/ID ID/EX EX/MEM MEM/WB

DR

D

RS1

RS2

Data

PC

if jump

=0?

NOP

NOP

NOP

Jump forced NOP

Three NOPs MUST replace the 3 unwanted instructions

already started

When the Branch Target Address is clocked into the PC three unwanted instructions are already in IF/ID, ID/EX and EX/MEM

40

Page 41: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

IF ID EX MEM WB

Stalls with jump (2/3)

ADD

4 MUX

DATAMEM

ALUM

UX

MUX

=0?

INSTRMEM

RF

SE

PC

DEC

MUX

IF/ID ID/EX EX/MEM MEM/WB

DR

D

RS1

RS2

Data

PC

if jump

=0?

NOP

NOP

forced NOP when jump

NOTE in this case the jump condition detection and the new PC value are input to the MUX in the same clok interval

Two NOPs MUST replace the 2 unwanted

instructions already started

41

Page 42: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

IF ID EX MEM WB

Stalls with jump (3/3)

ADD

4

DATAMEM

ALUM

UX

MUX

=0?

INSTRMEM

RF

SE

DEC

MUX

IF/ID ID/EX EX/MEM MEM/WB

DR

D

RS1

RS2

Data

PC

if jump

=0?

NOP

NOP for jump

NOTE In this case the jump condition and the new PC act on the MUX in the same period when the condition is detected

PC

MUX

A NOP MUST replace the unwanted instruction

already started

Very slow solution !

42

Page 43: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

43

Delayed branch

Similarly to the LOAD case. In several RISC CPUs the BRANCH instructions hazard is handled by SW through the compiler (delayed branch):

BRANCH instruction

delay slot

Next instruction

The compiler tries tofill the delay-slots

with “useful” instructions(worst case: NOP).

delay slot

delay slot

Page 44: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

44

Delayed branch/jump

Add R5, R4, R3Sub R6, R5, R2Or R14, R6, R21Sne R1, R8, R9 ; branch condition

Br R1, +100

Sne R1, R8, R9 ; branch condition

Br R1, +100Add R5, R4, R3Sub R6, R5, R2Or R14, R6, R21

CompiledOriginal

Executed in both casesObviously in this

instructions group there must be no

jumps!!!

Instead of one or more “postponed” instructions, the compiler inserts NOPs when no suitable instructions are available

Page 45: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

45

Independent Adder for BRANCH/JMP

To reduce the number of stalls

BTA <=PC1+ (IR15)16 ## IR15-0 /(IR25)6 ## IR25..0 if Branch: if (RS1 op 0) PC <= BTA

if JMP always PC <= BTA

IF

ID

EX -------------------------

MEM

WB

-------------------------

-------------------------

(New fetchonly one stall)

ALU (additional full adder)

A <- Ra; B <- Rb; PC2 <- PC1ID/EX <- Decode; ID/EX <- Opc ext.

IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4

NOTE: in this case there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!!!!!

Page 46: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

BRANCH/JMP – 1 stall

ADDER

4

IM RF

PC

DEC

IF/ID ID/EX

IR1

IF ID

PC1

MUX

MUX

SE

##

A

B

PC2

NOTE: for “Unconditional Jump” instructions there a similar situation : we need only to provide further inputs to the MUXs of the PC by considering either the RS1 register (JR and JRL) or the 26 less-significant bits of the IR with SE (J and JL) to be added to the instruction PC (not the current PC)

The source of the next PC is selected according to the opcode and the value of the branch test register

= 0 ?

For Branches

Standard increment

Branch

Offset and sign

extension

Displacement of the Branch instructionPC of the Branch instruction

46

Page 47: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

47

Handling the Control Hazards

Dynamic Prediction: Branch Target Buffer => no stall (almost..)

T/NT

TAGS

Predicted PCPC

= HIT : Fetch with predicted PC

MISS : Fetch with PC + 4

Correct prediction : no stalls Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before)

N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID

Page 48: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

48

Prediction Buffer: the simplest implementation uses a single bit that indicates what happened

when last branch occurred.

In case of predominance of one prediction, when the opposite situation occurs we have two

consecutive errors.

Loop1Loop2

When the program ends loop2, the prediction fails (branch

predicted as taken but actually it is untaken), then it fails again when it predicts as untaken

whilst entering once again loop2

Page 49: 1 DLX computer Electronic Computers M. 2 RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer RISC architectures In CISC

49

Usually two bits.

TAKEN

TAKEN

UNTAKEN

UNTAKEN

TAKEN

UNTAKEN

TAKEN

UNTAKEN

TAKEN

TAKEN

UNTAKEN UNTAKEN