Upload
oria
View
82
Download
0
Embed Size (px)
DESCRIPTION
Outline Introduction Version 0 MIPS CPU : Unpipelined MIPS CPU It executes integer instructions Version 1 MIPS CPU : Pipelined MIPS CPU It executes integer instructions Handout to use MIPS CPU. Getting ready for CS6143 The prerequisite for CS6143 CS6133 for graduate students - PowerPoint PPT Presentation
Citation preview
Computer Architecture II
CS 6143CS 6143
Versions 0 & 1
MIPS CPU
Haldun Hadimioglu
Computer Science & Engineering
Haldun Hadimioglu
MIPS Versions 0 & 1 2CS 6143
Outline Introduction Version 0 MIPS CPU : Unpipelined MIPS
CPU It executes integer instructions
Version 1 MIPS CPU : Pipelined MIPS CPU It executes integer instructions
Handout to use MIPS CPU
Haldun Hadimioglu
MIPS Versions 0 & 1 3CS 6143
Getting ready for CS6143 The prerequisite for CS6143
CS6133 for graduate students CS2214 for undergraduate students
CS6143 students who took the prerequisite course and did not use the Hennessy & Patterson book must realize that they will put in more effort than the other CS6143 students
They will have to learn the MIPS assembly language and the MIPS pipeline by themselves !
If you are not sure you are ready for CS6143, you can work on the execution timing of the pipelined MIPS CPU on the next slide
You learned about it when you took the prerequisite course CS6133 or CS2214
If you do not understand the timing, you need to take CS6133
If you understand the timing, then study the remaining slides to refresh your memory on CPU design and pipelining
Haldun Hadimioglu
MIPS Versions 0 & 1 4CS 6143
Test Program Determine when the execution of the second iteration ends
if L1 cache memories take one clock period and there is no cache miss
Show all forwardings and write-in-the-first-half-read-in-the-second-half cases
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
IF ID EX MEM WB IF ID EX MEM WB1 2 3 4 5 10 11 12 13 14
2 3/4 5 6 7 11 12/13 14 15 163/4 5 6 7 8 12/13 14 15 16 175 6 7 8 9 14 15 16 17 18
6 7 8 9 10 15 16 17 18 19
8 9 17 187 8 9 10 11 16 17 18 19 20
9 10 11 12 18 19 20 21
LD R1, 500(R8)DADD R2, R3, R1DSUB R5, R2, R1XOR R8, R5, R2SLT R11, R2, R5OR R14, R11, R15BNEZ R14, (-7)10
SD R11, 600(R14)
All data hazards are RAW
The second iteration ends in clock period 21
Haldun Hadimioglu
MIPS Versions 0 & 1 5CS 6143
Introduction On the microarchitecture layer, a computer is a
collection of at least three interconnected digital systems
A central processing unit (CPU) A (main) memory An I/O controller to control an I/O device, such as the
disk There can be several I/O controllers to control different I/O
devices
Intr
odu
ctio
n
Memory
CPU
I/OController
InterconnectionSystem
Disk
Haldun Hadimioglu
MIPS Versions 0 & 1 6CS 6143
Digital Systems A digital system performs microoperations
It consists of a datapath (data unit) and a control unit
The datapath actually performs the microoperations The control unit determines which microoperation
happens when
Registers ALUs Buses
SequencerStatus signals Control signals
Datapath
Control Unit
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 7CS 6143
Digital Systems The datapath (data unit) has registers,
ALUs and buses to perform the microoperations
Registers keep information temporarilyALUs perform arithmetic/logic operationsBuses interconnect the registers and ALUsOther components are used include
Multiplexers (MUXes), decoders, encoders, comparators, counters, etc.
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 8CS 6143
Digital Systems The control unit has a sequencer circuit
that determines the sequence of microoperations
The sequencer needs status signals from the data unit to know what is happening there
Then, it determines which microoperations to be performed and indicates to the datapath by means of control signalsIn
trod
uct
ion
Haldun Hadimioglu
MIPS Versions 0 & 1 9CS 6143
Designing Digital systems Datapath design is simpler than the
control unit since it has highly regular (duplicated) circuits
A 64-bit ADDer is composed of 4 16-bit identical ADDers
A 64-bit comparator consists of 8 8-bit identical comparators, etc.
Control unit design is more difficult due to Large amounts of random logicA lot of effort is needed to make sure there are
no timing problems Microoperations must start at the right time and end
at the right time !
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 10CS 6143
Designing digital systems We will use the finite-state machine (FSM)
technique to design the MIPS CPU where the FSM state diagram will have states with microoperations
The state diagram shows which state follows which state precisely
Each state indicates which microoperations to perform
The state diagram shows which states are needed when for which machine language instruction
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 11CS 6143
Designing the microarchitecture level of a computer There are two tasks in this design
Develop the CPU and memory digital systems so that instructions can be run
Develop the memory and I/O controller digital systems so that I/O can happen
We will concentrate on the CPU and memory digital systemsIn
trod
uct
ion
Haldun Hadimioglu
MIPS Versions 0 & 1 12CS 6143
Designing the CPU and memory digital systems First we focus on the CPU digital system while we
make a few design decisions on the memory hierarchy quickly
We will design the CPU as a slow CPU running only integer instructions : No pipelining
This is Version 0 Then, we will improve the CPU speed by using
pipelining, but still running integer instructions This is Version 1
For both versions the memory will be a black box with a few details
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 13CS 6143
Designing the CPU as a Digital System The MIPS CPU digital system
We will concentrate on FSM state diagram of the MIPS CPU
The FSM state diagram describes both the datapath and the control unit
Datapath of the CPU Datapath hardware for the execution of integer MIPS
instructions will be covered
We will not concentrate on the MIPS CPU control unit
It can be implemented by hardwiring and/or microprogramming
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 14CS 6143
Designing the CPU digital system To design the MIPS CPU, we will start with
the MIPS architectureWhat is the connection between the
architecture and the CPU? A computer processes digital information, by running
machine language instructions A program is a list of instructions each of which
specifies operations on data (arguments) An instruction specifies architectural operations Each architectural operation is implemented by
microoperations
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 15CS 6143
Designing the CPU Digital System In order to perform an architectural
operation, the CPU performs a series of microoperations in a number of clock periods
That is an architectural operation is broken down into smaller operations called microoperations
That is, to run a machine language instruction, the CPU performs microoperations
The CPU performs some microoperations alone and some in cooperation with the memory and the I/O controllers
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 16CS 6143
Designing the CPU Digital System Architectural operations
An architectural operation is what we describe as the semantics of the instruction
The architectural operation specified by the DADD instruction
Rd Rs + Rt The architectural operation specified by the DSLLV
instruction Rd Rs << Rt
The architectural operation specified by the MOVN instruction
If Rt < 0 then Rd Rs The architectural operation specified by the J instruction
PC[36-63] (4 x Offset)
It is the CPU that contributes the most to the execution of an instruction since it performs most of the microoperations needed for an architectural operation
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 17CS 6143
Designing the CPU Digital System Typical CPU digital system microoperations
Add, subtract, multiply In the past, a 32-bit addition was completed in 1 clock
period. Today, a 64-bit addition is completed in several clock periods
AND, OR, XOR Shift right, Shift left Read data from memory, write data to memory
In the past, a memory access was completed in 1 clock period.
Today, it is completed in several clock periods
Read instructions from memory (fetch) Increment the program counter Transfer a register to another register …
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 18CS 6143
Designing the CPU as a Digital System Other machines, especially CISC
machines, require other microoperations such as
Reading indirect address(es) from the memoryEffective address calculation for
Indexing Autoincrement Autodecrement
Alignment for Instructions Data Addresses
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 19CS 6143
Designing the CPU Digital System Architecture’s effect on microoperations
The decisions made on architecture determine the microoperations needed for the execution of the instructions
General microoperations found on most CPUs The ones mentioned on previous slides
Specific microoperations for certain CPUs Specific microoperations for MMUs, caches, I/O controllers
The architecture also determines the characteristics of each microoperation
If the autoincrement addressing mode is used, the number to be automatically added to the base register can be 4 or 8 depending on the length of memory location and world length sizes
Whether to attach 16 bits or 32 bits during sign extension Thus, each machine language instruction requires a
number of certain microoperations taking a certain time : the CPIi
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 20CS 6143
Designing the CPU Digital System Microoperations
The CPU can perform one or more microoperations per clock period, depending on the complexity of the microoperation and the availability of the hardware resources
Most often a microoperation can be completed in one clock period unless it is a complex microoperation
If a complex microoperations is desired to be run in a clock period, the clock period needs to be longer
The more and complex the microoperations are, the longer it takes to run the machine language instruction
CISC instructions take longer time to execute (larger CPIi) because of this reason
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 21CS 6143
Designing the CPU Digital System Calculating CPIi
The time it takes to run an instruction, CPIi, is then determined by
The number of microoperations needed for it The complexity of the microoperations
The number of clock periods for an instruction, CPIi, becomes a matter of figuring out the microoperations and how to distribute them to individual clock periods
One can come up with 5-10 simple microoperations to be performed one after another, resulting in a CPIi of 5-10
But, since microoperations are simple, the clock period is short
Alternatively, one can come up with 2-4 complex microoperations, resulting in a CPIi of 2-4
But, the clock period is longer
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 22CS 6143
Designing the CPU Digital System Calculating CPIi
What can we do ? Few long clock periods vs. many but shorter clock
periods ? Since increasing the clock frequency is important for
marketing purposes the second option would weigh in substantially
It turns out that if pipelining is implemented, having many shorter clock periods would not matter as we will see
CPIi figures will be large but CPIave will be close to 1 (one) !
Today’s microprocessors have instruction CPIi values in the range of 10-30, but CPIave figures for their targeted applications even less than 1 (one) !
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 23CS 6143
Designing the CPU Digital System Determining microoperations for a
machine language instructionSome microoperations are performed for all
the instructions Usually at the same point in time during the
execution of every instruction Fetching the instruction is always the first
microoperation to perform for all CPUs Updating PC (PC PC + 4) so that it points at the next
instruction is also universal
The other microoperations depend on the instruction, the addressing mode, where the arguments are, the length of the arguments, etc.
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 24CS 6143
Designing the CPU Digital System Determining microoperations for a machine
language instruction We would list all the microoperations for each
instruction, by making sure that we are consistent in terms of
Bus usage We often decide an approximate number of buses we need
for our datapath Today’s CPUs have at least three internal buses to
complete an integer arithmetic microoperation in one clock period
Two buses carry the numbers from two registers and the third bus carries the result to a register
ALU usage An ALU is expensive and so we try to limit the number of
them
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 25CS 6143
Designing the CPU Digital System Determining microoperations for a machine
language instruction We would list all the microoperations for each
instruction, by making sure that we are consistent in terms of
Register usage Additional registers not visible to the architecture level are
used to keep temporary values : microarchitecture registers Typically, the more registers are used, the more clock periods
we spend for an instruction since temporary values will be passed from one clock period to another
But, sometimes we have to use microarchitecture registers, such as the instruction register that keep the current instruction
Control unit usage
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 26CS 6143
Designing the CPU Digital System Designing the MIPS CPU digital system
Determine how each MIPS architectural operation is implemented by microoperations
Most microoperations must be simple enough to be completed in less than one clock period
A few microoperations may not be completed in a clock period
For example a memory read may take several clock periods
These microoperations should be accommodated in the FSM state diagram, the datapath and the control unit
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 27CS 6143
Designing the CPU Digital System The MIPS microoperations implied by the MIPS
machine language instructions are Instruction fetch, performed always Update PC for next instruction, performed always Effective address calculation for Displacement and
relative addressing modes Sign extension or catenation of 0s for data/addresses Reading data from the memory Writing data to the memory Perform an arithmetic/logic Register transfer Testing a condition
Intr
odu
ctio
n
Haldun Hadimioglu
MIPS Versions 0 & 1 28CS 6143
Unpipelined MIPS CPU : Version 0 By using the MIPS CPU Handout
The most interesting component of a computer is the CPU
We know that the CPU has registers, buses, ALUs and a sequencer, among other
Note that whether hardwiring or microprogramming is used, the datapath stays the same, at least theoretically
The textbook gives the description of the datapath, not the control unit
We will do the same thing The datapath performs microoperations on data
It uses registers, buses and the ALU for that purpose The microoperations are in turn controlled by the
control unit.
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 29CS 6143
Overview We are now ready for the organizational
design of the MIPSWe know the architecture of MIPS
We will designThe MIPS CPU that will have
A control unit with a sequencer A datapath containing registers, buses and the ALU
The datapath performs the microoperations and the control unit determines the timing and sequence of these microoperations
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 30CS 6143
Overview The way the MIPS computer is covered indicates
that the authors organized the computer similar to the commercial MIPS systems where
There is an integer MIPS CPU A system control coprocessor (CP0) responsible for
memory management and cache control. A FP coprocessor (CP1)
The integer MIPS CPU registers are either architectural or microarchitectural (temporary registers)
There are two other coprocessors, CP2 and CP3 that are reserved for future use
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 31CS 6143
Overview Designing the MIPS CPU for all of instructions is
prohibitive First, we will design a MIPS CPU to execute only
integer instructions that include LD, SD DADD, DSUB DADDI AND, OR, XOR ANDI, ORI, XORI SLT SLTI BEQZ, BNEZ
All these integer instructions use either the I-format or the R-format
We will not cover the execution of J-format instructions Their execution hardware can be derived after learning
how the hardware for R-format and I-format instructions is constructed
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 32CS 6143
Overview The MIPS CPU will have all the
architectural registers32 64-bit GPRs64-bit PC
FP registers are to be added later in the semester
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 33CS 6143
New Microarchitectural registers These (temporary) registers are not a part
of the state (hence architecture) 32-bit instruction register, IR, to keep the
current instruction IR contains the instruction until it is completely
executed 64-bit A and B registers
They keep the content of Rs and Rt registers of the current instruction
64-bit register Imm It contains the sign extended value of the 16-
bit Displacement/Offset/Immediate (DOImm) field of I-type instructions
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 34CS 6143
New Microarchitectural registers 64-bit Load Memory Data, LMD, register
It keeps the data read from the memory for Load instructions
64-bit ALUoutput register It keeps the result of the ALU operation
temporarily
1-bit Cond register It keeps the result of compare operation
between register A and 0 This is needed for the BEQZ and BNEZ instructions
that compare register Rs with 0
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 35CS 6143
New Microarchitectural registers 64-bit A and B registers
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
To registerA
To registerB
Opcode Rs Rd FunctionShamtRt
5 5 6
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0 R format
I format
Haldun Hadimioglu
MIPS Versions 0 & 1 36CS 6143
New Microarchitectural registers Even if an instruction does not have Rs and Rt
fields, such as a J-format instruction, Rs and Rt field bits are used to move Rs and Rt content to A and B, respectively
The values of A and B registers will not be used ! The reason for moving to A and B is to make the
common case fast where we think most instructions are R-format or I-format and require this move !
Opcode Offset26
6 26
Rs Rt
5 5
To registerA
To registerB
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Jump
J format
Haldun Hadimioglu
MIPS Versions 0 & 1 37CS 6143
New Microarchitectural registers 64-bit register Imm
Even if the current instruction is not an I-format instruction, such as an R-format or J-format instruction, DOImm field bits are used to move DOImm+ to Imm
The value of the Imm register will not be used ! The reason for moving to Imm is to make the common
case fast where we think many instructions are I-format and require this move !
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
To register Imm after sign extension
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
I format
Haldun Hadimioglu
MIPS Versions 0 & 1 38CS 6143
New Microarchitectural registers The textbook implies in Appendix A that the
Displacement used for loads and stores is signed Similarly, the textbook is sign extending the
immediate data elements of ANDI, ORI and XORI instruction
Instead of attaching zeros to the left In order not to complicate the coverage of
textbook CPU design, we will accept these and assume the 16-bit value is signed for the integer instructions we will work on
We will use DOImm+ to indicate a sign-extended value from now on
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 39CS 6143
The MIPS CPU state diagram The design of a CPU is very complex
We have to consider the space (hardware) and time (speed)
The design, analysis, description, testing, modification, optimization, servicing and maintenance can be more efficient if there are efficient tools around
These include HDLs and CAD tools The textbook uses a typical register transfer language
(RTL) notation in Appendix A to describe the execution of instructions
We will use the same RTL notation which is also used in the handout
To quickly see the execution steps of the integer machine language instructions, a FSM state diagram and a CPU datapath figure are developed in the handout
Additionally, timing diagrams and tables are provided to understand the CPU design
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 40CS 6143
The MIPS CPU state diagram An instruction goes through several
phases when executed We give a name to each phase of an
instruction execution A phase is also called major cycle
Each major cycle will take one or more minor cycles (clock periods)
Each minor cycle is a state Each minor cycle takes typically one clock period
Each major cycle often has at least one microoperation
Often the name of a major cycle is derived from the major microoperation of the cycle
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 41CS 6143
The MIPS CPU state diagram The number of major cycles and their complexity
are small for RISC systems and larger for CISC systems
Often for RISC systems, the CPIi for most frequently used instructions is between 4 and 6
However, this number has to be larger to have deep pipelining and high clock frequencies
In simple systems like RISC systems sharing of hardware among different major cycles is not necessary
A hardware resource is often needed in one major cycle only
The hardware for each major cycle can then be easily identified and often named stage
So, the execution of an instruction is the movement of the instruction through some or all of the stages of the CPU !
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 42CS 6143
The MIPS CPU state diagram The MIPS integer instructions go through
at most five major cycles during the execution
However, even for this RISC machine, it is difficult to name 5 cycle names because not all instructions do similar things in a major cycle
Some microoperations will be performed in advance in anticipation of a frequent operation
The early operations will not alter the state and will not cause longer clock periods, but will slightly increase the hardwareU
np
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu
MIPS Versions 0 & 1 43CS 6143
The MIPS CPU state diagram The MIPS CPU major cycles for integer
instructions (pages A-27 – A-28) Instruction fetch cycle
Abbreviated as IF, standing for instruction fetch Same for all MIPS instructions.
Instruction decode/Register fetch cycle Abbreviated as ID, standing for instruction decode Same for all MIPS instructions.
Execution/effective address cycle Abbreviated as EX, standing for execution
Memory access/branch completion cycle Abbreviated as MEM, standing for memory
Write-back cycle Abbreviated as WB, standing for write-backU
np
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu
MIPS Versions 0 & 1 44CS 6143
The MIPS CPU state diagram Emphasizing again that designing a CPU is
determining which microoperation happens when for each architectural operation (the semantics of the instruction)
For the MIPS, like many other CPUs, the IF and ID stages are identical for all instructions
The same microoperations are performed for all instructions
These microoperations implement portions of the architectural operation
For the MIPS, the remaining portions of the architectural operation are performed in the EX, MEM and WB stages
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 45CS 6143
The MIPS CPU state diagram Architectural operations of I-format
instructions among the integer instructions
Load/Store instructions LD Rt, Disp(Rs) Rt M[Rs + Disp+] SD Rt, Disp(Rs) M[Rs + Disp+] Rt
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
Architectural operations ofLoad/Store instructions
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Superscript + indicates sign extension
I format
Haldun Hadimioglu
MIPS Versions 0 & 1 46CS 6143
The MIPS CPU state diagram Architectural operations of I-format instructions
among the integer instructions
Arithmetic/Logic instructions DADDI Rt, Rs, Imm+ Rt Rs + Imm+
ANDI Rt, Rs, Imm+ Rt Rs Λ Imm+
ORI Rt, Rs, Imm+ Rt Rs ν Imm+
XORI Rt, Rs, Imm+ Rt Rs Ө Imm+
SLTI Rt, Rs, Imm+ If Rs < Imm+ then Rt 1
else Rt 0
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0 I format
Haldun Hadimioglu
MIPS Versions 0 & 1 47CS 6143
The MIPS CPU state diagram Architectural operations of I-format instructions
among the integer instructions
Branch instructions BEQZ Rs, Offset If Rs = 0, then PC PC + (4 x
Offset+) BNEZ Rs, Offset If Rs ≠ 0, then PC PC + (4 x
Offset+)
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0 I format
Haldun Hadimioglu
MIPS Versions 0 & 1 48CS 6143
The MIPS CPU state diagram Architectural operations of R-format instructions
among the integer instructions
Arithmetic/Logic instructions DADD Rd, Rs, Rt Rd Rs + Rt DSUB Rd, Rs, Rt Rd Rs - Rt AND Rd, Rs, Rt Rd Rs Λ Rt OR Rd, Rs, Rt Rd Rs ν Rt XOR Rd, Rs, Rt Rd Rs Rt SLT Rt, Rs, Rt If Rs < Rt then Rt 1 else Rt
0
6 5 5
Opcode Rs Rd FunctionShamtRt
5 5 6
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
R format
Haldun Hadimioglu
MIPS Versions 0 & 1 49CS 6143
The MIPS CPU state diagram All J-format instructions are not executed
by the CPU we are designing
However, one can incorporate them to the CPU design after the design of the R-format and I-format instructions is completed
Opcode Offset26Rs Rt
5 5
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
J format
Haldun Hadimioglu
MIPS Versions 0 & 1 50CS 6143
The MIPS CPU state diagram The major cycles of the DLX CPU are shown by the state
diagram given in the MIPS CPU handout Registers A and B are used to prepare operands for an ALU
operation Each state takes 1 clock period
Later, we will change it to one or more clock periods Memory accesses and complex arithmetic operations will take
more than one clock period to perform The state that has a memory access or a complex arithmetic
operation will take more than one clock period
All microoperations mentioned in a state are performed in parallel, so their order does not matter
If a state takes more than one clock period, one has to be careful about the parallel operations
We now obtain the state diagram and the datapath hardware of the MIPS CPUU
np
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu
MIPS Versions 0 & 1 51CS 6143
The MIPS major cycles and states The instruction fetch cycle (IF stage)
It is performed for all the instructionsThere are two microoperations performed In general, all CPUs, regardless of their
architecture do these two microoperations Read the machine language instruction pointed by
the program counter (PC) to the instruction register (IR)
Update the program counter so that it points at the instruction that follows the instruction being read from the memory
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
From now on look at the MIPS CPU handout to follow the design
Haldun Hadimioglu
MIPS Versions 0 & 1 52CS 6143
The MIPS major cycles and states The instruction fetch cycle (IF stage)
Read the machine language instruction pointed by the program counter (PC) to the instruction register (IR)
IR ← M[PC] Note the RTL notation that we use an equal sign (=) if
the destination is a wire or a bus and an arrow sign () if the destination is a register, such as IR
As mentioned before we will make a few design decisions on the memory hierarchy as we design the CPU : We will have an instruction cache which will have only instructions
We will have Memory Port 1 to access the instruction cache
To access the instruction cache, Memory Port 1 has a 64-bit address bus, ADB1, a 32-bit data bus, MDB1, and at least a read control signal, Read1, to inform the instruction cache we want to read now
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 53CS 6143
The MIPS major cycles and states The instruction fetch cycle (IF stage)
Read the machine language instruction pointed by the program counter (PC) to the instruction register (IR)
IR ← M[PC] Then, the read of the instruction in terms buses is as
follows :
Note again the three microoperations implement the instruction read and they happen at the same and their order does not matter
Note the RTL notation that we use an equal sign (=) if the destination is a wire or a bus, such as ADB1 and an arrow sign () if the destination is a register, such as IR
ADB1 = PC ; Read1 = 1 ; IR MDB1
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 54CS 6143
The MIPS major cycles and states The instruction fetch cycle (IF stage)
Update the program counter so that it points at the next instruction
PC ← PC + 4 Since an instruction is four bytes long, we need to add
4 to PC We can use the general ALU in the EX stage to do the
addition, at the expense of increasing the complexity of the ALU input logic : PC must be connected to MUX2, a 4 must be connected to MUX 3 and the output of the ALU must be connected to MUX1
The alternative is to have a simple 32-bit integer adder in the IF stage
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 55CS 6143
The MIPS major cycles and states The instruction fetch cycles (IF stage)
Update the program counter so that it points at the next instruction
PC ← PC + 4 We choose the second alternative since we will need
to have an adder in the IF stage when the CPU is pipelined
The MUX1 select input is controlled by the Sel circuit in the EX stage
The Sel circuit is in turn controlled by the Cond flip-flop and the control unit
The control unit in the IF stage instructs the Sel circuit to generate a SelectMUX1 value so that the output of the adder in IF is transferred to PC in the IF major cycle
PC PC + 4
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 56CS 6143
The instruction fetch cycle (IF stage) The two microoperations of the IF cycle
can be shown in state 0 as follows
The two microoperations are simply shown without using buses to save space
The instruction cache read and PC update microoperations happen simultaneously and complete before the end of the clock period
IR M[PC] ;PC PC + 4 ;
0aIF
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 57CS 6143
The instruction fetch cycle (IF stage) If the instruction cache happens to take
more than one clock period, then we stay in this state and update PC the last clock period of the memory access so the address to the instruction cache, the PC value, is not changed
During this state the state register in the control unit is 0, indicating we are in state 0
The self-directed arrow in state 0 indicates waiting for the slow cache for more than 1 clock period
Also during this clock period the control unit determines the next state as state 1
It is the instruction decode cycle, the ID cycle
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 58CS 6143
The MIPS major cycles and states The instruction decode cycle (ID stage)
The most important goal in this cycle is to decode the instruction
Decoding the instruction means the CPU determines what the current instruction is
It is performed for all the instructions regardless of their architecture
Decoding is done by the control unit that checks the opcode and function bits of IR
They are input as status signals to the control unit
During this time the datapath does not do anything
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 59CS 6143
The MIPS major cycles and states The instruction decode cycle (ID stage)
Instead of doing nothing in the datapath, we decide to perform three microoperations in order to be prepared
Transfer GPR register Rs pointed by I-format and R-format instructions to register A
Transfer GPR register Rt pointed by I-format and R-format instructions to register B
Transfer the DOImm field of IR to register Imm after sign extension
By doing these in advance, we save time But, not all instructions need them : J-format instructions
do not need them and some of I-format instructions do not need the transfer to register B
This is fine since A, B and Imm registers are not architectural registers and so changing them will not result in program errors
These three microoperations are performed for all the instructions
In general, RISC CPUs do these three microoperationsUn
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 60CS 6143
The MIPS major cycles and states The instruction decode cycle (ID stage)
A ← GPR[Rs] ; B ← GPR[Rt] ; Imm ← DOImm+
The GPR register file is designed so that two GPRs can be read simultaneously, by using the Rs and Rt fields of IR
This means the GPR register file has two read ports controlled by Rs and Rt
Note that the order of these microoperations does not matter as they happen simultaneously
There is also a write port to the GPR register file controlled by Rt and Rd fields : 10 bits are connected to the GPR file to determine the destination register
A simple Sign Extend circuit attaches 48 zeros or 48 ones to the DOImm field of IR and the result is stored on register Imm
A GPR[Rs] ; B GPR[Rt]
Imm DOImm+
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 61CS 6143
The instruction decode cycle (ID stage) The three microoperations of the ID cycle
can shown in state 1 as follows
The GPR read ports are directly connected to register A and B and so no buses are used
The Sign Extend circuit is directly connected to register Imm
The three microoperations happen simultaneously and complete before the end of the clock period
A GPR[Rs] ; B GPR[Rt] ;Imm DOImm+
1ID
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 62CS 6143
The instruction decode cycle (ID stage) During this clock period the state register in the
control unit is 1, indicating we are in state 1 Also during this clock period the control unit
determines what the next state will be based on the type of the instruction
If it is a memory reference instruction (LD, SD), the next state is state 2 in the EX cycle
If it is a R-format A/L instruction, the next state is state 6 in the EX cycle
If it is a I-format A/L instruction, the next state is state 9 in the EX cycle
If it is a branch instruction, the next state is state 12 in the EX cycle
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 63CS 6143
Completing the execution of LD and SD The LD instruction
LD Rt, Disp(Rs) Rt M[Rs + Disp+] We see that to execute the LD we need to
1) Calculate the effective address, the address of the memory location we want load from : Rs + Disp+
2) Read the cache memory pointed by the effective address3) Transfer the value to GPR register Rt
The SD instruction SD Rt, Disp(Rs) M[Rs + Disp+] Rt We see that to execute the SD we need to
1) Calculate the effective address, the address of the memory location we want store to : Rs + Disp+
2) Write to the cache memory pointed by the effective address Transfer the value from GPR register Rt to the memory
pointed by the effective address
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 64CS 6143
Completing the execution of LD and SD LD and SD both have a microoperation in
common : calculating the effective address
Then their microoperations differ In order to calculate the effective address,
we need to sign extend the DOImm fieldThis has already been done in the ID stage
We save time !We also realize that GPR register Rt has been
transferred to register B Register B will be written to the memory for the SD
instruction then
LD requires one extra microoperation than the SD as we will soon see
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 65CS 6143
Completing the execution of LD and SD We decide to have the effective address
calculation of LD and SD in the Execution/Effective address cycle
The effective address is stored in a microarchitectural register called ALUoutput1
Then, we separate LD and SD execution in the Memory Access/Branch completion cycle : Both access the memory
LD reads the memory location pointed by the effective address to a microarchitectural register called LMD
SD writes microarchitectural register B to a memory location pointed by the effective address and completes its execution
LD completes its execution by transferring the data in LMD to GPR register Rt
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 66CS 6143
Completing the execution of LD and SD The effective address calculation
Rs + Disp+
Rs is now in register A Sign extended DOImm is in register Imm
As we will see shortly, A/L instructions will have their arithmetic/logic operation performed in this cycle as well
They need the ALU in this cycle, in this stageTherefore, we decide to use the adder of the
ALU to do the addition for the effective address
ALUoutput1 A + Imm
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 67CS 6143
Completing the execution of LD and SD We make another decision on the memory
hierarchy that data accesses will be made to another cache, the Data cache with its own address and data buses and control signals
Reading from the data cache
Note that the microoperations are performed in parallel and the order does not matter
This microoperation can be stated without giving the bus detail
ADB2 = ALUoutput1 ; Read2 = 1 ; LMD MDB3
LMD M[ALUoutput1]
Haldun Hadimioglu
MIPS Versions 0 & 1 68CS 6143
Completing the execution of LD and SD Note that the cache access can take more
than one clock period and so we may stay in this state more than one clock period
The LD instruction completes by transferring LMD to GPR register Rt
The Rt field of IR is used by the GPR register file to select the register to be written the value from LMD
We then go back to state 0, the IF cycle, to start executing the next instruction
GPR[Rt] LMD
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 69CS 6143
Completing the execution of LD and SDStoring to the data memory
Note that the microoperations are performed in parallel and the order does not matter
This microoperation can be stated without giving the bus detail
Note that the cache access can take more than one clock period and so we may stay in this state more than one clock period
SD completes its execution ! We then go back to state 0, the IF cycle to
start executing the next instruction
ADB2 = ALUoutput1 ; Write2 = 1 ; MDB2 = B
M[ALUoutput1] B
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 70CS 6143
Completing the execution of LD and SD The portion of the state diagram for LD
and SD
ALUoutput1 A + Imm
2EX
From the ID cycle
LD, SD
LMD M[ALUoutput1]
3 LD
M[ALUoutput1] B5
SD
GPR[Rt] LMD
4
WB a
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 71CS 6143
Completing the execution of I-format A/L instructions The I-format A/L instructions
DADDI Rt, Rs, Imm+ Rt Rs + Imm+
ANDI Rt, Rs, Imm+ Rt Rs Λ Imm+
ORI Rt, Rs, Imm+ Rt Rs ν Imm+
XORI Rt, Rs, Imm+ Rt Rs Ө Imm+
SLTI Rt, Rs, Imm+ If Rs < Imm+ then Rt 1
else Rt 0
To execute these instructions we need to perform an operation specified by the Opcode field of IR
Then we transfer the result to GPR register Rt
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 72CS 6143
Completing the execution of I-format A/L instructions We see that we can perform the required
operations for the I-format instructions in one state
Which one to perform would be determined by the Opcode field
The inputs are Rs and sign extended DOImm Rs is already transferred to A and sign extended
DOImm is already transferred to register Imm We see we save time by moving them in the ID
stage !
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 73CS 6143
Completing the execution of I-format A/L instructions We see that we can perform the required
operations for the I-format instructions in one state
The result would be stored on the microarchitectural register ALUoutput1
Though, we could store the result of the operation directly on GPR register Rt
This would require a separate bus from the output of the ALU to the write port of the GPR file
We decide to store to ALUoutput1 and then transfer from ALUoutput to the GPR write port
This decision will help pipelining as we will see later !
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 74CS 6143
Completing the execution of I-format A/L instructions The microoperation for the current I-
format A/L operation
The meaning of “op” is that the type of the operation is indicated by the Opcode field of IR
What happens is that the control unit uses the Opcode field to generate a set of control signals
These control signals are connected to the ALU, telling which operation to perform
ALUoutput1 A op Imm
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 75CS 6143
Completing the execution of I-format A/L instructions The result that is in ALUout1 is moved to another
microarchitectural register ALUout2
This decision increases the CPIi of the I-format instruction one more clock period !
This decision will also help pipelining as we will see later !
The microoperation for the transfer of the result to GPR register Rt
The Rt field of IR is used by the GPR register file to select the register to be written the value from ALUoutput
We then go back to state 0, the IF cycle to start executing the next instruction
GPR[Rt] ALUoutput2
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
ALUout2 ALUoutput1
Haldun Hadimioglu
MIPS Versions 0 & 1 76CS 6143
Completing the execution of I-format A/L instructions The portion of the state diagram for I-format A/L
instructions
ALUoutput1 A op Imm
9EX
From the ID cycle
I-Format A/L instructions
MEM
GPR[Rt] ALUoutput2
10
a
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
ALUout2 ALUoutput1
11
WB
Haldun Hadimioglu
MIPS Versions 0 & 1 77CS 6143
Completing the execution of R-format A/L instructions The R-format A/L instructions
DADD Rd, Rs, Rt Rd Rs + Rt DSUB Rd, Rs, Rt Rd Rs - Rt AND Rd, Rs, Rt Rd Rs Λ Rt OR Rd, Rs, Rt Rd Rs ν Rt XOR Rd, Rs, Rt Rd Rs Ө Rt SLT Rt, Rs, Rt If Rs < Rt then Rt 1 else Rt
0
We see that to execute these instructions we need to perform an operation specified by the Opcode and Function fields
Then, we transfer the result to GPR register RdUn
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 78CS 6143
Completing the execution of R-format A/L instructions We see that we can perform the all
required operations for R-format instructions in one state
Which one to perform would be determined by the Opcode and Function fields
The inputs are Rs and Rt Rs is already transferred to register A and Rt is
already transferred to register B We see we save time by moving them in the ID
stage !
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 79CS 6143
Completing the execution of R-format A/L instructions We see that we can perform the all
required operations for R-format instructions in one state
The result would be stored on the microarchitectural register ALUoutput1
Though, we could store the result of the operation directly on GPR register Rd
This would require a separate bus from the output of the ALU to the write port of the GPR file
We decide to store to ALUoutput and transfer from ALUoutput to the GPR write port
This decision will help pipelining as we will see later !Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 80CS 6143
Completing the execution of R-format A/L instructions The microoperation for the current R-
format A/L operation
The meaning of “func” is that the type of the operation is indicated by the Opcode and Function fields of IR
What happens is that the control unit uses the Opcode and Function fields to generate a set of control signals
These control signals are connected to the ALU, telling which operation to perform
ALUoutput1 A func B
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 81CS 6143
Completing the execution of R-format A/L instructions The result that is in ALUout1 is moved to another
microarchitectural register ALUout2
This decision increases the CPIi of the I-format instruction one more clock period !
This decision will also help pipelining as we will see later !
The microoperation for the transfer of the result to GPR register Rd
The Rd field of IR is used by the GPR register file to select the register to be written the value from ALUoutput
We then go back to state 0, the IF cycle to start executing the next instruction
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
GPR[Rd] ALUoutput2
ALUout2 ALUoutput1
Haldun Hadimioglu
MIPS Versions 0 & 1 82CS 6143
Completing the execution of R-format A/L instructions The portion of the state diagram for R-format A/L
instructions
ALUoutput1 A func B
6EX
From the ID cycleR-Format A/L instructions
WB GPR[Rd] ALUoutput2
8
a
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
MEM
7
ALUout2 ALUoutput1
Haldun Hadimioglu
MIPS Versions 0 & 1 83CS 6143
Completing the execution of Branch instructions
The BEQZ instruction BEQZ Rs, Offset If Rs = 0, then PC PC + (4 x Offset+)
The BNEZ instruction BNEZ Rs, Offset If Rs ≠ 0, then PC PC + (4 x Offset+)
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 84CS 6143
Completing the execution of Branch instructions
We see that to execute these instructions we need to
1) Calculate the effective address the address to branch to Add PC to the result of the multiplication of the
sign extended Offset by 4
2) Test if Rs is equal to or not equal to zero and store the result of the test in the Cond flip-flop (FF) Testing Rs and calculating the effective address can
be done at the same time
3) If the Cond FF is 1, transfer the effective address to PCU
np
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu
MIPS Versions 0 & 1 85CS 6143
Completing the execution of Branch instructions In order to calculate the effective address, we
need to sign extend the DOImm field This has already been done in the ID stage
We save time ! We then have to multiply it by 4
We also realize that GPR register Rs has been transferred to register A
Register A will be tested !
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 86CS 6143
Completing the execution of Branch instructions The effective address calculation
PC + (4 x Offset+) Sign extended DOImm is in register Imm
We know that shifting a number to the left by two bit positions is multiplying it by four
We decide to use the adder of the ALU to the addition Before the addition the Imm value is shifted to the left by
two bit positions in the ALU
ALUoutput1 PC + Imm * 4
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
ALUoutput1 PC + Imm << 2
Haldun Hadimioglu
MIPS Versions 0 & 1 87CS 6143
Completing the execution of Branch instructions Testing if Rs is equal to or not equal to
zero and storing the result of the test in the Cond FF
The Zero circuit in the EX stage compares register A with zero
The result of Zero is stored on the Cond FF Note that the Cond bit is initially set to 0 until a branch
changes it
The opcode of the branch instruction executed is used by the Sel circuit to generate
1 if the condition is satisfied 0 if the condition is not satisfied
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 88CS 6143
Completing the execution of Branch instructions The control unit sends a control signal to Sel to
indicate how to generate the output For example, if A = 0 and it is a BEQZ instruction, Sel
outputs 1 and Cond is stored 1 But, if A = 0 and it is a BNEZ instruction, Sel outputs 0
and Cond is stored 0
The Branchop is the combined effect of the test and Sel operations
Note that the Sel circuit is also used in the IF cycle so that it generates the right value for MUX1 so that we transfer PC+4 to PC
Cond A Branchop 0
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 89CS 6143
Completing the execution of Branch instructions Changing PC if the Cond FF is 1
This means we branch to a memory location That is we take the branch
Reset the Cond FF to 0 so that it can be used for another branch instruction
We then go back to state 0, the IF cycle to start executing the next instruction
If (Cond) PC ALUoutput1
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Cond 0
Haldun Hadimioglu
MIPS Versions 0 & 1 90CS 6143
Completing the execution of Branch instructions The portion of the state diagram for Branch
instructions
ALUoutput1 PC + Imm * 4Cond A Branchop 0
12
EX
From the ID cycleBranch
MEM If (Cond) PC ALUoutput1Cond 0
13
a
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 91CS 6143
The complete state diagram The state diagram for integer instructions
and the datapath are given in the MIPS CPU handout
They will be modified to implement a pipelined MIPS CPU
But, the overall CPU structure will be similar
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 92CS 6143
CPIi of Integer Instructions With this implementation, the CPIi of the
instructions can be calculated asCPILW = 5 because we trace states 0, 1, 2, 3,
4CPISW = 4 because we trace states 0, 1, 2, 5CPIA/L = 5 because we trace states
0, 1, 6, 7, 8 if R-format 0, 1, 9 10, 11 if I-format
CPIBranch = 4 because we trace states 0, 1, 12, 13
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 93CS 6143
Control Signals The semantics of each state is that a
microoperation is implemented by the control unit, turning on and off a few MUX select, register clock inputs, ALU control inputs and enable control signals
They are connected to MUXes, registers, ALUs and tri-state buffers (TRBs)
They are shown as angled signals in the handoutDepending on the type of chips used, tri-state
chips and/or additional MUXes will be used, for example, for the usage of the constants, in the datapathU
np
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu
MIPS Versions 0 & 1 94CS 6143
The Clock Signal The clock period duration is determined by the
slowest but important microoperation in the CPU All the signal delays in the datapath and control unit are
added up to calculate the time for this important operation
It is usually the integer add microoperation Though it could be the cache access time if it was a little
longer than the integer addition time Usually, the cache is slower than the CPU in commercial
systems now and so we do not consider it when we calculate the clock period duration
The loop back line drawn for states 0, 3 and 5 indicate that the CPU would spend more than one clock period if the cache memory takes more than one clock period for the access
That is we assume the integer addition takes one clock period !
For high-performance systems this is not the case though !
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 95CS 6143
Clock Signal The clock period duration is determined by the
addition of all the delays in the control unit and the delays in the datpath for the integer add microoperation
The delays in the control unit include the delays to generate the MUX select, register clock input, ALU control and enable control signals
Gate networks generate these select and clock control signals if hardwiring is used
The micromemory and additional circuits generate these select and clock control signals if microprogramming is used
The delays in the datapath include Delay of data travel from registers to the ALU inputs Delay of the adder in the ALU Delay of the data travel from the ALU to the destination
register in the datapath : ALUoutput1
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 96CS 6143
Architecture-Microarchitecture Interaction An example of how architectural decisions can
affect the microarchitecture design is the following
The Rd and Rt fields of R-format and I-format instructions are not in the same position
Therefore, we need to use two separate states to transfer the result of an A/L operation from ALUoutput to a destination GPR register : states 8 and 11
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
6 5 5
Opcode Rs Rd FunctionShamtRt
5 5 6
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 97CS 6143
Updating PC The execution sequence in the textbook is
not clear since it updates PC for all instructions in MEM and in MEM it updates PC again for branch instructions if the condition is true
To eliminate the confusion, we remove the NPC register which is redundant
But, we will use the NPC register when we implement pipelining
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 98CS 6143
Using the state diagram Consider the following piece of program in
the MIPS memory---100 LD R1, 150(R0) ; R1 <-- M[R0 + 150+] ; M[150] has C104 DADDI R2, R1, #18 ; R2 <-- R1 + 18+ where 18 is in Hex108 DADD R2, R2, R3 ; R2 <-- R2 + R3 ; R3 has 1A10C SD R2, 200(R0) ; M[R0 + 200+] <-- R2110 BEQZ R2, 5 ; If R2 is equal to 0, branch to address 128---150 C ; The content of this location is C---200 ?
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
Haldun Hadimioglu
MIPS Versions 0 & 1 99CS 6143
Using the state diagram If the cache memory is not slow (takes one clock period per
access) and there is no miss, then this piece of program will take 23 clock periods as the table below shows the execution of the program with respect to time
See the MIPS CPU handout for timing IF ID EX MEM WB100 LD R1, 150(R0) 1 2 3 4 5
104 DADDI R2, R1, #18 6 7 8 9 10
108 DADD R2, R2, R3 11 12 13 14 1510C SD R2, 200(R0) 16 17 18 19
110 BEQZ R2, 5 20 21 22 23
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
Haldun Hadimioglu
MIPS Versions 0 & 1 100CS 6143
Using the state diagram If the clock frequency is 1GHz
4.6 5
23
run nsinstructio ofNumber
program for the cyclesclock ofNumber CPIave
ns 1 second 10 10
1
frequencyClock
1 periodClock 9-
9
ns 23 1 23 periodClock programfor periodsclock ofNumber CPUtime
217 10 10 23
5
10 CPUtime
run nsinstructio ofNumber MIPS
69-6ave
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 101CS 6143
We assumed the instruction and data cache memories take one clock period each What if they took two clock periods each ?
LD would take 7 clock periods since we trace states 0, 0, 1, 2, 3, 3, 4
States 0 and 3 are repeated twice since the cache memories take two clock periods each
SD would take 6 clock periods since we trace states 0, 0, 1, 2, 5, 5
States 0 and 5 are repeated twice since the cache memories take two clock periods each
DADD would take 7 clock periods since we trace states 0, 0, 1, 2, 6, 7, 8
State 0 is repeated twice since the cache memory takes two clock periods
BEQZ would take 5 clock periods since we trace states 0, 0, 1, 12, 13
State 0 is repeated twice since the cache memory takes two clock periods
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 102CS 6143
Using the state diagram If the cache memories are slow (they take two clock period
per access) and there is no miss, then this piece of program will take 30 clock periods as the table below shows the execution of the program with respect to time
See the MIPS CPU handout for timing IF ID EX MEM WB100 LD R1, 150(R0) 1-2 3 4 5-6 7
104 DADDI R2, R1, #18 8-9 10 11 12 13
108 DADD R2, R2, R3 14-15 16 17 18 1910C SD R2, 200(R0) 20-21 22 23 24-25
110 BEQZ R2, 5 26-27 28 29 30
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
Haldun Hadimioglu
MIPS Versions 0 & 1 103CS 6143
We have so far assumed that the cache memories do not have misses ! What if both instruction and data cache
memories result is cache misses ?That is, there is a cold start !
What is the new execution time ?
To calculate the new execution time we have to study the structure of the cache memories
The size of the physical (main) memory, the size of the cache memories, the size of cache blocks, the type of mapping (direct, associative, block-set associative), the block replacement strategy, etc.
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 104CS 6143
What if both instruction and data cache memories result is cache misses ?
For this semester We will concentrate on Level 1 cache memories, i.e.
instruction and data cache memories We will assume that there is no Level 2 cache memory
miss ! We will assume that all the addresses shown are physical
addresses unless otherwise specified For this presentation assume that
The physical (main) memory has 256 Mbytes The physical memory has 8 Bytes per location The bus width between the physical memory and lowest level
cache is 8 Bytes The instruction cache is 8KBytes The data cache is 16KBytes Both cache block sizes are 32 bytes Both cache memories use direct mapping Both caches use write-back with write-allocate Both cache memories access the needed item first The physical memory latency is 4 clock periods and
transferring an 8-Byte content is one clock period each
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 105CS 6143
Instruction and data cache misses ? The physical memory has 256MBytes or
228 BytesThe physical address is 28 bits longThe physical memory has 228/32 = 228/25 = 223
blocksThe instruction cache has 8KB/32 = 213/25 = 28
= 256 blocksThe data cache has 16KB/32 = 214/25 = 29 =
512 blocks
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu
MIPS Versions 0 & 1 106CS 6143
Instruction and data cache misses ? The physical address is used by the physical memory and
instruction cache as follows
The physical address is used by the physical memory and data cache as follows
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
15 8 5
23
Instructioncache block #
Byte offset
Main memory block number
Address tag
14 9 5
23
Data cacheblock #
Byte offset
Main memory block number
Address tag
Haldun Hadimioglu
MIPS Versions 0 & 1 107CS 6143
Instruction and data cache misses ? The instruction cache has 32-Byte blocks
Each block contains 8 instructions since each instruction is 4 Bytes long
Instructions in physical memory locations 100 through 110 are in one instruction cache block
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
00000100
8 bytes
00000108
0000011000000118
Instruction cache blocks have 32 bytes and so each holds 8 instructions !Instructions in 100, 104, 108, 10C, 110, 114, 118 and 11C are in one instruction cache block !
4 bytes
LD R1, 150(R0) DADDI R2, R1, #18
DADD R2, R2, R3 SD R2, 200(R0)
BEQZ R2, 5
4 bytes
Which instruction cache block is this ?
Haldun Hadimioglu
MIPS Versions 0 & 1 108CS 6143
Instruction and data cache misses ? The instruction cache has 32-Byte blocks
Instructions in physical memory locations 100 through 110 are in instruction cache memory block number 8
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
0000100 LD R1, 150(R0)
0000 0000 0000 0000 0001 0000 00000 0 0 0 1 0 0
5 bits ! The byte offset is 5 bits long. The LD instruction has 0 offset from the beginning of the block, i.e. the first instruction of the block
Instructioncache block # 8 since 00001000 is 8 in decimal
Address tag
Instructions in 100, 104, 108, 10C, 110 are in instruction cache block 8 !
Haldun Hadimioglu
MIPS Versions 0 & 1 109CS 6143
Instruction and data cache misses ? How long does it take to access individual instructions ?
Both cache memories access the needed item first The physical memory latency is 4 clock periods and transferring an 8-
Byte content is one clock period each
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
00000100
8 bytes
00000108
0000011000000118
4 bytes
LD R1, 150(R0) DADDI R2, R1, #18
DADD R2, R2, R3 SD R2, 200(R0)
BEQZ R2, 5
4 bytes
Start access Latency
TransferM[100] & M[104]
TransferM[108] & M[10C]
TransferM[110] & M[114]
TransferM[118] & M[11C]
Time
Block fill time = 8 clock periods
M[100] is the needed item and accessed first !
Five clock periods !
Six clock periods !
Seven clock periods !
Eight clock periods !
Haldun Hadimioglu
MIPS Versions 0 & 1 110CS 6143
Instruction and data cache misses ? The data cache has 32-Byte blocks
Each block contains 4 data elements since each data element is 8 Bytes long
The data element in physical memory location 150 is in one data cache block
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
00000140
8 bytes
00000148
0000015000000158
Data cache blocks have 32 bytes and so each holds 4 data elements !Data elements in 140, 148, 150, and 158 are in one data cache block !
C
Which data cache block is this ?
Haldun Hadimioglu
MIPS Versions 0 & 1 111CS 6143
Instruction and data cache misses ? The data cache has 32-Byte blocks
The data element in physical memory location 150 is in data cache block number 10
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
0000150 C
0000 0000 0000 0000 0001 0101 00000 0 0 0 1 5 0
5 bits ! The byte offset is 5 bits long. The data element has 8-Byte offset from the beginning of the block, i.e. the third data element of the block
Data cache block # 10 since 000001010 is 10 in decimal
Address tag
Data element in 150 is in data cache block 10 !
Haldun Hadimioglu
MIPS Versions 0 & 1 112CS 6143
Instruction and data cache misses ? How long does it take to access individual data element ?
Both cache memories access the needed item first The physical memory latency is 4 clock periods and transferring an 8-
Byte content is one clock period each
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Start access Latency
TransferM[150]
TransferM[158]
TransferM[140]
TransferM[154]
Time
Block fill time = 8 clock periods
M[150] is the needed item and accessed first !
Seven clock periods !
Eight clock periods !
Five clock periods !
Six clock periods !
00000140
8 bytes
00000148
0000015000000158
C
Haldun Hadimioglu
MIPS Versions 0 & 1 113CS 6143
Instruction and data cache misses ? The data cache has 32-Byte blocks
Each block contains 4 data elements since each data element is 8 Bytes long
The data element in physical memory location 200 is in one data cache block
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
00000200
8 bytes
00000208
0000021000000218
Data cache blocks have 32 bytes and so each holds 4 data elements !Data elements in 200, 208, 210, and 218 are in one data cache block !
?
Which data cache block is this ?
Haldun Hadimioglu
MIPS Versions 0 & 1 114CS 6143
Instruction and data cache misses ? The data cache has 32-Byte blocks
The data element in physical memory location 200 is in data cache block number 16
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
0000200 ?
0000 0000 0000 0000 0010 0000 00000 0 0 0 2 0 0
5 bits ! The byte offset is 5 bits long. The data element has 0 offset from the beginning of the block, i.e. the first data element of the block
Data cache block # 16 since 000010000 is 16 in decimal
Address tag
Data element in 150 is in data cache block 16 !
Haldun Hadimioglu
MIPS Versions 0 & 1 115CS 6143
Instruction and data cache misses ? How long does it take to access individual instructions ?
Both cache memories access the needed item first The physical memory latency is 4 clock periods and transferring an 8-
Byte content is one clock period each
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0
Start access Latency
TransferM[200]
TransferM[208]
TransferM[210]
TransferM[218]
Time
M[200] is the needed item and accessed first !
Five clock periods !
Six clock periods !
Seven clock periods !
Eight clock periods !
00000200
8 bytes
00000208
0000021000000218
?
Haldun Hadimioglu
MIPS Versions 0 & 1 116CS 6143
Instruction and data cache misses ? How long does it take to run the program with a cold start ?
This piece of program will take 35 clock periods as the table below shows the execution of the program with respect to time
Un
pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
0 See the MIPS CPU handout for timing IF ID EX MEM WB100 LD R1, 150(R0) 1/5 6 7 8/12 13
104 DADDI R2, R1, #18 14 15 16 17 18
108 DADD R2, R2, R3 19 20 21 22 2310C SD R2, 200(R0) 24 25 26 27/31
110 BEQZ R2, 5 32 33 34 35When the program is run we execute instructions in 100, 104, 108, 10C and 110
When the program is run we access data in 150 (a read) and 200 (a write)
Haldun Hadimioglu
MIPS Versions 0 & 1 117CS 6143
Pipelining Pipelining increases the speed of a CPU
The CPU executes multiple instructions simultaneously
The unpipelined MIPS CPU has five stages that correspond to the five major cycles
For the unpipelined MIPS CPU, at any time only one stage is busy and all the others are idle IF ID EX MEM WB
Control Unit
Instructions InstructionsDatapathPip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 118CS 6143
What is Pipelining ? The unpipelined CPU works like this :
Only, one instruction is in the CPU !
IF ID MEMEX WB
1 2 3 4 5
LD R1, 150(R0) LD R1, 150(R0) LD R1, 150(R0) LD R1, 150(R0) LD R1, 150(R0)
Clock period
DADDI R2, R1, #1C
6
Continues this way…
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 119CS 6143
What is Pipelining ? Pipelining is the simultaneous execution of
multiple instructions in an assembly line fashion in a single CPU
IF ID MEMEX WB
1 2 3 4 5
Clock period
6
DADDI R2, R1, #18 LD R1, 150(R0)DADD R2, R2, R3SD R2, 200(R0)BEQZ R2, 5DADDI R2, R1, #18LD R1, 150(R0) LD R1, 150(R0)DADDI R2, R1, #18DADD R2, R2, R3 LD R1, 150(R0)DADDI R2, R1, #18DADD R2, R2, R3SD R2, 200(R0)
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
LD R1, 150(R0)
Haldun Hadimioglu
MIPS Versions 0 & 1 120CS 6143
What is Pipelining ? Pipelining is a microarchitectural
technique where consecutive instructions are executed overlappingly
Each instruction is in a pipeline stage All stages are busy
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 121CS 6143
What is a Stage ? Each stage is specialized hardware
corresponding to a specific major cycle IF, ID, EX, MEM, WB
Recall how we defined a stage for the unpipelined CPU
The hardware for each major cycle can then be easily identified and often named stage
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 122CS 6143
What is Pipelining ? Pipelined execution of instructions is similar to
the assembly line manufacturing of cars
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 123CS 6143
What is Pipelining ? There are two differences
On a car assembly line there is only one type of car assembled
For the CPU the instructions executed are different Loads, Stores, A/L, Branch instructions
All the cars on an assembly line have the same requirements : the same pieces are placed on the cars
For the CPU, even if two back-to-back instructions are of the same type (for example two back-to-back Loads), they have different requirements (different effective addresses hence different memory locations are accessed)P
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 124CS 6143
What is Pipelining ? Because of these two differences, each
stage has to pass information related to the instruction it just worked on to the next stage
Additional temporary registers (latches, buffers) are placed between each pair of stages to pass the information about the instruction just leaving one stage and entering the next one
IF ID MEMEX WB
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Latches
Haldun Hadimioglu
MIPS Versions 0 & 1 125CS 6143
What is Pipelining ? Latches are then necessary to pass
information about an instruction from one stage to the next
Latches are also needed so that partial work done by one stage is passed to the next stage so the work continues
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 126CS 6143
What is the Pipe ? We give the name “pipe” to the set of
stages since the stages are cascaded to each other in a single dimension forming a pipe where instructions
Enter from one endStay in a stage for one clock periodProceed to the next stageFinally exit from the other endBy which time the instruction execution is
completed
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 127CS 6143
What is Pipelining ? Consider a sequence of instructions and a
5-stage pipeline
Assume that all the instructions use the five stages
That is they all take five clock periods to complete their execution
This is not possible in real life but let’s assume this for the time being to understand pipelining quickly
…I9 I8 I7 I6 I5 I4 I3 I2 I1
IF ID EX MEM WBInstructions Instructions
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 128CS 6143
What is Pipelining ? The execution can be shown as follows
Stage
Time
IF
ID
EX
MEM
WB
1 2 763 4 85
I1
I1
I2
I1
I2
I3
I1
I2
I3
I4
I1
I2
I3
I4
I5
I2
I3
I4
I5
I6
I3
I4
I5
I6
I7
I4
I5
I6
I7
I8
0
IF
ID
EX
MEM
WB
v vv
vvv
vvv
v
vvv
vv
vvv
vv
vvv
vv
vvv
vv
Pipeline is full ≡ all stages are busy ≡ start-up time = 5 clock periods
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 129CS 6143
What is Pipelining ? Compared with unpipelining, the five
stages are more complex to allow overlapped execution
All stages take the same amount of time, one clock period
The length of the clock period is determined by the slowest stage
Though, it is difficult to obtain stages with equal amount of work hence time
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 130CS 6143
What is Pipelining ? If the CPU is unpipelined, the instructions would
take 5 clock periods each
CPIi = 5 Since each instruction is taking 5 clock periods
CPIave = 5 Since the number of clock periods divided by the number
of instructions run is 5
I1 I2 I3 I4 I5 I6 I7
5 10 15 20 25 30Time
35
periodsclock 5 7
35
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 131CS 6143
What is Pipelining ? If the CPU is pipelined, after the pipeline
becomes full (the start-up time), every clock period an instruction is completed as opposed to completing every 5 clock periods
CPIi = 5 Since each instruction is taking 5 clock periods
CPIave ≈ 1 Since after the start-up time, we complete one
instruction each clock period
I1 I2 I3 I4 I5 I6 I7
5 6 7 8 9 10Time
11
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 132CS 6143
What is Pipelining ? Once the pipeline is filled, each clock
period an instruction exits the pipelineEach clock period an instruction is completed
It seems each instruction takes one clock period to execute
CPIave ≈ 1 !!!
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 133CS 6143
What is Pipelining ? Assume for next few slides that the
unpipelined MIPS CPU is converted to a pipelined CPU with the stages shown above
CPILoad = 5CPIStore = 4CPIA/L = 5CPIBranch = 4
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 134CS 6143
What is Pipelining ? Consider the following piece of MIPS code
---200 LD R1, 500(R0) ; R1 M[R0 + 500+]204 DADD R2, R3, R4 ; R2 R3 + R4208 DSUB R5, R6, R7 ; R5 R61 - R7 20C XOR R8, R9, R10 ; R8 <-- R9 + R10210 SLT R11, R12, R13 ; If R12 < R13, R11 1, else R11
0
214 OR R14, R15, R16 ; R14 R15 ν R16218 SD R17, 600(R0) ; M[R0 + 600+] <-- R1721C BEQZ R18, 5 ; If R18 is equal to 0, branch to
address 234---
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
This code is not realistic since the instructions are all independent of each other !
But, for the sake of understanding pipelining, we will use this piece of code !
Haldun Hadimioglu
MIPS Versions 0 & 1 135CS 6143
What is Pipelining ? Let’s see its pipelined execution by using textbook’s
notation and assume that the cache memories take one clock period and there is no miss
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10
210 SLT R11, R12, R13214 OR 14, R15, R16
218 SD R17, 600(R0
21C BEQZ R18, 5
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM
IF ID EX MEM
IF
ID
EX
MEM
WB
v vv
vvv
vvv
v
vvv
v
vv
v
v
v
vvv
v
vvv
v
vvv
vv v v v v v
vv
vv
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 136CS 6143
What is Pipelining ? Textbook’s notation is hard to follow if there are more than
few instructions Also, the notation requires a lot of space even for few
instructions From now on, we will use our notation
The execution by assuming assume that the cache memories take one clock period and there is no miss
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R13214 OR R14, R15, R16218 SD R17, 600(R0)21C BEQZ R18, 5
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4 5 6 7
4 5 6 7 85 6 7 8 96 7 8 9 10
7 8 9 108 9 10 11P
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 137CS 6143
What is Pipelining ? What if the MIPS CPU was not pipelined ?
The execution timing would be as follows by assuming that the cache memories take one clock period and there is no miss
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R13214 OR R14, R15, R16218 SD R17, 600(R0)21C BEQZ R18, 5
IF ID EX MEM WB
1 2 3 4 56 7 8 9 10
11 12 13 14 15
16 17 18 19 2021 22 23 24 2526 27 28 29 30
31 32 33 3435 36 37 38
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
The execution completes in 38 clock periods !
Pipelined execution takes 11 clock periods !
Haldun Hadimioglu
MIPS Versions 0 & 1 138CS 6143
What is Pipelining ? Pipelining decreases the execution time of
the program, CPUtimeThe number of instructions run, NI, stays the
same We execute the same number of instructions for a
program Instructions go through the same stages as the
unpipelined case But, we execute several instructions at the same
time All the stages are busy now The CPU does more per clock period CPIave decreasesP
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 139CS 6143
What is Pipelining ? We execute more instructions per unit
time (a second)The throughput is increased
The MIPSave figure is increased The number of instructions executed per second is
increased
That is why companies like to mention the MIPSave figure for their generation of microprocessors since they improve the pipeline which improves MIPSave
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 140CS 6143
Hardware-related issues to solve The stages must be precisely timed,
synchronized Each stage must take the same amount of time Each stage must have about the same amount of work
This is hard to come up unless it is a RISC architecture
Suppose that we managed to have the same amount of work per stage so that each stage takes the same time
What is the clock period ? Theoretically the clock period can stay the same as the
unpipelined CPU But the simultaneous execution increases the overhead
per clock period The clock period duration is increased slightly !P
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 141CS 6143
Hardware-related issues to solve A solution to these two problems today is to
break up stages that are taking too long into several simpler stages so that the stages are finer
Then, the pipeline is longer ≡ there are many stages Since each stage is doing simpler work, the clock period is
shorter ≡ the clock frequency is higher Today, a technique to increase the microprocessor
frequency is exactly this ≡ make stages simpler and simpler ≡ make pipelines longer and longer
Today’s microprocessor pipelines are typically 15 to 25 stages long
Clock skew problems can cause timing problems A signal may arrive too late to play a role in generating
another signal since the pipeline is very long !Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 142CS 6143
What is Pipelining ? Pipelining does not decrease the CPIi of
each individual instruction but increases the clock period slightly
The execution time of each instruction in terms of seconds is increased slightly !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 143CS 6143
Pipelined MIPS CPU Design In CS6143, we design the MIPS CPU by going
through eight versions : 0 through 7 Version 0 is the unpipelined CPU executing only integer
instructions Version 1 is the pipelined CPU executing only integer
instructions Initially, the Version 1 design will not be an acceptable
design New hardware to handle pipelining is not identified For example, the latches between stages are not identified It will not handle well certain situations called hazards There are three types of hazards : structural, data and control All programs have hazards, so we will quickly change the design
Branch instructions take a long time, causing pipeline startups It will have imprecise interrupts It will assume ideal memory All memory accesses take one clock period
So, somehow, the initial design of this version of MIPS CPU executes the code in a pipelined fashion
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 144CS 6143
Pipelined MIPS CPU Design Versions We will design the pipelined MIPS CPU Version 1
in several steps The final design of Version 1 will improve the pipeline by
introducing additional hardware to better handle integer instructions
New hardware to handle pipelining is identified (latches, etc.)
It will better handle the three hazards Branch instructions will take 2 clock periods
But, we will have delayed branches which is not practical It will still assume that the cache memories take more
than one clock period and there are cache misses It will still have some an unacceptable feature
It will still have imprecise interrupts
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 145CS 6143
Pipelining MIPS CPU Consider the mnemonic machine
language discussed before---200 LD R1, 500(R0) ; R1 M[R0 + 500+]204 DADD R2, R3, R4 ; R2 R3 + R4208 DSUB R5, R6, R7 ; R5 R61 - R7 20C XOR R8, R9, R10 ; R8 <-- R9 + R10210 SLT R11, R12, R13 ; If R12 < R13, R11 1, else R11
0
214 OR R14, R15, R16 ; R14 R15 ν R16218 SD R17, 600(R0) ; M[R0 + 600+] <-- R1721C BEQZ R18, 5 ; If R18 is equal to 0, branch to
address 234---
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 146CS 6143
Pipelining MIPS CPU Here is the execution of the code
discussed earlier
This MIPS CPU pipeline version has problems as mentioned previously and on the next slide
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R13214 OR R14, R15, R16218 SD R17, 600(R0)21C BEQZ R18, 5
IF ID EX MEM WB
1 2 3 4 52 3 4 5 63 4 5 6 7
4 5 6 7 85 6 7 8 96 7 8 9 10
7 8 9 108 9 10 11
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 147CS 6143
Issues with the Current Design This program will be executed without difficulty
since all instructions are independent of each other
There is no meaningful application where all instructions are independent of each other
An instruction, I1, generates a result that is used by another instruction, I2, so that I2 depends on I1
This code assumes we will always execute in sequence : even if we execute branch instructions
Latching hardware is not identified All memory accesses take one clock period Some instructions could take shorter times
Such as the BEQZ instruction The interrupts are impreciseP
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 148CS 6143
Improving Initial Version 1 Design The pipelined MIPS CPU state diagram and
pipeline stages We will obtain the final state diagram and final
datapath after several iterations The initial design of Version 1 will be improved by
going through several designs First, we will add new hardware, including latches Second, we will handle hazards better Third, we will execute Branch instructions faster Fourth, we will have longer L1 cache hit times and
cache misses
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 149CS 6143
Improving Initial Version 1 Design Version 1 will be improved by going through
several designs First we will add the hardware overhead, including
latches When we have pipelined execution, it is important not to
lose the info about the execution of each instruction With pipelining, each stage transforms the instruction by
doing so affects the architectural registers and the memory (the state)
Some piece of this state is needed to execute an instruction in a latter specific stage
So, when we move an instruction from one stage to another, it is necessary to transfer the information to the next stage (to make the state of the instruction available to the next stage) so that correct execution happens
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 150CS 6143
Latching hardware Each stage starts with the “sum” of work that has
been done on the instruction in previous stages Each stage works on the instruction resulting in new
work that will be needed in later stages to complete the instruction
For that purpose stages are provided with their own latches
In other words, a stage works on an instruction that has left the previous stage and produces something related to the instruction and passes it to the next stage to be used in the next clock period
Thus, we need to save the product of a stage in temporary registers (buffers, latches) for the next stage
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 151CS 6143
Latching hardware So we need the latches (buffers)
The amount of storage between two stages is not constant :
We will not discuss the control unit, but we will know that it is there
IF ID MEMEX WB Instructions
I7 I6 I5 I4 I3I8 I7 I6 I5 I4
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 152CS 6143
Latching hardware The new hardware
Three additional IRs Though not all the bits of the extra IRs are needed
Two NPC registers Two ALUoutput registers One A register One B register
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 153CS 6143
Latching hardware Here is the new look of the MIPS CPU datapath with buffers
The leftmost buffer set (with NPC and IR) will be called buffer set 2 since these buffers are used by the second stage from left (ID)
The next buffer set to the right (NPC, A, B, Imm and IR) is buffer set 3, and so on
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 154CS 6143
Latching hardware We will identify the registers by using the
buffer set number (or the stage number using the registers)
Buffer set 2 registers (Stage 2 uses them) 2.NPC and 2.IR
Used by the second stage from left : IDBuffer set 3 registers (Stage 3 uses them)
3.NPC, 3.A, 3.B, 3.Imm and 3.IR Used by the third stage from left : EX
Buffer set 4 registers (Stage 4 uses them) 4.Cond, 4.ALUoutput, 4.B and 4.IR
Used by the fourth stage from left MEMBuffer set 5 registers (Stage 5 uses them)
5.ALUoutput, 5.LMD and 5.IR
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 155CS 6143
Latching hardware What did we do ?
We identified buffers for the pipelined execution of instructions
The initial implementation of Version 1 does not identify the buffers
The initial implementation of Version 1 does not specify that there are four IR registers, two NPC registers, two ALUoutput registers, etc.
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 156CS 6143
Timing of Microoperations We need to know about the timing of microoperations
When does exactly the instruction fetch occur for the LD instruction ?
That is, we know the instruction fetch will happen in clock period 1 (one), but exactly when ?
Similarly when does exactly PC get its value updated to 204 when we execute the LD ?
Note : on the unpipelined CPU, this code takes 38 clock periods !
---
200 LD R1, 500(R0)
204 DADD R2, R3, R4
208 DSUB R5, R6, R720C XOR R8, R9, R10
210 SLT R11, R12, R13
214 OR R14, R15, R16
218 SD R17, 600(R0)
21C BEQZ R18, 5
---
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 157CS 6143
Timing of Microoperations We clock (store on) our registers at the end of a
clock period and therefore, registers change their values in the beginning of the next clock period
Therefore, IR gets its new value (the LD instruction) in beginning of the ID cycle (in clock period 2)
PC gets its new value (204) in beginning of the ID cycle (in clock period 2)
Clock
Clock period 1 Clock period 2
PC 200 204 2081FC
IR ? LD R1, 500(R0) DADD R2, R3, R4?
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 158CS 6143
Instruction fetch (IF) Cycle Fetch the instruction pointed by PC to 2.IR
2.IR M[PC] Update PC by adding 4
PC PC + 4
How about 2.NPC ?
Soon, we will see that !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 159CS 6143
Instruction decode/register fetch (ID) Cycle Prepare temporary registers A, B and Imm in case we need
GPR registers, an effective address or an immediate operand3.A GPR[2.IR.Rs]
3.B GPR[2.IR.Rt]3.Imm 2.IR.DOImm+
How about 3.NPC & 3.IR ?
Soon, we will see them !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 160CS 6143
Execute (EX) Cycle for Load/Store Instructions How do we know we have a Load/Store instruction ?
The IR register for this stage (3.IR) was not transferred value from the IR register of the previous stage (2.IR)
We need to update the ID stage : 3.IR 2.IR
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 161CS 6143
Instruction decode/register fetch (ID) cycle Prepare temporary registers A, B and Imm and move IR to
the next stage3.A GPR[2.IR.Rs]3.B GPR[2.IR.Rt]3.Imm 2.IR.DOImm+
3.IR 2.IR
How about 3.NPC ?
Soon, we will see that !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 162CS 6143
Execute (EX) Cycle for Load/Store Instructions Calculate the effective address
4.ALUoutput 3.A + 3.Imm We should not forget to move 3.IR to the next stage
4.IR 3.IR
How about 4.Condand 4.B ?
Soon, we will see them !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 163CS 6143
Memory access/branch completion (MEM) Cycle for Load Instructions Read the data from memory
5.LMD M[4.ALUoutput] We should not forget to move 5.IR to the next stage
5.IR 4.IR
How about 5.ALUoutput ?
Soon, we will see that !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 164CS 6143
Write-back (WB) Cycle for Load instructions Transfer LMD to a GPR register
GPR[5.IR.Rt] 5.LMD
The Load takes 5 clock periods to execute : CPILoad = 5
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 165CS 6143
Memory access/branch completion (MEM) Cycle for Store instructions The effective address is in 4.ALUoutput
Where is the data to store ? It is in 3.B We did not transfer 3.B to 4.B ?
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 166CS 6143
Execute (EX) Cycle for Load/Store Instructions Calculate the effective address
4.ALUoutput 3.A + 3.Imm We should not forget to move 3.IR to the next stage
4.IR 3.IR Transfer 3.B to 4.B
4.B 3.B
How about 4.Cond ?
Soon, we will see that !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 167CS 6143
Memory access/branch completion (MEM) Cycle for Store Instructions Write 4.B to the memory pointed by 4.ALUoutput
M[4.ALUoutput] 4.B
The Store takes 4 clock periods to execute : CPIStore = 4
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 168CS 6143
Execute (EX) Cycle for A/L R-format instructions Perform the operation specified by the Function field of 3.IR
4.ALUoutput 3.A func 3.B We should not forget to move 3.IR to the next stage
4.IR 3.IR
How about 4.Cond ?
Soon, we will see that !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 169CS 6143
Memory access/branch completion (MEM) Cycle for A/L R-format Instructions We could complete the execution of these instructions in this
cycle by transferring 4.ALUoutput to a GPR register But, we decide to complete the execution in the WB cycle to help us
handle data hazards better as we will see later
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 170CS 6143
Memory access/branch completion (MEM) Cycle for A/L R-format Instructions Transfer 4.ALUoutput and 4.IR to the next stage
5.ALUoutput 4.ALUoutput 5.IR 4.IR
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 171CS 6143
Write-back (WB) Cycle for A/L R-format instructions We transfer the result from 5.ALUoutput to a GPR register
GPR[5.IR.Rd] 5.ALUoutput A/L R-format instructions take 5 clock periods to execute
CPIA/L R-format = 5
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 172CS 6143
Execute (EX) Cycle for A/L I-format instructions Perform the operation specified by the Opcode field of 3.IR
4.ALUoutput 3.A op 3.Imm We should not forget to move 3.IR to the next stage
4.IR 3.IR
How about 4.Cond ?
Soon, we will see that !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 173CS 6143
Memory access/branch completion (MEM) Cycle for A/L I-format Instructions We could complete the execution of these instructions in this
cycle by transferring 4.ALUoutput to a GPR register But, we decide to complete the execution in the WB cycle to help us
handle data hazards better as we will see later
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 174CS 6143
Memory access/branch completion (MEM) Cycle for A/L I-format Instructions Transfer 4.ALUoutput and 4.IR to the next stage
5.ALUoutput 4.ALUoutput 5.IR 4.IR
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 175CS 6143
Write-back (WB) Cycle for A/L I-format Instructions We transfer the result from 5.ALUoutput to a GPR register
GPR[5.IR.Rt] 5.ALUoutput A/L I-format instructions take 5 clock periods to execute
CPIA/L I-format = 5
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB
2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 176CS 6143
Execute (EX) Cycle for Branch Instructions We need to store the result of compare of 3.A with 0 on Cond We need to calculate the effective address by adding PC and 4
times the Offset But, is PC changed by the instructions behind the Branch ? Yes !
We should have saved the PC value for Branch on a new register : NPC in the IF cycle !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
PC
NPC
GPR
IR IR
A
B
Imm
NPC
IR
Aluoutput
B
Cond
IR
Aluoutput
LMD
IF ID EX MEM WB2 3 4 5
Haldun Hadimioglu
MIPS Versions 0 & 1 177CS 6143
Execute (EX) Cycle for Branch Instructions We need to study the execution of Branch instructions
more carefully
When the Branch is in its EX stage, PC is 608
600 BEQZ R8, 4 ; Branch to 614 if R8 = 0604 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
WB
MEM
EX
ID
IF
?
?
?
?
BEQZ
?
?
?
BEQZ
?
?
BEQZ
Clock period, PC1, 600 3, 6042, 604
There is aProblem !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
We detect that there is a Branch in the beginning of its ID cycle (clock period 2)
We then immediately stop the IF stage from fetching any instruction and stop to add 4 to PC
Haldun Hadimioglu
MIPS Versions 0 & 1 178CS 6143
Execute (EX) Cycle for Branch Instructions We know we have a branch in the ID stage when
we decode it PC is 604 in ID
When the Branch reaches EX, it expects to have PC = 604
What shall we do ? We decide to have a new register to keep the PC value for
the Branch : NPC (New PC) We save the PC value for the Branch in NPC in the IF stage
So 604 moves with the Branch into the EX stage When the ID stage detects a Branch
It stops the IF stage fetching the next instruction It stops the IF stage adding 4 to PC We also have to stop incrementing PC so that if the condition
is not satisfied, we execute the instruction following the BEQ This is the instruction in location 604 We should not execute the instruction 608 after we execute
the BEQ
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 179CS 6143
Execute (EX) Cycle for Branch instructions We change the IF and ID stages to include
transfers to 2.NPC and 3.NPC The EX stage for the Branch is like this
4.IR 3.IR 4.Cond 3.A op 0 4.ALUoutput 3.NPC + (3.Imm * 4)
Now, we have the correct PC value in 3.NPC in the EX stage
But, when do we write to PC so we branch ?
3.NPC has 604
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 180CS 6143
Execute cycle (EX) Cycle for Branch instructions We write to PC the clock period after the Branch is in EX
We write to PC in the IF stage when it is clock period 4
The IF stage then changes PC and NPC if 4.Cond is 1 PC If (4.Cond) then 4.ALUoutput else if (2.IR.opcoce ≠ Branch) PC +
4 NPC If (4.Cond) then 4.ALUoutput else if (2.IR.opcoce ≠ Branch) PC
+ 4
We also need to clear 4.Cond so that a new Branch can be executed
4.Cond If (4.Cond) then 0
WB
MEM
EX
ID
IF
?
?
?
?
BEQZ
?
?
?
BEQZ
?
?
BEQZ
Clock period, PC 1, 600 3, 6042, 604
?
4, 604
?
AND
5, 614
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 181CS 6143
Execute (EX) Cycle for Branch Instructions What shall we do with DADD, DSUB and XOR ?
They should not be fetched until we know the Branch result !
If the ID stage has a Branch we stop the instruction fetch to the memory
But, we also have to clear 2.IR if it has a Branch so we fetch an instruction the next clock period (clock period 5) : 4.IR has the Branch in the 4th clock period
2.IR If 4.IR.opcode = Branch then NOP else if (2.IR.opcode ≠ Branch) then M[PC]
WB
MEM
EX
ID
IF
?
?
?
?
BEQZ
?
?
?
BEQZ
?
?
BEQZ
Clock period, PC 1, 600 3, 6042, 604
?
4, 604
NOP
AND
5, 614
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 182CS 6143
Execute (EX) Cycle for Branch Instructions What if we continued with the DADD, DSUB and
XOR ? Would they change any architectural register or memory
? NO ! Since we arranged the pipeline such that all register
writes and memory writes happen at the end of the pipeline
By that time we know we have a Branch we stop and flush out them
RISC architectures allow late writes that help the hardware designer
CISC architectures require early writes in the pipeline The hardware designer has to undo these early writes when
a branch is finally recognized Unnecessary pressure on the hardware designer
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 183CS 6143
Execute (EX) Cycle for Branch Instructions Stopping the fetches, how does the execution
look ?WB
MEM
EX
ID
IF
?
?
?
?
BEQZ
?
?
?
BEQZ
?
?
BEQZ
Clock period, PC 1, 600 3, 6042, 604
?
4, 604
?
NOP
AND
5, 614
The pipeline is almost empty with only one instruction in the WB stage!There is only one instruction in the pipeline
This is why Control instructions are important to deal with for pipelines
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 184CS 6143
Execute (EX) Cycle for Branch instructions Stooping the fetches shown in a different way
IF
ID
EX
MEM
WB
v vv
v
vvv
v
vvv
vv
v
600 BEQZ R8, 4 601 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
11 2 3 4 5 6 7 8 9IF ID EX
IF
IF ID EX MEM WB
????
???
?? ?A pipeline bubble
is generated
The Branch causesa pipeline start-up !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
vvv
vv
Haldun Hadimioglu
MIPS Versions 0 & 1 185CS 6143
Execute (EX) Cycle for Branch Instructions In the 4th clock period we complete the
execution of the Branch by writing the effective address to PC in IF
The control unit knows we are completing the Branch instruction and so does not allow an instruction fetch
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 186CS 6143
Let’s rewrite microoperations for the Branch IF stage (for all instructions)
2.IR If 4.IR.opcode = Branch then NOP else if (2.IR.opcode ≠ Branch) then M[PC]
PC If (4.Cond) then 4.ALUoutput else if (2.IR.opcoce ≠ Branch) then PC + 4 2.NPC If (4.Cond) then 4.ALUoutput else if (2.IR.opcoce ≠ Branch) then PC + 4 4.Cond If (4.Cond) then 0
ID stage (for all instructions) 3.A GPR[2.IR.Rs] 3.B GPR[2.IR.Rt] 3.Imm 2.IR.DOImm+ 3.IR 2.IR 3.NPC 2.NPC
EX stage 4.IR 3.IR 4.Cond 3.A op 0 4.ALUoutput 3.NPC + (3.Imm * 4)
The Branch execution completes in the IF stage in the next clock period
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 187CS 6143
Branch instructions take 4 clock periods to execute CPIBranch = 4
Since, the Branch execution is completed in the IF stage
Overall, executing a control instruction first creates a pipeline bubble and then causes a pipeline start-up where only one stage, IF, is busy It is therefore critical that the number of control
instructions be reduced by having Better programming styles Better compilers
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 188CS 6143
Evaluation of Pipelined MIPS CPU With pipelining and memory hierarchies
hardware has become more sensitive to The number of instructions, NI (due to increased
memory hierarchy delays) The number of control instructions (due to pipeline and
memory hierarchy delays that can occur) Now we see why the pipeline is sensitive to control
instructions The order of instructions (due to pipeline delays that
can occur) Class notes on the remaining versions will show examples
why the pipeline is sensitive to the certain order of instructions
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 189CS 6143
What is Pipelining ? Before we continue with the evaluation of
our design, a comment :Pipelining is often invisible to the programmer,
though current architectures allow some visibility to help/improve pipeline
For example, knowing the pipeline length and how many clock periods each stage takes help the compiler to come up with a more efficient code
This is because a better order of instructions can be obtained
This is a point made in earlier
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 190CS 6143
The execution of the code on the Version 1 MIPS pipeline is shown again below by assuming that the cache memories take one clock period and there is no miss
Assume that our integer-instruction pipeline can execute the XOR, SLT, etc.
It looks as if it takes 10 clock periods to run the code Though the Branch completes in clock period 11 !
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R13214 OR R14, R15, R16218 SD R17, 600(R0)21C BEQZ R18, 5
IF ID EX MEM WB
1 2 3 4 52 3 4 5 63 4 5 6 7
4 5 6 7 85 6 7 8 96 7 8 9 10
7 8 9 108 9 10
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 191CS 6143
The Speed Comparison The piece of program takes 11 clock periods on
the pipelined computer as opposed to 33 clock periods on the unpipelined
3 10
33
CPUtime
CPUtime Speedup
new
oldoverall
4.125 8
33
NI
programfor periodsclock of# CPI pipe w/oave
1.375 8
11
NI
programfor periodsclock of # CPI pipe w/ ave
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 192CS 6143
In general any pipeline will work fine if Every instruction is independent of every other instruction in
the pipeline at any moment Otherwise, we have what we call hazards as we will see soon
The number of control instructions is very small The order of instructions is good
Otherwise, we have what we call hazards as we will see soon There is a lot of hardware available
In the ideal case, CPIave ≡ the number of pipeline stages
In the ideal case, NI ≡ # of clock periods for the program
Speedupoverallideal = pipeline depth (the number of pipeline stages)
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 193CS 6143
Ideal MIPS
If the CPU completes one instruction per clock period
We now see why microprocessor companies are eager to increase the clock frequency !
610
frequencyclock periodclock per completedn instructio of # MIPSideal
6ideal
10
frequencyclock MIPS
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 194CS 6143
Pipeline Timing Due to start-ups and hazards, CPIave is not 1 The net effect of start-ups and hazards is that
more than one clock period is needed to execute an instruction on average
The amount of additional clock periods is due to the average delay cycles (stalls we will call soon) per instruction
ninstructioper (stalls) delays pipeline CPI CPI ideal aveave
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 195CS 6143
Pipeline Timing Since the ideal CPIave with pipelining is 1, we
obtain the following formula
It is clear from the above formula that the speedup is directly proportional to the number of pipeline stages
ninstructioper cycles stall Pipeline 1
depth Pipeline Speedupoverall
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 196CS 6143
Pipeline Timing Example : Assume that a program with no control
instructions is run and the following measurements are made on the MIPS
Calculate CPIave and CPUtime for both unpipelined and pipelined cases and Speedupoverall, the pipelined efficiency and MIPSideal for the pipelined case
Assume that clock frequency is 200MHz Note that this program is an ideal program since there is no
Store instruction ! NI = # of Loads + # of A/L = 10 + 90 = 100
Instruction CPIi # of times executed
Unpipelined time
Loads 5 10 0.25μsec
A/L 5 90 2.25 μsec
5ns 10 5 10 200
1
frequencyClock
1 periodClock 9-
6
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 197CS 6143
Pipeline Timing Example continued
For the unpipelined case : CPUtimeunpipelined = TimeLoads + TimeA/L = 0.25 + 2.25 = 2.5
μsec
# of clock periods for Loads = # of times executed x CPIi
= 10 x 5 = 50 # of clock periods for A/L = # of times executed x CPIi
= 90 x 5 = 450 # of clock periods for program = # of clock periods for
Loads + # of clock periods for A/L = 50 + 450 = 500
5 100
500
NI
programfor periodsclock of # CPI pipe w/oave
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 198CS 6143
Pipeline Timing Example continued
For the pipelined case : # of clock periods for program = Start-up time + (NI – 1) = = 5 + (100 – 1) = 104 CPUtimepipelined = # of clock periods for program x clock period
= 104 x 5 = 520ns = 0.52 μsec
Speedupoverall is not 5 because of the startup time....
200 10
10 200
10
frequencyclock MIPS
6
6
6ideal
4.81 .52
2.5
CPUtime
CPUtime Speedup
new
oldoverall
0,96 5
4.81
Speedup
Speedup efficiency Pipeline
ideal
overall
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 199CS 6143
Improving Initial Version 1 Design Now, we will make an assessment of pipelining to
prepare ourselves for next set of improvements Pipelining increases the speed but there are difficulties
and problems associated with pipelining : The hardware is complicated
Additional temporary registers (latches or buffers) are needed between stages so that latter stages can correctly work on an instruction
Some latches are simple duplication of earlier registers and some are latches that save the output of a stage.
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 200CS 6143
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The pressure on the memory is doubled : two memory accesses per clock period happen
One for instruction in the IF stage One for data in the MEM stage
For example, for the program execution on slide 190, the CPU makes two memory accesses in the 4th clock period
The frequency of simultaneous accesses depends on the number of Loads and Stores
The number of Loads and Stores depend on the programmer and compiler
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 201CS 6143
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Not all instructions require all the stages Some stages are empty, idle, creating a pipeline bubble
that cannot be avoided RISC instructions require fewer stages therefore the chance
having many unneeded stages is reduced With CISC, the number of stages is larger
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 202CS 6143
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The startup time slows the system Its impact is through
The number of times it occurs (due to control instructions) The time it takes to fill the pipeline (pipeline depth or latency)
RISC systems perform better here since they have shorter pipelines
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 203CS 6143
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Some instructions have complex microoperations that take longer than one clock period to complete
Overall, it is difficult to have balanced stages
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 204CS 6143
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
The clock period is determined by The slowest stage which is often the stage with the
addition The EX stage
The latches that need set up time and propagation delays The clock skew problem
In RISC systems it is easy to distribute the work equally to stages but with CISC it is more difficult
So, in order not to increase the clock period length in CISC systems, a stage that has a complex microoperation takes more than one clock period
But, this creates bubbles !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 205CS 6143
Improving Initial Version 1 Design Pipelining increases the speed but there are
difficulties and problems associated with pipelining :
Because of what we call hazards, an instruction in the stream may not be moved to the next stage but forced to stay in the same stage more than one clock period
The instruction stalled The stages to the left of the stalled instruction cannot
move their instruction to the right to keep the strict order of execution
These stages become idle (do not work on new instruction) but keep the old instructions
This creates a pipeline bubble : The speed is decreased. Note that the startups decrease the speed since there is a
larger bubble in the pipeline Control instructions result in startups Pipeline “hazards” also create startups if poorly designed
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 206CS 6143
Pipeline Hazards They are caused by a number of reasons
forcing the pipeline to stop the execution of an instruction and the instructions that are behind
The instructions are stalled The hazards generate either bubbles or a start-up of
the pipeline.
There are three types of hazardsStructuralDataControlP
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 207CS 6143
Structural Hazards Structural hazards occur from resource conflicts
that can be solved with more resources, i.e. more or faster hardware
Examples of structural hazards are Only one memory port in the CPU which stops the IF
stage if a Load/Store is using this single memory port to access data
If a L1 cache memory takes two or more clock periods ! If the GPR set has only one write port and several
simultaneous GPR writes are performed, only one GPR write will happen, the others will write one by one
If a stage performs a complex microoperation taking several clock periods, such as FP arithmetic, and this microoperation is not pipelined, then instructions behind it will stay idle in their stages (these instructions are stalled)
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 208CS 6143
Structural Hazards Due to a structural hazard, one or more
instructions behind the instruction that caused the hazard are delayed, are not allowed to move.
The stages behind the hazard causing instruction become idle : A bubble is generated
The bubble moves one stage per clock period and eventually leaves the pipeline.
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 209CS 6143
Structural Hazards What if there was only one memory port ?
If a Load or Store tries to access a data element in the memory in the MEM cycle, then, the IF stage is forced to stay idle by the control unit so that the priority is given to the instruction already in the pipeline to complete it as soon as possible
The instruction that was going to be fetched is stalled A bubble is created in the IF stage Theoretically, the instructions behind it are also stalled The bubble moves up the pipeline one stage per clock
period Stalling ends when the bubble leaves the pipeline Next slide shows this process one more time with that
theoretical stalling of the instructions behind the first stalled instruction
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 210CS 6143
Structural Hazards What if there was only one memory port ?
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10
210 SLT R11, R12, R13214 OR 14, R15, R16
218 SD R17, 600(R0
21C BEQZ R18, 5
11 2 3 4 5 6 7 8 9 10 11IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
Stall IF ID EX MEM WB
IF ID EX MEMIF ID EX MEM
IF ID EX MEM
IF ID EX
IF
ID
EX
MEM
WB
v vv
vvv
vv
v
vvv
v
vvv
vv
vvv
vv
vvv
vv
vvv
v
vv
v
v
v
vv
????
???
?
? ?
A bubble iscreated andmoves upthe pipeline
v
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 211CS 6143
Structural Hazards What if there was only one memory port ?
We will avoid using textbook notation of instruction execution since even for a few instructions, a large space is needed to show the flow of execution
Rather, we will use our own notation shown below
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R13214 OR R14, R15, R16218 SD R17, 600(R0)21C BEQZ R18, 5
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4 5 6 7
5 6 7 8 86 7 8 9 10
8 9 10 117 8 9 10 11
9 10 11
XOR is delayed, stalled, in clock period 4 by the LD accessing the memory for its data
XOR is fetched in the 5th clock period, not in the 4th clock period Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 212CS 6143
Structural Hazards What if there was only one memory port ?
The control unit stops the IF stage from accessing the memory to fetch the XOR
The reason is that we want to complete the execution of the LD that is already in the pipeline
Instructions in the pipeline has higher priority for completion
The SD instruction will access the memory in the 11th clock period to write data
There will not be an instruction fetch in the 11th clock period
Once a stall occurs, a bubble is introduced not all the stages are busy
The execution of the instruction is increased ≡ its CPIi is increased
CPIave is increased CPUtime is increased
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 213CS 6143
Structural Hazards What if there was only one memory port ?
We will not have this structural hazard in our system It is also clear from the Version 1 datapath diagram that
we have two separate memory ports Memory Port 1 for instruction fetches Memory Port 2 for data accesses
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 214CS 6143
Hazards Structural Hazards
Often, to solve structural hazards more or faster hardware is needed
However, the solution of the other two hazards, data and control hazards, requires
More hardware and Better compilation techniques
To better order instructions To reduce the number of control instructions
The result is that Pipeline bubbles are eliminated or reduced The number of pipeline start-ups is also reduced
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 215CS 6143
Hazards The overall hardware structure that that detects
a hazard and stops (stalls) an instruction or several instructions until the hazard condition does not exist is called pipeline interlock
Note that if an instruction is stalled, the instructions behind it are also stalled as we will see shortly
Thus, it is costly to stall a single instruction in the pipeline
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 216CS 6143
Data Hazards As mentioned before all previous program
examples had instructions independent of each other
The instructions did not have any register or memory location in common
For example, an instruction writes to R9 and the next instruction did not read R9
The second instruction did not depend on the first instruction
There is no data dependency between them There are other types of data dependencies as we will see
shortly If two instructions have data dependency between them
and they are in the pipeline there can be a data hazard
Let’s see the definition on the next slide
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 217CS 6143
Data Hazards Data hazards occur between two
instructions which are executed close enough in time and there is writable data shared by them
That is there is a data dependency between two instructions and the correct result will occur only if the execution is confined to the sequential rather than pipelined execution to enforce the right order of access to the shared data
The second instruction cannot be executed in a pipelined fashion
It has to wait, stall ! This is sequential execution then
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 218CS 6143
Data Hazards If we change the instruction sequence of the
previous code to include dependency, there will be data hazards
We observe that the DADD writes to R2 and the instructions below DADD read R2
R2 has data that is writable and shared by several instructions
The DADD and the remaining instructions are executed close in time
Can there be data hazards among them ?
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 219CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it
The data element in R2 is shared by all the instructions below the DADD and they are executed close in time
An instruction, I1, writes to register and another instruction, I2, reads the same register (the data element)
I1 has to write first and then I2 has to read : There is a Read after Write (RAW) dependency
BUT, if I2 reads before I1 writes then there is a RAW hazard Can I2 read before I1 write ? Yes We have to stop I2 if it tries to read R2 before the DADD writes to
R2
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
RAW ?RAW ? RAW ? RAW ? RAW ?
RAW ?
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 220CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it
There are data dependencies, but are they all data hazards ?
Will all the instructions below the DADD try to read R2 before the DADD writes ? NO !
Soon we will see that data hazards will happen between the DADD and DSUB, XOR and SLT
DSUB, XOR and SLT will try to read R2 before the DADD writes to R2
The OR, SD and BEQZ will read R2 after the DADD writes to R2
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
RAW RAWRAW
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 221CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it
DSUB, XOR and SLT will try to read R2 before the DADD writes to R2
These data dependencies result in data hazards This data hazard is one of three types of data hazards
An instruction, I1, writes to register and another instruction, I2, reads the same register (the data element)
I1 has to write first and then I2 has to read : Read after Write (RAW)
If I2 reads before I1 writes there is a RAW hazard We will stall DSUB, XOR and SLT when they try to read R2
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All RAW
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 222CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID Stall Stall Stall EX MEM WB
IF Stall Stall Stall ID EX MEMStall Stall Stall IF ID EX
Stall Stall Stall IF ID
Stall Stall Stall IF
Stall Stall Stall
IF
ID
EX
MEM
WB
v vv
vvv
v
v
v
vvv
vvv
v
vvv
vv
vv
????
???
?
? ? v
Why do we stall the DSUB in the ID stage ?
We stall the DSUB for 3 clock periods and create a 3-clock period bubble thatmoves up the pipeline
XOR is fetched and idling in the IF stage
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
v
Haldun Hadimioglu
MIPS Versions 0 & 1 223CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it We stalled the DSUB in the ID stage since it reads its operands
in ID as we designed in Version 1) The DSUB reads its operands R2 and R6 in the ID stage This is clock period 4
When will the DADD write to R2 ? In clock period 6 ! When will R2 actually get the new value ? In the beginning of the 7th clock period !
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID Stall Stall Stall EX MEM WB
IF Stall Stall Stall ID EX MEMStall Stall Stall IF ID EX
Stall Stall Stall IF ID
Stall Stall Stall IF
Stall Stall Stall
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 224CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it Why does R2 get its new value in the beginning of the 7th
clock period ? According to the state diagram of Version 1, the DADD writes
from 5.ALUoutput to its destination register in the WB stage This is clock period 6 Why does R2 get the value in the beginning of the 7th clock period
? As we discussed before, we clock (store on) our registers at the
end of a clock period and therefore, registers change their values in the beginning of the next clock period
Clock
Clock period 6 Clock period 7
5.ALUoutput Result of DADD ? ??
R2 ? Result of DADD ??
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 225CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it In summary then that the DSUB is stalled in the ID stage for
three clock periods A 3-clock period long bubble is created and moves up the
pipeline If we show the pipeline in our notation
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4/7 8 9 104/7 8 9 10 11
8 9 10 11 12
10 11 12 139 10 11 12 13
11 12 13
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 226CS 6143
Pipeline Interlocks What we are doing is that we are checking
for hazard situations in the ID stage and when we recognize a hazard, we stall the instruction in the ID stage !
If an instruction does not have a hazard situation, it is allowed to proceed to the EX stage
That is the instruction is issued to the EX stage If the instruction has a hazard, it is stalled in
the ID stage by the pipeline interlock to preserve the execution pattern
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 227CS 6143
Pipeline Interlocks If an instruction is stalled in the ID stage,
then the instruction in the IF stage is stalled
That is the instruction behind the stalled instruction is not allowed to pass by and continue with its execution
This is called static issuing Static issuing reduces hardware since we do not
have to keep track of which instruction changed which part of the state
Because, if an instruction is stalled, it has to update the state before all instructions that follow it
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 228CS 6143
Pipeline Interlocks If dynamic issuing is allowed then an instruction
in the IF stage would pass by the stalled instruction in the ID stage and start its EX cycle
However, dynamic issuing results in other data hazards, WAR and WAW, to happen as we will discuss later
We need to have hardware not to allow an instruction behind a stalled instruction to update the state
Can we somehow allow this instruction to proceed ? Yes, we can allow it to generate its results
But, we have to buffer the results and write them to the destination after the stalled instruction is finished for correct execution pattern
We then need additional hardware to keep temporary results and keep track of instructions’ progress
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 229CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it What if DSUB does not have a RAW hazard but XOR has ?
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID Stall Stall EX MEM WB IF Stall Stall ID EX MEM
Stall Stall IF ID EX
Stall Stall IF ID
Stall Stall IF
IF
ID
EX
MEM
WB
v vv
vvv
v
vv
vvv
vvv
v
?vv
vv
vv
????
???
?
? ? v
vvv
vv
vWe stall the XOR for 2 clock periods and create a 2-clock period bubble that moves up the pipeline
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 230CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it What if DSUB does not have a RAW hazard but XOR has ?
The XOR is in ID in the 5th clock period but has to wait until the 7th clock period
If we show the pipeline in our notation
IF ID EX MEM WB
1 2 3 4 52 3 4 5 63 4 5 6 7
4 5/7 8 9 105/7 8 9 10 11
9 10 11 12
8 9 10 11 12
10 11 12
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 231CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it What if DSUB and XOR do not have a RAW hazard but SLT
has ?
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
11 2 3 4 5 6 7 8 9 10 IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB IF ID Stall EX MEM WB
IF Stall ID EX MEM
Stall IF ID EX
Stall IF ID
IF
ID
EX
MEM
WB
v vv
vvv
v
vv
vvv
vvv
v
?vv
vv
vv
????
???
?
? ? v
vvv
vv
vWe stall the SLT for 1 clock period and create a 1-clock period bubble that moves up the pipeline
v
v
vv
v
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 232CS 6143
Data Hazards Let’s concentrate on the DADD and the instructions that
follow it What if DSUB and XOR do not have a RAW hazard but SLT
has ? If we show the pipeline in our notation
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4 5 6 7
4 5 6 7 85 6/7 8 9 10
8 9 10 11
6/7 8 9 10 11
9 10 11
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
RA
W
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 233CS 6143
Eliminating Hazards We will eliminate delays due to RAW hazards
We will write GPR registers in the WB stage in the first half of the clock period and read GPR registers in the ID in the second half of the same clock period
We will add new hardware to eliminate other delays
We will reduce the amount of delay due to control hazards
By assuming a certain compiler functionality we will eliminate the control hazard delays completely
However, this compiler functionality is not acceptable in real life
It does not allow software compatibility as we will see later this lccture
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 234CS 6143
Data Hazards Writing to a GPR in the first half – reading
the same GPR register in the second half of the same clock period
Consider the timing diagram of writing to R2 in the 6th clock period again
What if we clock (store on) R2 in the middle of the 6th clock period where there is a negative edge ?
That is, what if we do not write at the end of the 6th clock period, but the middle ?
This is possible by using negative-edge triggered GPR registers
So, we write from 5.ALUoutput to R2 in the middle of the clock period !P
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 235CS 6143
Data Hazards Writing to a GPR in the first half – reading the
same GPR register in the second half of the same clock period
OK, we write in the first half, can we read the same register in the second half ?
Yes, reading means getting the value from R2 in the second half and storing it on the destination register at the end of the same clock period when there is a positive edge
We read from GPR registers and store on temporary registers 3.A and 3.B in the ID stage
In this specific example R2 is stored on 3.B for the DSUB instruction
This will save one clock period for us From now on the GPR registers are clocked by
negative edges and the other registers are clocked by positive edgesP
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 236CS 6143
Data Hazards Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period Let’s visualize what happens in clock periods 5, 6 and 7
Clock
Clock period 6 Clock period 7
5.ALUoutput Result of DADD ??
R2 ?
3.B ? ?? Result of DADD
Clock period 5
Result of DADD
In the 6th clock period R2 has its new value and is transferred to 3.BTherefore, the DSUB can be in EX in the 7th clock period to use 3.B
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 237CS 6143
Data Hazards Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period Let’s see the new execution flow
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
IF ID EX MEM WB
IF ID EX MEM WBIF ID Stall Stall EX MEM WB
IF Stall Stall ID EX MEMStall Stall IF ID EX
Stall Stall IF ID
Stall Stall IF ID
Stall Stall IF
IF
ID
EX
MEM
WB
v vv
vvv
v
v
v
vvv
vvv
v
?vv
vv
vv
????
???
?
? ? v
v
We stall the DSUB for 2 clock periods and create a 2-clock period bubble that moves up the pipeline
vv
vv
1 2 3 4 5 6 7 8 9 10
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1We will draw short lines in the WB and ID stages to indicate that the RAW hazard has been resolved by the write-in-first-half-read-in-the-second-half feature
Haldun Hadimioglu
MIPS Versions 0 & 1 238CS 6143
Data Hazards Writing to a GPR in the first half – reading the same GPR
register in the second half of the same clock period If we show the pipeline in our notation
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4/6 7 8 94/6 7 8 9 10
7 8 9 10 11
9 10 11 12 8 9 10 11 12
10 11 12
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
We will draw short lines in the WB and ID stages to indicate that the RAW hazard has been resolved by the write-in-first-half-read-in-the-second-half feature
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 239CS 6143
Data Hazards Writing to a GPR in the first half – reading the same GPR register
in the second half of the same clock period Will this help if DSUB does not have a RAW hazard but XOR has ?
Yes !
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4 5 6 7
4 5/6 7 8 95/6 7 8 9 10
8 9 10 11
7 8 9 10 11
9 10 11
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
We saved one clock period !
Note that the GPR registers are always written in the middle of the clock period ! We show the short lines when this feature helps a RAW hazard !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 240CS 6143
Data Hazards Writing to a GPR in the first half – reading the same GPR register
in the second half of the same clock period Will this help if DSUB and XOR do not have a RAW hazard but SLT has ?
Yes !
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4 5 6 7
4 5 6 7 85 6 7 8 9
7 8 9 10
6 7 8 9 10
8 9 10
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R720C XOR R8, R9, R10210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
RA
W
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 241CS 6143
Data Hazards How will we eliminate the remaining two stall cycles ?
We will use forwarding also known as bypassing to do that This means we have additional hardware to eliminate the stalls The additional hardware will be new wires also MUX2 and MUX3 of the
datapath will be larger To visualize how we can do this, let’s look at the Version 1 state
diagram and the datapath for the DADD instruction
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
IF ID EX MEM WB
IF ID EX MEM WBIF ID Stall Stall EX MEM WB
IF Stall Stall ID EX MEMStall Stall IF ID EX
Stall Stall IF ID
Stall Stall IF ID
Stall Stall IF
1 2 3 4 5 6 7 8 9 10
The new value of R2 is calculated in the EX stage in the 4th clock period for the DADD
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 242CS 6143
Data Hazards Forwarding (Bypassing)
The new value of R2 is stored on 4.ALUoutput at the end of the 4th clock period
The new value of R2 is available for use in the MEM stage in the beginning of the 5th clock period
Why do not we forward the new value of 4.ALUoutput directly from the MEM stage to the EX stage in the 5th clock period ?
At the same time, why do not we allow the DSUB to read the old value of R2 to 3.B in the ID stage so we do not stall it in the 4th clock period ?
But, when the DSUB enters the EX in the 5th clock period, it uses the forwarded value from 4.ALUoutput ? It bypasses the value of 3.B
200 LD R1, 500(R0)
204 DADD R2, R3, R2208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID Stall EX MEM WBIF Stall ID EX MEM WB
Stall IF ID EX MEM
Stall IF ID EX
Stall IF ID
1 2 3 4 5 6 7 8 9 10
The arrow from MEM to EX indicates forwarding
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 243CS 6143
Data Hazards Forwarding (Bypassing)
What we are doing is that instead of waiting to get the new value of R2 that goes (i) from the ALU to 4.ALUoutput, then (ii) to 5.ALUoutput and then finally (iii) to R2, we forward the new value of R2 directly to the EX stage, to the input of the ALU, bypassing the value in 3.B that has the old R2 value
MUX3 is larger now
MU
X2
MU
X3
3.B
3.Im
m
4.ALUoutput
EX MEMID
AD
D
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 244CS 6143
Data Hazards Forwarding (Bypassing)
If we show the pipeline in our notation
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4 5 6 7 4 5/6 7 8 9
5/6 7 8 9 10
8 9 10 11 7 8 9 10 11
9 10 11
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
The arrow from MEM to EX indicates forwarding
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 245CS 6143
Data Hazards Forwarding (Bypassing)
What can we do to eliminate the stall for the XOR ?
To eliminate the stall for the XOR we will employ forwarding from the WB stage to the EX stage (as you will see on the next slide) !
Because we see that if we allow the XOR to read the old value of R2 in clock period 5, it can get the new value of R2 in the beginning of the 6th clock period
In the 6th clock period, the new value of R2 is with the DADD in the WB stage on register 5.ALUoutput
We then forward the value from 5.ALUoutput to MUX3, bypassing 3.B
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
200 LD R1, 500(R0)
204 DADD R2, R3, R2208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID Stall EX MEM WBIF Stall ID EX MEM WB
Stall IF ID EX MEM
Stall IF ID EX
Stall IF ID
Haldun Hadimioglu
MIPS Versions 0 & 1 246CS 6143
Data Hazards Forwarding (Bypassing)
Now, there is no stall ! Note the short lines in clock period 6 that indicate that write-
in-first-half-read-in-the-second-half helps eliminate the stall between the DADD and the SLT
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM
IF ID EX
1 2 3 4 5 6 7 8 9 10
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 247CS 6143
Data Hazards Forwarding (Bypassing)
If we show the pipeline in our notation
There is no stall ! Note the short lines in clock period 6 that indicate that write-
in-first-half-read-in-the-second-half helps eliminate the stall between the DADD and the SLT
IF ID EX MEM WB
1 2 3 4 52 3 4 5 6
3 4 5 6 74 5 6 7 8
5 6 7 8 9
7 8 9 106 7 8 9 10
8 9 10
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 248CS 6143
Data Hazards Forwarding (Bypassing)
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R6, R220C XOR R8, R9, R2210 SLT R11, R12, R2214 OR 14, R15, R2218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM
IF ID EX
1 2 3 4 5 6 7 8 9 10
What if R2 is the first operand register, register Rs, in the R-format ?
Till now we considered this code where for the DSUB, XOR and SLT, R2 is the second operand register, i.e. register Rt in the R-format
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 249CS 6143
Data Hazards Forwarding (Bypassing)
What if the code is that R2 is Rs for the DSUB, XOR and SLT ?
In this case we forward from 4.ALUoutput and 5.ALUoutput to MUX2, bypassing 3.A
Only the DSUB, XOR and SLT instructions will have the RAW hazard and the stall cycles will be eliminated by forwarding to MUX2
200 LD R1, 500(R0)
204 DADD R2, R3, R4208 DSUB R5, R2, R720C XOR R8, R2, R10210 SLT R11, R2, R12214 OR 14, R2, R15218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM
IF ID EX
1 2 3 4 5 6 7 8 9 10
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 250CS 6143
Data Hazards Forwarding (Bypassing)
What if the code is that R2 is Rs for the DSUB, XOR and SLT ? If we show the pipeline in our notation
IF ID EX MEM WB
1 2 3 4 52 3 4 5 63 4 5 6 74 5 6 7 8
5 6 7 8 9
7 8 9 106 7 8 9 10
8 9 10
200 LD R1, 500(R0)204 DADD R2, R3, R4208 DSUB R5, R2, R720C XOR R8, R2, R10210 SLT R11, R2, R12214 OR 14, R2, R15218 SD R2, 600(R0)21C BEQZ R2, 5
All
RA
W
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 251CS 6143
Data Hazards MIPS forwarding (Bypassing) for the general case
By using forwarding (bypassing) results that have not reached the destination GPR, can be forwarded to the inputs of
Functional units in the ALU Memory port 2 The zero detection unit
Bypassing the inputs that are shown in the Version 1 state diagram and datapath
Remember that we forward a value when it is needed
Two exceptions are Store and Branch instructions since they complete not in 5 but, 4 and 2 (soon we will see), respectively !Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 252CS 6143
Data Hazards What forwarding does is that functional
units in the ALU, memory port 2 and the zero detection unit bypass registers that originally supply values
If they cannot get the new value of a register on time, the new values are forwarded from
4.ALUoutput 5.ALUoutput 5.LMD
To the inputs of Functional units in the ALU Memory port 2 The zero detection unit
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 253CS 6143
Data Hazards Forwarding (Bypassing)
We show the changes to the inputs of the ALU below
MU
X2
MU
X3
3.B
3.Im
m
4.A
LU
outp
ut
3.A
3.N
PC
5.A
LU
outp
ut
EX 5.L
MD
MEM WB
AL
U
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 254CS 6143
Data Hazards Forwarding (Bypassing)
We show the changes to the inputs of Memory Port 2 below
4.ALUoutput
4.B
MemoryPort
2
5.ALUoutput
5.LMD
MUX5AB2
DB2 DB3
MEM
WB
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 255CS 6143
Data Hazards Forwarding (Bypassing)
We show the exception case for Store instructions where the value to be written to a memory location has to be passed to a Store in the EX stage even though it is not needed in EX, but in MEM
We have to have a new MUX in EX that will move data to 4.B either from 3.B or from 5.ALUout or 5.LMD
4.ALUoutput
4.B
5.ALUoutput
5.LMD
MEM WB
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
EXM
UX
63.B
3.Im
m3.
A3.
NP
C
Haldun Hadimioglu
MIPS Versions 0 & 1 256CS 6143
Data Hazards Forwarding (Bypassing)
We show the changes to the input of the Zero detection circuit below
Zero ?
MU
X7
3.A
From 4.ALUoutput
From 5.ALUoutput
From 5.LMD
4.C
ond
EX MEMID
Soon, when we cover control hazards we will see that this circuit is moved to the ID stage
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 257CS 6143
Data Hazards Forwarding (Bypassing)
In summary, we have the following changes to the MIPS datapath for forwarding purposes
Three new multiplexers, MUX5, MUX6 and MUX7 MUX2 and MUX3 are larger
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 258CS 6143
Data Hazards As we said before, there are three types of data
hazards Read after write, RAW
Instruction 1 has to write and then Instruction 2 has to read : I1W - I2R
We studied it on previous slides We need to prevent I2R - I1W So, we stall I2 unless we can forward the value We can do forwarding and write-in-the-first-half-read-in-
the-second-half to avoid the stall for all cases except one that involves Load instructions as described below
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 259CS 6143
Data Hazards There are two other types of data hazards
Write after read, WAR Instruction 1 has to read and then Instruction 2 has to write : I1R -
I2W We need to prevent I2W - I1R So, we need to stall I2
This hazard cannot occur on MIPS since all reads are early and all writes are late
This will happen when some instructions write early and some other read late
An example is for an instruction that uses the autoincrement addressing mode :
ADD R1, (R2)+ This instruction does the following : R1 R1 + M[R2] then R2 R2 + 8 Often the CPU writes the new value of R2 in the MEM stage, not in the WB
stage, provided that there is a separate integer ADDer So, we write to R2 early, perhaps before a previous instruction can read it This instruction is a typical CISC instruction The example shows how the architecture complexity affects the hardware
design, in this case pipelining !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 260CS 6143
Data Hazards There are two other types of data hazards
Write after write, WAW Instruction 1 has to write and then Instruction 2 has to write : I1W
- I2W We need to prevent I2W - I1W So, we need to stall I2 to prevent a wrong value on the destination
This hazard cannot occur on MIPS since all reads are early and all writes are late
This will happen if more than one stage can write Allowing writes in different stages can result in two writes to a GPR in the
same clock period The previous example can cause a WAW hazard ADD R1, (R2)+ R1 R1 + M[R2] then R2 R2 + 8 The CPU writes the new value of R2 in the MEM stage, not in the WB stage So, we write to R2 early, perhaps when a previous instruction is also writing
to R2 at the same time This instruction is a typical CISC instruction The example shows how the architecture complexity affects the hardware
design, in this case pipelining !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 261CS 6143
Data Hazards There are two other types of data hazards
Write after write, WAW The WAW hazard will also happen when an instruction is
allowed to proceed even though the instruction in front of it is stalled
For example, with dynamic issuing, an instruction passes by a stalled instruction, so it can write to a register that perhaps the stalled instruction will write soon !
This is a topic to deal with in later versions of the MIPS CPU !
The fourth hazard ? Read after Read, RAR
This is not a hazard since no value is changed by the two readingsP
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 262CS 6143
Data Hazards Let’s consider our piece of mnemonic machine language program
again where there is now a dependency between the LD and the instructions that follow it
We observe that the LD writes to R1 and the instructions below LD read R1
The LD and the remaining instructions are executed close in time
Can there be data hazards among them ?
200 LD R1, 500(R0)
204 DADD R2, R3, R1208 DSUB R5, R6, R120C XOR R8, R9, R1210 SLT R11, R12, R1214 OR 14, R15, R1218 SD R1, 600(R0)21C BEQZ R1, 5
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 263CS 6143
Data Hazards
The data element in R1 is shared by all the instructions below the LD and they are executed close in time
Yes, there are data dependencies, but are they all data hazards ? Will all the instructions below the LD try to read R1 before the LD writes ? Data hazards will be happen between the LD and DADD, DSUB and XOR DADD, DSUB and XOR will try to read R1 before the LD writes to R1 This data hazard is the RAW hazard We might have to stall DADD, DSUB and XOR when they try to read
R1 ???? The SLT, OR, SD and BEQZ will read R1 after the LD writes to R1 They do not have any hazard situation !!!
200 LD R1, 500(R0)
204 DADD R2, R3, R1208 DSUB R5, R6, R120C XOR R8, R9, R1210 SLT R11, R12, R1214 OR 14, R15, R1218 SD R1, 600(R0)21C BEQZ R1, 5
RAW ?
RAW ?
RAW ?
RAW ?
RAW ?
RAW ?
RAW ?
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 264CS 6143
Data Hazards Do we have to stall DADD, DSUB and XOR when
they try to read R1 ?
If yes, can we eliminate any possible stall by using forwarding ?
Yes, we can eliminate the data hazard stalls between the LD and DSUB and XOR !
But, we cannot eliminate a stall cycle between the LD and DADD with forwarding and write-in-the-first-half-read-in-the-second-half
All RAW
200 LD R1, 500(R0)
204 DADD R2, R3, R1208 DSUB R5, R6, R120C XOR R8, R9, R1210 SLT R11, R12, R1214 OR 14, R15, R1218 SD R1, 600(R0)21C BEQZ R1, 5
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 265CS 6143
Data Hazards Why is that we cannot eliminate the stall cycle
between the LD and DADD ?
According to our state diagram, the LD reads the data from the memory in the MEM stage
This is clock period 4 The data will come from the memory at the end of the 4th
clock period since the memory takes one clock period to access
But, the DADD needs that data from the memory in the beginning of the 4th clock period
We need to stall the DADD and forward the data from WB to EX in the 5th clock period
200 LD R1, 500(R0)
204 DADD R2, R3, R1
RA
W IF ID EX MEM WBIF ID Stall EX MEM WB
1 2 3 4 5 6 7
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 266CS 6143
Data Hazards Why is that we cannot eliminate the stall cycle
between the LD and DADD ?
200 LD R1, 500(R0)
204 DADD R4, R3, R1208 DSUB R5, R6, R120C XOR R8, R9, R1210 SLT R11, R12, R1214 OR 14, R15, R1218 SD R1, 600(R0)21C BEQZ R1, 5
All
RA
W IF ID EX MEM WBIF ID Stall EX MEM WB
IF Stall ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM
IF ID EX
IF ID
1 2 3 4 5 6 7 8 9 10
IF
ID
EX
MEM
WB
v vv
vvv
v
v
?vv
vv
v
v
????
???
?
? ?
v
vvv
v
vv
vv
vvv
vv
vvv
vv
v
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 267CS 6143
Data Hazards Why is that we cannot eliminate the stall cycle
between the LD and DADD ? We see that the DADD is stalled to wait for the LD to
read the memory Where is the DADD stalled ? In the ID stage ? YES
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 268CS 6143
Data Hazards Why is that we cannot eliminate the stall cycle
between the LD and DADD ? As mentioned before we are checking for hazard
situations in the ID stage and when we recognize a hazard, we stall the instruction in the ID stage !
We have static issuing We stall the DADD due to its RAW hazard We stall the DSUB, XOR and the others behind the DADD
for correct execution pattern
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 269CS 6143
Data Hazards Why is that we cannot eliminate the stall cycle
between the LD and DADD ? If we show the pipeline in our notation
Note the short lines in clock period 5 that indicate that write-in-first-half-read-in-the-second-half helps eliminate the stall between the LD and the DSUB
IF ID EX MEM WB1 2 3 4 5
2 3/4 5 6 73/4 5 6 7 85 6 7 8 9
6 7 8 9 10
8 9 10 117 8 9 10 11
9 10 11
200 LD R1, 500(R0)204 DADD R2, R3, R1208 DSUB R5, R6, R120C XOR R8, R9, R1210 SLT R11, R12, R1214 OR 14, R15, R1218 SD R1, 600(R0)21C BEQZ R1, 5
All
RA
W
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 270CS 6143
Data Hazards Why is that we cannot eliminate the stall cycle between the
LD and DADD ? The stall can be avoided (the interlock for the LD situation can
be eliminated) if there was an independent instruction, an instruction that did not need R1 was placed between the LD and DADD
For the first time we have an example of the importance of ordering instructions carefully
If we had a compiler that guaranteed to find an independent instruction that does not depend on the LD, we would never have the Load interlock !
This is what we call the compiler scheduling an independent instruction
The instruction position following the LD is called load delay slot and the compiler fills the delay slot with an independent instruction
This is called delayed Load If the compiler cannot find an independent instruction, it inserts a
NOP in the delayed Load slot
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 271CS 6143
Data Hazards Why is that we cannot eliminate the stall
cycle between the LD and DADD ? If the compiler changes the order of
instructions to avoid stalls, to fill delay slots, then it is called pipeline scheduling or instruction scheduling
We will have more examples of how the compiler arranges the code for better pipeline efficiency throughout the semester
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 272CS 6143
Data Hazards Delayed Loads are not practical and not used !
If delayed Loads were used, the Load interlock in hardware is removed since it is guaranteed a Load is not followed by a depending instruction
We can guarantee removing the interlock will work only if it runs new code just compiled for the delayed Load CPU
But, there is a lot of software compiled years ago and the compilers did not take into account this delayed Load feature
The old code has a lot of LD instructions followed by depending instructions
If we ran them on a CPU with delayed Loads (no Load interlock) the depending instruction will get wrong data and programs will generate wrong results
This is the legacy software situation !
Our MIPS CPU will not have delayed Loads !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 273CS 6143
Control Hazards Control hazards occur when a control
instruction is executedControl instructions are jump, jump to a
procedure, branch and return from procedure Except the branch instruction, all control
instructions change the order of executionThe branch may or may not change the order
of execution depending on the condition test If the order of the execution is changed,
the pipeline is emptiedThat is, there is a pipeline start-upThis results in a performance loss worse than
the data hazard performance loss
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 274CS 6143
Control Hazards Especially conditional branches are troublesome
The order of execution may be changed or may not be changed
So, we do not know which instruction to fetch next Which one to fetch depends on the test : equal to zero or
not equal to zero ? Note that besides comparing with zero, we also have to
compute the possible branch address, the effective address, the address of the target instruction
If these two are not performed early, there is a large control hazard penalty of three clock periods.
If the branch instruction does not change the order of execution, i.e. we continue with the instruction following the branch we say the branch is not taken
If the branch instruction changes the order of execution, i.e. we continue with the instruction pointed by the effective address we say the branch is taken
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 275CS 6143
Control Hazards If we recall what we did earlier
Branch instructions go through stages IF, ID and EX
They actually complete the execution back in stage IF
Therefore, CPIBranch = 4
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 276CS 6143
Control Hazards Let’s take a look at the code studied earlier
Assuming that we take the branch !
IF
ID
EX
MEM
WB
vv
v
???
v
??v
?v
v
600 BEQZ R8, 4 601 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
11 2 3 4 5 6 7 8 9IF ID EX
Stall Stall Stall
IF ID EX MEM WB
????
???
?? ?
The Branch causesa pipeline start-up ! ?
??
?A pipeline bubbleis generated
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 277CS 6143
Control Hazards If we show the pipeline in our notation
Assuming that we take the branch !
We see that we have three stall cycles if the branch is taken
600 BEQZ R8, 4 601 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
IF ID EX MEM WB1 2 3
5 6 7 8 9
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 278CS 6143
Control Hazards Let’s take a look at the code studied earlier
Assuming that we do not take the branch !
IF
ID
EX
MEM
WB
vv
v
vvv
v
vvv
vv
v
600 BEQZ R8, 4 601 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
1 2 3 4 5 6 7 8 9IF ID EX
Stall Stall Stall IF ID EX MEM WB
IF
????
???
?? ?
The Branch causesa pipeline start-up ! v
vv
v
IF ID EX MEM
IF ID EX
IF ID
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
A pipeline bubbleis generated
Haldun Hadimioglu
MIPS Versions 0 & 1 279CS 6143
Control Hazards If we show the pipeline in our notation
Assuming that we do not take the branch !
Are we fetching the DADD in the 5th clock period ? If yes, why ?
600 BEQZ R8, 4 601 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
IF ID EX MEM WB1 2 3
5 6 7 8 9
6 7 8 9 10
7 8 9 10 118 9 10 11 129 10 11 12 13
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 280CS 6143
Control Hazards Assuming that we do not take the branch !
Why are we fetching the DADD in the 5th clock period ?
Can we fetch the DADD in the 2nd clock period ?
The answer is yes, if the control unit allows the completion of the fetch cycle of the DADD in the 2nd clock period
Then, the DADD stays on the 2.IR register until the end of 4th clock period then moves to the ID stage as will be shown soon
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 281CS 6143
Control Hazards Assuming that we do not take the branch !
But, if the control unit stops fetching of the DADD in the 2nd clock period to save itself from a memory access that might be unnecessary if the branch is taken, then the DADD must be fetched in the 5th clock period
Why would the control unit stop fetching the DADD in the 2nd clock period ?
We are asking this question because we know that decoding an instruction is very quick : just checking the Opcode bits is enough for many instructions
Thus, the control unit would know right in the beginning of the 2nd clock period that there is a Branch in the ID stage, and we can get the DADD by the end of the 2nd clock period !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 282CS 6143
Control Hazards Assuming that we do not take the branch !
If the CPU designer decides to continue with the fetching of the DADD in the 2nd clock period
IF
ID
EX
MEM
WB
v vv
v
vvv
v
vvv
vv
v
600 BEQZ R8, 4 601 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
1 2 3 4 5 6 7 8 9IF ID EX
IF Stall Stall ID EX MEM WB
IF ID
????
???
?? ?
vvv
v
IF ID EX MEM WB
IF ID EX MEM
IF ID EX
vv
vv v
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 283CS 6143
Control Hazards Assuming that we do not take the branch !
If the CPU designer decide to continue with the fetching of the DADD in the 2nd clock period
If we show the pipeline in our notation
We save one clock period !
600 BEQZ R8, 4 601 DADD R9, R19, R11608 DSUB R12, R13, R1460C XOR R15, R16, R17610 SLT R18, R19, R20614 AND R21, R22, R23
IF ID EX MEM WB1 2 3
2/4 5 6 7 8
5 6 7 8 96 7 8 9 10
7 8 9 10 118 9 10 11 12
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 284CS 6143
Control Hazards Assuming that we do not take the branch !
The CPU designer might decide to design the control unit so that it aborts the fetch of the DADD in the 2nd clock period
This is a toss up for the CPU designer ! How often the branches are not taken is critical If branches are not taken often, then the designer can
design the control unit to allow fetching the DADD BUT, if we go ahead with continuing with the fetch which
causes a page-fault (the instruction is not in the memory) and we read the page of the instruction from disk and then realize the branch is taken, all this effort will be wasted !
The frequency of untaken branches depends on the application, programmer, the compiler and the instruction set !
We decide not to fetch the next instruction We do not fetch DADD !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 285CS 6143
Control Hazards If we summarize : if we have a control
instruction, the time penalty is high Jumps, jumps to a procedure and returns from
a procedure instructions require an unconditional change to the order of execution pattern
The sooner we calculate the target instruction address, the more stall cycles we can reduce
But, with branches we also need to test the condition so we need to determine two items
The target address The condition
The sooner we calculate the target instruction address and the condition, the more stall cycles we can save
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 286CS 6143
Control Hazards Thus, solving the branch execution
problem is more difficult than the others In fact, one can think of the jump, jump to a
procedure and return from a procedure instructions as a special case of the branch where the condition is always true, so we have to take the jump/return
Overall, control hazards, especially branch instructions, attract a lot of interest in computer architecture research
Many journal and conference papers last 15 years are published on the topic of branch penalty reduction !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 287CS 6143
Control Hazards Let’s change our earlier code a little
If the Branch is not taken, the target instruction is the DSUB, the instruction that follows the Branch
If the Branch is taken, the target instruction is the SLT instruction that is two instructions below the instruction that follows the Branch (DSUB)
200 LD R1, 500(R0)
204 DADD R2, R3, R4
208 BEQZ R18, 2 20C DSUB R5, R6, R7 210 XOR R8, R9, R10
214 SLT R11, R12, R13218 OR 14, R15, R16
21C SD R17, 600(R0)
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 288CS 6143
Control Hazards Assuming that we do take the branch and do not
fetch the DSUB !11 2 3 4 5 6 7 8 9 10 11
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX
IF ID EX MEM WBIF ID EX MEM
IF ID EX
IF
ID
EX
MEM
WB
v vv
vvv
vv
v
vv
vvv
vvv
v
vvv
vv
v
v
vv
????
???
?
? ? v
200 LD R1, 500(R0)
204 DADD R2, R3, R4
208 BEQZ R18, 2 20C DSUB R5, R6, R7 210 XOR R8, R9, R10
214 SLT R11, R12, R13218 OR 14, R15, R16
21C SD R17, 600(R0)A pipeline
start-up iscreatedP
ipel
ined
MIP
S C
PU
Des
ign
: V
ersi
on 1
Haldun Hadimioglu
MIPS Versions 0 & 1 289CS 6143
Control Hazards For the case where we take the branch,
we have a pipeline start-up created in clock period 7
That is, the pipeline is emptied ! We need to improve the penalty cycles for
our pipeline We will modify our state diagram so that
Branch instructions will take two clock periods
Branch instructions will be in only IF and IDCPIBranch = 2
There will be only one clock period of stall after this implementation
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 290CS 6143
Control Hazards The changes on the state diagram for the Branch
instruction As we discussed before we need to determine the target
address and the condition as early as possible We would know we have a branch in the beginning of the
ID cycle In that case, we determine the target address and the
condition in the ID stage The target address calculation requires adding PC and
(4*Offset), for which the ID stage has an ADDer circuit now The ADDer is accessed by the IF stage if the ID stage has a
Branch We can justify a separate ADDer in the ID stage, besides
the ones in IF and EX, since there is large Branch penalty to pay
The execution of all other non-control instructions is not affectedPip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 291CS 6143
Control Hazards The changes on the state diagram for the
Branch instruction0
IF
3.A GPR[2.IR.Rs]3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+3.IR 2.IR
1
ID CPIBranch = 2
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
2.IR If (2.IR.opcode == Branch) then NOP else M[PC]PC If ((2.IR.opcode == Branch) & (GPR[2.IR.Rs] op 0)) then (2.NPC + (4 * 2.IR.DOImm+)) else if (2.IR.opcode ≠ Branch) then PC + 42.NPC If ((2.IR.opcode == Branch) & (GPR[2.IR.Rs] op 0)) then (2.NPC + (4 * 2.IR.DOImm+)) else if (2.IR.opcode ≠ Branch) then PC + 4
Haldun Hadimioglu
MIPS Versions 0 & 1 292CS 6143
Control Hazards The changes to the IF and ID stages
Zero ?
2.N
PC
ID
IF
2.IR
SignExtend
5GPR
16
RsSel
MU
X1
64PC
AB1
AD
D
AD
D
644
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
DOImm+
GPR[Rs]
*4
Haldun Hadimioglu
MIPS Versions 0 & 1 293CS 6143
Control Hazards The changes to the IF and ID stages
The ADDer in the ID stage is used by MUX1 in the IF stage
This hardware will be correct for the case of GPRs written in the first half of the clock period where we check the GPR in the second half the clock period to determine if it is zero !
The Zero circuit has a forwarding circuit with MUX7 that is moved to the ID stage
We have forwardings to the ID stage so that we bypass the GPR register to test
These forwardings are from : The output of the ALU This is a new forwarding compared
with slide 256 4.ALUoutput 5.ALUoutput 5.LMD
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 294CS 6143
Control Hazards The execution of the Branch now
Assume that the Branch is taken
11 2 3 4 5 6 7 8 9 10 11IF ID EX MEM WB
IF ID EX MEM WB
IF ID
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF
ID
EX
MEM
WB
v vv
vvv
vv
v
?v
??v
???
v
???
?v
vv
vv
????
???
?
? ? v
200 LD R1, 500(R0)
204 DADD R2, R3, R4
208 BEQZ R18, 2 20C DSUB R5, R6, R7 210 XOR R8, R9, R10
214 SLT R11, R12, R13218 OR 14, R15, R16
21C SD R17, 600(R0)
A 1-clock period long bubble is created. The other stall cycle is because the Branch takes 2 clock periods v
v
vv
vv
vvv
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 295CS 6143
Control Hazards The execution of the Branch now
Assume that the Branch is taken If we show the pipeline in our notation
It looks like there is 2-clock period long bubble created on the previous slide
This is because the Branch does not have its EX cycle anymore ! Overall, there is only one stall cycle now !
200 LD R1, 500(R0)
204 DADD R2, R3, R4
208 BEQZ R18, 2 20C DSUB R5, R6, R7 210 XOR R8, R9, R10
214 SLT R11, R12, R13218 OR 14, R15, R16
21C SD R17, 600(R0)
IF ID EX MEM WB1 2 3 4 5
2 3 4 5 63 4
6 7 8 9 105 6 7 8 9
7 8 9 10
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 296CS 6143
Control Hazards Can we improve the Branch hardware so
there is no one stall cycle ? We will take a look at three solutions and
decide to go ahead with the last solution
Solution 3 ! Solution 1
We can eliminate the one clock period stall when branches are not taken by continuing the execution of the already fetched instruction that follows the branch
We discussed this before and said this is a toss up !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 297CS 6143
Control Hazards Solution 2
Adding to Solution 1, we design the Branch hardware such that it assumes branch-not-taken and continues the execution
If however, the branch is taken (we guessed wrong) we discard the instruction in the ID stage and fetch from the target address
That is we back out and continue Note again that Solution 2 includes Solution 1
Branch not takenIF ID IF ID EX MEM…. IF ID EX…….
Branch taken (we guessed wrong)IF ID IF (discard it) …… …… IF ID EX…….
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 298CS 6143
Control Hazards Solution 2
If we guessed wrong, we pay a one-clock period stall penalty
Otherwise, there is no stall on the pipeline !We have to make sure that the state of the
machine is not changed so that backing out is simple
For CPUs that have long pipelines this would be difficult
For the MIPS, it is not a problem
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 299CS 6143
Control Hazards Solution 2
What if we design the Branch hardware such that it assumes branch-taken (instead of assuming branch-not-taken) ?
This is not useful for the MIPS since the target address and the test are known together
On CPUs where the target address is known before the test outcome, this technique can be useful
These CPUs are often CISC CPUs !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 300CS 6143
Control Hazards Solution 3
This is the one we will use for Version 3 The final method we will use is delayed branch which
makes use of the compiler and the hardware In this technique, we continue the execution of the
instruction(s) that follow(s) the Branch in the branch delay slot no matter what the Branch outcome is
The branch delay slot is the set of instruction positions following the branch
The length of the branch delay slot is the time penalty paid ≡ the number of stall cycles due to the Branch ≡ the amount of time we are not sure about the target instruction
For the current design it is 1 clock period Therefore, the branch delay slot has 1 instruction
Branch Rx, Offset
Branch delay slot One instruction long or more
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 301CS 6143
Control Hazards The changes on the state diagram due to delayed
branches We have to execute the instruction that follows the
branch in any case
2.IR M[PC]PC If ((2.IR.opcode == Branch) & (GPR[2.IR.Rs] op 0)) then (PC + ((2.IR.DOImm) + * 4)) else (PC + 4)
0
IF
3.A GPR[2.IR.Rs]3.B GPR[2.IR.Rt]
3.Imm 2.IR.DOImm+
3.IR 2.IR
1
IDCPIBranch = 2
Haldun Hadimioglu
MIPS Versions 0 & 1 302CS 6143
Control Hazards Solution 3
Delayed branch means we execute the instructions in the branch delay slot no matter what the Branch outcome is These instructions must be independent of the branch
so that the program execution is correct ! For our MIPS CPU the branch delay slot is one
instruction long Because, we are not sure which instruction is the
target instruction for one clock period The following clock period we know which instruction
is the target Then, why don’t we execute the instruction right after
the Branch whether we take the branch or not ? It should be easy to find one instruction that can be
executed no matter what the Branch outcome is ????
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 303CS 6143
Control Hazards Solution 3
Delayed branch means we execute the instructions in the branch delay slot no matter what the Branch outcome is
It is the compiler that changes the order of instructions so that after the Branch there is an independent instruction
We say the compiler schedules an instruction to the Branch delay slot
This is another example of how ordering instructions is important (needed)
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 304CS 6143
Control Hazards Solution 3
Delayed branch means we execute the instructions in the branch delay slot no matter what the Branch outcome is
How can the compiler find an independent instruction for the MIPS CPU to place in the Branch delay slot ?
There are three possible cases Case 1 : From before branch
If the instruction before the Branch is independent of the Branch
This one always improves the performance :
Original code
DADD R1, R2, R3Bxxxx R6, 5
New code
Bxxxx R6, 5DADD R1, R2, R3
The compiler realizes the DADDis independent of the Bxxxx ≡ The DADD can be executed after theBxxxx. The compiler moves theDADD after the Bxxxx
Branch delay slot
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 305CS 6143
Control Hazards Solution 3
Delayed branch means we execute the instructions in the branch delay slot no matter what the Branch outcome is
Case 2 : From target branch It is used for loops where there is a large probability that the branch will
be taken It improves the performance if the branch is taken
Original code
DSUB R7, R8, R9
DADD R1, R2, R3Bxxxx R1, (-9)10
loop :
The compiler realizes the DADDis not independent of the Bxxxx. But, the DSUB is independent of The Branch ≡ The DSUB can be executed after theBxxxx. The compiler moves theDSUB to the Brach delay slot. This will save time if we branch back to the beginning of the loop. If we exit the loop, it must be OK to execute the DSUB ! Branch offset must be adjusted ! The code is longer !
New code
DSUB R7, R8, R9
DADD R1, R2, R3Bxxxx R1, (-8)10
DSUB R7, R8, R9
loop :
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 306CS 6143
Control Hazards Solution 3
Delayed branch means we execute the instructions in the branch delay slot no matter what the Branch outcome is
Case 3 : From fall through It is used when there is a high probability that the branch will not
be taken It improves the performance if the branch is not taken
Original code
DADD R1, R2, R3Bxxxx R1, 7
DSUB R12, R13, R14
The compiler realizes the DADD is not independent of the Bxxxx. But, the DSUB is independent of the Branch ≡ The DSUB can be executed right after the Bxxxx. The compiler moves the DSUB to the Branch delay slot. This will save time if the branch is not taken. It must be OK to execute the DSUB even if we take the branch !
Original code
DADD R1, R2, R3Bxxxx R1, 7DSUB R12, R13, R14
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 307CS 6143
Control Hazards Solution 3
Delayed branch means we execute the instructions in the branch delay slot no matter what the Branch outcome is
You might have realized that delayed branch is not practical since it requires the compiler to know that the CPU is expecting an independent instruction in the Branch delay slot
This means that old code cannot be run on this MIPS CPU either because that compiler did not generate the code for a CPU with a Branch delay slot or that compiler did generate a code with a Branch delay slot, but the delay slot was more than one instruction since it was an old generation MIPS CPU
This is the legacy software situation !
Today’s microprocessors do not use delayed branches because of the compatibility issue
However, academically, it is an interesting idea
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 308CS 6143
Control Hazards Solution 3
Delayed branch means we execute the instructions in the branch delay slot no matter what the Branch outcome is
Shall we not use Solution 3 for the MIPS CPU ≡ Shall we not use delayed Branches ?
We will use delayed Branches in Version 1 for the sake of simplifying our discussion
We will eventually not use delayed Branches when we cover advanced pipelining in more advanced versions of the MIPS CPU !
When we cover advanced pipelining, we will be discuss features of today’s microprocessors !
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 309CS 6143
Control Hazards Solution 3
Let’s take a look at the execution of the following code with a taken branch
Notice the DSUB is an independent instruction in the branch delay slot It must be OK to execute to execute the DSUB even if we take the branch
Notice we changed the BEQZ register to R2 to show forwarding to the ID stage
The forwarding is from the EX stage to the ID stage where the output of the ALU is forwarded to the ID stage to test the result of the addition that is just performed in EX to decide to branch
200 LD R1, 500(R0)
204 DADD R2, R3, R4
208 BEQZ R2, 2 20C DSUB R5, R6, R7 210 XOR R8, R9, R10
214 SLT R11, R12, R13218 OR 14, R15, R16
21C SD R17, 600(R0)
IF ID EX MEM WB1 2 3 4 5
2 3 4 5 63 4
4 5 6 7 8
6 7 8 9 105 6 7 8 9
7 8 9 10Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 310CS 6143
Summary of Version 1 We added hardware to deal with structural, data and
control hazards Still, it executes integer instructions It issues instructions statically
Except for the branch which is not issued and completed in two clock periods
The branch is not issued to save time !
Because of static issuing instructions complete in-order, except for the branch which can complete before the instructions that are issued earlier
This results in imprecise interrupts ! Only the L1 cache memories are considered
The L2 cache memories can be slower and there can be L1 cache misses
IF ID EX MEM WBStatic
Instructionissue
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 311CS 6143
Summary of Version 1 We realize we need to modify Version 1 so that it
Executes FP instructions All levels of the memory hierarchy must considered Handles interrupts better
All three are difficult problems to solve FP operations, such add, subtract, multiply and divide are
complex and cannot be completed in one clock period as we can with integer add operation
The integer add is done in EX and takes one clock period The FP add, subtract, multiply and divide will be done in EX and
take multiple clock cycles ! More instructions can complete out-of-order The interrupt hardware becomes even more complex We solve one problem (executing FP instructions) but made the
other problem more complex All levels of the memory hierarchy must be considered
The cache memories, slower main memory and the virtual memory (disk)
Interrupts can happen randomly We also need to save the state which is not easy for a pipelined
CPU Advanced versions will attempt to solve them
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 312CS 6143
Test Program Determine when the execution of the second iteration ends
if L1 cache memories take one clock period and there is no cache miss
Show all forwardings and write-in-the-first-half-read-in-the-second-half cases
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
IF ID EX MEM WB IF ID EX MEM WB
LD R1, 500(R8)DADD R2, R3, R1DSUB R5, R2, R1XOR R8, R5, R2SLT R11, R2, R5OR R14, R11, R15BNEZ R14, (-7)10
SD R11, 600(R14)
The answer is on the next slide
Haldun Hadimioglu
MIPS Versions 0 & 1 313CS 6143
Test Program Determine when the execution of the second iteration ends
if L1 cache memories take one clock period and there is no cache miss
Show all forwardings and write-in-the-first-half-read-in-the-second-half cases
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
IF ID EX MEM WB IF ID EX MEM WB1 2 3 4 5 10 11 12 13 14
2 3/4 5 6 7 11 12/13 14 15 163/4 5 6 7 8 12/13 14 15 16 175 6 7 8 9 14 15 16 17 18
6 7 8 9 10 15 16 17 18 19
8 9 17 187 8 9 10 11 16 17 18 19 20
9 10 11 12 18 19 20 21
LD R1, 500(R8)DADD R2, R3, R1DSUB R5, R2, R1XOR R8, R5, R2SLT R11, R2, R5OR R14, R11, R15BNEZ R14, (-7)10
SD R11, 600(R14)
All data hazards are RAW
The second iteration ends in clock period 21
Haldun Hadimioglu
MIPS Versions 0 & 1 314CS 6143
Test Program Determine when the execution of the second iteration ends if L1
cache memories take two clock period and there is no cache miss Show all forwardings and write-in-the-first-half-read-in-the-second-
half cases
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
IF ID EX MEM WB IF ID EX MEM WB1-2 3 4 5-6 7 17-18 19 20 21-22 23
3-4 5/6 7 8 9 19-20 21/22 23 24 25
5-6 7 8 9 10 21-22 23 24 25 267-8 9 10 11 12 23-24 25 26 27 28
9-10 11 12 13 14 25-26 27 28 29 30
13-14 15 29-30 31
11-12 13 14 15 16 27-28 29 30 31 32
15-16 17 18 19 31-32 33 34 35
LD R1, 500(R8)DADD R2, R3, R1DSUB R5, R2, R1XOR R8, R5, R2SLT R11, R2, R5OR R14, R11, R15BNEZ R14, (-7)10
SD R11, 600(R14)
All hazards pointed by the arrows are data hazards and type RAW
The second iteration ends in clock period 35
There are structural hazards in IF and MEM stages due to slow cache memories
We assume there is a write buffer that allows Stores to complete in one clock period
Haldun Hadimioglu
MIPS Versions 0 & 1 315CS 6143
Test Program Determine when the execution of the second iteration ends
if L1 cache memories have misses Assume that the memory levels are as described in the
unpipelined CPU case with the following additions and reminders
The bus width between the physical and lowest level cache is 8 Bytes
The instructions cache is 8KBytes and the data cache is 16KBytes Both cache block sizes are 32 bytes Both cache memories use direct mapping Both caches use write-back with write-allocate Both cache memories access the needed item first The Data Cache has two read and two write ports The Instruction Cache has two read ports The latency to access the L2 cache is 4 clock periods and
transferring an 8-Byte content is one clock period each The L2 cache memory can handle one miss per L1 cache memory
at a time This means that if the instruction cache and the data cache have
misses at the same time, they will be handled at the same time by the L2 cache
This means the L2 cache can handle two hits at the same time
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 316CS 6143
Test Program Determine when the execution of the second iteration ends
if L1 cache memories have misses Assume that the L1 instruction and data cache memories and
the physical memory have the following properties Each Level 1 cache memory can handle only one miss at a time A Store miss requires that the Store instruction stays in the MEM
stage until the miss is handled It just cannot store to the write buffer and then proceed
Each Level 1 cache memory can handle up to four hits while it handles a miss
An instruction that immediately follows a Load or a Store is forced to stall an extra clock period in the ID stage to make sure the access for the data element is completed
For the given code, assume the following The first instruction occupies the leftmost 4 bytes of the top
position of an instruction block Each data element accessed is to a separate data block all of
which do not map to the same area in data cache It means each Load and Store instruction accesses a different block in
each iteration This means there will be four data cache misses in two iterations ! This is very unusual but, it is assumed here just to show an extreme case
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
Haldun Hadimioglu
MIPS Versions 0 & 1 317CS 6143
Test Program We observe all 8 instructions are in one instruction cache block There are four data accesses, each one is in one separate data
block, resulting in four data cache misses Determine when the execution of the second iteration ends Show all forwardings and write-in-the-first-half-read-in-the-second-
half cases
Pip
elin
ed M
IPS
CP
U D
esig
n :
Ver
sion
1
IF ID EX MEM WB IF ID EX MEM WB1/5 6 7 8/12 13 18 19/24 25 26/30 31
6 7/12 13 14 15 19/24 25/30 31 32 33
7/12 13 14 15 16 25/30 31 32 33 3413 14 15 16 17 31 32 33 34 35
14 15 16 17 18 32 33 34 35 36
16 17 34 35
15 16 17 18 19 33 34 35 36 37
17 18 19 20/24 35 36 37 38/42
LD R1, 500(R8)DADD R2, R3, R1DSUB R5, R2, R1XOR R8, R5, R2SLT R11, R2, R5OR R14, R11, R15BNEZ R14, (-7)10
SD R11, 600(R14)
All hazards pointed by the arrows are data hazards and type RAW
The second iteration ends in clock period 42There are structural hazards in IF and MEM stages due to cache misses