Upload
buitram
View
227
Download
2
Embed Size (px)
Citation preview
Chapter 5The Processor
Husam AlzaqHusam AlzaqIslamic University of Gaza
2009/2010
Introduction§4.1 Int
CPU performance factorsI t ti t
roduction
Instruction countDetermined by ISA and compiler
CPI and Cycle time
n
CPI and Cycle timeDetermined by CPU hardware
We will examine two MIPS implementationspA simplified versionA more realistic pipelined versionp p
Simple subset, shows most aspectsMemory reference: lw, swy ,Arithmetic/logical: add, sub, and, or, slt
Control transfer: beq, j
22 Chapter 5 — The Processor
The CPUProcessor (CPU): the active part of the
t hi h d ll th k (d tcomputer, which does all the work (data manipulation and decision-making)Datapath: portion of the processor which contains hardware necessary to perform operations required by the processor (the brawn)Control: portion of the processor (also in hardware) which tells the datapath what ) pneeds to be done (the brain)
33 Chapter 5 — The Processor
Instruction ExecutionPC → instruction memory, fetch instructionRegister numbers → register file, read registersDepending on instruction classp g
Use ALU to calculateArithmetic resultMemory address for load/storeBranch target address
Access data memory for load/storePC ← target address or PC + 4
44 Chapter 5 — The Processor
Basic Instruction Cycle
55 Chapter 5 — The Processor
CPU Overview
66 Chapter 5 — The Processor
MultiplexersCan’t just join wires together
Use multiplexers
77 Chapter 5 — The Processor
Control
88 Chapter 5 — The Processor
Question?Why do we have two separate memories, one for instruction and the others for Data, in the previous figure??p g
99 Chapter 5 — The Processor
Logic Design Basics§4.2 Logic D
esigInformation encoded in binary gn Conve
Low voltage = 0, High voltage = 1One wire per bit entions
One wire per bitMulti-bit data encoded on multi-wire buses
C bi ti l l tCombinational elementOperate on dataOutput is a function of input
State (sequential) elementsState (sequential) elementsStore information
1010 Chapter 5 — The Processor
Combinational Elements
AND gate AAdderAND-gateY = A & B
A
BY+
AdderY = A + B
AB
Y
MultiplexerArithmetic/Logic Unit
Y = F(A, B)
I0 YM
Y = S ? I1 : I0A
YALU
( , )
0I1 Yu
x
S
B
YALU
F
1111 Chapter 5 — The Processor
S F
Sequential ElementsRegister: stores data in a circuit
Uses a clock signal to determine when to update the stored valuepEdge-triggered: update when Clk changes from 0 to 1from 0 to 1
ClkD Q
Clk
D
Clk Q
1212 Chapter 5 — The Processor
Sequential ElementsRegister with write control
Only updates on clock edge when write control input is 1pUsed when stored value is required later
Clk
D QWrite
Write
DClk
Q
1313 Chapter 5 — The Processor
Clocking MethodologyCombinational logic t f d t d itransforms data during clock cycles
Between clock edgesInput from state elements, output to state elementLongest delay determines clock period
1414 Chapter 5 — The Processor
Building a Datapath§4.3 B
u
Datapath
uilding a D
Elements that process data and addressesin the CPU
Datapath
Registers, ALUs, mux’s, memories, …
We will build a MIPS datapath
h
We will build a MIPS datapath incrementally
R fi i h i d iRefining the overview design
1515 Chapter 5 — The Processor
Fetch elements
1616 Chapter 5 — The Processor
Instruction Fetch
Increment by 4 for next
32-bit register
4 for next instruction
1717 Chapter 5 — The Processor
R-Format InstructionsRead two register operandsPerform arithmetic/logical operationWrite register resultWrite register result
1818 Chapter 5 — The Processor
Load/Store InstructionsRead register operandsC l l t dd i 16 bit ff tCalculate address using 16-bit offset
Use ALU, but sign-extend offsetL d R d d d t i tLoad: Read memory and update registerStore: Write register value to memory
1919 Chapter 5 — The Processor
Branch InstructionsRead register operandsCompare operands
Use ALU subtract and check Zero outputUse ALU, subtract and check Zero outputCalculate target address
Sign-extend displacementShift left 2 places (word displacement)S t e t p aces ( o d d sp ace e t)Add to PC + 4
Already calculated by instruction fetchAlready calculated by instruction fetch
2020 Chapter 5 — The Processor
Branch Instructions
JustJustre-routes
wires
Sign-bit wire
2121 Chapter 5 — The Processor
replicated
Composing the ElementsFirst-cut data path does an instruction in one clock cycle
Each datapath element can only do oneEach datapath element can only do one function at a timeHence we need separate instruction and dataHence, we need separate instruction and data memories
U lti l h lt t d tUse multiplexers where alternate data sources are used for different instructions
2222 Chapter 5 — The Processor
R-Type/Load/Store Datapath
2323 Chapter 5 — The Processor
Full Datapath
2424 Chapter 5 — The Processor
ALU Control§4.4 A S
ALU used for
Sim
ple Im
Load/Store: F = addBranch: F = subtract
mplem
entBranch: F subtractR-type: F depends on funct field
tation Scchem
eALU control Function0000 AND0001 OR0010 add0110 subtract0110 subtract0111 set-on-less-than1100 NOR
2525 Chapter 5 — The Processor
2626 Chapter 5 — The Processor
ALU ControlAssume 2-bit ALUOp derived from opcode
Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU controllw 00 load word XXXXXX add 0010
00 t d XXXXXX dd 0010sw 00 store word XXXXXX add 0010beq 01 branch equal XXXXXX subtract 0110R-type 10 add 100000 add 0010
subtract 100010 subtract 0110AND 100100 AND 0000OR 100101 OR 0001OR 100101 OR 0001set-on-less-than 101010 set-on-less-than 0111
2727 Chapter 5 — The Processor
The Main Control Unit
2828 Chapter 5 — The Processor
The Main Control Unit
2929 Chapter 5 — The Processor
The Main Control UnitControl signals derived from instruction
0 rs rt rd shamt functR-type31:26 5:025:21 20:16 15:11 10:6
35 or 43 rs rt addressLoad/Store
31:26 25:21 20:16 15:0
4 rs rt address
Store
Branch 4 rs rt address31:26 25:21 20:16 15:0
Branch
opcode always read
read, except for load
write for R-type
and load
sign-extend and add
3030 Chapter 5 — The Processor
for load and load
Datapath With Control
3131 Chapter 5 — The Processor
Controller Signal
3232 Chapter 5 — The Processor
Controller Signal
Memto Reg Mem MemInstruction RegDst ALUSrc
Memto-Reg
Reg Write
Mem Read
Mem Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1
3333 Chapter 5 — The Processor
R-Type Instruction
3434 Chapter 5 — The Processor
Load Instruction
3535 Chapter 5 — The Processor
Branch-on-Equal Instruction
3636 Chapter 5 — The Processor
Mapping the Main Control Function to Gates
How do we generate all the signals?Simple combinational logic (truth tables) Use a structured two-level logic array – PLAmUse a structured two level logic array PLAmby using an array of AND gates followed by an array of OR gates. A PLA is one of the mostarray of OR gates. A PLA is one of the most common ways to implement a control function.
See Appendix C pages C-7 and C-8See Appendix C, pages C 7 and C 8We will revisit this to cover different implementation techniques (ROM PLAimplementation techniques (ROM, PLA, sequencer, etc.
3737 Chapter 5 — The Processor
3838 Chapter 5 — The Processor
3939 Chapter 5 — The Processor
Implementing Jumps
2 addressJump
J d dd
31:26 25:0Jump
Jump uses word addressUpdate PC with concatenation ofp
Top 4 bits of old PC26 bit jump address26-bit jump address00
Need an extra control signal decoded from opcode
4040 Chapter 5 — The Processor
p
Datapath With Jumps Added
4141 Chapter 5 — The Processor
Executing different types of instructions
Which functional units are used?An example: EXECUTING AN R-type INSTRUCTION
Step #1: Instruction is fetched from the instruction memory and the PC is incrementedStep #2: two operands are read from the register file; the main control lines are setStep #3: ALU control generates ALU codes and performs operations on data read from the register fileSt #4 Th lt f ALU i itt b k t thStep #4: The result from ALU is written back to the register file
4242 Chapter 5 — The Processor
Functional units used by yinstruction class
4343 Chapter 5 — The Processor
Our Simple Control StructureAll of the logic is combinationalWe wait for everything to settle down, and the right thing to be donethe right thing to be done
ALU might not produce “right answer” right awayawaywe use write signals along with clock to d t i h t itdetermine when to write
Cycle time determined by length of the y y glongest path
4444 Chapter 5 — The Processor
Cycle time
4545 Chapter 5 — The Processor
Performance IssuesLongest delay determines clock period
Critical path: load instructionInstruction memory → register file → ALU →Instruction memory → register file → ALU →data memory → register file
Not feasible to vary period for differentNot feasible to vary period for different instructionsViolates design principle
Making the common case fastMaking the common case fastWe will improve performance by pipelining
4646 Chapter 5 — The Processor
Example: Performance of single cycle Machine
Calculate cycle time assuming negligible delays except:
Memory (200ps),Memory (200ps),ALU and adders (100ps)Register file access (50ps)Register file access (50ps)
25% of the instructions are loads, 10% stores, 45% are ALU, 10% branches and 5% are jump5% are jump
4747 Chapter 5 — The Processor
Example: Performance of single cycle Machine
4848 Chapter 5 — The Processor
Example: Performance of single cycle Machine
If you use a fixed clock cycle, determine the clock cycle
If you use a variable clock cycle, determine the clock cyclethe clock cycle
Which is better?
4949 Chapter 5 — The Processor
Single Cycle Implementation - Problems
InefficientCPI is 1CPI is 1Clock cycle determined by the longest pathWaste of resources (2 ALUs, etc) = waste of areaWaste of resources (2 ALUs, etc) waste of area
Performance: Calculate cycle time assuming:Negligible delays except memory (200ps), ALU and adders g g y p y ( p )(100ps), register file access (50ps)
Instruction mix: 25% loads, 10% stores, 45% ALU, 15% b h 5% jbranches, 5% jumpsCompare two implementations:
h i i 1 fi d l k leach instruction – 1 fixed clock cycleeach instruction – 1 variable length cock cycle
5050
Penalty seems small, but increases when FP taken into account Chapter 5 — The Processor
A Multicycle Implementation§5.5 A M
An implementation in which an instruction
Multicycle
is executed in multiple cycleObjective: To re-implement the MIPS
e ImplemObjective: To re implement the MIPS
instruction set using a multi-cycle implementation
mentation
implementation. The benefits are
Shared hardware Instructions can take a different number ofInstructions can take a different number of cycles (reduced computing time).
5151 Chapter 5 — The Processor
A High-level view of Multicycle Datapath g y pA single memory unit is used for both instructions and d tdata.A single ALU is used rather than an ALU and two adders.One or more registers are added after every major functional unit.
52
Multicycle Approach
Break up the instructions into steps,Break up the instructions into steps, each step takes a cycle
balance the amount of work to be donebalance the amount of work to be donerestrict each cycle to use only one major f ti l itfunctional unitFunctional units: memory, register file, and ALU
At the end of a cycleAt the end of a cycleUse internal registers to store results between steps
5353
between stepsChapter 5 — The Processor
Continue
Replacing the three ALUs of the single-cycle by a single ALU means that the single ALU must accommodate allALU means that the single ALU must accommodate all the inputs that used to go to the three different ALUs.
5454
ContinueControl signals:
The programmer-visible state units (PC, Memory, Register file) and IR writeMemory ReadALU control: same asALU control: same as single cycleMultiplexor single/twoMultiplexor single/two control lines
5555
Continue PC write control signal:PCWrite : PC+4 and
Three possible sources for the PC:ALUOut : address of the beq
PCWrite : PC 4 and jump PCWriteCond : beq
ALUOut : address of the beqAddress for jump ( j ) PC+4PC+4
5656
Continue
5757
Breaking the Instruction Execution into Clock Cycles
1. Instruction fetch step
IR <= Memory[PC];
IR <= Memory[PC];MemRead
y[ ];PC <= PC + 4;
MemReadIRWriteIorD = 0-------------------------------PC <= PC + 4;ALUSrcA = 0ALUSrcA 0ALUSrcB = 01ALUOp = 00 (for add)
PCSource = 00PCWrite
58
PCWriteThe increment of the PC and instruction memory access can occur in parallel, how?
Breaking the Instruction Execution into Clock Cycles
2. Instruction decode and register 2. Instruction decode and register fetch step
Actions that are either applicable to all instructionsOr are not harmful
A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15-0] << 2 );
5959
2. Instruction decode and register fetch stepA <= Reg[IR[25:21]];B <= Reg[IR[20:16]];
ALUOut <= PC + (sign-extend(IR[15-0] << 2 );
A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];Since A and B are overwritten on
every cycle Doneevery cycle Done------------------------------------------ALUOut <= PC + (sign-
extend(IR[15-0]<<2);Thi iThis requires:ALUSrcA 0ALUSrcB 11ALUOp 00 (for add)
branch target address will be stored in ALUOut.
60The register file access and computation of branch target occur in parallel.
Breaking the Instruction Execution into Clock Cyclesg y
3. Execution, memory address computation, or branch completion
Memory reference:Memory reference:ALUOut <= A + sign-extend(IR[15:0]);
Arithmetic logical instruction:Arithmetic-logical instruction:ALUOut <= A op B;
Branch:if (A == B) PC <= ALUOut;( )
Jump:PC <= { PC[31:28] (IR[25:0] 2’b00)};
6161
PC <= { PC[31:28], (IR[25:0], 2’b00)};
Memory reference:ALUOut <= A + sign-extend(IR[15:0]);ALUS A 1 && ALUS B 10
3. Execution, memory address computation, or branch completion
ALUSrcA = 1 && ALUSrcB = 10 ALUOp = 00
Arithmetic-logical instruction:ALUOut <= A op B;ALUSrcA = 1 && ALUSrcB = 00 ALUOp = 10
Branch:if (A == B) PC <= ALUOut;ALUSrcA = 1 && ALUSrcB = 00 ALUO 01 (f bt ti )ALUOp = 01 (for subtraction)PCSource = 01PCWriteCond
Jump:PC <= { PC[31:28], (IR[25:0],2’b00) };PCSource = 10PCWrite
62
PCWrite
Breaking the Instruction Execution into Clock Cyclesg y4. Memory access or R-type instruction completion step
Memory reference:MDR M [ALUO ] M R dMDR <= Memory [ALUOut]; MemRead
or IorD=1Memory [ALUOut] <= B; MemWritey [ ] ;
Arithmetic-logical instruction (R-type):R [IR[15 11]] ALUO t R D t 1 R W itReg[IR[15:11]] <= ALUOut; RegDst=1 RegWrite
MemtoReg=0Memory read completion step5. Memory read completion stepLoad:
Reg[IR[20:16]] <= MDR; MemtoReg=1 RegWriteReg[IR[20:16]] <= MDR; MemtoReg=1 RegWriteRegDst=0
6363
Breaking the Instruction Execution into Clock Cyclesg y
6464
Defining the Controlg
Two different techniques to design the control:
Finite state machineFinite state machineMicroprogramming
E l CPI i M lti l CPUExample: CPI in a Multicycle CPUUsing the SPECINT2000 instruction mix, which is: 25% load, 10% store, 11% branches, 2% jumps, and 52% ALU., j p ,What is the CPI, assuming that each state in the multicycle CPU requires 1 clock cycle?
Answer:The number of clock cycles for each instruction class is the following:
Load: 5Stores: 4
6565
Stores: 4ALU instruction: 4Branches: 3Jumps: 3
Example Continue The CPI is given by the following:
CPII∑n count Instructio
CPIn countInstruction countInstructio
cyclesCPU clock CPI ii∑ ×==
CPIn countInstruction countInstructioCPI i
i
ratio The
∑ ×=
is simply the instruction frequency for the instruction class i. We can therefore substitute to bt i
n countInstruction countInstructio i
obtain:
CPI = 0.25×5 + 0.10×4 + 0.52×4 + 0.11×3 + 0.02×3 = 4.12
This CPI is better than the worst-case CPI of 5.0 when all instructions take the same number of clock cycles.
66
Defining the Control (Cont.)g ( )
67
Defining the Control (Cont.)g ( )
The completeThe complete finite state machinemachine control
6868
Defining the Control (Cont.)g ( )Finite state machine controllers are typically implemented using a block of combinational logic and a register to holdcombinational logic and a register to hold the current state.
69
Exceptions and Interrupts§5.6 E
x
“Unexpected” events requiring changei fl f t l
xceptions
in flow of controlDifferent ISAs use the terms differently
ExceptionArises within the CPU
e.g., undefined opcode, overflow, syscall, …
InterruptFrom an external I/O controller
Dealing with them without sacrificingDealing with them without sacrificing performance is hard
7070 Chapter 5 — The Processor
5.6 ExceptionspExceptionsInterruptsType of event From where? MIPS terminologyType of event From where? MIPS terminologyI/O device request External Interrupt
Invoke the operating system from user program Internal Exception
Arithmetic overflow Internal Exception
Using an undefined instruction Internal Exception
Hardware malfunction Either Exception or interruptHardware malfunction Either Exception or interrupt
71
How Exception Are Handled
To communicate the reason for an exception:1 a status register ( called the Cause register)1. a status register ( called the Cause register)2. vectored interrupts
Exception type Exception vector address (in hex)Undefined instruction C000 0000hex
Arithmetic overflow C000 0020hex
7272
How Control Checks for ExceptionAssume two possible exceptions:
Undefined instructionUndefined instructionArithmetic overflow
7373
Continue
7474
The multicycle datapath with the addition needed to implement exceptions
Continue
7575
The finite state machine with the additions to handle exception detection
Handling ExceptionsIn MIPS, exceptions managed by a System Control Coprocessor (CP0)Control Coprocessor (CP0)Save PC of offending (or interrupted) instruction
I MIPS E ti P C t (EPC)In MIPS: Exception Program Counter (EPC)Save indication of the problem
I MIPS C i tIn MIPS: Cause registerWe’ll assume 1-bit
0 for undefined opcode 1 for overflow0 for undefined opcode, 1 for overflow
Jump to handler at 8000 00180
7676 Chapter 5 — The Processor
An Alternate MechanismVectored Interrupts
Handler address determined by the causeExample:p
Undefined opcode: C000 0000Overflow: C000 0020Overflow: C000 0020…: C000 0040
Instructions eitherInstructions eitherDeal with the interrupt, orJump to real handler
7777 Chapter 5 — The Processor
Handler ActionsRead cause, and transfer to relevant h dlhandlerDetermine action requiredqIf restartable
Take corrective actionTake corrective actionuse EPC to return to program
Oth iOtherwiseTerminate programReport error using EPC, cause, …
7878 Chapter 5 — The Processor
Concluding Remarks§4.14 C
ISA influences design of datapath and control
Concludin
Datapath and control influence design of ISAPipelining improves instruction throughput
ng Rem
arp g p g pusing parallelism
More instructions completed per second
rks
p pLatency for each instruction not reduced
Hazards: structural data controlHazards: structural, data, control
7979 Chapter 5 — The Processor