Chapter 5 The Processorsite.iugaza.edu.ps/.../02/CA_Chapter_5_The_Processor.pdfState(sequential)State (sequential)elements Store information 1010 Chapter 5 — The Processor Combinational

Chapter 5The Processor

Husam AlzaqHusam AlzaqIslamic University of Gaza

2009/2010

Introduction§4.1 Int

CPU performance factorsI t ti t

roduction

Instruction countDetermined by ISA and compiler

CPI and Cycle time

n

CPI and Cycle timeDetermined by CPU hardware

We will examine two MIPS implementationspA simplified versionA more realistic pipelined versionp p

Simple subset, shows most aspectsMemory reference: lw, swy ,Arithmetic/logical: add, sub, and, or, slt

Control transfer: beq, j

22 Chapter 5 — The Processor

The CPUProcessor (CPU): the active part of the

t hi h d ll th k (d tcomputer, which does all the work (data manipulation and decision-making)Datapath: portion of the processor which contains hardware necessary to perform operations required by the processor (the brawn)Control: portion of the processor (also in hardware) which tells the datapath what ) pneeds to be done (the brain)


Instruction ExecutionPC → instruction memory, fetch instructionRegister numbers → register file, read registersDepending on instruction classp g

Use ALU to calculateArithmetic resultMemory address for load/storeBranch target address

Access data memory for load/storePC ← target address or PC + 4


Basic Instruction Cycle


CPU Overview


MultiplexersCan’t just join wires together

Use multiplexers


Control


Question?Why do we have two separate memories, one for instruction and the others for Data, in the previous figure??p g


Logic Design Basics§4.2 Logic D

esigInformation encoded in binary gn Conve

Low voltage = 0, High voltage = 1One wire per bit entions

One wire per bitMulti-bit data encoded on multi-wire buses

C bi ti l l tCombinational elementOperate on dataOutput is a function of input

State (sequential) elementsState (sequential) elementsStore information


Combinational Elements

AND gate AAdderAND-gateY = A & B

A

BY+

AdderY = A + B

AB

Y

MultiplexerArithmetic/Logic Unit

Y = F(A, B)

I0 YM

Y = S ? I1 : I0A

YALU

( , )

0I1 Yu

x

S

B

YALU

F


S F

Sequential ElementsRegister: stores data in a circuit

Uses a clock signal to determine when to update the stored valuepEdge-triggered: update when Clk changes from 0 to 1from 0 to 1

ClkD Q

Clk

D

Clk Q


Sequential ElementsRegister with write control

Only updates on clock edge when write control input is 1pUsed when stored value is required later

Clk

D QWrite

Write

DClk

Q


Clocking MethodologyCombinational logic t f d t d itransforms data during clock cycles

Between clock edgesInput from state elements, output to state elementLongest delay determines clock period


Building a Datapath§4.3 B

u

Datapath

uilding a D

Elements that process data and addressesin the CPU

Datapath

Registers, ALUs, mux’s, memories, …

We will build a MIPS datapath

h

We will build a MIPS datapath incrementally

R fi i h i d iRefining the overview design


Fetch elements


Instruction Fetch

Increment by 4 for next

32-bit register

4 for next instruction


R-Format InstructionsRead two register operandsPerform arithmetic/logical operationWrite register resultWrite register result


Load/Store InstructionsRead register operandsC l l t dd i 16 bit ff tCalculate address using 16-bit offset

Use ALU, but sign-extend offsetL d R d d d t i tLoad: Read memory and update registerStore: Write register value to memory


Branch InstructionsRead register operandsCompare operands

Use ALU subtract and check Zero outputUse ALU, subtract and check Zero outputCalculate target address

Sign-extend displacementShift left 2 places (word displacement)S t e t p aces ( o d d sp ace e t)Add to PC + 4

Already calculated by instruction fetchAlready calculated by instruction fetch


Branch Instructions

JustJustre-routes

wires

Sign-bit wire


replicated

Composing the ElementsFirst-cut data path does an instruction in one clock cycle

Each datapath element can only do oneEach datapath element can only do one function at a timeHence we need separate instruction and dataHence, we need separate instruction and data memories

U lti l h lt t d tUse multiplexers where alternate data sources are used for different instructions


R-Type/Load/Store Datapath


Full Datapath


ALU Control§4.4 A S

ALU used for

Sim

ple Im

Load/Store: F = addBranch: F = subtract

mplem

entBranch: F subtractR-type: F depends on funct field

tation Scchem

eALU control Function0000 AND0001 OR0010 add0110 subtract0110 subtract0111 set-on-less-than1100 NOR



ALU ControlAssume 2-bit ALUOp derived from opcode

Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU controllw 00 load word XXXXXX add 0010

00 t d XXXXXX dd 0010sw 00 store word XXXXXX add 0010beq 01 branch equal XXXXXX subtract 0110R-type 10 add 100000 add 0010

subtract 100010 subtract 0110AND 100100 AND 0000OR 100101 OR 0001OR 100101 OR 0001set-on-less-than 101010 set-on-less-than 0111


The Main Control Unit


The Main Control Unit


The Main Control UnitControl signals derived from instruction

0 rs rt rd shamt functR-type31:26 5:025:21 20:16 15:11 10:6

35 or 43 rs rt addressLoad/Store

31:26 25:21 20:16 15:0

4 rs rt address

Store

Branch 4 rs rt address31:26 25:21 20:16 15:0

Branch

opcode always read

read, except for load

write for R-type

and load

sign-extend and add


for load and load

Datapath With Control


Controller Signal


Controller Signal

Memto Reg Mem MemInstruction RegDst ALUSrc

Memto-Reg

Reg Write

Mem Read

Mem Write Branch ALUOp1 ALUp0

R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1


R-Type Instruction


Load Instruction


Branch-on-Equal Instruction


Mapping the Main Control Function to Gates

How do we generate all the signals?Simple combinational logic (truth tables) Use a structured two-level logic array – PLAmUse a structured two level logic array PLAmby using an array of AND gates followed by an array of OR gates. A PLA is one of the mostarray of OR gates. A PLA is one of the most common ways to implement a control function.

See Appendix C pages C-7 and C-8See Appendix C, pages C 7 and C 8We will revisit this to cover different implementation techniques (ROM PLAimplementation techniques (ROM, PLA, sequencer, etc.




Implementing Jumps

2 addressJump

J d dd

31:26 25:0Jump

Jump uses word addressUpdate PC with concatenation ofp

Top 4 bits of old PC26 bit jump address26-bit jump address00

Need an extra control signal decoded from opcode


p

Datapath With Jumps Added


Executing different types of instructions

Which functional units are used?An example: EXECUTING AN R-type INSTRUCTION

Step #1: Instruction is fetched from the instruction memory and the PC is incrementedStep #2: two operands are read from the register file; the main control lines are setStep #3: ALU control generates ALU codes and performs operations on data read from the register fileSt #4 Th lt f ALU i itt b k t thStep #4: The result from ALU is written back to the register file


Functional units used by yinstruction class


Our Simple Control StructureAll of the logic is combinationalWe wait for everything to settle down, and the right thing to be donethe right thing to be done

ALU might not produce “right answer” right awayawaywe use write signals along with clock to d t i h t itdetermine when to write

Cycle time determined by length of the y y glongest path


Cycle time


Performance IssuesLongest delay determines clock period

Critical path: load instructionInstruction memory → register file → ALU →Instruction memory → register file → ALU →data memory → register file

Not feasible to vary period for differentNot feasible to vary period for different instructionsViolates design principle

Making the common case fastMaking the common case fastWe will improve performance by pipelining


Example: Performance of single cycle Machine

Calculate cycle time assuming negligible delays except:

Memory (200ps),Memory (200ps),ALU and adders (100ps)Register file access (50ps)Register file access (50ps)

25% of the instructions are loads, 10% stores, 45% are ALU, 10% branches and 5% are jump5% are jump





If you use a fixed clock cycle, determine the clock cycle

If you use a variable clock cycle, determine the clock cyclethe clock cycle

Which is better?


Single Cycle Implementation - Problems

InefficientCPI is 1CPI is 1Clock cycle determined by the longest pathWaste of resources (2 ALUs, etc) = waste of areaWaste of resources (2 ALUs, etc) waste of area

Performance: Calculate cycle time assuming:Negligible delays except memory (200ps), ALU and adders g g y p y ( p )(100ps), register file access (50ps)

Instruction mix: 25% loads, 10% stores, 45% ALU, 15% b h 5% jbranches, 5% jumpsCompare two implementations:

h i i 1 fi d l k leach instruction – 1 fixed clock cycleeach instruction – 1 variable length cock cycle

5050

Penalty seems small, but increases when FP taken into account Chapter 5 — The Processor

A Multicycle Implementation§5.5 A M

An implementation in which an instruction

Multicycle

is executed in multiple cycleObjective: To re-implement the MIPS

e ImplemObjective: To re implement the MIPS

instruction set using a multi-cycle implementation

mentation

implementation. The benefits are

Shared hardware Instructions can take a different number ofInstructions can take a different number of cycles (reduced computing time).


A High-level view of Multicycle Datapath g y pA single memory unit is used for both instructions and d tdata.A single ALU is used rather than an ALU and two adders.One or more registers are added after every major functional unit.

52

Multicycle Approach

Break up the instructions into steps,Break up the instructions into steps, each step takes a cycle

balance the amount of work to be donebalance the amount of work to be donerestrict each cycle to use only one major f ti l itfunctional unitFunctional units: memory, register file, and ALU

At the end of a cycleAt the end of a cycleUse internal registers to store results between steps

5353

between stepsChapter 5 — The Processor

Continue

Replacing the three ALUs of the single-cycle by a single ALU means that the single ALU must accommodate allALU means that the single ALU must accommodate all the inputs that used to go to the three different ALUs.

5454

ContinueControl signals:

The programmer-visible state units (PC, Memory, Register file) and IR writeMemory ReadALU control: same asALU control: same as single cycleMultiplexor single/twoMultiplexor single/two control lines

5555

Continue PC write control signal:PCWrite : PC+4 and

Three possible sources for the PC:ALUOut : address of the beq

PCWrite : PC 4 and jump PCWriteCond : beq

ALUOut : address of the beqAddress for jump ( j ) PC+4PC+4

5656

Continue

5757

Breaking the Instruction Execution into Clock Cycles

1. Instruction fetch step

IR <= Memory[PC];

IR <= Memory[PC];MemRead

y[ ];PC <= PC + 4;

MemReadIRWriteIorD = 0-------------------------------PC <= PC + 4;ALUSrcA = 0ALUSrcA 0ALUSrcB = 01ALUOp = 00 (for add)

PCSource = 00PCWrite

58

PCWriteThe increment of the PC and instruction memory access can occur in parallel, how?

Breaking the Instruction Execution into Clock Cycles

2. Instruction decode and register 2. Instruction decode and register fetch step

Actions that are either applicable to all instructionsOr are not harmful

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15-0] << 2 );

5959

2. Instruction decode and register fetch stepA <= Reg[IR[25:21]];B <= Reg[IR[20:16]];

ALUOut <= PC + (sign-extend(IR[15-0] << 2 );

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];Since A and B are overwritten on

every cycle Doneevery cycle Done------------------------------------------ALUOut <= PC + (sign-

extend(IR[15-0]<<2);Thi iThis requires:ALUSrcA 0ALUSrcB 11ALUOp 00 (for add)

branch target address will be stored in ALUOut.

60The register file access and computation of branch target occur in parallel.

Breaking the Instruction Execution into Clock Cyclesg y

3. Execution, memory address computation, or branch completion

Memory reference:Memory reference:ALUOut <= A + sign-extend(IR[15:0]);

Arithmetic logical instruction:Arithmetic-logical instruction:ALUOut <= A op B;

Branch:if (A == B) PC <= ALUOut;( )

Jump:PC <= { PC[31:28] (IR[25:0] 2’b00)};

6161

PC <= { PC[31:28], (IR[25:0], 2’b00)};

Memory reference:ALUOut <= A + sign-extend(IR[15:0]);ALUS A 1 && ALUS B 10

3. Execution, memory address computation, or branch completion

ALUSrcA = 1 && ALUSrcB = 10 ALUOp = 00

Arithmetic-logical instruction:ALUOut <= A op B;ALUSrcA = 1 && ALUSrcB = 00 ALUOp = 10

Branch:if (A == B) PC <= ALUOut;ALUSrcA = 1 && ALUSrcB = 00 ALUO 01 (f bt ti )ALUOp = 01 (for subtraction)PCSource = 01PCWriteCond

Jump:PC <= { PC[31:28], (IR[25:0],2’b00) };PCSource = 10PCWrite

62

PCWrite

Breaking the Instruction Execution into Clock Cyclesg y4. Memory access or R-type instruction completion step

Memory reference:MDR M [ALUO ] M R dMDR <= Memory [ALUOut]; MemRead

or IorD=1Memory [ALUOut] <= B; MemWritey [ ] ;

Arithmetic-logical instruction (R-type):R [IR[15 11]] ALUO t R D t 1 R W itReg[IR[15:11]] <= ALUOut; RegDst=1 RegWrite

MemtoReg=0Memory read completion step5. Memory read completion stepLoad:

Reg[IR[20:16]] <= MDR; MemtoReg=1 RegWriteReg[IR[20:16]] <= MDR; MemtoReg=1 RegWriteRegDst=0

6363

Breaking the Instruction Execution into Clock Cyclesg y

6464

Defining the Controlg

Two different techniques to design the control:

Finite state machineFinite state machineMicroprogramming

E l CPI i M lti l CPUExample: CPI in a Multicycle CPUUsing the SPECINT2000 instruction mix, which is: 25% load, 10% store, 11% branches, 2% jumps, and 52% ALU., j p ,What is the CPI, assuming that each state in the multicycle CPU requires 1 clock cycle?

Answer:The number of clock cycles for each instruction class is the following:

Load: 5Stores: 4

6565

Stores: 4ALU instruction: 4Branches: 3Jumps: 3

Example Continue The CPI is given by the following:

CPII∑n count Instructio

CPIn countInstruction countInstructio

cyclesCPU clock CPI ii∑ ×==

CPIn countInstruction countInstructioCPI i

i

ratio The

∑ ×=

is simply the instruction frequency for the instruction class i. We can therefore substitute to bt i

n countInstruction countInstructio i

obtain:

CPI = 0.25×5 + 0.10×4 + 0.52×4 + 0.11×3 + 0.02×3 = 4.12

This CPI is better than the worst-case CPI of 5.0 when all instructions take the same number of clock cycles.

66

Defining the Control (Cont.)g ( )

67

Defining the Control (Cont.)g ( )

The completeThe complete finite state machinemachine control

6868

Defining the Control (Cont.)g ( )Finite state machine controllers are typically implemented using a block of combinational logic and a register to holdcombinational logic and a register to hold the current state.

69

Exceptions and Interrupts§5.6 E

x

“Unexpected” events requiring changei fl f t l

xceptions

in flow of controlDifferent ISAs use the terms differently

ExceptionArises within the CPU

e.g., undefined opcode, overflow, syscall, …

InterruptFrom an external I/O controller

Dealing with them without sacrificingDealing with them without sacrificing performance is hard


5.6 ExceptionspExceptionsInterruptsType of event From where? MIPS terminologyType of event From where? MIPS terminologyI/O device request External Interrupt

Invoke the operating system from user program Internal Exception

Arithmetic overflow Internal Exception

Using an undefined instruction Internal Exception

Hardware malfunction Either Exception or interruptHardware malfunction Either Exception or interrupt

71

How Exception Are Handled

To communicate the reason for an exception:1 a status register ( called the Cause register)1. a status register ( called the Cause register)2. vectored interrupts

Exception type Exception vector address (in hex)Undefined instruction C000 0000hex

Arithmetic overflow C000 0020hex

7272

How Control Checks for ExceptionAssume two possible exceptions:

Undefined instructionUndefined instructionArithmetic overflow

7373

Continue

7474

The multicycle datapath with the addition needed to implement exceptions

Continue

7575

The finite state machine with the additions to handle exception detection

Handling ExceptionsIn MIPS, exceptions managed by a System Control Coprocessor (CP0)Control Coprocessor (CP0)Save PC of offending (or interrupted) instruction

I MIPS E ti P C t (EPC)In MIPS: Exception Program Counter (EPC)Save indication of the problem

I MIPS C i tIn MIPS: Cause registerWe’ll assume 1-bit

0 for undefined opcode 1 for overflow0 for undefined opcode, 1 for overflow

Jump to handler at 8000 00180


An Alternate MechanismVectored Interrupts

Handler address determined by the causeExample:p

Undefined opcode: C000 0000Overflow: C000 0020Overflow: C000 0020…: C000 0040

Instructions eitherInstructions eitherDeal with the interrupt, orJump to real handler


Handler ActionsRead cause, and transfer to relevant h dlhandlerDetermine action requiredqIf restartable

Take corrective actionTake corrective actionuse EPC to return to program

Oth iOtherwiseTerminate programReport error using EPC, cause, …


Concluding Remarks§4.14 C

ISA influences design of datapath and control

Concludin

Datapath and control influence design of ISAPipelining improves instruction throughput

ng Rem

arp g p g pusing parallelism

More instructions completed per second

rks

p pLatency for each instruction not reduced

Hazards: structural data controlHazards: structural, data, control


Documents

Chapter 5 The Processorsite.iugaza.edu.ps/.../02/CA_Chapter_5_The_Processor.pdfState(sequential)State (sequential)elements Store information 1010 Chapter 5 — The Processor Combinational