Computer Architecture We will use a quantitative approach to analyze architectures and potential improvements and see how well they work (if at all) –

Computer Architecture• We will use a quantitative approach to analyze

architectures and potential improvements and see how well they work (if at all)– We study RISC instruction sets to promote instruction-

level, block-level and thread-level parallelism– Pipelining, superscalar, branch speculation, vector

processing, multi-core & parallel processing– Out of order completion architectures– Compiler optimizations– Improving cache performance (and virtual memory

performance if time permits)– Early on, we concentrate on the MIPS 5-stage pipeline,

later we will also look at other approaches including the Pentium processing

Performance Measures• Many different values can be used

– MIPS, MegaFLOPS – misleading values– Clock Speed – does not take into account parallelism/pipeline, stalls,

cache, etc– Execution time – we use this to compare benchmarks but we have to

make sure the benchmarks were run equally (loaded system, unloaded system, etc)

– Throughput – number of programs per unit of time, possibly useful for servers

– CPU time, user CPU time, system CPU time– CPU performance = 1 / execution time

• What does it really mean for 1 computer to be faster than another?– If we use a benchmark suite and 1 computer consistently

outperforms the other, this is useful, otherwise we have to take into account the types of programs where one computer was better than the other

SPEC2006 Bench-marks

Design Concepts• Take advantage of parallelism– There are many opportunities for parallelism in code

• Use multiple hardware components (ALU functional units, register ports, caches/memory modules, disk drive access, etc)

• Distribute instructions to hardware components in an overlapped (pipelined) or distributed (parallel) fashion

• Use hardware and software approaches

• Principle of locality of reference– Design memory systems to support this aspect of program and data

access

• Focus on the common case– Amdahl’s Law (next slide) demonstrates that minor improvements

to the common case is usually more useful than large improvements to rare cases

– Find ways to enhance the hardware for common cases over rare cases

Amdahl’s Law• When comparing two systems, we view the speedup as

– CPU time of old system / CPU time of new system• E.g., old system takes 10.5, new system takes 5.25, speedup = 10.5 / 5.25 = 2

• Speedup of one enhancement– 1 / (1 – F + F / S)

• F = fraction of the time the enhancement can be used• S = the speedup of the enhancement itself (that is, how much faster the

computer runs when the enhancement is in use)

– Example: an integer processor performs FP operations in software routines, a benchmark consists of 14% FP operations, a co-processor could perform all FP operations 4 times faster, if we add the co-processor, our speedup is• 1 / (1 - .14 + .14 / 4) = 1.12, or a 12% speedup

• Why does Amdahl’s Law promote the “common case”?– Since we have a reciprocal, the smaller the value, the greater the

speedup– The denominator subtracts F from 1 and adds F / S, so –F will have a

larger impact than F / S

Examples• Web server enhancements:– Faster CPU (10 times faster on computation, 30% CPU

operations, 70% I/O)• speedup = 1 / (1 - .3 + .3 / 10) = 1.37 (37% speedup)

– More hard disk space that improves I/O performance by 2.5• speedup = 1 / (1 - .7 + .7 / 2.5) = 1.72 (72% speedup)

– Select the common case

• A benchmark has 20% FP square root operations, 50% total FP operations, 50% other– Add an FP sqrt unit with a speedup of 10

• speedup = 1 / (1 - .2 + .2 / 10) = 1.22

– Add a new FP ALU with a speedup of 1.6 for all FP ops• speedup = 1 / (1 - .5 + .5 / 1.6) = 1.23

– Again, the common case is the better way to go (slightly)

Another Example• Architects have suggested a new feature that can be used

20% of the time and offers a speedup of 3• One architect though feels that he can provide a better

enhancement that will offer a 7 time speedup for that particular feature

• What percentage of the time would the second feature have to be used to match the first enhancement?– Speedup from feature 1 = 1 / (1 - .2 + .2 / 3) = 1.154– For speedup from feature 2 = 1.154, we need to solve for x where

1.154 = 1 / (1 – x + x / 7)

• Algebra gives us the following– 1 – x + x / 7 = 1 / 1.154 = .867– 1 - .867 = x – x / 7 .133 = (7x – x) / 7 = 6x / 7– 7 * .133 / 6 = x = .156, so the second feature would have to be

used 15.6% of the time to offer the same speedup

CPU Performance Formulae• Another way to compute speedup is to compute the

CPU’s performance before and after some enhancement(s)

• We will use the following formulae– CPU time = CPU clock cycles * clock cycle time

• CPU clock cycles = number of elapsed clock cycles• CPU clock cycles = instruction count (IC) * clock cycles per

instruction (CPI) = IC * CPI– not all instructions will have the same CPI, so we might have to compute this

as (S CPIi * ICi) for all classes of instructions i

– For instance, we might have loads, stores, branches, ALU (integer) operations, FP operations with CPIs of 5, 4, 3, 2 and 10 respectively

• Clock cycle time = 1 / clock rate (we will abbreviate clock cycle time as CCT going forward)

– Given two enhancements, compute their CPU exeuction time, speedup of machine 2 over machine 1= • CPU time machine1 / CPU time machine2

Example• Consider that we can either enhance the FP sqrt unit or enhance all

FP units– IC breakdown: 25% FP operations, 2% of which are FP square root

operations, 75% all other instructions– CPI: 4.0 for FP operations (on average across all FP operations), 20 for FP

sqrt, 1.33 for all other instructions• CPI original machine = 25% * 4.0 + 75% * 1.33 = 2.00

– If we enhance all FP units, the overall CPI for FP operations becomes 2.5, if we enhance just the FP sqrt, it reduces to 2.0

• Compute the CPU time of each (note that IC and clock rate (CCT) remain the same)– CPI all FP = 75% * 1.33 + 25% * 2.5 = 1.625

– Speedup enhancing all FP = (IC * 2.00 * CCT) / (IC * 1.625 * CCT) = 1.23

– CPI FP sqrt = CPI original – 2% * (20 – 2) = 1.64

– Speedup enhancing FP sqrt = (IC * 2.00 * CCT) / (IC * 1.64 * CCT) = 1.22– Enhancing all FP is better by 1.64 / 1.625 = 1.01, or about 1%

Another Example• Our current machine has a load-store architecture and

we want to know whether we should introduce a register-memory mode for ALU operations– Assume a benchmark of 21% loads, 12% stores, 43% ALU

operations and 24% branches– CPI is 2 for all instructions except ALU which is 1

• The new mode will lengthen the ALU CPI of 2, and it also, as a side effect, lengthens Branch CPI to 3– The IC will be reduced because we need fewer loads, let’s

assume this new mode will be used in 25% of all ALU operations

• Use the CPU execution time formula to determine the speedup of the new addressing mode

Solution• The number of ALU operations that will use this new mode is 25%,

or 43% * 25% = 11%– This means that we will have 11% fewer instructions so ICnew = 89% *

ICold

– Those dropped instructions will all be loads, so we will have a different breakdown of instruction mix• Loads = (21% - 11%) / 89% = 11%• Stores = 12% / 89% = 13%• ALU = 43% / 89% = 48%• Branches = 24% / 89% = 27%

– CPIold = 43% * 1 + 57% * 2 = 1.57

– CPInew = (48% + 11% + 13%) * 2 + 27% * 3 = 1.89

– CPU execution time old = IC * 1.57 * CCT – CPU execution time new = .89 * IC * 1.89 * CCT

• Speedup = 1.57 / (.89 * 1.89) = .933, which is actually a slowdown! – We would not want to use this enhancement

Which Formula?• In a previous example, we solved the problem of FP sqrt or all FP

units which we had solved earlier (slide 6)• Which approach should we use?

– Depends on what information we are given, notice in using Amdahl’s law, we know the fraction of time an enhancement can be used and how much speedup that enhancement gives us

– We can compute the same thing in the CPU time formula

• Let’s try another example to prove it– Benchmark consists of 35% loads, 15% stores, 40% ALU and 10%

branch operations with a CPI breakdown of 5 (loads/stores), 4 (ALU branches)

– Enhancement: since we have separate INT and FP registers and this benchmark does not use the FP registers, can we use a compiler to move values from INT to FP registers and back rather than using the slower loads & stores? Yes. How much speedup will this give us?

– Assume the compiler can reduce the loads/stores by 20% because of this enhancement

Solution• CPI goes down, IC remains the same, CPU clock

time is unchanged– Solution using CPU Time formula

• CPIold = 50% * 5 + 50% * 4 = 4.5– 20% of the loads/stores now become register moves, so our new

breakdown of instructions is 40% load/store and 60% ALU

• CPInew = 40% * 5 + 60% * 4 = 4.4

• Speedup = (4.5 * IC * CPU clock time) / (4.4 * IC * CPU clock time) = 4.5 / 4.4 = 1.023 or 2.3% speedup

– Amdahl’s Law• Speedup of enhancement is 5/4 (5 cycles down to 4) = 1.25• Fraction the enhancement can be used

– the enhancement is used in 20% of the loads/stores which were 50% of the total instruction mix and these instructions took up 5 cycles of time each, so it is used .20 * .50 * 5 / 4.5 (the original CPI) = .111

• Speedup = 1 / (1 - .111 + .111 / 1.25) = 1.023

Instruction Set Principles• We studied instruction set design issues in 362– Here, we develop a RISC instruction set to be used

throughout the course, called MIPS

• We want a fixed-length instruction format and a load-store instruction set both of which will help support a pipeline

• What other issues should we consider?– Number of operands (2-operand or 3-operand)?– Number of registers and what type (should we differentiate

between data and address registers?)

• Memory issues– What addressing modes should we allow?– How many bits should we allow for address displacements,

for immediate data?

Comparisons

Addressing Modes• Data will either be

– Constants (immediate data)– Stored in registers– Stored in memory

• For data stored in memory, there are numerous ways to specify the address– Direct, indirect (pointers), register indirect (pointers in registers), base

displacement (sum of displacement and value in register) indexed (sum of values in two registers), etc – see the next slide

– Complex modes can impact CPI because of the time it takes to obtain or compute the address

• Design issues– How many bits should be allowed for an immediate datum or a

displacement? Analysis of SPEC benchmarks indicate no more than 15 bits are needed for a displacement (displacements are < 32K) and 8 bits for most immediate data

– Which modes? Again, an analysis of SPEC benchmarks indicate that immediate and displacement modes are most common (see figure A.7)

Branch Issues• Branches typically use PC-relative branching– The branch target location is computed as PC PC + offset rather

than PC offset, this keeps the offset to fewer bits in the instruction• also, with PC + offset, we do not need to know absolute memory locations

at compile time allowing code to be moved in memory during execution

• Branches break down into– Conditional branches (test a condition first)– Unconditional branches– Procedure calls and returns (require storing a return address,

probably parameter passing as well)• register windows are used in some high performance computers for

parameter passing (this is explained in the out-of-class notes)

• Conditional branches make up 75-82% of all branches– Distance (offsets) for most branches can be limited to about 8 bits

(255 locations) – see figure A.15 on page A-18

Continued• For procedure calls/returns, how is the state saved/restored

– Through memory or register windows

• What is the form of condition for conditional branches?– Complex conditions can be time consuming– Using condition codes is problematic with pipelines– A simple zero test (value == 0 or value != 0) is the simplest and

fastest approach but requires additional instructions• e.g., to compare x == y + 1, do x – y + 1 first, then compare the result to 0

• When is the comparison performed?– With the branch instruction or prior to the branch?

Types of Instructions• Arithmetic (integer) and logic operations• FP operations (+, -, *, /) and conversion between int and FP

– We separate FP and integer operations for several reasons• they have different CPIs• we will use different register files• we will use different execution units in the pipeline

• Data transfer (loads, stores)• Control (conditional, unconditional branches, procedure calls, returns,

traps)• I/O• Strings (move, compare, search)• OS operations (OS system calls, virtual memory, other)• Graphics (pixel operations, compression/decompression, others)

– In this course we will only concentrate on the first 4 classes although we will briefly consider vector operations as well, which are often used to support graphics

– See figure A.13 on page A-16 for a breakdown of the SPECInt92 benchmark programs as executed on the Intel 80x86 architecture

Embedded Application Instruction Sets• With RISC, instruction sets were being restricted– Fewer instructions in the instruction set– Fixed length instructions– Fewer and simpler addressing modes– Load-store instruction sets

• With the popularity of embedded applications due to handheld devices, new restrictions are being introduced– 16-bit and 32-bit instruction sizes to accommodate narrower

buses• Requires smaller memory addresses, smaller immediate data, fewer

registers• This also improves cache performance because we can fit more in

the caches• An alternative is to use compression on instructions, compress an

instruction, fetch it, uncompress it in the CPU – IBM follows this

Compiler Optimizations• In order to support the increasingly complex hardware,

we need compiler support in the form of machine code optimizations, here are some examples:– High-level optimizations on source code

• example: procedure in-lining, loop transformation

– Local optimizations on single-lines of code• example: change the order of references in a block or expression

– Global optimizations extend local across branches• example: loop unrolling

– Register allocation to optimize the storage of variables in registers and minimize memory fetches

– Machine-dependent optimizations• take advantage of the specific architecture

– see Figure A.19, page A-25 and A.20, page A-28

Continued• Two examples

– Sub-expression elimination – assume that a particular expression is used in several expressions, the value can be computed one time and stored in a register, later uses can reference the register and not have to re-compute the same expression

– Graph coloring – an algorithm used to determine the best (or a good) allocation of local variables to registers• this is an NP complete problem, so compilers use a graph coloring

approximation or heuristic algorithm instead

• There is a problem with compiler optimizations: phase-ordering– Since compiler optimizations are made in a particular order, one

optimization might impact and wipe out the gain by an earlier optimization• consider for performing register allocation is performed near the end of

optimization but sub-expression elimination, performed earlier in the process, needs some registers, so the earlier optimization relies on having access to registers which may be re-allocated later!

Introduction to MIPS• Developed in 1985, since then, there have been many

versions, here, we examine a subset called MIPS64• RISC architecture designed for pipeline efficiency– optimizing compiler essential to improve efficiency

• General-purpose register set and load-store architecture – 32 64-bit general purpose (integer) registers

• labeled R0, …, R31• R0 is always 0 • 8-, 16-, 32-bit values are sign extended to become 64 bit values

– 32 64-bit floating point registers • labeled F0, …, F31 where only half the register is used for floats

• No explicit character or string types – characters treated as ints, ala C– strings as arrays of ints

• Arrays are available, using base-displacement addressing

Continued• Two addressing modes used: displacement,

immediate– direct addressing can be accomplished by using R0 as the

displacement register– register indirect can be accomplished by using a base of 0– displacements of 12-16 bits and immediate data of 8-16 bits– memory is byte addressable and 64-bit addresses are used

• Approximately 100 operations – op code requires 7 bits, however we will reduce this to 6 bits by

using one op code for all integer ALU operations– Fixed length 32-bit instructions– 3 instruction formats used (shown on the next slide)

• I-type for immediate data, used for loads, stores, conditional branches and ALU operations that have an immediate datum as an operand

• R-type for register type, used for all other ALU operations and FP operations• J-type for jump type, used for jump, jump and link (procedure call), trap, return

– Immediate data and displacements are limited to 16 bits except for Jump instructions in which case displacements are limited to 26 bits

Continued3 operand instructions are

available as long as all operands are in registers (R-type) or 2 registers and immediate datum (I-type)

immediate datum (which isalso used for displacementoffsets) is limited to 16 bits (2’s complement) but extended to 32 bits

funct is the specific type ofALU or FP function

MIPS Instructions• See Figure A.26, page A-40 for full list, here we look at the

instructions we will be using• Loads/Stores

– LD, SD – load/store double word (we could also use LW, SW for word sized data movements)• LD R2, 204(R3) – load item from M[204 + R3] in R2

– L.S, L.D, S.S, S.D – load and store single and double precision floats (S = single, D = double)• L.S F3, 0(R5) – note the use of integer register for the base

• ALU operations (integer only)– DADD/DSUB, DADDI/DSUBI, DMUL, DDIV –

add/subtract/multiply/divide with 3 registers (or 2 registers and an immediate datum for add/subtract)• DADD R1, R2, R3• DADDI R1, R2, #1

– also similar operations for AND, OR, shift, rotate

– SLT, SLTI – set less than – used for comparison• SLT R1, R2, R3 – if R2 < R3, set R1 to 1, else R1 = 0

Continued• Branch operations

– BEQZ, BNEZ – branch if register tested is 0/not zero• BEQZ R1, foo – PC = PC + foo if R1 = 0

– J – unconditional jump to given location• J foo – sets PC = PC + foo

– JR – unconditional jump where offset is in given register• JR R3 – sets PC = PC + R3

• Floating Point operations– ADD.D, ADD.S, SUB.D, SUB.S, MUL.D, MUL.S, DIV.D, DIV.S

• floating point operations, 3 FP registers • ADD.S F1, F2, F3

– C.__.D, C.__.S – FP comparisons, __ is LT, GT, LE, GE, EQ, NE– CVT.__.__ - converts from one type to another using two registers

• CVT.D.L F2, R4 – convert double in F2 to long in R4

MIPS 5-Stage Architecture

See section C.3 pages C-31-C-34

IF & ID Stages• IF:– PC sent to instruction cache– PC incremented by 4, stored

in NPC temporarily• a MUX in the MEM stage

determines if the PC should get the value in NPC or the value computed in EX

– Instruction stored in IR

• ID:– Bits 6..10 denote one source

register (I-type and R-type instructions)

– Bits 11..15 denote one source register (R-type)

– Bits 16..32 store immediate datum or displacement, sign extended to 32 bits

• NPC, A, B and IMM are temporary registers used in later stages

EX Stage• This stage – computes ALU operations

• using register A & B or A& IMM, result from ALU placed in ALU output register and passed on to next stage

– computes effective addresses for loads and stores• A + IMM, stored in ALU output and

passed onto next stage

– computes branch target locations and performs the zero test to determine if a branch is taken or not• A zero tested• ALU adds PC + IMM, value sent to ALU

output and passed to next stage

MEM and WB Stages• MEM:– If load, ALU output stores address, sent

to data cache, resulting datum stored in LMD

– If store, ALU output stores address and B register stores datum, both are sent to data cache

– If branch, based on condition, MUX either selects NPC or branch target location (as computed in the ALU EX) to send back to PC

– If ALU, forward result from ALU output directly to LMD

• WB:– If a datum in LMD (load or ALU), store

in the register file

Comments on the MIPS Architecture• The simplified nature of MIPS means that many tasks

will require more than a single assembly/machine operation to complete– in CISC instruction sets, some operations can be done in 1

instruction, such as indirect addressing and compare-and-branch operations

– registers must be pre-loaded with the data before performing an ALU operation

– two or more instructions to perform scaled or indexed modes

• The CPI of MIPS operations is less than those in other instruction sets making up for this– all operations have a CPI of 4 except Loads and ALU

operations which have a CPI of 5 (because they must write their results to registers in the WB stage)

• The static size of all MIPS operations makes it easier to deal with pre-fetching and pipelining

Continued• The architecture requires the following hardware

elements to implement:– the ALU should have all integer operations (arithmetic, logic)

• we address floating point operations later in the semester

– an additional adder is needed for the IF stage (PC increment)– several temporary registers are needed

• IR, A, B, Imm, NPC, ALUOutput, LMD

– multiplexors to select the following• what to do after a condition is evaluated • whether a computed value is to be used later in temporary registers

A or B• whether to use a register value or the immediate datum• multiplexors in the ALU to select the output based on the specific

ALU operation (not shown in the figure)• multiplexors in the register file to select which register to send on to

the A or B temporary registers, and a demultiplexor to pass along the LMD value into one of the registers (not shown in the figure)

MIPS Code Example• Write a set of code to compute the average of the elements in an int array,

assuming the array starts at memory location 50000 and that the number of elements in the array is stored at location 10000

• Store the resulting float value at location 10004

DADDI R1, R0, #50000 // R1 is our array indexDADDI R2, R0, #0 // R2 is our sum

LW R3, 10000(R0) // R3 is our loop counterCVT.W.S F2, R3 // copy number of elements into F2 as a float

Loop: BEQZ R3, Out // If R3 = = 0, then exit loopLW R4, 0(R1) // R4 is the next array elementDADD R2, R2, R4DADDI R1, R1, #4DSUBI R3, R3, #1J Loop

Out: CVT.W.S F1, R2 // convert sum to floating pointDIV.S F3, F1, F2 S.S F3, 10004(R0) // s tore resulting average

Another Example• Write a set of code that will find the largest and smallest items in an

array, the array’s starting location is stored in R5 and the array contains 500 elements, store the min in R1, the max in R2

LW R1, 0(R5) // R1 is minLW R2, 0(R5) // R2 is maxDADDI R3, R0, #500 // R3 is our loop counter

Loop: BEQZ R3, OutDADDI R5, R5, #4 // reset array pointer to next elementLW R4, 0(R5) // load next array elementDSUBI R3, R3, #1SLT R6, R4, R1 // if R4 < R1, set R6 (new min to take care of)BNEZ R6, SetMinSLT R6, R2, R4 // if R2 < R4, set R6 (new max to take care of)BNEZ R6, SetMaxJ Loop

SetMin: DADDI R1, R4, #0J Loop

SetMax: DADDI R2, R4, #0J Loop

Out: …

Documents

Computer Architecture We will use a quantitative approach to analyze architectures and potential improvements and see how well they work (if at all) –