125
Chapter 8 Pipelining

Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Embed Size (px)

Citation preview

Page 1: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Chapter 8

Pipelining

Page 2: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Pipelining

• A strategy for employing parallelism to achieve better performance

• Taking the “assembly line” approach to fetching and executing instructions

Page 3: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

The Cycle

The control unit:

Fetch

Execute

Fetch

Execute

Etc.

Etc.

Page 4: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 5: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

The Cycle

How about separate components for fetching the instruction and executing it?

Then

fetch unit: fetch instruction

execute unit: execute instruction

So, how about fetch while execute?

Page 6: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

clock cycle clock cycle

Page 7: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Overlapping fetch with execute

Two stage pipeline

Page 8: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Both components busy during each clock cycle

F4

Page 9: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

The Cycle

The cycle can be divided into four parts

fetch instruction

decode instruction

execute instruction

write result back to memory

So, how about four components?

Page 10: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 11: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

The four components operating in parallel

Page 12: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

buffer for instruction

buffer for operands

buffer for result

Page 13: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Instruction

I3

Operands for I2

Operation info for I2

Write info for I2

Result of instruction

I1

Page 14: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

One clock cycle for each pipeline stage

Therefore cycle time must be long enough for the longest stage

A unit is idle if it requires less time than another

Best if all stages are about the same length

Cache memory helps

Page 15: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Fetching (instructions or data) from main memory may take 10 times as long as an operation such as ADD

Cache memory (especially if on the same chip) allows fetching as quickly as other operations

Page 16: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

One clock cycle per component, four cycles total to complete an instruction

Page 17: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Completes an instruction each clock cycle

Therefore, four times as fast as without pipeline

Page 18: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Completes an instruction each clock cycle

Therefore, four times as fast as without pipeline

as long as nothing takes more than one cycle

But sometimes things take longer -- for example, most executes such as ADD take one clock, but suppose DIVIDE takes three

Page 19: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 20: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

and other stages idle

Write has nothing to writeDecode can’t use its “out” bufferFetch can’t use its “out” buffer

Page 21: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

A data “hazard” has caused the pipeline to “stall”

no data for Write

Page 22: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

An instruction “hazard” (or “control hazard”) has caused the pipeline to “stall”

Instruction I2 not in the cache, required a main memory access

Page 23: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 24: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Structural Hazards

• Conflict over use of a hardware resource

• Memory – Can’t fetch an instruction while another instruction is

fetching an operand, for example

– Cache: same• Unless cache has multiple ports• Or separate caches for instructions, data

• Register file• One access at a time, again unless multiple ports

Page 25: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Structural Hazards

• Conflict over use of a hardware resource--such as the register file

Example:

LOAD X(R1), R2 (LOAD R2, X(R1) in

MIPS)address of memory location i.e., the address in R1 + X

Load that word from memory (cache) into R2

X + [R1]

Page 26: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

I2 writing to register file

I3 must wait for register file

I2 takes extra cycle for cache access as part of execution

calculate the address

I5 fetch delayed

Page 27: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Data Hazards

• Situations that cause the pipeline to stall because data to be operated on is delayed– execute takes extra cycle, for example

Page 28: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Data Hazards

• Or, because of data dependencies

– Pipeline stalls because an instruction depends on data from another instruction

Page 29: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Concurrency

A 3 + A

B 4 x A

A 5 x C

B 20 + C

Can’t be performed concurrently--result incorrect if new value of A is not used

Can be performed concurrently (or in either order) without affecting result

Page 30: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Concurrency

A 3 + A

B 4 x A

A 5 x C

B 20 + C

Second operation depends on completion of first operation

The two operations are independent

Page 31: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

MUL R2, R3, R4

ADD R5, R4, R6 (dependent on result in R4 from previous instruction)

will write result in R4

can’t finish decoding until result is in R4

Page 32: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Data Forwarding

• Pipeline stalls in previous example waiting for I1’s result to be stored in R4

• Delay can be reduced if result is forwarded directly to I2

Page 33: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

pipeline stall

data forwarding

Page 34: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

MUL R2, R3, R4

from R2

R2 x R3

toR4

to I2

from R3

Page 35: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 36: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

ADD R5, R4, R6

MUL R2, R3, R4

R2, R3

R2 x R3

R2 x R3

R2 x R3

R5 R4 + R5

R4

R6

Page 37: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

If solved by software:

MUL R2, R3, R4

NOOP

NOOP

ADD R5, R4, R6

2 cycle stall introduced by hardware

(if no data forwarding)

Page 38: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Side Effects

• ADD (R1)+, R2, R3– Not only changes destination register,

but also changes R1

• ADD R1, R3• ADDWC R2, R4

– Add with carry dependent on condition code flag set by previous ADD—an implicit dependency

Page 39: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Side Effects

• Data dependency on something other than the result destination

• Multiple dependencies

• Pipelining clearly works better if side effects are avoided in the instruction set

– Simple instructions

Page 40: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Instruction Hazards

• Pipeline depends on steady stream of instructions from the instruction fetch unit

pipeline stall from a cache miss

Page 41: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Decode, execute, and write units are all idle for the “extra” clock cycles

Page 42: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Branch Instructions

• Their purpose is to change the content of the PC and fetch another instruction

• Consequently, the fetch unit may be fetching an “unwanted” instruction

Page 43: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

SW R1, A

BUN K

LW R5, B

two stage pipeline

fetch instruction 3

discard instruction 3 and fetch instruction K instead

computes new PC value

Page 44: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

the lost cycle is the “branch penalty”

Page 45: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

four stage pipeline

instruction 3 fetched and decoded

instruction 4 fetched

instructions 3 and 4 discarded, instruction K fetched

Page 46: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

In a four stage pipeline, the penalty is two clock cycles

Page 47: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Unconditional Branch Instructions

• Reducing the branch penalty requires computing the branch address earlier

• Hardware in the fetch and decode units

– Identify branch instructions

– Compute branch target address

(instead of doing it in the execute stage)

Page 48: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

fetched and decoded

discarded

penalty reduced to one cycle

Page 49: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Instruction Queue and Prefetching

• Fetching instructions into a “queue”

• Dispatch unit (added to decode) to take instructions from queue

• Enlarging the “buffer” zone between fetch and decode

Page 50: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

buffer for instruction

buffer for operands

buffer for result

Page 51: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

buffer for multiple instructions

buffer for operands

buffer for result

Page 52: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

“oldest instruction in the queue--next to be dispatched

“newest instruction in the queue--most recently fetched

Page 53: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 54: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

previous out, F1 in

F1 out, F2 in

F2 out, F3 in

F4 in F5 in

Page 55: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

instructions 3 and 4

instructions 3,4, and 5

keeps fetching despite stall

Page 56: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

calculates branch target concurrently

“branch folding”

discards F6 and fetches K

Page 57: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

completes an instruction each clock cycle

no branch penalty

Page 58: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Instruction Queue

• Avoiding branch penalty requires that the queue contain other instructions for processing while branch target is calculated

• Queue can mitigate cache misses--if instruction not in cache, execution unit can continue as long as queue has instructions

Page 59: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Instruction Queue

• So, it is important to keep queue full

• Helped by increasing rate at which instructions move from cache to queue

• Multiple word moves (in parallel)

cache instruction queue

one clock cycle

n words

Page 60: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Conditional Branching

• Added hazard of dependency on previous instruction

• Can’t calculate target until a previous instruction completes execution

SUBW R1, A F ED W

F D

1

2

fetch K or fetch 3?

BGTZ K

Page 61: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Conditional Branching

• Occur frequently, perhaps 20% of all instruction executions

• Would present serious problems for pipelining if not handled

• Several possibilities

Page 62: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Delayed Branching

• Location(s) following a branch instruction have been fetched and must be discarded

• These positions called “branch delay slots”

Page 63: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

the penalty is two clock cycles in a four stage pipeline

two branch delay slots

if branch address calculated here

Page 64: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

fetched and decoded

discarded

penalty reduced to one cycle

one branch delay slot

Page 65: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Delayed Branching

• Instructions in delay slots always fetched and partially executed

• So, place useful instructions in those positions and execute whether branch is taken or not

• If no such instructions, use NOOPs

Page 66: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

shift R1 N times

R2 contains N

fetched and discarded on every iteration

branch if not zero

Page 67: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

shift R1 N times R2 contains N

do the shifting while the counter is being decremented and tested

branch delayed for one instruction

Page 68: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

appears to “branch” here

actually branches here

Branch instruction waits for an instruction cycle before actually branching—hence “delayed branch”

“delay slot”

Page 69: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

If there is no useful instruction to put here, put NOOP

“delay slot”If branches are “delayed branches then the next instruction will always be fetched and executed

Page 70: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

next to last pass through the loop

last pass through the loop

Page 71: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

will branch, so fetch the decrement instruction

won’t branch, so fetch the add instruction

Page 72: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Delayed Branching

• Compiler must recognize and rearrange instructions

• One branch delay slot can usually be filled--filling two is much more difficult

• If adding pipeline stages increases number of branch delay slots, benefit may be lost

Page 73: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Branch Prediction

• Hardware attempts to predict which path will be taken

• For example, assumes branch will not take place

• Does “speculative execution” of the instructions on that path

• Must not change any registers or memory until branch decision is known

Page 74: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

fetch unit predicts branch will not be taken

results of compare known in cycle 4

if branch taken these instructions discarded

Page 75: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Static Branch Prediction

• Hardware attempts to predict which path will be taken (shouldn’t always assume the same result)

• Based on target address of branch—is it higher or lower than current address

• Software (compiler) can make better prediction and for example set a bit in the branch instruction

Page 76: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Dynamic Branch Prediction

• Processor keeps track of branch decisions

• Determines likelihood of future branches

• One bit for each branch instruction

– LT branch likely to be taken

– LNT branch likely not to be taken

Page 77: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 78: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Four states

ST: strongly likely

SNT: strongly not likely

Page 79: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Dynamic Branch Prediction

• If prediction wrong (LNT) at end of loop, it is changed (to ST), therefore is correct until the last iteration

• Stays correct for subsequent executions of the loop (only changes if wrong twice in a row)

• Initial prediction can be guessed by the hardware, better if set by the compiler

Page 80: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 81: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Pipelining’s Effect on Instruction Sets

• Multiple addressing modes

– Facilitate use of data structures

• Indexing, offsets

– Provide flexibility

– One instruction instead of many

• Can cause problems for the pipeline

Page 82: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Structural Hazards

• Conflict over use of a hardware resource--such as the register file

Example: (indirect addressing with offset)

LOAD X(R1), R2 (LOAD R2, X(R1) in

MIPS)address of memory location i.e., the address in R1 + X

Load that word from memory (cache) into R2

X + [R1]

Page 83: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

I2 writing to register file

I3 must wait for register file

I2 takes extra cycle for cache access as part of execution

calculate the address (register access)

I5 fetch delayed

Page 84: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

LOAD ( X(R1) ), R2

double indirect with offset

two clock pipeline stall

while address is calculated and data fetched

Page 85: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

ADD #X, R1, R2

LOAD (R2), R2

LOAD (R2), R2

same seven clock cycles

three simpler instructions to do the same thing

Page 86: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Condition Codes

• Conditional branch instructions dependent on condition codes set by a previous instruction

• For example, COMPARE R3, R4 sets a bit in the PSW to be tested by BRANCH if ZERO

Page 87: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Branch decision must wait for completion of Compare

Can’t take place in decode, must wait for execution stage

Page 88: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

result of compare is in PSW by the time Branch instruction is decoded

depends on Add instruction not affecting condition codes

Page 89: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Condition Codes and Pipelining

• Compiler must be able to reorder instructions

• Condition codes set by few instructions

• Compiler should be able to control which instructions set condition codes

Page 90: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Datapath: registers, ALU, interconnecting bus

single internal bus

general registers

Page 91: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

with a single internal bus, one thing at a time over the bus

single internal bus

Page 92: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

1) PC out, MAR in, Read, Select 4, Add, Z in

2) Z out, PC in, Y in, wait for memory

3) MDR out, IR in

4) Offset field of IR out, Add, Z in

5) Z out, PC in, end

unconditional branch

Page 93: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

three internal buses

three port register file

transfers three at a time

PC incrementer

for address incrementing (multiple word transfers)

Page 94: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

1) PC out, R=B, MAR in, Read, Inc PC

2) Wait for memory

3) MDR out B, R=B, IR in

4) R4 out A, R5 out B, select A, Add, R6 in, end

ADD R4, R5, R6

Page 95: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 96: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

three internal bus organization modified for pipelining

two caches--one for instructions, one for data

separate MARs one for each cache

Page 97: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

PC connected directly to IMAR, can transfer concurrently with ALU operation

data address can come from register file or from ALU

Page 98: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

separate MDRs for read and write

buffer registers for ALU inputs and output

Page 99: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

buffering for control signals following Decode and Execute

instruction queue loaded directly from cache

Page 100: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 101: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Can perform simultaneously in any combination:

Reading an instruction from instruction cache

Incrementing the PC

Decoding an instruction

Reading from or writing into data cache

Reading contents of up to two registers from the register file

Writing into one register of the register file

Performing an ALU operation

Page 102: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Superscalar Operation

• Pipelining fetches one instruction per cycle, completes one per cycle (if no hazards)

• Adding multiple processing units for each stage would allow more than one instruction to be fetched, and moved through the pipeline, during each cycle

Page 103: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Superscalar Operation

• Starting more than one instruction in each clock cycle is called multiple issue

• Such processors are called superscalar

Page 104: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

a processor with two execution units

Page 105: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

instruction queue and multiple word moves enables fetching n instructions

Page 106: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

dispatch unit capable of decoding two instructions from queue

Page 107: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

so, if top two instructions are:

ADDF R1, R2, R3 ADD R4, R5, R6

Page 108: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing
Page 109: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Floating point execution unit is pipelined also

Page 110: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

instructions complete out of order

OK if no dependencies

problem if error (imprecise interrupt/exception)

Page 111: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

imprecise interrupt

error occurs here

later instructions have already completed!

Page 112: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

results written in program order

Page 113: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

error occurs here

later instructions discarded

results written in program order

precise interrupts

Page 114: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

temporary registers allow greater flexibility

Page 115: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

register renaming

ADD R4, R5, R6

would write to R6

writes to another register “renamed” as R6, used by subsequent instructions

using that result

Page 116: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

R0

R1

R2

.

.

.

Rn

0

1

2

.

.

.

n

the “architectural”

registers

the physical registers

changeable mapping

register renaming

Page 117: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Superscalar

• Statically scheduled– Variable number of instructions

issued each clock cycle– Issued in-order (as ordered by

compiler)– Much effort required by compiler

• Dynamically scheduled– Issued out-of-order (determined by

hardware)

Page 118: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Superscalar

• Two issue, dual issue--capable of issuing two instructions at a time (as in previous example)

• Four issue--four at a time

• Etc.

• Overhead grows with issue width

Page 119: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

Closing the Performance Gap

• At the “micro architecture” level

• Interpreting the CISC x86 instruction set with hardware that translates to “RISC-like” operators

• Pipelining and superscalar execution of those micro-ops

Page 120: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

microcode

compiler

x86 machine language

C++ program

hard wired logic

early x86 processors

Page 121: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

micro op translator

compiler

x86 machine language

C++ program

micro-ops

hard wired logic

current x86 processors

Page 122: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

micro op translator

compiler

x86 machine language

C++ program

micro-ops

hard wired logic

microcode

current x86 processors

some instructions

Page 123: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

micro op translator

compiler

x86 machine language

C++ program

micro-ops

hard wired logic

compiler

MIPS machine language

C++ program

hard wired logic

Page 124: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

micro op translator

compiler

x86 machine language

C++ program

micro-ops

hard wired logic

micro op translator

compiler

x86 machine language

Java program

micro-ops

hard wired logic

JVM

Java byte code

Page 125: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing

micro op translator

compiler

x86 machine language

C++ program

micro-ops

hard wired logic