Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2

Computer Architecture

Pipelines & Superscalars

Pipelines

• Data Hazards• Code:

lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

The last four instructions all depend on a result

produced by the first!

MIPS instructionshave the format

op dest, srca, srcb

Pipelines - Data hazards

• Examine the pipeline(ignore first 2!)

• r2 onlyupdatedin timefor add!

Pipelines - Data Hazards

• Compilersolution• Insert

NOOPs• Inefficient!


• Second compiler solution• Reorder

lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

sub $2, $1, $3lw $4, 0($1)add $15, $1, $1and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

These two must not define$1 or $3!

ReadWritten


• Second compiler solution• Reorder


ReadWritten

First use of $2


• Compiler analyses dependencies• Register

definitions

• Registeruse

• Read After Write(RAW)dependency

• No dependencies

• Instruction can be moved!


Written

Usesof $2


• Hardware solution• Value forwarding

• Hardware detectsdependency

• scoreboard• Forwards result

from WB to EXfor subsequentuse

• Hardware• Transparent to software!

Data Hazards - classification

• Read after Write (RAW)• Instruction 1 must write

before instruction 2 reads

• Write after Write (WAW)• Instructions 1 and 2 both write

Instruction 2 must write after 1

• Write after Read (WAR)• Instruction 1 reads

Instruction 2 writes (overwrites)• Instruction 2 must not write before 1 reads

Reordering algorithms must consider all three!

Lecture 5 - Key Points

• Data Hazards• RAW - most common• WAW• WAR

• Compiler looks for dependencies• then re-orders

• Hardware• Scoreboard

• Monitors dependencies• ensures correct operation

• Value forwarding hardware• Forwards results from EX stage

Pipelines - Exceptions

• Caused by overflow, underflow• Example

add $1, $2, $1• Overflow detected in EX stage• Causes jump to exception handler

• as branch - remainder of pipeline flushed

but• Compiler needs original $1 causing overflow

Register must not be overwritten • EX stage needs to squash WB operation

• Precise Exception problem - more later!

Pipelines - Depth

• Pipeline can’t be too deep• Hazards are frequent

many stalls in deep pipelines

0.5

1.0

1.5

2.0

2.5

1 2 4 8 16

Rel

ativ

eP

erfo

rman

ce

Pipeline Depth

TooDeep!

Pipelines - Depth

• Pipeline can’t be too deep• Hazards are frequent

many stalls in deep pipelines

0.5

1.0

1.5

2.0

2.5

1 2 4 8 16

Rel

ativ

eP

erfo

rman

ce

Pipeline Depth

TooDeep!

Superpipelined

CISC and pipelines

• High Speed CISC processors are pipelined• Overlap IF, EX

• Variable• instruction length• running time (number of microcode cycles)pipeline imbalance“backup” in pipe stagescomplicate hazard detection

• Complex addressing modesauto-increment updates address registermultiple memory accesses required

smooth pipeline flow more difficult!

Instruction Queues

• Vital performance determinant• Rate of instruction fetch

• High Performance processors• Fetch multiple instructions in each cycle

• 2 - 4 common• Use wide datapath to memory

• PowerPC 604 128 bits = 4 instructions• Despatch unit

• Examine dependencies• Determine which instructions can be

despatched

Instruction Queues

• Q “matches” fetch/despatch rates• General Strategy for matching

Producers - Consumers• Use of FIFO-style Queues• Absorb

AsynchronousDelivery / ConsumptionRates

• ProvidesElasticityin pipelines

Producer

FIFO

Consumer

DifferingInstantaneous

Rates

Superscalar Processors

PowerPC organisation

PowerPC 601~1993

Boundary of theSi die

New - Look in the “Example Processors” sectionof the Web notes

3-way SuperScalar• Integer• Branch• Floating Point

A newer machine will have more functional units here!


• Multiple Functional Units• PowerPC 604

6-way superscalar

• Despatch Unit • Sends “ready” instructions to all free units• PowerPC 604:

• potential 4 instructions/cycle (pipeline lengths are different!)

• reality: 2-3 instructions/cycle?(program dependent!)

Branch UnitLoadStore Unit3 Integer UnitsFloating Point Unit


• Mix of functional units• Up to 8-way superscalar common now

• 2 Floating point units• Usually have ~3 cycle latency

• 3 Integer Arithmetic• Branch unit• Load / store unit• + ….?

• Marketing departments can play some games with the ‘n’ of a n-way superscalar!

Superscalar – Maximum throughput

• Instruction Issue Unit is the key!• If IIU only issues 4 instructions per cycle,• An n-way superscalar (n >> 4) can still only

complete 4 instructions / cycle!• IIU has many tasks

• Pre-fetch instructions• At least one cache line!

• Check dependencies• Has data required by this instruction been computed

yet?• Keeps register ‘scoreboard’

• Mark registers which will be written by instructions already issued

• It’s a small dataflow machine (see later!)

• Check availability of functional units

Documents

Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2