25
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga

Computer Architecture Pipelines & Superscalars

Embed Size (px)

DESCRIPTION

Computer Architecture Pipelines & Superscalars. Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga. Pipelines. Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15,100($2) - PowerPoint PPT Presentation

Citation preview

Page 1: Computer  Architecture Pipelines &  Superscalars

Computer Architecture

Pipelines & Superscalars

Sunset over the Pacific OceanTaken from Iolanthe II about 100nm north of Cape Reanga

Page 2: Computer  Architecture Pipelines &  Superscalars

Pipelines

• Data Hazards• Code:

lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

The last four instructions all depend on a result

produced by the first!

MIPS instructionshave the format

op dest, srca, srcb

Page 3: Computer  Architecture Pipelines &  Superscalars

Pipelines - Data hazards

• Examine the pipeline(ignore first 2!)

• r2 onlyupdatedin timefor add!

Page 4: Computer  Architecture Pipelines &  Superscalars

Pipelines - Data Hazards

• Compilersolution• Insert

NOOPs• Inefficient!

Page 5: Computer  Architecture Pipelines &  Superscalars

Pipelines - Data Hazards

• Second compiler solution• Reorder

lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

sub $2, $1, $3lw $4, 0($1)add $15, $1, $1and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

These two must not define$1 or $3!

ReadWritten

Page 6: Computer  Architecture Pipelines &  Superscalars

Pipelines - Data Hazards

• Second compiler solution• Reorder

sub $2, $1, $3lw $4, 0($1)add $15, $1, $1and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

ReadWritten

First use of $2

Page 7: Computer  Architecture Pipelines &  Superscalars

Pipelines - Data Hazards

• Compiler analyses dependencies• Register

definitions

• Registeruse

• Read After Write(RAW)dependency

• No dependencies

• Instruction can be moved!

sub $2, $1, $3lw $4, 0($1)add $15, $1, $1and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)

Written

Usesof $2

Page 8: Computer  Architecture Pipelines &  Superscalars

Pipelines - Data Hazards

• Hardware solution• Value forwarding

• Hardware detectsdependency

• scoreboard• Forwards result

from WB to EXfor subsequentuse

• Hardware• Transparent to software!

Page 9: Computer  Architecture Pipelines &  Superscalars

Data Hazards - classification

• Read after Write (RAW)• Instruction 1 must write

before instruction 2 reads

• Write after Write (WAW)• Instructions 1 and 2 both write

Instruction 2 must write after 1

• Write after Read (WAR)• Instruction 1 reads

Instruction 2 writes (overwrites)• Instruction 2 must not write before 1 reads

Reordering algorithms must consider all three!

Page 10: Computer  Architecture Pipelines &  Superscalars

Lecture 5 - Key Points

• Data Hazards• RAW - most common• WAW• WAR

• Compiler looks for dependencies• then re-orders

• Hardware• Scoreboard

• Monitors dependencies• ensures correct operation

• Value forwarding hardware• Forwards results from EX stage

Page 11: Computer  Architecture Pipelines &  Superscalars

Pipelines - Exceptions

• Caused by overflow, underflow• Example

add $1, $2, $1• Overflow detected in EX stage• Causes jump to exception handler

• as branch - remainder of pipeline flushed

but• Compiler needs original $1 causing overflow

Register must not be overwritten • EX stage needs to squash WB operation

• Precise Exception problem - more later!

Page 12: Computer  Architecture Pipelines &  Superscalars

Superpipelines

Page 13: Computer  Architecture Pipelines &  Superscalars

Superpipelines

• Time to complete each instruction = t• Total: Fetch + decode + fetch operands + operation + write-back

• Clock frequency: f = 1/t

• An n-stage pipeline allows n instructions ‘in flight’ simultaneously

• Each pipeline stage does 1/n of the work Each stage requires time t/n

• Assumes a perfectly balanced pipeline!• Balanced = each stage requires the same time

Clock frequency: fpipe = 1/(t/n) = n/t

Increasing n increases processor power?

Page 14: Computer  Architecture Pipelines &  Superscalars

Pipelines - Depth

• Pipeline can’t be too deep• Hazards are frequent

many stalls in deep pipelines

0.5

1.0

1.5

2.0

2.5

1 2 4 8 16

Rel

ativ

eP

erfo

rman

ce

Pipeline Depth

TooDeep!

Page 15: Computer  Architecture Pipelines &  Superscalars

Pipelines - Depth

• Pipeline can’t be too deep• Hazards are frequent

many stalls in deep pipelines

0.5

1.0

1.5

2.0

2.5

1 2 4 8 16

Rel

ativ

eP

erfo

rman

ce

Pipeline Depth

TooDeep!

Superpipelined

Page 16: Computer  Architecture Pipelines &  Superscalars

Pipeline depth

• Increasing number of stages• Each stage adds overheads

• Problems balancing pipeline

• Require tpd1 ≈ tpd

2 ≈ tpd3

• Stage time is tpdj + tpd

reg

• n stages means n tpdreg overhead

Reg

iste

r

Op

erat

ion

(wo

rk)

Reg

iste

r

Reg

iste

r

Op

erat

ion

(wo

rk)

Op

erat

ion

(wo

rk)

tpdregtpd

1 tpd2 tpd

3tpdreg tpd

reg

Page 17: Computer  Architecture Pipelines &  Superscalars

CISC and pipelines

• High Speed CISC processors are pipelined• Overlap IF, EX

• Variable• instruction length• running time (number of microcode cycles)pipeline imbalance“backup” in pipe stagescomplicate hazard detection

• Complex addressing modesauto-increment updates address registermultiple memory accesses required

smooth pipeline flow more difficult!

Page 18: Computer  Architecture Pipelines &  Superscalars

Instruction Queues

• Vital performance determinant• Rate of instruction fetch

• High Performance processors• Fetch multiple instructions in each cycle

• 2 - 4 common• Use wide datapath to memory

• PowerPC 604 128 bits = 4 instructions• Despatch unit

• Examine dependencies• Determine which instructions can be

despatched

Page 19: Computer  Architecture Pipelines &  Superscalars

Instruction Queues

• Q “matches” fetch/despatch rates• General Strategy for matching

Producers - Consumers• Use of FIFO-style Queues• Absorb

AsynchronousDelivery / ConsumptionRates

• ProvidesElasticityin pipelines

Producer

FIFO

Consumer

DifferingInstantaneous

Rates

Page 20: Computer  Architecture Pipelines &  Superscalars

Superscalar Processors

Page 21: Computer  Architecture Pipelines &  Superscalars

PowerPC organisation

PowerPC 601~1993

Boundary of theSi die

New - Look in the “Example Processors” sectionof the Web notes

3-way SuperScalar• Integer• Branch• Floating Point

A newer machine will have more functional units here!

Page 22: Computer  Architecture Pipelines &  Superscalars

Superscalar Processors

• Multiple Functional Units• PowerPC 604

6-way superscalar

• Despatch Unit • Sends “ready” instructions to all free units• PowerPC 604:

• potential 4 instructions/cycle (pipeline lengths are different!)

• reality: 2-3 instructions/cycle?(program dependent!)

Branch UnitLoadStore Unit3 Integer UnitsFloating Point Unit

Page 23: Computer  Architecture Pipelines &  Superscalars

Superscalar Processors

• Mix of functional units• Up to 8-way superscalar common now

• 2 Floating point units• Usually have ~3 cycle latency

• 3 Integer Arithmetic• Branch unit• Load / store unit• + ….?

• Marketing departments can play some games with the ‘n’ of a n-way superscalar!

Page 24: Computer  Architecture Pipelines &  Superscalars

Pentium Quad Core - 2008

• Distinguish between • Multiple ‘cores’ (separate processors) – later –

and• Superscalars – multiple functional units per

processor☺“Wide dynamic execution” in Intel-speak

• Quad core• 4 cores• Complete up to 4 instructions / cycle each• IIU can issue four instructions / cycle• 3 Mb L2 cache / processor (total 12Mb)• Master clock 3.2 GHz, front side bus 1.6GHz• 771 pins

Page 25: Computer  Architecture Pipelines &  Superscalars

Superscalar Limitations

• To achieve maximum performance• Instruction mix must match Functional Unit mix

• eg if we have 2 Integer ALUs, 2 FPUs, 1 branch unit, 1 load/store unit

• Instruction issue unit (IIU) can issue 4 instructions• Each four instructions should be able to use 4 of the

functional units• If instruction stream doesn’t have right mix

• Some functional units will remain idle

• FPUs require multiple cycles• Additional stalls

• Pipeline hazards stall pipeline• 4-way superscalar gets 1.8-3 instructions completed per

cycle• Program dependent!