92
OOO Execution and 21264 1

OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

OOO Execution and 21264

1

Page 2: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Parallelism

• ET = IC * CPI * CT• IC is more or less fixed• We have shrunk cycle time as far as we can• We have achieved a CPI of 1.• Can we get faster?

2

Page 3: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Parallelism

• ET = IC * CPI * CT• IC is more or less fixed• We have shrunk cycle time as far as we can• We have achieved a CPI of 1.• Can we get faster?

2

We can reduce our CPI to less than 1.The processor must do multiple operations at once.

This is called Instruction Level Parallelism (ILP)

Page 4: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The Basic 5-stage Pipeline

• Like an assembly line -- instructions move through in lock step

• In the best case, it can achieve one instruction per cycle (IPC).

• In practice, it’s much worse -- branches, data hazards, long-latency memory operations cause much lower IPC.

• We want an IPC > 1!!!

3

EXDeco

de

Fetch Mem Write

back

Page 5: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Approach 1: Widen the pipeline

• Process two instructions at once instead of 1• Often 1 “odd” PC instruction and 1 “even” PC

• This keeps the instruction fetch logic simpler.

• 2-wide, in-order, superscalar processor• Potential problems?

4

EXDeco

de

Fetch Mem Write

back

EXDeco

de

Fetch Mem Write

back

Decode

2 inst

Fetch 4

values

Mem

Two

Mem-

ory

ops

Fetch

PC

and

PC+4

Write

back

2

values

Page 6: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Dual issue: Structural Hazards

• Structural hazards• We might not replicate everything• Perhaps only one multiplier, one shifter, and one load/

store unit• What if the instruction is in the wrong place?

5

EX

Deco

de

Fetch Mem Write

back

EXDeco

de

Fetch Write

back

Decode

2 inst

Fetch 4

values

Fetch

PC

and

PC+4

Write

back

2

values

<<

*

If an “upper” instruction needs the “lower” pipeline, squash the “lower” instruction

Page 7: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Dual issue: Structural Hazards

• Structural hazards• We might not replicate everything• Perhaps only one multiplier, one shifter, and one load/

store unit• What if the instruction is in the wrong place?

5

EX

Deco

de

Fetch Mem Write

back

EXDeco

de

Fetch Write

back

Decode

2 inst

Fetch 4

values

Fetch

PC

and

PC+4

Write

back

2

values

<<

*EX

Deco

de

Fetch Mem Write

back

EXDeco

de

Fetch Write

back

Decode

2 inst

Fetch 4

values

Fetch

PC

and

PC+4

Write

back

2

values

<<

*

If an “upper” instruction needs the “lower” pipeline, squash the “lower” instruction

Page 8: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Dual issue: Data Hazards

• The “lower” instruction may need a value produced by the “upper” instruction

• Forwarding cannot help us -- we must stall.

6

EX

Deco

de

Fetch Mem Write

back

EXDeco

de

Fetch Write

back

Decode

2 inst

Fetch 4

values

Fetch

PC

and

PC+4

Write

back

2

values

<<

*EX

Deco

de

Fetch Mem Write

back

EXDeco

de

Fetch Write

back

Decode

2 inst

Fetch 4

values

Fetch

PC

and

PC+4

Write

back

2

values

<<

*

Page 9: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Compiling for Dual Issue

• The compiler should• Pair up non-conflicting instructions• Align branch targets (by potentially inserting noops

above them)

• These are similar to the rules for VLIW, but they are just guidelines, not rules.

7

Page 10: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Beyond Dual Issue

• Wider pipelines are possible.• There is often a separate floating point pipeline.

• Wide issue leads to hardware complexity• Compiling gets harder, too.• In practice, processors use of two options if they

want more ILP• If we can change the ISA: VLIW• If we can’t: Out-of-order

8

Page 11: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Going Out of Order: Data dependence refresher.

9

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t5,$t1,$t2

4: add $t3,$t1,$t2

1

2

3

4

Page 12: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Going Out of Order: Data dependence refresher.

9

1 2

3 4

There is parallelism!!We can execute1 & 2 at once

and 3 & 4 at once

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t5,$t1,$t2

4: add $t3,$t1,$t2

1

2

3

4

Page 13: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Going Out of Order: Data dependence refresher.

9

1 2

3 4

There is parallelism!!We can execute1 & 2 at once

and 3 & 4 at once

We can parallelize instructions that do not have a “read-after-

write” dependence (RAW)

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t5,$t1,$t2

4: add $t3,$t1,$t2

1

2

3

4

Page 14: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

• In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously.

• But beware:• Is there a dependence here?

• Can we reorder the instructions?

• Is the result the same?

Data dependences

10

1: add $t1,$s2,$s3

2: sub $t1,$s3,$s4

1 2

2: sub $t1,$s3,$s4

1: add $t1,$s2,$s3

2: sub $t1,$s3,$s4

1: add $t1,$s2,$s3

Page 15: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

• In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously.

• But beware:• Is there a dependence here?

• Can we reorder the instructions?

• Is the result the same?

Data dependences

10

1: add $t1,$s2,$s3

2: sub $t1,$s3,$s4

1 2

2: sub $t1,$s3,$s4

1: add $t1,$s2,$s3

No! The final value of $t1 is different

2: sub $t1,$s3,$s4

1: add $t1,$s2,$s3

Page 16: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

False Dependence #1

• Also called “Write-after-Write” dependences (WAW) occur when two instructions write to the same value

• The dependence is “false” because no data flows between the instructions -- They just produce an output with the same name.

11

Page 17: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

• Is there a dependence here?

• Can we reorder the instructions?

• Is the result the same?

Beware again!

12

1: add $t1,$s2,$s3

2: sub $s2,$s3,$s4

1 2

2: sub $s2,$s3,$s4

1: add $t1,$s2,$s3

Page 18: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

• Is there a dependence here?

• Can we reorder the instructions?

• Is the result the same?

Beware again!

12

1: add $t1,$s2,$s3

2: sub $s2,$s3,$s4

1 2

2: sub $s2,$s3,$s4

1: add $t1,$s2,$s3

No! The value in $s2 that 1 needs will be destroyed

Page 19: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

False Dependence #2

• This is a Write-after-Read (WAR) dependence• Again, it is “false” because no data flows between

the instructions

13

Page 20: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Out-of-Order Execution

• Any sequence of instructions has set of RAW, WAW, and WAR hazards that constrain its execution.

• Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences?

14

Page 21: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The Central OOO Idea

1. Fetch a bunch of instructions2. Build the dependence graph3. Find all instructions with no unmet

dependences4. Execute them.5. Repeat

15

Page 22: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

Page 23: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

1

2

3

4

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

Page 24: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

1

2

3

4

3

4

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

Page 25: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

1

2

3

4

3

4

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

5: or $t4,$s1,$s3

6: mul $t2,$t3,$s5

7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

Page 26: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

1

2

3

4

3

4

3

4

5

6

7

8

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

5: or $t4,$s1,$s3

6: mul $t2,$t3,$s5

7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

Page 27: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

1

2

3

4

3

4

3

4

5

6

7

8

5

6

7

8

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

5: or $t4,$s1,$s3

6: mul $t2,$t3,$s5

7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

Page 28: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

1

2

3

4

3

4

3

4

5

6

7

8

5

6

7

8

7

8

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

5: or $t4,$s1,$s3

6: mul $t2,$t3,$s5

7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

Page 29: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Example

16

1

2

3

4

3

4

3

4

5

6

7

8

5

6

7

8

7

8

8 Instructions in 5 cycles

WAR

WAW

RAW

1: add $t1,$s2,$s3

2: sub $t2,$s3,$s4

3: or $t3,$t1,$t2

4: add $t5,$t1,$t2

5: or $t4,$s1,$s3

6: mul $t2,$t3,$s5

7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

Page 30: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Simplified OOO Pipeline

17

Deco

de

FetchEX

Mem Write

back

Sche

dule

• A new “schedule” stage manages the “Instruction Window”• The window holds the set of instruction the processor

examines• The fetch and decode fill the window• Execute stage drains it

• Typically, OOO pipelines are also “wide” but it is not necessary.

• Impacts• More forwarding, More stalls, longer branch resolution• Fundamentally more work per instruction.

Page 31: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The Instruction Window

18

• The “Instruction Window” is the set of instruction the processor examines• The fetch and decode fill the window• Execute stage drains it

• The larger the window, the more parallelism the processor can find, but...

• Keeping the window filled is a challenge

Page 32: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Case Study: Alpha 21264

Page 33: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Digital Equipment Corporation

• One of the Big Old Computer companies (along with IBM)–Business-oriented computers–Check out Gordon Bell’s lecture in “History of

Computing” class • They produced a string of famous machines• Sold to Compaq in 1998• Sold to HP (and Intel) in 2002

Page 34: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The PDPs• Most famous: PDP-11

– Birthplace of UNIX– Elegant ISA– Designed by a small team in short order

• In response to competitor• Formed by defecting engineers

– 16 bits of virtual address• PDP-5 and PDP-8 were 12 bits

– Chronically short of address bits– Sold until 1997

Page 35: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The VAX• (In)famous and long-lived

–for "Virtual Address Extension (to the PDP-11)”• LOTS of extensions

–Very CISCy -- polynomial evaluate inst. Etc.

Page 36: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The Alpha

• Four processors–21064, 21164, 21264, 21364, (21464)–21 for “21st century”; 64 - for “64 bit”

• High-end workstations/servers• Fast processors in the world at introduction• Unix, VMS (old VAX OS), WindowsNT, Linux• Alpha died when Intel bought the IP and the

design team.

Page 37: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

AlphaAXP• New ISA from scratch

– No legacy anything (almost)• VAX-style floating point mode

– 64-bit– Very clean RISC ISA

• Register-Register/Load-Store• No condition codes• Conditional moves -- reduced branching, but at what cost?

– 32 GPRs and FPRs• OS support

– PALCode -- “firmware” control of low-level hardware• VAX compatibility provided in software

– VAX ISA -> Alpha via a compiler

Page 38: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21064

• Introduced in 1991• 100-300Mhz (blazingly fast at

the time)• 750nm/0.75micron (vs 45nm

today)• 234mm2

die, 1.6M transistors• 33 Watts• Full custom design

Page 39: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21064 (cont)• Pipeline

– Dual issue– 7 stage integer/10 stage FP– 4 cycle mis-prediction penalty.– 45 bypassing paths– 22 instructions “in flight”

• Caches– On-chip L1I + L1D. 8KB each– Off-chip L2

• Branch prediction– Static: forward taken/Back not taken– Simple dynamic prediction– 80% accuracy

Page 40: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21164

• Introduced in 1995• 500Mhz• 500nm/0.5micron• 299mm2

die, 9.7M transistors• 56W

Page 41: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21164 (cont)• Pipeline

– Quad issue: 2 integer + 2 FP– 7 stage integer/10 stage FP

• Caches– On-chip L1I + L1D. 8KB each. Direct-mapped (fast!)

• Hit under miss/miss under miss (21 outstanding at once)– On-chip 3-way 96KB L2.– Off-chip L3 (1-64MB)

• ISA changes– Native support for byte operations

• Branch prediction– 5 cycle mispredict penalty– History-based dynamic predictor. Bits stored per cache line.

Page 42: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264

• Introduced in 1998• 600Mhz-1.2Ghz• 0.35-0.18micron• 314mm2

die, 15.2M transistors• 73W

Page 43: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264 (cont)

• Pipeline–6-issue: 4 integer + 2 FP–7 stage integer/longer for FP, depending or op.–80 in-flight instructions

• Caches–On-chip L1I + L1D. 64KB each. 2-way–Off-chip L2–Compared to 21164 8x the L1 capacity, but no on-

chip L2

Page 44: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Aggressive Speculation

• The 21264 executes instructions that may or may not be on the correct path.

• When it’s wrong, it has to undo those instructions–It stores backups of renaming tables, register

file, etc.–It also must prevent changes to memory from

occurring until the instructions “commit”

31

Page 45: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

In Order Fetch and Commit

• Fetch is in-order• Execution is out of order

–Extract as much parallelism as possible• Commit is in-order

–Make the changes permanent in program order.

–This is what is “visible” to the programmer.–This enables precise exceptions (mostly)

32

Page 46: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264 (cont)

• Fetch unit– Pre-decodes instructions in the Icache– next line and set predictors -- correct 80-100%– Tournament predictor

• A local history predictor + A global history predictor• A third predictor to track which one is most effective• 2 cycle to make a prediction

33

Page 47: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: I Cache/fetch

Instructions Next Line Next Way Pre-decodedbits

• 64KB, 2-way, 16byte lines (4 instructions)• Each line also contains extra information:

– Incorporates BTB and parts of instruction decode– BTB data is protected by 2-bits of hysteresis, trained by branch predictor.

• Branch prediction is aggressive to find parallelism and exploit speculative out-of-order execution. – We wants lots of instructions in flight.

• On a miss, it prefetches up to 64 instructions

Page 48: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264

BranchPredictor

Next line/Set prediction

L1I64KB, 2-way

Int regrename

FP regrename

IntIQ

20 entries

FPIQ

15 entries

FPRegFile(72)

Int RegFile(80)

IntRegFile(80)

ALU

ALU

ALU

ALU

L1D64KB2-way

L296KB3-way

Fetch Rename IssueRegRead Execute Memory

FP Mult

FP Add

Slo

t

Page 49: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264

BranchPredictor

Next line/Set prediction

L1I64KB, 2-way

Int regrename

FP regrename

IntIQ

20 entries

FPIQ

15 entries

FPRegFile(72)

Int RegFile(80)

IntRegFile(80)

ALU

ALU

ALU

ALU

L1D64KB2-way

L296KB3-way

Fetch Rename IssueRegRead Execute Memory

FP Mult

FP Add

Slo

t

“enriched”L1 Icache

Page 50: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264

BranchPredictor

Next line/Set prediction

L1I64KB, 2-way

Int regrename

FP regrename

IntIQ

20 entries

FPIQ

15 entries

FPRegFile(72)

Int RegFile(80)

IntRegFile(80)

ALU

ALU

ALU

ALU

L1D64KB2-way

L296KB3-way

Fetch Rename IssueRegRead Execute Memory

FP Mult

FP Add

Slo

t

“enriched”L1 Icache

Out-of-order

Page 51: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264

BranchPredictor

Next line/Set prediction

L1I64KB, 2-way

Int regrename

FP regrename

IntIQ

20 entries

FPIQ

15 entries

FPRegFile(72)

Int RegFile(80)

IntRegFile(80)

ALU

ALU

ALU

ALU

L1D64KB2-way

L296KB3-way

Fetch Rename IssueRegRead Execute Memory

FP Mult

FP Add

Slo

t “Cluster”

“enriched”L1 Icache

Out-of-order

Page 52: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264

BranchPredictor

Next line/Set prediction

L1I64KB, 2-way

Int regrename

FP regrename

IntIQ

20 entries

FPIQ

15 entries

FPRegFile(72)

Int RegFile(80)

IntRegFile(80)

ALU

ALU

ALU

ALU

L1D64KB2-way

L296KB3-way

Fetch Rename IssueRegRead Execute Memory

FP Mult

FP Add

Slo

t “Cluster”

Dual portedL1

“enriched”L1 Icache

Out-of-order

Page 53: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

How Much Parallelism is There?

• Not much, in the presence of WAW and WAR dependences.

• These arise because we must reuse registers, and there are a limited number we can freely reuse.

• How can we get rid of them?

36

Page 54: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Removing False Dependences

• If WAW and WAR dependences arise because we have too few registers• Let’s add more!

• But! We can’t! The Architecture only gives us 32 (why or why did we only use 5 bits?)

• Solution:• Define a set of internal “physical” register that is as large

as the number of instructions that can be “in flight” -- 128 in the latest intel chip.

• Every instruction in the pipeline gets a registers• Maintaining a register mapping table that determines

which physical register currently holds the value for the required “architectural” registers.

• This is called “Register Renaming”37

Page 55: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

• Separate INT and FP• Replaces “architectural registers” with “physical

registers”– 80 integer physical registers– 72 FP physical registers– Eliminates WAW and WAR hazards

• Register map table maintains mapping between architectural and physical registers– One copy for each in-flight instruction (80 copies)

• Special handling for conditional moves.

Page 56: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

• Two parts– Content-addressable lookup to find physical register inputs– Register allocation to rename the output

• Four instructions can be renamed each cycle.– 8 ports on the lookup table– 4 allocations per cycle

• There is no fixed location for architectural register values!– How can we read architectural register r10?

Page 57: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

1: Add r3, r2, r32: Sub r2, r1, r33: Mult r1, r3, r14: Add r2, r3, r15: Add r2, r1, r3

r1 r2 r3p1 p2 p3

1:

2:

3:

4:

5:

12

34

5WARWAWRAW

Register map table

Page 58: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

1: Add r3, r2, r32: Sub r2, r1, r33: Mult r1, r3, r14: Add r2, r3, r15: Add r2, r1, r3

r1 r2 r30: p1 p2 p31: p1 p2 p42:

3:

4:

5:

p4, p2, p3

12

34

5WARWAWRAW

Page 59: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

1: Add r3, r2, r32: Sub r2, r1, r33: Mult r1, r3, r14: Add r2, r3, r15: Add r2, r1, r3

r1 r2 r30: p1 p2 p31: p1 p2 p42: p1 p5 p43:

4:

5:

p4, p2, p3p5, p1, p4

12

34

5WARWAWRAW

Page 60: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

1: Add r3, r2, r32: Sub r2, r1, r33: Mult r1, r3, r14: Add r2, r3, r15: Add r2, r1, r3

r1 r2 r30: p1 p2 p31: p1 p2 p42: p1 p5 p43: p6 p5 p44:

5:

p4, p2, p3p5, p1, p4p6, p4, p1

12

34

5WARWAWRAW

Page 61: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

1: Add r3, r2, r32: Sub r2, r1, r33: Mult r1, r3, r14: Add r2, r3, r15: Add r2, r1, r3

r1 r2 r30: p1 p2 p31: p1 p2 p42: p1 p5 p43: p6 p5 p44: p6 p7 p45:

p4, p2, p3p5, p1, p4p6, p4, p1p7, p4, p6

12

34

5WARWAWRAW

Page 62: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

1: Add r3, r2, r32: Sub r2, r1, r33: Mult r1, r3, r14: Add r2, r3, r15: Add r2, r1, r3

r1 r2 r30: p1 p2 p31: p1 p2 p42: p1 p5 p43: p6 p5 p44: p6 p7 p45: p6 p8 p4

p4, p2, p3p5, p1, p4p6, p4, p1p7, p4, p6p8, p6, p4

12

34

5WARWAWRAW

Page 63: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Renaming

1: Add r3, r2, r32: Sub r2, r1, r33: Mult r1, r3, r14: Add r2, r3, r15: Add r2, r1, r3

r1 r2 r30: p1 p2 p31: p1 p2 p42: p1 p5 p43: p6 p5 p44: p6 p7 p45: p6 p8 p4

p4, p2, p3p5, p1, p4p6, p4, p1p7, p4, p6p8, p6, p4

12

34

5WARWAWRAW

1

2 3

4 5

Page 64: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

• Separate Int and FP• Decouple front and back ends• Dynamically track dependences

– Instructions can issue once their input registers are written– Track register status in “register scoreboard”– Issue instructions “around” long-latency operations– Exploit cross-loop parallelism

• Issue up to 4 instructions/cycle (2 floating point)– Issue oldest first– Compact the queue (the free slots are always mostly at the top)

Page 65: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

RegisterFile

ALU

ALU

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 66: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

12

RegisterFile

ALU

ALU

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 67: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

342

RegisterFile

1,-

ALU

ALU

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 68: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

54

RegisterFile

2,3

ALU1

ALU

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 69: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

54

RegisterFile

ALU2

ALU3

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 70: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

54

RegisterFile

ALU

ALU3

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 71: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

RegisterFile

5,4

ALU

ALU3

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 72: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Issue Queue

1: Add p4, p2, p32: Sub p5, p1, p43: Mult p6, p4, p14: Add p7, p4, p65: Add p8, p6, p4

RegisterFile

ALU5

ALU4

1

2 3

4 5

p3p4

p5p2p1

p8p7p6

Register scoreboard

Page 73: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The Issue Window

56

opcode etc vrs vrt rs_value valid rt_value valid

=

=

=

=

alu_out_dst_0Decoded

Instruction

data vrtvrs alu_out_dst_1 alu_out_value_0 alu_out_value_1

Ready

rt_value

rs_value

opcode

Page 74: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The Issue Window

57

Arbitration

ALU0

ALU1

insts

Page 75: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Execution

• Integer ALUs are clustered• Two ALUs share a complete replica of the

Int register file• 1 cycle extra latency for cross-cluster

updates– Not a big performance hit– Issue queue can issue any instruction to either

cluster– Critical paths tend to stay in one cluster

• Area savings– Register file size is quadratic in # of ports– Each replica needs 4 read, 4 write ports (2 local

writes, 2 remote)– Unclustered -> 8 read, 4 write ports– O(2*82) vs O(122)

• Simpler too.• This is the beginning of the “slow wires

problem”

Page 76: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Memory Interface

• Memory is king!!!– One of Alpha’s niche markets was large, memory-intensive applications– They went 64-bits for the physical address space as much as for the

virtual.• Lots of outstanding requests

– 32 loads, 32 stores (D only)– 8 cache misses (I + D)

• Big caches (64KB, 2-way)– What does Patterson’s thumb say?– 2 loads/stores per cycle– Double-pumped instead of multi-ported. (area vs clock rate)– Virtually-index, physically tagged

• 8-entry victim buffer shared between L1I and L1D

Page 77: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Memory interface• Memory ordering

– Renames memory locations• LDQ/STQ

– 32 entries each.– Sorted in fetch order (but arrive out-of-order)– Instruction remain in the queues until

retirement– Load watch for younger stores to the same

address• Squash the load and subsequent instructions if a

match occurs– Stores watch for younger stores – Speculative loads get speculative data from

“speculative store data buffer”

Page 78: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

61

Page 79: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Retirement

• Instructions retire in-order• At retirement

–Stores write to memory–Renamed registers are released

• Each instruction carries the physical register number that held the previous value for the instruction’s architectural destination register.

• Since retirement is in-order, that register is dead.

• On exceptions,–All younger instructions are squashed–Register map reverts to state before the exception.

Page 80: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Memory interface

Source:ST r0, 0(r10)LD r1, 0(r11)

Execution:LD r1, 0(r11)… ST r0, 0(r10)

R11 == r10 => violation, pipe flush

• Ordering violations

• Mark the Load as “delayed”–In the future, it will wait for all previous stores–Clear the “delayed” flag ever 16,384 cycles

Page 81: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21264: Memory Interface

• Speculative cache hits (integer only)• The instruction queue assumes loads hit the L1• When they don’t hit, do a mini-restart

–Up to 8 instructions are “pulled back into the issue queue to be reissued”

–Results in a 2 cycle bubble• A single 4-bit predictor tracks the miss behavior.

Page 82: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Alpha 21364

• Introduced 2003• 1.3Ghz• 0.18micron, 130M transistors• 400mm2

• 125 Watts• 21264 + 1.75MB on-chip L2• Essentially a 21264 with an on-chip

cache.

Page 83: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

21064 21164 21264 21364 21064 21164 21264 21364

21064 21164 21264 21364

21064 21164 21264 21364

Page 84: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the
Page 85: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

300MHz1.7x improvement

Page 86: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

600MHz1.8x improvement

300MHz1.7x improvement

Page 87: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

600MHz1.8x improvement

300MHz1.7x improvement

27.8x improvement8.3x cycle time improvement3.5x from architecture

Page 88: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Modern OOO Processors• The fastest machines in the world are OOO

superscalars• AMD Barcelona

• 6-wide issue• 106 instructions inflight at once.

• Intel Nehalem• 5-way issue to 12 ALUs• > 128 instructions in flight

• OOO provides the most benefit for memory operations.• Non-dependent instructions can keep executing during cache

misses.• THis is so-called “memory-level parallelism.”• It is enormously important. CPU performance is (almost) all

about memory performance nowadays (remember the memory wall graphs!)

68

Page 89: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

The Problem with OOO

• Even the fastest OOO machines only get about 1-2 IPC, even though they are 4-5 wide.

• Problems• Insufficient ILP within applications. -- 1-2 per thread,

usually• Poor branch prediction performance• Single threads also have little memory parallelism.

• Observation• On many cycles, many ALUs and instruction queue slots

sit empty

69

Page 90: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Simultaneous Multithreading• AKA HyperThreading in Intel machines• Run multiple threads at the same time• Just throw all the instructions into the pipeline• Keep some separate data for each

• Renaming table• TLB entries• PCs

• But the rest of the hardware is shared.• It is surprisingly simple (but still quite complicated)

70

Deco

de

Fetch

T1EX

Mem Write

back

Sche

dule

Rena

me

Deco

deEX

Mem Write

back

Sche

dule

Rena

me

Deco

de

Mem Write

back

Sche

dule

Rena

me

Fetch

T2

Fetch

T3

Fetch

T4

Page 91: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

SMT Advantages

• Exploit the ILP of multiple threads at once• Less dependence or branch prediction (fewer

correct predictions required per thread)• Less idle hardware (increased power efficiency)• Much higher IPC -- up to 4 (in simulation)• Disadvantages: threads can fight over resources

and slow each other down.

• Historical footnote: Invented, in part, by our own Dean Tullsen when he was at UW

71

Page 92: OOO Execution and 21264cseweb.ucsd.edu/classes/wi10/cse240a/Slides/12_OOO.pdfThe Basic 5-stage Pipeline • Like an assembly line -- instructions move through in lock step • In the

Keeping the Window Filled

• Keeping the instruction window filled is key!• Instruction windows are about 32 instructions

• (size is limited by their complexity, which is considerable)

• Branches are every 4-5 instructions.• This means that the processor predict 6-8

consecutive branches correctly to keep the window full.

• On a mispredict, you flush the pipeline, which includes the emptying the window.

72