44
Tema 5: Processadors superescalars Eduard Ayguadé i Josep Llosa These slides have been prepared using material available at: 1) the companion web site for “Computer Organization & Design. The Hardware/Software Interface. Copyright 1998 Morgan Kaufmann Publishers.” 2) some slides and examples are part of the teaching material of Prof. Guri Sohi (U. of Wisconsin/Madison). 3) Some processor diagrams have been extracted from the “Microprocessor Report journal, Copyright In/Stat&MDR.” 4) Other material available through the internet. Issuing multiple instructions per cycle Issuing multiple instructions per cycle CPI < 1 Two variations: Very Long Instruction Word (VLIW): fixed number of instructions (up to 16) scheduled by the compiler Joint HP/Intel (EPIC/Itanium) Superscalar: varying number of instructions/cycle (1 to 8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled) IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA-8000

Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

1

Tema 5: Processadors superescalars

Eduard Ayguadé i Josep Llosa

These slides have been prepared using material available at: 1) the companion web site for “Computer Organization & Design. The Hardware/Software Interface. Copyright 1998 Morgan Kaufmann Publishers.” 2) some slides and examples are part of theteaching material of Prof. Guri Sohi (U. of Wisconsin/Madison). 3) Some processor diagrams have been extracted from the“Microprocessor Report journal, Copyright In/Stat&MDR.” 4) Other material available through the internet.

Issuing multiple instructions per cycleIssuing multiple instructions per cycle

CPI < 1

Two variations:Very Long Instruction Word (VLIW): fixed number of instructions(up to 16) scheduled by the compiler

Joint HP/Intel (EPIC/Itanium)Superscalar: varying number of instructions/cycle (1 to 8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled)

IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA-8000

Page 2: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

2

VLIW designVLIW design

Making our pipeline superscalar, for example:2 instructions executed per cycle, with some constraints:

One instruction arithmeticOne instruction accessing memoryThe second can only be issued if the first is issued

2 instructions fetched from memory, paired and aligned to 64-bitboundaries to feed the two pipelines

M

M

W

W

WEDFLoad or Store instruction

WEDFALU or branch instruction

MEDFLoad or Store instruction

MEDFALU or branch instruction

WMEDFLoad or Store instruction

WMEDFALU or branch instruction

WMEDFLoad or Store instruction

WMEDFALU or branch instruction

VLIW designVLIW design

8

Page 3: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

3

VLIW designVLIW design

Simple superscalar code scheduling:Loop: ld $3, 0($1)

add $3, $3, $2st $3, 0($1)addi $1, $1, -4bne $1, $0, loop

The first three instructions have data dependences, and so do the last two. A possible schedule is:

i.e. 4 cycles to execute 5 instructions (CPI=0.8)

4st $3, 4($1)bne $1, $0, loop3nopadd $3, $3, $22nopaddi $1, $1, -41ld $3, 0($1)noploop:

cycleData memory accessALU or branch

VLIW designVLIW design

Loop unrolling can help to decrease CPI (assuming that thenumber of iterations is a multiple of 4):

Loop: ld $3, 0($1)add $3, $3, $2st $3, 0($1)ld $3, 4($1)add $3, $3, $2st $3, 4($1)ld $3, 8($1)add $3, $3, $2st $3, 8($1)ld $3, 12($1)add $3, $3, $2st $3, 12($1)addi $1, $1, -16bne $1, $0, loop

(notice that with unrolling we reduce the number ofinstructions that control the execution of loop)

Page 4: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

4

VLIW designVLIW design

A possible schedule is:

i.e. 8 cycles to execute 4 iterations (instead of 4 cycles periteration): CPI=0.57

4ld $4, 4($1)add $6, $6, $2

5st $3, 0($1)add $5, $5, $2

6st $6, 12($1)add $4, $4, $2

7st $5, 8($1)addi $1, $1, -16

8st $4, 4+16($1)bne $1, $0, loop

3ld $5, 8($1)add $3, $3, $22ld $6, 12($1)nop1ld $3, 0($1)noploop:

cycleData memory accessALU or branch

EPIC: beyond RISC and VLIWEPIC: beyond RISC and VLIW

Explicitly Parallel Instruction Computing:Parallel instruction encodingInstruction dependence hints allows flexible instruction groupingLarge directly addressable register file (128 or more)Fully predicated instruction set

Family of binary-compatible processors avoiding recompilation

Page 5: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

5

EPIC: Instructions BundleEPIC: Instructions Bundle

Grouping information: dependencies among instructions in the bundle (no empty slots as in VLIW) and chaining with next bundle No direct mapping to hardware as in VLIW

Instruction 2Instruction 2 Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate

Each instruction contains:• Opcode• Predicate register (6)• Source1 (7)• Source2 (7)• Destination (7)• Opcode extension / branch target / misc

Template contains:• Instruction grouping

information• Prefetch hints

EPIC: PredicationEPIC: Predication

Predicate registers: 64, each just one bit

Increased opportunity for parallel execution

instr 1instr 2…cmp (a==b)jump equ lb1

instr 1instr 2…cmp (a==b)jump equ lb1

instr 3instr 4jump lb2

instr 3instr 4jump lb2

lb1: instr 5instr 6

lb1: instr 5instr 6

lb2: instr 7instr 8...

lb2: instr 7instr 8...

instr 1instr 2...p1, p2 ← cmp (a==b)(p1) instr 3(p1) instr 4(p2) instr 5(p2) instr 6instr 7instr 8...

instr 1instr 2...p1, p2 ← cmp (a==b)(p1) instr 3(p1) instr 4(p2) instr 5(p2) instr 6instr 7instr 8...

Page 6: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

6

EPIC: SpeculationEPIC: Speculation

Long latency loads stall the processorAvoid complexity of out-of-order logic in dynamic superscalar cores

load.s: speculative, non faulting, load access

instr 1instr 2jump equ lb1

instr 1instr 2jump equ lb1

load $1, …instr 3 …, $1…

load $1, …instr 3 …, $1…

load.s $1, …instr 1instr 2jump equ lb1

load.s $1, …instr 1instr 2jump equ lb1

chk.s $1instr 3 …, $1…

chk.s $1instr 3 …, $1…

Itanium® architectureItanium® architecture

Up to 6 instructions per cycle from two bundles, 3 instructions each.

Page 7: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

7

Itanium® vs. Itanium® 2 architectureItanium® vs. Itanium® 2 architecture

Estimated Itanium® 2 performance: 1.5-2X of Itanium

Itanium® 2 die layout and overviewItanium® 2 die layout and overview

Page 8: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

8

Superscalar designSuperscalar design

Superscalar Design: Execute multiple instructions everyclock cycle

T = N * CPI * 1/W * tc

being W the number of instructions that can be initiated per clock cycle

The superscalar wants to avoid the rigid layout ofinstructions imposed by the VLIW design

We are going too fast … think againWe are going too fast … think again

#define N 9984double x[N+8], y[N+8], u[N]loop () {

register int i;double q;for (i=0; i<N; i++) {

q = u[i] * y[i];y[i] = x[i] + q;x[i] = q – u[i] * x[i];

}}

Operations per iteration:• 3 reads• 2 writes• 2 multiplications• 1 addition• 1 substraction

Operations per iteration:• 3 reads• 2 writes• 2 multiplications• 1 addition• 1 substraction

Page 9: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

9

We are going too fast … think againWe are going too fast … think again

ld $10, @y[0]ld $11, @u[0]ld $12, @x[0]ld $13, Nloop: ld f1, 0($10)

ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8 ; delay slot

Execution stage:• 1 cycle integer• 3 cycles FP

Branches effective after D

Execution stage:• 1 cycle integer• 3 cycles FP

Branches effective after D

We are going too fast … think againWe are going too fast … think again

Pipelined processor with 1 integer unit and 1 FP unit, 100 Mhz

1 iteration of 14 instructions every 15 cyclesIf tc=10ns, we get:

93 MIPS out of 100 MIPS peak26.6 MFLOPS out of 100 MFLOPS peak

F D E M W

F D E WE E

F D E M WF D E M W

F D E WE EF D E M W

F D E WE EF D E WE E

F D E M WF D E M W

F D E M WF D E M W

F D E M WF D E M W

F D E M W

F D E

F D E M WF D E M

F DF

1 iteration, 15 cycles

Page 10: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

10

We are going too fast … think againWe are going too fast … think again

VLIW processor with W=4, 1 integer unit, 1 FP, 1 memoryunit and 1 branch unit

1 iteration of 14 instructions every 5 cyclesIf tc=10ns, we get:

280 MIPS out of 400 MIPS peak80 MFLOPS out of 100 MFLOPS peak

branchesintegerFPmemory

add3 (i+2)mulf2 (i+2)st1 (i+1)

bne (i+2)mulf1 (i+2)st2 (i)

sub (i+2)subf (i+1)ld3 (i+2)

add2 (i+2)addf (i+1)ld2 (i+2)

add1 (i+2)ld1 (i+2)

time

Modulo scheduling

st2 (0)

st2 (1)

We are going too fast … think againWe are going too fast … think again

20

21

19

18

17

16

15

14

13

12

st2 (0)11

st1 (0)10

9

subf (0)8

addf (0)7

6

add3 (0)mulf2 (0)5

bne (0)mulf1 (0)4

sub (0)ld3 (0)3

add2 (0)ld2 (0)2

add1 (0)ld1 (0)1

branchesintegerFPmemoryCycle

st2 (1)

st1 (1)

subf (1)

addf (1)

add3 (1)mulf2 (1)

bne (1)mulf1 (1)

sub (1)ld3 (1)

add2 (1)ld2 (1)

add1 (1)ld1 (1)

branchesintegerFPmemory

st2 (2)

st1 (2)

subf (2)

addf (2)

add3 (2)mulf2 (2)

bne (2)mulf1 (2)

sub (2)ld3 (2)

add2 (2)ld2 (2)

add1 (2)ld1 (2)

branchesintegerFPmemory

Two competing memory operations in the same cycle

st2 delayed: one iteration lasts 3 more cycles, but we can overlap and start a new iteration every 5 cycles pipelining

Page 11: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

11

ld f1, 0($10)ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8

We are going too fast … think againWe are going too fast … think again

F FF F

D DD D F F F F

E D D D D F F F F

M E D D D D F F

Iteration i Iteration i+1

time

W M E D D F FF F

EW M D DD D F F F F

EW E E D D D D F F F F

E E M D D D D F F

W E W E D D

W W W M M E E

W W M M E

W W M EW

EW

EE

MW

EM

W EW

EE

EM

W EW

EE

EW

EE

EM

WE

MW

EM

WE

MW

EM

WE

MW

W E EE E E E

W E M M E E11 cycles per iteration

In order execution

Note: instructions start execution following the order of F/D. Otherwise, it could be10 cycles per iteration.

How can a processor that fetches instructions in lexicographical order achieve this parallelism?

We are going too fast … think againWe are going too fast … think again

FD

EM

W

FD

EW

EE

FD

EM

WF

DE

MW

FD

EW

EE

FD

FD

EW

EE

FD

EW

EE

FD

EM

WF

DF

DE

MW

FD F

DF

D

EM

W

EM

W

EM

WE

MW

EM

W

Iteration i Iteration i+1

time

FD

EM

W

FD

EW

EE

FD

EM

WF

DE

MW

FD

EW

EE

FD

EM

WF

DE

WE

EF

DE

WE

EF

DE

MW

FD

EM

WF

DE

MW

FD

EM

WF

DE

MW

FD

EM

W

Out-of-order execution

ld f1, 0($10)ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8

How can a processor that fetches instructions in lexicographical order achieve this parallelism?

Page 12: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

12

We are going too fast … think againWe are going too fast … think again

FD

EM

W

FD

EW

EE

FD

EM

WF

DE

MW

FD

EW

EE

FD

FD

EW

EE

FD

EW

EE

FD

EM

WF

DF

DE

MW

FD F

DF

D

EM

W

EM

W

EM

WE

MW

EM

W

Iteration i Iteration i+1

time

FD

EM

W

FD

EW

EE

FD

EM

WF

DE

MW

FD

EW

EE

FD

EM

WF

DE

WE

EF

DE

WE

EF

DE

MW

FD

EM

WF

DE

MW

FD

EM

WF

DE

MW

FD

EM

W

FD

EM

W

FD

EW

EE

FD

EM

WF

DE

MW

FD

EW

EE

FD

EM

WF

DE

WE

EF

DE

WE

EF

DE

MW

FD

EM

WF

DE

MW

FD

EM

WF

DE

MW

FD

EM

W

Out-of-order executionBranch prediction

5 cyclesper iteration?

ld f1, 0($10)ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8

How can a processor that fetches instructions in lexicographical order achieve this parallelism?

Iteration i+2

We are going too fast … think againWe are going too fast … think again

VLIW processor with W=7, 2 integer units, 2 FP, 2 ld/st unitand 1 branch unit

1 iteration of 14 instructions every 3 cyclesIf tc=10ns, we get:

467 MIPS out of 700 MIPS peak133 MFLOPS out of 200 MFLOPS peak

add3 (i+3)

add2 (i+3)

add1 (i+3)

integer

mulf1 (i+3)

mulf2 (i+2)

FP

addf (i+2)

subf (i+1)

FP

sub (i+3)

integer branchesmemorymemory

st1 (i+1)

bne (i+3)st2 (i)ld3 (i+3)

ld2 (i+3)ld1 (i+3)time

Page 13: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

13

Superscalar designSuperscalar design

Problems for Superscalar design:Structural Hazards:

Need Multiple Execution Units (Multiple Pipelines)Need multiple simultaneous accesses to register files.Need multiple simultaneous accesses to caches

Data Hazards:How to deal with Read After Write hazardsHow to deal with Write After Read and Write After Write hazardsWhat to do with stalled instructions

Control Hazards:What to do with conditional branchesWhat to do with computed branches

Superscalar designSuperscalar design

WAR and WAW dependences caused by out-of-orderexecution:

ld $3, 10($1)add $4, $3, $3ld $3, 100($2) add $5, $3, $3…sub $6, $6, $3

F D E M M M M M M M WF D E M W

F D E M WF D E M W

F D E M W

$3 written

$3 written

Page 14: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

14

Superscalar designSuperscalar design

Structural Hazards:Have as many functional units as neededBuild register files with many read and write portsBuild multi-port caches

Data Hazards solutions:Execute instructions in order. Use score-board to eliminate data hazards by stalling instructionsExecute instructions out or order, as soon as operands are available, but graduate them in order. Why?Use register renaming to avoid WAR and WAW data hazards

Superscalar designSuperscalar design

Control Hazards solutions:Use branch prediction:

Make sure that the branch is resolved before registers are modified…or use speculative execution, roll back results if branches werepredicted wrong

Page 15: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

15

Branch predictionBranch prediction

What do we need to predict for a jump/branch?jump:

the address of the target address, which can be stored in the sameinstruction or computed from the current PC plus a displacement

Return from subroutine ret:the return address, which is obtained from the stack (increasing theSP and reading from memory)

conditional branch:the address of the target address, which is usually computed from thecurrent PC plus a displacementIs the branch going to branch or continue with next instruction?

Branch predictionBranch prediction

Branch Target BufferStores for each jump/branch the target address

It is a cache, so when full a replacement algorithm is applied

@ instruction target @

BTB

PC

=PC + 4

hit

0

1

Page 16: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

16

Branch predictionBranch prediction

The BTB does not work for predicting the return address ofa subroutine

The same instruction address (the one that points to ret) needs to have different target addresses:

Solution: implement a return stack that mimics the original stack

call subroutine

call subroutine

……..

……

…………..

ret

subroutine:

Branch predictionBranch prediction

Branch History TableSays whether or not branch taken last timeThe simplest is a 1-bit table attached to the BTB that is set to 1 iflast time the branch jumped (taken), 0 otherwise (not taken). Initially set to 0

@ instruction

BTB

PC

=PC + 4

hit

BHT

0/1

and

target @

0

1

Page 17: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

17

Branch predictionBranch prediction

Question: How many mispredictions in a loop?

Answer: 2End of loop case, when it exits instead of looping as beforeFirst time through loop on next time through code, when it predictsexit instead of looping

Solution: 2-bit counter BHT that changes prediction only ifget misprediction twice:

Increment for taken, decrement for not-taken00,01,10,11 (initially set to 00)

Branch predictionBranch prediction

Automaton for the 2-bit counter predictor:

Can it be better? Yes, of course … a lot of research in branch prediction in the last years

T T NT NTNT NT NT

T T T

NTT

NT: not takenT: taken

11 10 01 00

Page 18: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

18

Correlating branchesCorrelating branches

Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects predictionof current branch

Idea: record m most recently executed branches as takenor not taken, and use that pattern to select the properbranch history table

In general, (m,n) predictor means record last m branchesto select between 2m history tables each with n-bitcounters

Correlating branchesCorrelating branches

Branch History Table (BHT)

Old 2-bit BHT is then a (0,2) predictorPHT could also be indexed with some bits of the PC

@ instruction

BTB

PC

=PC + 4

hit

BHT

01

f

target @

0

1

0110

m

n

PHT

Page 19: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

19

Freq

uenc

y of

Mis

pred

ictio

ns

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

nasa

7

mat

rix30

0

tom

catv

dodu

cd

spic

e

fppp

p gcc

espr

esso

eqnt

ott li

0%1%

5%6% 6%

11%

4%

6%5%

1%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of different schemesAccuracy of different schemes

Selective history predictorSelective history predictor

… but not always the same predictor is the best

Predict taken/not_taken

11100100

Choose predictor2

Choose predictor1

PC

GlobalHistory

k

predictor1predictor1

predictor2predictor2

01

f

n

0

1

selector

Page 20: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

20

SpeculationSpeculation

Allow an instruction, that is dependent on branch, toexecute (without any consequences, including exceptions): boosting

Separate speculative bypassing of results from real bypassing of results:

When instruction no longer speculative (i.e. branch has beenresolved), its boosted results can update state or can be discardedExecute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits

We will elaborate on this later in this chapter

Dependences in a programDependences in a program

RAW, WAR and WAW dependencesRAW

WARWAW

Page 21: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

21

Dependences in a programDependences in a program

RAW dependences are important because they determine the data flow in the program. We will solve them later in this chapter

WAR and WAW appear because we are reusing registers tostore temporary values

The number of registers visible at the machine language level isfixed (and usually small)They need to be reused

Solution: dynamically rename registers!

Register renamingRegister renaming

Each entry of the (logical) register file either:Contains the value that is stored in this registerContains a pointer to an element of a list of (physical) registersavailable for renaming

$0$1$2

$i

$31

ren0ren1

ren2

ren3

renj

0: renamed1: value

0 renj

1 value

rename buffer

register file

Page 22: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

22

Register renamingRegister renaming

At the decode stage, the destination register (e.g. $i) isalways renamed with a register from the rename buffer (e.g. renj)

From now on, when an instruction uses register $i, thename of the source register is changed to renjWhen the value of renj has been computed, it istransferred (if still needed as register $i) to the registerfile. renj is free for a new renaming

Register renamingRegister renaming

Example:add $3, $3, 4ld $8, ($3)add $3, $3, 4ld $9, ($3)

If first instruction finishes after ren3 has been used, then theresult in ren1 does not have to be written back to $3

add ren1, $3, 4ld ren2, (ren1)add ren3, ren1, 4ld ren4, (ren3)

$3

$8$9

ren1 0 $3

$8$9

ren1 0

ren2 0

$3

$8$9

ren3 0

ren2 0

$3

$8$9

ren3 0

ren2 0ren4 0

Page 23: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

23

In-order superscalar processorsIn-order superscalar processors

Instructions are fetched, executed and committed in compiler-generated order

if one instruction stalls, all instructions behind it stall

Instructions are statically scheduled by the hardwaremeans they are scheduled in their compiler-generated orderhow many of the next n instructions can be issued, where n is thesuperscalar issue width

Main advantage of in-order instruction scheduling: simplerimplementation

faster clock cyclefewer transistors

Out-of-order superscalar processorsOut-of-order superscalar processors

Instructions are fetched in compiler-generated order, butthey may be executed out of this order

Instruction completion may be in-order (today) or out-of-order (older computers)

Dynamic schedulinghardware decides in what order instructions can be executedinstructions behind a stalled instruction can pass it

Main advantage of out-of-order execution: higherperformance

better at hiding latencies, less processor stallinghigher utilization of functional units

Page 24: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

24

Precise exceptionsPrecise exceptions

An exception is precise if the following two conditions are met:

All the instructions preceding the instruction that produced theexception have been executed and have modified the processstate correctlyAll instructions following the instruction that produced theexception have not yet been executed and have done no modification to the process state

In-order completion is necessary in order to have a microarchitecture with precise exceptions

Out-of-order superscalar processorsOut-of-order superscalar processors

Page 25: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

25

Dynamic executionDynamic execution

Based on Tomasulo’s algoritm proposed back in the 60’s

Why do we study it? because it lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

It did not consider in-order completion

Dynamic executionDynamic execution

Control and buffers distributed with functional units

ADD MULT

$0$1$2

$i

$31

0

1

ren0ren1

ren2

ren3

renj

rename/reorderbuffer

register file

Common Data Bus CDB

bypassesreservationstations

mux

Page 26: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

26

Reservation stationsReservation stations

An instruction is sent to a reservation station if there is oneempty for the resource that can execute it

Reservation stations hold instructions, with possiblepending operands, waiting there for all operands available

busy oper tag1 source1 tag2 source2 dest

0: source operand not available1: source operand available

if tag=0, pointer to register in rename buffer

pointer to register in rename buffer

if 1, reservation station is occupied with an instruction(with pending operands or in execution)

value1 value2

if tag=1, value for the operand

Reservation stationsReservation stations

Each reservation station RS monitors if a result is availableon the CDB, whose destination register is one of thepending operands:

RStag=0 and RSsource=CDBdest

The CDB transfers the value (CDBvalue) that will be stored in the destination register (CBDdest) of the rename buffer

Page 27: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

27

Reservation stationsReservation stations

Ready to execute? … and then?Ready to execute? … and then?

When both RStag1=RStag2=1, theinstruction is ready forexecution

For example, assume that wehave two adders and fourreservation stations (Entry0..3) that can feed them

Signal “I want an adder” is generated as

RStag1 and RStag2

Page 28: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

28

Reservation stationsReservation stations

All reservation stations can be unified in a single structure(in some processors called Instruction Window)

ExampleExample

F1 ← F2 / F4F6 ← F0 + F1F1 ← F3 - F4F7 ← F1 * F5F8 ← F2 + F3

1 ADD/SUB, 3 cycles, 2 RS

1 MUL, 4 cycles, 1 RS

1 DIV, 8 cycles, 1 RS

W-EEEI---DF

WEEEE---IDF

WEEEIDF

WEEE--------IDF

WEEEEEEEEIDF

2019181716151413121110987654321

Note: The reservation station is occupied from I to W

Page 29: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

29

Access to memoryAccess to memory

Some superscalar processors only allowed a single memory operation per cycle, but this rapidly became a performance bottleneck

To allow multiple memory requests to be servicedsimultaneously, the memory hierarchy has to be multiported

It is usually sufficient to multiport only the lowest level ofthe memory hierarchy, namely the primary caches sincemany requests do not proceed to the upper levels in thehierarchy

Multiported cacheMultiported cache

Access time increases with number of ports

Multiporting can be achieved by making multiple serial requests during the same cycle

L1L1

port1 port2

Page 30: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

30

Multiported cacheMultiported cache

Multiporting can also be achieved by having multiplememory banks: Interleaved cache

Bandwidth reduced if both accesses go to the same bank

bank1bank1 bank2bank2 bank3bank3 bank4bank4

port1 port2

Access to memoryAccess to memory

To allow memory operations to be overlapped with otheroperations (both memory and non-memory), the memoryhierarchy must be non-blocking

That is, if a memory request misses in the data cache, other memory requests, should be allowed to proceed: hit-on-miss or miss-on-miss

Page 31: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

31

Access to memoryAccess to memory

Each memory port requires an adder to compute theeffective memory address:

register numbers are readily available from the instructionsthemselves while memory addresses have to be computed andavailable late in the pipeline

Loads/stores do exhibit RAW, WAW and WAR dependences in both the computation of the effectiveaddress or in the value that needs to be stored in memory

Reservation stations for ld/stReservation stations for ld/st

Are there any order constraints that need to be considered?

Page 32: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

32

Dependency checkingDependency checking

Examples:ld $1, 100($2)mul $3, $4, $5ld $6, 4($3)st $7, 10($8)

mul $1, $2, $3st $2, 4($1)ld $5, 10($6)add $7, $7, $5

I

F D I E M WF D I E WE E E E

F D I E M WI I I IF D I E M W

What if 4($3)=10($8)? Incorrect!

What if 4($1)=10($6)? Incorrect!

F D I E WF D I I M WI I I E

F D I E M WF D I E W

E E E E

I

I

Dependency checkingDependency checking

Examples:mul $1, $2, $3st $4, 4($1)st $5, 10($6)

Stores must go to memory in FIFO order in order to haveprecise exceptions

F D I E WF D I I M WI I I E

F D I E M W

E E E E

What if 4($1) ≠ 10($6), but 4($1) causes an exception? In order to be precise, the second store can not go to memory

What if 4($1) = 10($6)? Incorrect!

I

Page 33: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

33

Dependency checkingDependency checking

Whenever there is a load in a reservation station whoseaddress can not be computed, any store that follows can not go to memory

Similarly, whenever there is a store in a reservation stationwhose address is not yet know, any memory access thatfollows can not go to memory

Bypassing memory accessesBypassing memory accesses

New load addresses are checked with waiting store addresses. If thereis a match, the load must wait for the store it matches. Store data may be bypassed to the matching load

Page 34: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

34

Instruction retirementInstruction retirement

Retiring only one instruction per cycle can be a bottleneck. It is possible to retire multiple instructions in parallel

Rename / reorderBuffer

Reorder bufferReorder buffer

The reorder buffer is an extension of the rename buffer:

The tail does not advance until the value for that renamedregister has been generated

ren0ren1

ren2

reni

renj

tail: points to the first renamedregister pending to be moved tothe register file

head: points to the last register renamed (i.e. the lastinstruction decoded)

Page 35: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

35

Reorder bufferReorder buffer

If head reaches tail, renaming has to be stopped. Thisstalls the processor until registers for renaming are available (structural hazard)

A value from a register in the reorder buffer may not needto be transfered to the register file:

None of the registers in the register file is renamed to it

Support for speculative executionSupport for speculative execution

Each entry in the reorder buffer contains a special“speculative” bit:

speculative instructions are marked in the reorder buffershould a branch becomes confirm it will turn the speculative bits ofthe corresponding speculative instruction to “confirm”if it is not confirmed, status is set to “kill”

When an instruction reaches the tail of the reorder buffer:if it is marked “speculative”, must stall retirement till it is not speculativeif it is marked “confirm”, continue to commit instructionif it is marked as “kill”, its result is discarded

Page 36: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

36

ExampleExample

F1 ← F2 / F4F6 ← F0 + F1F1 ← F3 - F4F7 ← F1 * F5F8 ← F2 + F3

1 ADD/SUB, 3 cycles, 2 RS

1 MUL, 4 cycles, 1 RS

1 DIV, 8 cycles, 1 RS

C----W-EEEI---DF

C----WEEEE---IDF

C--------WEEEIDF

CWEEE--------IDF

CWEEEEEEEEIDF

2019181716151413121110987654321

ExampleExample

F1 ← F2 / F4F6 ← F0 + F1F1 ← F3 - F4F7 ← F1 * F5F8 ← F2 + F3

1 ADD/SUB, 3 cycles, 2 RS

1 MUL, 4 cycles, 1 RS

1 DIV, 8 cycles, 1 RS

C-WEEEI-------DF

C----WEEEE---IDF

C--------WEEEIDF

CWEEE--------IDF

CWEEEEEEEEIDF

2019181716151413121110987654321

rename/reorder buffer with 4 entries: structural hazard

Page 37: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

37

Yet another exampleYet another example

Yet another exampleYet another example

The processor has:an instruction window with 24 entriesa LSU with 8 entriesan instruction fetch buffer (IB) with 4 entries2 adders1 pipelined multiplier with a 2 cycle latencyand two memory ports both pipelined with a 2 cycle latency

The physical register file is integrated into the instructionwindow. An structure (RAT: Register Alias Table) maintainsmappings from the logical register numbers to instructionwindow entries (if the logical register is renamed).

Page 38: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

38

Yet another example: cycle 1Yet another example: cycle 1

Yet another example: cycle 2Yet another example: cycle 2

Page 39: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

39

Yet another example: cycle 3Yet another example: cycle 3

Yet another example: cycle 4Yet another example: cycle 4

Page 40: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

40

Yet another example: cycle 5Yet another example: cycle 5

Yet another example: cycle 6Yet another example: cycle 6

Page 41: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

41

Yet another example: cycle 7Yet another example: cycle 7

Yet another example: cycle 8Yet another example: cycle 8

Page 42: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

42

Yet another example: cycle 9Yet another example: cycle 9

Yet another example: cycle 10Yet another example: cycle 10

Page 43: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

43

Yet another example: cycle 11Yet another example: cycle 11

Current high-performance µP (8/06)Current high-performance µP (8/06)

Page 44: Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW designVLIW design Simple superscalar code scheduling: zLoop: ld $3, 0($1) add $3, $3,

44

Current high-performance µP (8/06)Current high-performance µP (8/06)

30 years of progress30 years of progress

4004 to Pentium® 4 processor:Transistor count: more than 20,000x increaseFrequency: more than 20,000x increase39% compound annual growth

Now you can tell what has happened in between …