Advanced Microarchitecture

Advanced MicroarchitectureLecture 3: Superscalar Fetch

2

Fetch Rate is an ILP Upper Bound• To sustain an execution rate of N IPC, you

must be able to sustain a fetch rate of N IPC!

• Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve!

• This also suggests that you don’t need to fetch N instructions every cycle, just on average

Lecture 3: Superscalar Fetch

3

Impediments to “Perfect” Fetch• A machine with superscalar degree N will

ideally fetch N instructions every cycle• This doesn’t happen due to

– Instruction cache organization– Branches– And interaction between the two


4

Instruction Cache Organization• To fetch N instructions per cycle from I$, we

need– Physical organization of I$ row must be wide

enough to store N instructions– Must be able to access entire row at the same

time

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Address Cache Line

Alternative: do multiple fetchesper cycle

Not Good: increases cycle timelatency by too much

5

Fetch Operation• Each cycle, PC of next instruction to fetch

is used to access an I$ line• The N instructions specified by this PC and

the next N-1 sequential addresses form a fetch group

• The fetch group might not be aligned with the row structure of the I$


6

Fragmentation via Misalignment• If PC = xxx01001, N=4:

– Ideal fetch group is xxx01001 through xxx01100 (inclusive)



Tag Inst Inst Inst Inst


000001010011

111

xxx01001 00 01 10 11

Row widthFetch groupCan only access one line per

cycle, means we fetch only 3instructions (instead of N=4)

7

Fetch Rate Computation• Assume N=4• Assume fetch group starts at random

location• Then fetch rate =

¼ x 4+ ¼ x 3+ ¼ x 2+ ¼ x 1

= 2.5 instructions per cycle


8

Reduces Fetch Bandwidth• It now takes two cycles to fetch N

instructions– Halved fetch bandwidth!


Decoder




000001010011

111

xxx01001 00 01 10 11

Decoder




000001010011

111

xxx01100 00 01 10 11Inst Inst Inst

Inst

Cycle 1

Cycle 2

Inst Inst InstReduction may not be as bad asa full halving

9

Reducing Fetch Fragmentation• Make |Fetch Group| != |Row Width|


Tag Inst Inst Inst InstInst Inst Inst Inst



Address

Inst Inst Inst Inst

If start of fetch group is N or more from the end of the cache line,then N instructions can be delivered

Cache Line

10

May Require Extra Hardware





Inst Inst Inst Inst

Rotator

Inst Inst Inst InstAligned fetch group

11

Fetch Rate Computation• Let N=4, cache line size = 8• Then fetch rate =

5/8 x 4+ 1/8 x 3+ 1/8 x 2+ 1/8 x 1

= 3.25 instructions per cycle


12

Fragmentation via Branches• Even if fetch group is aligned, and/or cache

line size > than fetch group, taken branches disrupt fetch


Tag Inst Inst Inst InstTag Inst Branch InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Inst

X X

13

Fetch Rate Computation• Let N=4• Branch every 5 instructions on average• Assume branch always taken• Assume branch target may start at any

offset in a cache row


25% chance of fetch group starting at each location

20% chance for each instruction to be a branch

14

Fetch Rate Computation (2)


start of fetch group

¼ x 1 instructionstart of fetch group

¼ x ( 0.2 x 1 + 0.8 x 2 )start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) )

start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) )

= 2.048 Instructions Fetched per CycleSimplified analysis: doesn’t account for higherprobability of fetch group being aligned due toprevious fetch group not containing branches

15

Ex. IBM RS/6000


PC = B1010

T logic

A0 B0A4 B4A8 B8A12 B12

T logic

A1 B1A5 B5A9 B9A13 B13

T logic

A2 B2A6 B6A10 B10A14 B14

A3 B3A7 B7A11 B11A15 B15

Instruction Buffer Network

0123

0123

0123

0123

OneCacheLine

From TagCheck Logic

2 233

BB12 B13 B10 B11

B11 B12 B13B10

16

Types of Branches• Direction:

– Conditional vs. Unconditional

• Target:– PC-encoded

• PC-relative• Absolute offset

– Computed (target derived from register)

• Must resolve both direction and target to determine the next fetch group


17

Prediction• Generally use hardware predictors for both

direction and target– Direction predictor simply predicts that a

branch is taken or not-taken(Exact algorithms covered next lecture)

– Target prediction needs to predict an actual address


18

Where Are the Branches?• Before we can predict a branch, we need to

know that we have a branch to predict!

Where’s the branch in this fetch group?


I$PC

1001010101011010101001010100101011010100101001010101011010100100100000100100111001001010

19

Simplistic Fetch Engine


I$

PD PD PD PDDir

PredTargetPred

Branch’s PC

+sizeof(inst)

Huge latency! Clock frequency plummets

Fetch PC

20

Branch Identification


I$

DirPred

TargetPred

Branch’s PC+sizeof(inst)

Store 1 bit perinst, set if inst

is a branch

partial-decodelogic removed… still a long latency (I$ itself sometimes > 1 cycle)

Note: sizeof(inst) maynot be known before

decode (ex. x86)

Predecode branches on fill from L2

21

Line Granularity• Predict next fetch group independent of

exact location of branches in current fetch group

• If there’s only one branch in a fetch group, does it really matter where it is?


XXTXXNXX

TN

One predictor entryper instruction PC

One predictor entryper fetch group

22

Predicting by Line


I$

br1 br2Dir

PredTargetPred

+sizeof($-line)

CorrectDir Pred

CorrectTarget Pred

br1 br2

Cache Line address

N N N --

X Y

N T T YT -- T XBetter! Latency determined by BPred

This is still challenging: we mayneed to choose between multipletargets for the same cache line

23

Multiple Branch Prediction


Dir PredTarget Pred

I$

N N N Taddr0addr1addr2addr3

Scan for1st “T”

0 1

+LSBs of PC

sizeof($-line)

no LSBs of PC

PC

24

Direction Prediction• Details next lecture• Over 90% accurate today for integer

applications• Higher for FP applications


25

Target Prediction• PC-relative branches

– If not-taken:next address = branch address + sizeof(inst)

– If taken:next address = branch address + SEXT(offset)

• Sizeof(inst) doesn’t change• Offset doesn’t change

(not counting self-modifying code)


26

Taken Targets Only• Only need to predict

taken-branch targets• Taken branch target is

the same every time• Prediction is really just

a “cache”


TargetPred

+sizeof(inst)

PC

27

Branch Target Buffer (BTB)


V BIA BTA

Branch PC

Branch TargetAddress

=

Valid Bit

Hit?

Branch InstructionAddress (Tag)

Next Fetch PC

28

Set-Associative BTB


V tag target

PC

=

V tag target V tag target

= =

Next PC

29

Cutting Corners• Branch prediction may be wrong

– Processor has ways to detect mispredictions– Tweaks that make BTB more or less “wrong”

don’t change correctness of processor operation• May affect performance


30

Partial Tags


00000000cfff981000000000cfff9824

00000000cfff984c

v00000000cfff98100000000cfff9704

v00000000cfff98200000000cfff9830

v00000000cfff98400000000cfff9900

00000000cfff981000000000cfff9824

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

000001111beef9810

31

PC-offset Encoding


00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

00000000cfff984c

v f981ff9704

v f982ff9830

v f984ff9900

00000000cf ff9900If target is too far away, ororiginal PC is close to “roll-

over”point, then target will be

mispredicted

32

BTB Miss?• Dir-Pred says “taken”• Target-Pred (BTB) misses

– Could default to fall-through PC (as if Dir-Pred said NT)• But we know that’s likely to be wrong!

• Stall fetch until target known … when’s that?– PC-relative: after decode, we can compute

target– Indirect: must wait until register read/exec


33

Stall on BTB Miss


I$

BTB ???

DirPred T

Decode+

PC

displacement

Next PC(unstall fetch)

34

BTB Miss Timing


BTB Lookup(Miss)

Start I$ AccessCurrent PC

Decode +

Next PC

Start I$ Access

Cycle i i+1 i+2 i+3

Cycle ii+1i+2i+3

Stage 1 Stage 2 Stage 3 Stage 4BTB miss

I$ accessdecode

I$ access

stallstall stall

stall stall renameInject nopsi+4 I$ access stall

35

Decode-time Correction


I$

BTB foo

DirPred T

Decode+

PC

displacement

Fetch continues downpath of “foo”

barLater, we discover

predicted target waswrong; flush insts

and resteer (3 cycles of

bubbles better than 20+)

Similar penalty as a BTB miss

36

What about Indirect Jumps?

• Stall until R5 is ready and branch executes– may be a while if Load R5 =

0[R3] misses to main memory• Fetch down NT-path

– why?


I$

BTB ???

DirPred T

Decode

PC

Get target from R5

37

Subroutine Calls


A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

C: 0xFFB0: CALL printf

P: 0x1000: (start of printf)

0x1000FC31

0x1000FD01

0x1000FFB1

No Problem!

38

Subroutine Returns


P: 0x1000: ST $RA [$sp]

0x1B98: LD $tmp [$sp]

A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

A’:0xFC38: CMP $ret, 0

B’:0xFD0C: CMP $ret, 0

0x1B9C: RETN $tmp

0xFC381B901

X

39

Return Address Stack (RAS)• Keep track of call stack


A: 0xFC34: CALL printfFC38

D004P: 0x1000: ST $RA [$sp]…

0x1B9C: RETN $tmp

FC38BTB

A’:0xFC38: CMP $ret, 0

FC38

40

Overflow

1. Wrap-around and overwrite• Will lead to eventual misprediction after four

pops2. Do not modify RAS

• Will lead to misprediction on next pop


FC90 top of stack64AC: CALL printf

64B0 ??? 421C48C87300

41

How Can You Tell It’s a Return?• Pre-decode bit in BTB (return=1, else=0)• Wait until after decode

– Initially use BTB’s target prediction– After decode when you know it’s a return, treat

like it’s a BTB miss or BTB misprediction– Costs a few bubbles, but simpler and still better

than a full pipeline flush


Documents

Advanced Microarchitecture