Advanced Microarchitecture

Preview:

DESCRIPTION

Advanced Microarchitecture. Lecture 3: Superscalar Fetch. Fetch Rate is an ILP Upper Bound. To sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC! - PowerPoint PPT Presentation

Citation preview

Advanced MicroarchitectureLecture 3: Superscalar Fetch

2

Fetch Rate is an ILP Upper Bound• To sustain an execution rate of N IPC, you

must be able to sustain a fetch rate of N IPC!

• Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve!

• This also suggests that you don’t need to fetch N instructions every cycle, just on average

Lecture 3: Superscalar Fetch

3

Impediments to “Perfect” Fetch• A machine with superscalar degree N will

ideally fetch N instructions every cycle• This doesn’t happen due to

– Instruction cache organization– Branches– And interaction between the two

Lecture 3: Superscalar Fetch

4

Instruction Cache Organization• To fetch N instructions per cycle from I$, we

need– Physical organization of I$ row must be wide

enough to store N instructions– Must be able to access entire row at the same

time

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Address Cache Line

Alternative: do multiple fetchesper cycle

Not Good: increases cycle timelatency by too much

5

Fetch Operation• Each cycle, PC of next instruction to fetch

is used to access an I$ line• The N instructions specified by this PC and

the next N-1 sequential addresses form a fetch group

• The fetch group might not be aligned with the row structure of the I$

Lecture 3: Superscalar Fetch

6

Fragmentation via Misalignment• If PC = xxx01001, N=4:

– Ideal fetch group is xxx01001 through xxx01100 (inclusive)

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

xxx01001 00 01 10 11

Row widthFetch groupCan only access one line per

cycle, means we fetch only 3instructions (instead of N=4)

7

Fetch Rate Computation• Assume N=4• Assume fetch group starts at random

location• Then fetch rate =

¼ x 4+ ¼ x 3+ ¼ x 2+ ¼ x 1

= 2.5 instructions per cycle

Lecture 3: Superscalar Fetch

8

Reduces Fetch Bandwidth• It now takes two cycles to fetch N

instructions– Halved fetch bandwidth!

Lecture 3: Superscalar Fetch

Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

xxx01001 00 01 10 11

Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

xxx01100 00 01 10 11Inst Inst Inst

Inst

Cycle 1

Cycle 2

Inst Inst InstReduction may not be as bad asa full halving

9

Reducing Fetch Fragmentation• Make |Fetch Group| != |Row Width|

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstInst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst InstInst Inst Inst Inst

Address

Inst Inst Inst Inst

If start of fetch group is N or more from the end of the cache line,then N instructions can be delivered

Cache Line

10

May Require Extra Hardware

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstInst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst InstInst Inst Inst Inst

Inst Inst Inst Inst

Rotator

Inst Inst Inst InstAligned fetch group

11

Fetch Rate Computation• Let N=4, cache line size = 8• Then fetch rate =

5/8 x 4+ 1/8 x 3+ 1/8 x 2+ 1/8 x 1

= 3.25 instructions per cycle

Lecture 3: Superscalar Fetch

12

Fragmentation via Branches• Even if fetch group is aligned, and/or cache

line size > than fetch group, taken branches disrupt fetch

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Branch InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Inst

X X

13

Fetch Rate Computation• Let N=4• Branch every 5 instructions on average• Assume branch always taken• Assume branch target may start at any

offset in a cache row

Lecture 3: Superscalar Fetch

25% chance of fetch group starting at each location

20% chance for each instruction to be a branch

14

Fetch Rate Computation (2)

Lecture 3: Superscalar Fetch

start of fetch group

¼ x 1 instructionstart of fetch group

¼ x ( 0.2 x 1 + 0.8 x 2 )start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) )

start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) )

= 2.048 Instructions Fetched per CycleSimplified analysis: doesn’t account for higherprobability of fetch group being aligned due toprevious fetch group not containing branches

15

Ex. IBM RS/6000

Lecture 3: Superscalar Fetch

PC = B1010

T logic

A0 B0A4 B4A8 B8A12 B12

T logic

A1 B1A5 B5A9 B9A13 B13

T logic

A2 B2A6 B6A10 B10A14 B14

A3 B3A7 B7A11 B11A15 B15

Instruction Buffer Network

0123

0123

0123

0123

OneCacheLine

From TagCheck Logic

2 233

BB12 B13 B10 B11

B11 B12 B13B10

16

Types of Branches• Direction:

– Conditional vs. Unconditional

• Target:– PC-encoded

• PC-relative• Absolute offset

– Computed (target derived from register)

• Must resolve both direction and target to determine the next fetch group

Lecture 3: Superscalar Fetch

17

Prediction• Generally use hardware predictors for both

direction and target– Direction predictor simply predicts that a

branch is taken or not-taken(Exact algorithms covered next lecture)

– Target prediction needs to predict an actual address

Lecture 3: Superscalar Fetch

18

Where Are the Branches?• Before we can predict a branch, we need to

know that we have a branch to predict!

Where’s the branch in this fetch group?

Lecture 3: Superscalar Fetch

I$PC

1001010101011010101001010100101011010100101001010101011010100100100000100100111001001010

19

Simplistic Fetch Engine

Lecture 3: Superscalar Fetch

I$

PD PD PD PDDir

PredTargetPred

Branch’s PC

+sizeof(inst)

Huge latency! Clock frequency plummets

Fetch PC

20

Branch Identification

Lecture 3: Superscalar Fetch

I$

DirPred

TargetPred

Branch’s PC+sizeof(inst)

Store 1 bit perinst, set if inst

is a branch

partial-decodelogic removed… still a long latency (I$ itself sometimes > 1 cycle)

Note: sizeof(inst) maynot be known before

decode (ex. x86)

Predecode branches on fill from L2

21

Line Granularity• Predict next fetch group independent of

exact location of branches in current fetch group

• If there’s only one branch in a fetch group, does it really matter where it is?

Lecture 3: Superscalar Fetch

XXTXXNXX

TN

One predictor entryper instruction PC

One predictor entryper fetch group

22

Predicting by Line

Lecture 3: Superscalar Fetch

I$

br1 br2Dir

PredTargetPred

+sizeof($-line)

CorrectDir Pred

CorrectTarget Pred

br1 br2

Cache Line address

N N N --

X Y

N T T YT -- T XBetter! Latency determined by BPred

This is still challenging: we mayneed to choose between multipletargets for the same cache line

23

Multiple Branch Prediction

Lecture 3: Superscalar Fetch

Dir PredTarget Pred

I$

N N N Taddr0addr1addr2addr3

Scan for1st “T”

0 1

+LSBs of PC

sizeof($-line)

no LSBs of PC

PC

24

Direction Prediction• Details next lecture• Over 90% accurate today for integer

applications• Higher for FP applications

Lecture 3: Superscalar Fetch

25

Target Prediction• PC-relative branches

– If not-taken:next address = branch address + sizeof(inst)

– If taken:next address = branch address + SEXT(offset)

• Sizeof(inst) doesn’t change• Offset doesn’t change

(not counting self-modifying code)

Lecture 3: Superscalar Fetch

26

Taken Targets Only• Only need to predict

taken-branch targets• Taken branch target is

the same every time• Prediction is really just

a “cache”

Lecture 3: Superscalar Fetch

TargetPred

+sizeof(inst)

PC

27

Branch Target Buffer (BTB)

Lecture 3: Superscalar Fetch

V BIA BTA

Branch PC

Branch TargetAddress

=

Valid Bit

Hit?

Branch InstructionAddress (Tag)

Next Fetch PC

28

Set-Associative BTB

Lecture 3: Superscalar Fetch

V tag target

PC

=

V tag target V tag target

= =

Next PC

29

Cutting Corners• Branch prediction may be wrong

– Processor has ways to detect mispredictions– Tweaks that make BTB more or less “wrong”

don’t change correctness of processor operation• May affect performance

Lecture 3: Superscalar Fetch

30

Partial Tags

Lecture 3: Superscalar Fetch

00000000cfff981000000000cfff9824

00000000cfff984c

v00000000cfff98100000000cfff9704

v00000000cfff98200000000cfff9830

v00000000cfff98400000000cfff9900

00000000cfff981000000000cfff9824

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

000001111beef9810

31

PC-offset Encoding

Lecture 3: Superscalar Fetch

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

00000000cfff984c

v f981ff9704

v f982ff9830

v f984ff9900

00000000cf ff9900If target is too far away, ororiginal PC is close to “roll-

over”point, then target will be

mispredicted

32

BTB Miss?• Dir-Pred says “taken”• Target-Pred (BTB) misses

– Could default to fall-through PC (as if Dir-Pred said NT)• But we know that’s likely to be wrong!

• Stall fetch until target known … when’s that?– PC-relative: after decode, we can compute

target– Indirect: must wait until register read/exec

Lecture 3: Superscalar Fetch

33

Stall on BTB Miss

Lecture 3: Superscalar Fetch

I$

BTB ???

DirPred T

Decode+

PC

displacement

Next PC(unstall fetch)

34

BTB Miss Timing

Lecture 3: Superscalar Fetch

BTB Lookup(Miss)

Start I$ AccessCurrent PC

Decode +

Next PC

Start I$ Access

Cycle i i+1 i+2 i+3

Cycle ii+1i+2i+3

Stage 1 Stage 2 Stage 3 Stage 4BTB miss

I$ accessdecode

I$ access

stallstall stall

stall stall renameInject nopsi+4 I$ access stall

35

Decode-time Correction

Lecture 3: Superscalar Fetch

I$

BTB foo

DirPred T

Decode+

PC

displacement

Fetch continues downpath of “foo”

barLater, we discover

predicted target waswrong; flush insts

and resteer (3 cycles of

bubbles better than 20+)

Similar penalty as a BTB miss

36

What about Indirect Jumps?

• Stall until R5 is ready and branch executes– may be a while if Load R5 =

0[R3] misses to main memory• Fetch down NT-path

– why?

Lecture 3: Superscalar Fetch

I$

BTB ???

DirPred T

Decode

PC

Get target from R5

37

Subroutine Calls

Lecture 3: Superscalar Fetch

A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

C: 0xFFB0: CALL printf

P: 0x1000: (start of printf)

0x1000FC31

0x1000FD01

0x1000FFB1

No Problem!

38

Subroutine Returns

Lecture 3: Superscalar Fetch

P: 0x1000: ST $RA [$sp]

0x1B98: LD $tmp [$sp]

A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

A’:0xFC38: CMP $ret, 0

B’:0xFD0C: CMP $ret, 0

0x1B9C: RETN $tmp

0xFC381B901

X

39

Return Address Stack (RAS)• Keep track of call stack

Lecture 3: Superscalar Fetch

A: 0xFC34: CALL printfFC38

D004P: 0x1000: ST $RA [$sp]…

0x1B9C: RETN $tmp

FC38BTB

A’:0xFC38: CMP $ret, 0

FC38

40

Overflow

1. Wrap-around and overwrite• Will lead to eventual misprediction after four

pops2. Do not modify RAS

• Will lead to misprediction on next pop

Lecture 3: Superscalar Fetch

FC90 top of stack64AC: CALL printf

64B0 ??? 421C48C87300

41

How Can You Tell It’s a Return?• Pre-decode bit in BTB (return=1, else=0)• Wait until after decode

– Initially use BTB’s target prediction– After decode when you know it’s a return, treat

like it’s a BTB miss or BTB misprediction– Costs a few bubbles, but simpler and still better

than a full pipeline flush

Lecture 3: Superscalar Fetch