41
Advanced Microarchitecture Lecture 3: Superscalar Fetch

Advanced Microarchitecture

  • Upload
    neci

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 3: Superscalar Fetch. Fetch Rate is an ILP Upper Bound. To sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC! - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureLecture 3: Superscalar Fetch

Page 2: Advanced  Microarchitecture

2

Fetch Rate is an ILP Upper Bound• To sustain an execution rate of N IPC, you

must be able to sustain a fetch rate of N IPC!

• Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve!

• This also suggests that you don’t need to fetch N instructions every cycle, just on average

Lecture 3: Superscalar Fetch

Page 3: Advanced  Microarchitecture

3

Impediments to “Perfect” Fetch• A machine with superscalar degree N will

ideally fetch N instructions every cycle• This doesn’t happen due to

– Instruction cache organization– Branches– And interaction between the two

Lecture 3: Superscalar Fetch

Page 4: Advanced  Microarchitecture

4

Instruction Cache Organization• To fetch N instructions per cycle from I$, we

need– Physical organization of I$ row must be wide

enough to store N instructions– Must be able to access entire row at the same

time

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Address Cache Line

Alternative: do multiple fetchesper cycle

Not Good: increases cycle timelatency by too much

Page 5: Advanced  Microarchitecture

5

Fetch Operation• Each cycle, PC of next instruction to fetch

is used to access an I$ line• The N instructions specified by this PC and

the next N-1 sequential addresses form a fetch group

• The fetch group might not be aligned with the row structure of the I$

Lecture 3: Superscalar Fetch

Page 6: Advanced  Microarchitecture

6

Fragmentation via Misalignment• If PC = xxx01001, N=4:

– Ideal fetch group is xxx01001 through xxx01100 (inclusive)

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

xxx01001 00 01 10 11

Row widthFetch groupCan only access one line per

cycle, means we fetch only 3instructions (instead of N=4)

Page 7: Advanced  Microarchitecture

7

Fetch Rate Computation• Assume N=4• Assume fetch group starts at random

location• Then fetch rate =

¼ x 4+ ¼ x 3+ ¼ x 2+ ¼ x 1

= 2.5 instructions per cycle

Lecture 3: Superscalar Fetch

Page 8: Advanced  Microarchitecture

8

Reduces Fetch Bandwidth• It now takes two cycles to fetch N

instructions– Halved fetch bandwidth!

Lecture 3: Superscalar Fetch

Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

xxx01001 00 01 10 11

Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

xxx01100 00 01 10 11Inst Inst Inst

Inst

Cycle 1

Cycle 2

Inst Inst InstReduction may not be as bad asa full halving

Page 9: Advanced  Microarchitecture

9

Reducing Fetch Fragmentation• Make |Fetch Group| != |Row Width|

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstInst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst InstInst Inst Inst Inst

Address

Inst Inst Inst Inst

If start of fetch group is N or more from the end of the cache line,then N instructions can be delivered

Cache Line

Page 10: Advanced  Microarchitecture

10

May Require Extra Hardware

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstInst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst InstInst Inst Inst Inst

Inst Inst Inst Inst

Rotator

Inst Inst Inst InstAligned fetch group

Page 11: Advanced  Microarchitecture

11

Fetch Rate Computation• Let N=4, cache line size = 8• Then fetch rate =

5/8 x 4+ 1/8 x 3+ 1/8 x 2+ 1/8 x 1

= 3.25 instructions per cycle

Lecture 3: Superscalar Fetch

Page 12: Advanced  Microarchitecture

12

Fragmentation via Branches• Even if fetch group is aligned, and/or cache

line size > than fetch group, taken branches disrupt fetch

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Branch InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Inst

X X

Page 13: Advanced  Microarchitecture

13

Fetch Rate Computation• Let N=4• Branch every 5 instructions on average• Assume branch always taken• Assume branch target may start at any

offset in a cache row

Lecture 3: Superscalar Fetch

25% chance of fetch group starting at each location

20% chance for each instruction to be a branch

Page 14: Advanced  Microarchitecture

14

Fetch Rate Computation (2)

Lecture 3: Superscalar Fetch

start of fetch group

¼ x 1 instructionstart of fetch group

¼ x ( 0.2 x 1 + 0.8 x 2 )start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) )

start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) )

= 2.048 Instructions Fetched per CycleSimplified analysis: doesn’t account for higherprobability of fetch group being aligned due toprevious fetch group not containing branches

Page 15: Advanced  Microarchitecture

15

Ex. IBM RS/6000

Lecture 3: Superscalar Fetch

PC = B1010

T logic

A0 B0A4 B4A8 B8A12 B12

T logic

A1 B1A5 B5A9 B9A13 B13

T logic

A2 B2A6 B6A10 B10A14 B14

A3 B3A7 B7A11 B11A15 B15

Instruction Buffer Network

0123

0123

0123

0123

OneCacheLine

From TagCheck Logic

2 233

BB12 B13 B10 B11

B11 B12 B13B10

Page 16: Advanced  Microarchitecture

16

Types of Branches• Direction:

– Conditional vs. Unconditional

• Target:– PC-encoded

• PC-relative• Absolute offset

– Computed (target derived from register)

• Must resolve both direction and target to determine the next fetch group

Lecture 3: Superscalar Fetch

Page 17: Advanced  Microarchitecture

17

Prediction• Generally use hardware predictors for both

direction and target– Direction predictor simply predicts that a

branch is taken or not-taken(Exact algorithms covered next lecture)

– Target prediction needs to predict an actual address

Lecture 3: Superscalar Fetch

Page 18: Advanced  Microarchitecture

18

Where Are the Branches?• Before we can predict a branch, we need to

know that we have a branch to predict!

Where’s the branch in this fetch group?

Lecture 3: Superscalar Fetch

I$PC

1001010101011010101001010100101011010100101001010101011010100100100000100100111001001010

Page 19: Advanced  Microarchitecture

19

Simplistic Fetch Engine

Lecture 3: Superscalar Fetch

I$

PD PD PD PDDir

PredTargetPred

Branch’s PC

+sizeof(inst)

Huge latency! Clock frequency plummets

Fetch PC

Page 20: Advanced  Microarchitecture

20

Branch Identification

Lecture 3: Superscalar Fetch

I$

DirPred

TargetPred

Branch’s PC+sizeof(inst)

Store 1 bit perinst, set if inst

is a branch

partial-decodelogic removed… still a long latency (I$ itself sometimes > 1 cycle)

Note: sizeof(inst) maynot be known before

decode (ex. x86)

Predecode branches on fill from L2

Page 21: Advanced  Microarchitecture

21

Line Granularity• Predict next fetch group independent of

exact location of branches in current fetch group

• If there’s only one branch in a fetch group, does it really matter where it is?

Lecture 3: Superscalar Fetch

XXTXXNXX

TN

One predictor entryper instruction PC

One predictor entryper fetch group

Page 22: Advanced  Microarchitecture

22

Predicting by Line

Lecture 3: Superscalar Fetch

I$

br1 br2Dir

PredTargetPred

+sizeof($-line)

CorrectDir Pred

CorrectTarget Pred

br1 br2

Cache Line address

N N N --

X Y

N T T YT -- T XBetter! Latency determined by BPred

This is still challenging: we mayneed to choose between multipletargets for the same cache line

Page 23: Advanced  Microarchitecture

23

Multiple Branch Prediction

Lecture 3: Superscalar Fetch

Dir PredTarget Pred

I$

N N N Taddr0addr1addr2addr3

Scan for1st “T”

0 1

+LSBs of PC

sizeof($-line)

no LSBs of PC

PC

Page 24: Advanced  Microarchitecture

24

Direction Prediction• Details next lecture• Over 90% accurate today for integer

applications• Higher for FP applications

Lecture 3: Superscalar Fetch

Page 25: Advanced  Microarchitecture

25

Target Prediction• PC-relative branches

– If not-taken:next address = branch address + sizeof(inst)

– If taken:next address = branch address + SEXT(offset)

• Sizeof(inst) doesn’t change• Offset doesn’t change

(not counting self-modifying code)

Lecture 3: Superscalar Fetch

Page 26: Advanced  Microarchitecture

26

Taken Targets Only• Only need to predict

taken-branch targets• Taken branch target is

the same every time• Prediction is really just

a “cache”

Lecture 3: Superscalar Fetch

TargetPred

+sizeof(inst)

PC

Page 27: Advanced  Microarchitecture

27

Branch Target Buffer (BTB)

Lecture 3: Superscalar Fetch

V BIA BTA

Branch PC

Branch TargetAddress

=

Valid Bit

Hit?

Branch InstructionAddress (Tag)

Next Fetch PC

Page 28: Advanced  Microarchitecture

28

Set-Associative BTB

Lecture 3: Superscalar Fetch

V tag target

PC

=

V tag target V tag target

= =

Next PC

Page 29: Advanced  Microarchitecture

29

Cutting Corners• Branch prediction may be wrong

– Processor has ways to detect mispredictions– Tweaks that make BTB more or less “wrong”

don’t change correctness of processor operation• May affect performance

Lecture 3: Superscalar Fetch

Page 30: Advanced  Microarchitecture

30

Partial Tags

Lecture 3: Superscalar Fetch

00000000cfff981000000000cfff9824

00000000cfff984c

v00000000cfff98100000000cfff9704

v00000000cfff98200000000cfff9830

v00000000cfff98400000000cfff9900

00000000cfff981000000000cfff9824

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

000001111beef9810

Page 31: Advanced  Microarchitecture

31

PC-offset Encoding

Lecture 3: Superscalar Fetch

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

00000000cfff984c

v f981ff9704

v f982ff9830

v f984ff9900

00000000cf ff9900If target is too far away, ororiginal PC is close to “roll-

over”point, then target will be

mispredicted

Page 32: Advanced  Microarchitecture

32

BTB Miss?• Dir-Pred says “taken”• Target-Pred (BTB) misses

– Could default to fall-through PC (as if Dir-Pred said NT)• But we know that’s likely to be wrong!

• Stall fetch until target known … when’s that?– PC-relative: after decode, we can compute

target– Indirect: must wait until register read/exec

Lecture 3: Superscalar Fetch

Page 33: Advanced  Microarchitecture

33

Stall on BTB Miss

Lecture 3: Superscalar Fetch

I$

BTB ???

DirPred T

Decode+

PC

displacement

Next PC(unstall fetch)

Page 34: Advanced  Microarchitecture

34

BTB Miss Timing

Lecture 3: Superscalar Fetch

BTB Lookup(Miss)

Start I$ AccessCurrent PC

Decode +

Next PC

Start I$ Access

Cycle i i+1 i+2 i+3

Cycle ii+1i+2i+3

Stage 1 Stage 2 Stage 3 Stage 4BTB miss

I$ accessdecode

I$ access

stallstall stall

stall stall renameInject nopsi+4 I$ access stall

Page 35: Advanced  Microarchitecture

35

Decode-time Correction

Lecture 3: Superscalar Fetch

I$

BTB foo

DirPred T

Decode+

PC

displacement

Fetch continues downpath of “foo”

barLater, we discover

predicted target waswrong; flush insts

and resteer (3 cycles of

bubbles better than 20+)

Similar penalty as a BTB miss

Page 36: Advanced  Microarchitecture

36

What about Indirect Jumps?

• Stall until R5 is ready and branch executes– may be a while if Load R5 =

0[R3] misses to main memory• Fetch down NT-path

– why?

Lecture 3: Superscalar Fetch

I$

BTB ???

DirPred T

Decode

PC

Get target from R5

Page 37: Advanced  Microarchitecture

37

Subroutine Calls

Lecture 3: Superscalar Fetch

A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

C: 0xFFB0: CALL printf

P: 0x1000: (start of printf)

0x1000FC31

0x1000FD01

0x1000FFB1

No Problem!

Page 38: Advanced  Microarchitecture

38

Subroutine Returns

Lecture 3: Superscalar Fetch

P: 0x1000: ST $RA [$sp]

0x1B98: LD $tmp [$sp]

A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

A’:0xFC38: CMP $ret, 0

B’:0xFD0C: CMP $ret, 0

0x1B9C: RETN $tmp

0xFC381B901

X

Page 39: Advanced  Microarchitecture

39

Return Address Stack (RAS)• Keep track of call stack

Lecture 3: Superscalar Fetch

A: 0xFC34: CALL printfFC38

D004P: 0x1000: ST $RA [$sp]…

0x1B9C: RETN $tmp

FC38BTB

A’:0xFC38: CMP $ret, 0

FC38

Page 40: Advanced  Microarchitecture

40

Overflow

1. Wrap-around and overwrite• Will lead to eventual misprediction after four

pops2. Do not modify RAS

• Will lead to misprediction on next pop

Lecture 3: Superscalar Fetch

FC90 top of stack64AC: CALL printf

64B0 ??? 421C48C87300

Page 41: Advanced  Microarchitecture

41

How Can You Tell It’s a Return?• Pre-decode bit in BTB (return=1, else=0)• Wait until after decode

– Initially use BTB’s target prediction– After decode when you know it’s a return, treat

like it’s a BTB miss or BTB misprediction– Costs a few bubbles, but simpler and still better

than a full pipeline flush

Lecture 3: Superscalar Fetch