Advanced Microarchitecture

Advanced MicroarchitectureLecture 3: Superscalar Fetch

Fetch Rate is an ILP Upper Bound• To sustain an execution rate of N IPC, you

must be able to sustain a fetch rate of N IPC!

• Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve!

• This also suggests that you don’t need to fetch N instructions every cycle, just on average

Lecture 3: Superscalar Fetch

Impediments to “Perfect” Fetch• A machine with superscalar degree N will

ideally fetch N instructions every cycle• This doesn’t happen due to

– Instruction cache organization– Branches– And interaction between the two

Instruction Cache Organization• To fetch N instructions per cycle from I$, we

need– Physical organization of I$ row must be wide

enough to store N instructions– Must be able to access entire row at the same

Lecture 3: Superscalar Fetch Decoder

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Address Cache Line

Alternative: do multiple fetchesper cycle

Not Good: increases cycle timelatency by too much

Fetch Operation• Each cycle, PC of next instruction to fetch

is used to access an I$ line• The N instructions specified by this PC and

the next N-1 sequential addresses form a fetch group

• The fetch group might not be aligned with the row structure of the I$

Fragmentation via Misalignment• If PC = xxx01001, N=4:

– Ideal fetch group is xxx01001 through xxx01100 (inclusive)

Tag Inst Inst Inst Inst

000001010011

xxx01001 00 01 10 11

Row widthFetch groupCan only access one line per

cycle, means we fetch only 3instructions (instead of N=4)

Fetch Rate Computation• Assume N=4• Assume fetch group starts at random

location• Then fetch rate =

¼ x 4+ ¼ x 3+ ¼ x 2+ ¼ x 1

= 2.5 instructions per cycle

Reduces Fetch Bandwidth• It now takes two cycles to fetch N

instructions– Halved fetch bandwidth!

Decoder

000001010011

xxx01001 00 01 10 11

Decoder

000001010011

xxx01100 00 01 10 11Inst Inst Inst

Cycle 1

Cycle 2

Inst Inst InstReduction may not be as bad asa full halving

Reducing Fetch Fragmentation• Make |Fetch Group| != |Row Width|

Tag Inst Inst Inst InstInst Inst Inst Inst

Address

Inst Inst Inst Inst

If start of fetch group is N or more from the end of the cache line,then N instructions can be delivered

Cache Line

May Require Extra Hardware

Inst Inst Inst Inst

Rotator

Inst Inst Inst InstAligned fetch group

Fetch Rate Computation• Let N=4, cache line size = 8• Then fetch rate =

5/8 x 4+ 1/8 x 3+ 1/8 x 2+ 1/8 x 1

= 3.25 instructions per cycle

Fragmentation via Branches• Even if fetch group is aligned, and/or cache

line size > than fetch group, taken branches disrupt fetch

Tag Inst Inst Inst InstTag Inst Branch InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Fetch Rate Computation• Let N=4• Branch every 5 instructions on average• Assume branch always taken• Assume branch target may start at any

offset in a cache row

25% chance of fetch group starting at each location

20% chance for each instruction to be a branch

Fetch Rate Computation (2)

start of fetch group

¼ x 1 instructionstart of fetch group

¼ x ( 0.2 x 1 + 0.8 x 2 )start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) )

start of fetch group

¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) )

= 2.048 Instructions Fetched per CycleSimplified analysis: doesn’t account for higherprobability of fetch group being aligned due toprevious fetch group not containing branches

Ex. IBM RS/6000

PC = B1010

T logic

A0 B0A4 B4A8 B8A12 B12

T logic

A1 B1A5 B5A9 B9A13 B13

T logic

A2 B2A6 B6A10 B10A14 B14

A3 B3A7 B7A11 B11A15 B15

Instruction Buffer Network

OneCacheLine

From TagCheck Logic

BB12 B13 B10 B11

B11 B12 B13B10

Types of Branches• Direction:

– Conditional vs. Unconditional

• Target:– PC-encoded

• PC-relative• Absolute offset

– Computed (target derived from register)

• Must resolve both direction and target to determine the next fetch group

Prediction• Generally use hardware predictors for both

direction and target– Direction predictor simply predicts that a

branch is taken or not-taken(Exact algorithms covered next lecture)

– Target prediction needs to predict an actual address

Where Are the Branches?• Before we can predict a branch, we need to

know that we have a branch to predict!

Where’s the branch in this fetch group?

1001010101011010101001010100101011010100101001010101011010100100100000100100111001001010

Simplistic Fetch Engine

PD PD PD PDDir

PredTargetPred

Branch’s PC

+sizeof(inst)

Huge latency! Clock frequency plummets

Fetch PC

Branch Identification

DirPred

TargetPred

Branch’s PC+sizeof(inst)

Store 1 bit perinst, set if inst

is a branch

partial-decodelogic removed… still a long latency (I$ itself sometimes > 1 cycle)

Note: sizeof(inst) maynot be known before

decode (ex. x86)

Predecode branches on fill from L2

Line Granularity• Predict next fetch group independent of

exact location of branches in current fetch group

• If there’s only one branch in a fetch group, does it really matter where it is?

XXTXXNXX

One predictor entryper instruction PC

One predictor entryper fetch group

Predicting by Line

br1 br2Dir

PredTargetPred

+sizeof($-line)

CorrectDir Pred

CorrectTarget Pred

br1 br2

Cache Line address

N N N --

N T T YT -- T XBetter! Latency determined by BPred

This is still challenging: we mayneed to choose between multipletargets for the same cache line

Multiple Branch Prediction

Dir PredTarget Pred

N N N Taddr0addr1addr2addr3

Scan for1st “T”

+LSBs of PC

sizeof($-line)

no LSBs of PC

Direction Prediction• Details next lecture• Over 90% accurate today for integer

applications• Higher for FP applications

Target Prediction• PC-relative branches

– If not-taken:next address = branch address + sizeof(inst)

– If taken:next address = branch address + SEXT(offset)

• Sizeof(inst) doesn’t change• Offset doesn’t change

(not counting self-modifying code)

Taken Targets Only• Only need to predict

taken-branch targets• Taken branch target is

the same every time• Prediction is really just

a “cache”

TargetPred

+sizeof(inst)

Branch Target Buffer (BTB)

V BIA BTA

Branch PC

Branch TargetAddress

Valid Bit

Branch InstructionAddress (Tag)

Next Fetch PC

Set-Associative BTB

V tag target

V tag target V tag target

Next PC

Cutting Corners• Branch prediction may be wrong

– Processor has ways to detect mispredictions– Tweaks that make BTB more or less “wrong”

don’t change correctness of processor operation• May affect performance

Partial Tags

00000000cfff981000000000cfff9824

00000000cfff984c

v00000000cfff98100000000cfff9704

v00000000cfff98200000000cfff9830

v00000000cfff98400000000cfff9900

00000000cfff981000000000cfff9824

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

000001111beef9810

PC-offset Encoding

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

00000000cfff984c

v f981ff9704

v f982ff9830

v f984ff9900

00000000cf ff9900If target is too far away, ororiginal PC is close to “roll-

over”point, then target will be

mispredicted

BTB Miss?• Dir-Pred says “taken”• Target-Pred (BTB) misses

– Could default to fall-through PC (as if Dir-Pred said NT)• But we know that’s likely to be wrong!

• Stall fetch until target known … when’s that?– PC-relative: after decode, we can compute

target– Indirect: must wait until register read/exec

Stall on BTB Miss

BTB ???

DirPred T

Decode+

displacement

Next PC(unstall fetch)

BTB Miss Timing

BTB Lookup(Miss)

Start I$ AccessCurrent PC

Decode +

Next PC

Start I$ Access

Cycle i i+1 i+2 i+3

Cycle ii+1i+2i+3

Stage 1 Stage 2 Stage 3 Stage 4BTB miss

I$ accessdecode

I$ access

stallstall stall

stall stall renameInject nopsi+4 I$ access stall

Decode-time Correction

BTB foo

DirPred T

Decode+

displacement

Fetch continues downpath of “foo”

barLater, we discover

predicted target waswrong; flush insts

and resteer (3 cycles of

bubbles better than 20+)

Similar penalty as a BTB miss

What about Indirect Jumps?

• Stall until R5 is ready and branch executes– may be a while if Load R5 =

0[R3] misses to main memory• Fetch down NT-path

– why?

BTB ???

DirPred T

Decode

Get target from R5

Subroutine Calls

A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

C: 0xFFB0: CALL printf

P: 0x1000: (start of printf)

0x1000FC31

0x1000FD01

0x1000FFB1

No Problem!

Subroutine Returns

P: 0x1000: ST $RA [$sp]

0x1B98: LD $tmp [$sp]

A: 0xFC34: CALL printf

B: 0xFD08: CALL printf

A’:0xFC38: CMP $ret, 0

B’:0xFD0C: CMP $ret, 0

0x1B9C: RETN $tmp

0xFC381B901

Return Address Stack (RAS)• Keep track of call stack

A: 0xFC34: CALL printfFC38

D004P: 0x1000: ST $RA [$sp]…

0x1B9C: RETN $tmp

FC38BTB

A’:0xFC38: CMP $ret, 0

Overflow

1. Wrap-around and overwrite• Will lead to eventual misprediction after four

pops2. Do not modify RAS

• Will lead to misprediction on next pop

FC90 top of stack64AC: CALL printf

64B0 ??? 421C48C87300

How Can You Tell It’s a Return?• Pre-decode bit in BTB (return=1, else=0)• Wait until after decode

– Initially use BTB’s target prediction– After decode when you know it’s a return, treat

like it’s a BTB miss or BTB misprediction– Costs a few bubbles, but simpler and still better

than a full pipeline flush

Advanced Microarchitecture

Documents

On Tuning Microarchitecture for Programs

TI II: Computer Architecture Microarchitecture

AMD Bulldozer Microarchitecture

IA-64 Microarchitecture --- Itanium Processor

Demystifying GPU Microarchitecture through Microbenchmarking

Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1

Advanced Microarchitecture

Intel Microarchitecture: Nehalem

2.2 MSP430 Microarchitecture

Advanced Systems Lab - Advanced Computing Laboratory · Advanced Systems Lab Spring 2020 Definitions Microarchitecture: Implementation of the architecture Examples: Caches, cache

COMP22111’ Processor’ Microarchitecture’

2.2 Microarchitecture 2.2b – Instruction Phases

Nehalem (microarchitecture)

i8085 microarchitecture

The Microarchitecture of Superscalar Processors

MIPS Microarchitecture Multicycle Processor

Router Microarchitecture and Scalability of Ring Topology ... · NoCArc’09 Ring Router Microarchitecture 1 Router Microarchitecture and Scalability of Ring Topology in On-Chip Networks

Processor Microarchitecture An Implementation Perspective

Advanced Microarchitecture

IBM Power6 Microarchitecture