Upload
neci
View
42
Download
0
Embed Size (px)
DESCRIPTION
Advanced Microarchitecture. Lecture 3: Superscalar Fetch. Fetch Rate is an ILP Upper Bound. To sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC! - PowerPoint PPT Presentation
Citation preview
Advanced MicroarchitectureLecture 3: Superscalar Fetch
2
Fetch Rate is an ILP Upper Bound• To sustain an execution rate of N IPC, you
must be able to sustain a fetch rate of N IPC!
• Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve!
• This also suggests that you don’t need to fetch N instructions every cycle, just on average
Lecture 3: Superscalar Fetch
3
Impediments to “Perfect” Fetch• A machine with superscalar degree N will
ideally fetch N instructions every cycle• This doesn’t happen due to
– Instruction cache organization– Branches– And interaction between the two
Lecture 3: Superscalar Fetch
4
Instruction Cache Organization• To fetch N instructions per cycle from I$, we
need– Physical organization of I$ row must be wide
enough to store N instructions– Must be able to access entire row at the same
time
Lecture 3: Superscalar Fetch Decoder
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst InstTag Inst Inst Inst Inst
Address Cache Line
Alternative: do multiple fetchesper cycle
Not Good: increases cycle timelatency by too much
5
Fetch Operation• Each cycle, PC of next instruction to fetch
is used to access an I$ line• The N instructions specified by this PC and
the next N-1 sequential addresses form a fetch group
• The fetch group might not be aligned with the row structure of the I$
Lecture 3: Superscalar Fetch
6
Fragmentation via Misalignment• If PC = xxx01001, N=4:
– Ideal fetch group is xxx01001 through xxx01100 (inclusive)
Lecture 3: Superscalar Fetch Decoder
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000001010011
111
xxx01001 00 01 10 11
Row widthFetch groupCan only access one line per
cycle, means we fetch only 3instructions (instead of N=4)
7
Fetch Rate Computation• Assume N=4• Assume fetch group starts at random
location• Then fetch rate =
¼ x 4+ ¼ x 3+ ¼ x 2+ ¼ x 1
= 2.5 instructions per cycle
Lecture 3: Superscalar Fetch
8
Reduces Fetch Bandwidth• It now takes two cycles to fetch N
instructions– Halved fetch bandwidth!
Lecture 3: Superscalar Fetch
Decoder
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000001010011
111
xxx01001 00 01 10 11
Decoder
Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst Inst
000001010011
111
xxx01100 00 01 10 11Inst Inst Inst
Inst
Cycle 1
Cycle 2
Inst Inst InstReduction may not be as bad asa full halving
9
Reducing Fetch Fragmentation• Make |Fetch Group| != |Row Width|
Lecture 3: Superscalar Fetch Decoder
Tag Inst Inst Inst InstInst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst InstInst Inst Inst Inst
Address
Inst Inst Inst Inst
If start of fetch group is N or more from the end of the cache line,then N instructions can be delivered
Cache Line
10
May Require Extra Hardware
Lecture 3: Superscalar Fetch Decoder
Tag Inst Inst Inst InstInst Inst Inst Inst
Tag Inst Inst Inst Inst
Tag Inst Inst Inst InstInst Inst Inst Inst
Inst Inst Inst Inst
Rotator
Inst Inst Inst InstAligned fetch group
11
Fetch Rate Computation• Let N=4, cache line size = 8• Then fetch rate =
5/8 x 4+ 1/8 x 3+ 1/8 x 2+ 1/8 x 1
= 3.25 instructions per cycle
Lecture 3: Superscalar Fetch
12
Fragmentation via Branches• Even if fetch group is aligned, and/or cache
line size > than fetch group, taken branches disrupt fetch
Lecture 3: Superscalar Fetch Decoder
Tag Inst Inst Inst InstTag Inst Branch InstTag Inst Inst Inst Inst
Tag Inst Inst Inst InstTag Inst Inst Inst Inst
Inst
X X
13
Fetch Rate Computation• Let N=4• Branch every 5 instructions on average• Assume branch always taken• Assume branch target may start at any
offset in a cache row
Lecture 3: Superscalar Fetch
25% chance of fetch group starting at each location
20% chance for each instruction to be a branch
14
Fetch Rate Computation (2)
Lecture 3: Superscalar Fetch
start of fetch group
¼ x 1 instructionstart of fetch group
¼ x ( 0.2 x 1 + 0.8 x 2 )start of fetch group
¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) )
start of fetch group
¼ x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) )
= 2.048 Instructions Fetched per CycleSimplified analysis: doesn’t account for higherprobability of fetch group being aligned due toprevious fetch group not containing branches
15
Ex. IBM RS/6000
Lecture 3: Superscalar Fetch
PC = B1010
T logic
A0 B0A4 B4A8 B8A12 B12
T logic
A1 B1A5 B5A9 B9A13 B13
T logic
A2 B2A6 B6A10 B10A14 B14
A3 B3A7 B7A11 B11A15 B15
Instruction Buffer Network
0123
0123
0123
0123
OneCacheLine
From TagCheck Logic
2 233
BB12 B13 B10 B11
B11 B12 B13B10
16
Types of Branches• Direction:
– Conditional vs. Unconditional
• Target:– PC-encoded
• PC-relative• Absolute offset
– Computed (target derived from register)
• Must resolve both direction and target to determine the next fetch group
Lecture 3: Superscalar Fetch
17
Prediction• Generally use hardware predictors for both
direction and target– Direction predictor simply predicts that a
branch is taken or not-taken(Exact algorithms covered next lecture)
– Target prediction needs to predict an actual address
Lecture 3: Superscalar Fetch
18
Where Are the Branches?• Before we can predict a branch, we need to
know that we have a branch to predict!
Where’s the branch in this fetch group?
Lecture 3: Superscalar Fetch
I$PC
1001010101011010101001010100101011010100101001010101011010100100100000100100111001001010
19
Simplistic Fetch Engine
Lecture 3: Superscalar Fetch
I$
PD PD PD PDDir
PredTargetPred
Branch’s PC
+sizeof(inst)
Huge latency! Clock frequency plummets
Fetch PC
20
Branch Identification
Lecture 3: Superscalar Fetch
I$
DirPred
TargetPred
Branch’s PC+sizeof(inst)
Store 1 bit perinst, set if inst
is a branch
partial-decodelogic removed… still a long latency (I$ itself sometimes > 1 cycle)
Note: sizeof(inst) maynot be known before
decode (ex. x86)
Predecode branches on fill from L2
21
Line Granularity• Predict next fetch group independent of
exact location of branches in current fetch group
• If there’s only one branch in a fetch group, does it really matter where it is?
Lecture 3: Superscalar Fetch
XXTXXNXX
TN
One predictor entryper instruction PC
One predictor entryper fetch group
22
Predicting by Line
Lecture 3: Superscalar Fetch
I$
br1 br2Dir
PredTargetPred
+sizeof($-line)
CorrectDir Pred
CorrectTarget Pred
br1 br2
Cache Line address
N N N --
X Y
N T T YT -- T XBetter! Latency determined by BPred
This is still challenging: we mayneed to choose between multipletargets for the same cache line
23
Multiple Branch Prediction
Lecture 3: Superscalar Fetch
Dir PredTarget Pred
I$
N N N Taddr0addr1addr2addr3
Scan for1st “T”
0 1
+LSBs of PC
sizeof($-line)
no LSBs of PC
PC
24
Direction Prediction• Details next lecture• Over 90% accurate today for integer
applications• Higher for FP applications
Lecture 3: Superscalar Fetch
25
Target Prediction• PC-relative branches
– If not-taken:next address = branch address + sizeof(inst)
– If taken:next address = branch address + SEXT(offset)
• Sizeof(inst) doesn’t change• Offset doesn’t change
(not counting self-modifying code)
Lecture 3: Superscalar Fetch
26
Taken Targets Only• Only need to predict
taken-branch targets• Taken branch target is
the same every time• Prediction is really just
a “cache”
Lecture 3: Superscalar Fetch
TargetPred
+sizeof(inst)
PC
27
Branch Target Buffer (BTB)
Lecture 3: Superscalar Fetch
V BIA BTA
Branch PC
Branch TargetAddress
=
Valid Bit
Hit?
Branch InstructionAddress (Tag)
Next Fetch PC
28
Set-Associative BTB
Lecture 3: Superscalar Fetch
V tag target
PC
=
V tag target V tag target
= =
Next PC
29
Cutting Corners• Branch prediction may be wrong
– Processor has ways to detect mispredictions– Tweaks that make BTB more or less “wrong”
don’t change correctness of processor operation• May affect performance
Lecture 3: Superscalar Fetch
30
Partial Tags
Lecture 3: Superscalar Fetch
00000000cfff981000000000cfff9824
00000000cfff984c
v00000000cfff98100000000cfff9704
v00000000cfff98200000000cfff9830
v00000000cfff98400000000cfff9900
00000000cfff981000000000cfff9824
00000000cfff984c
v f98100000000cfff9704
v f98200000000cfff9830
v f98400000000cfff9900
000001111beef9810
31
PC-offset Encoding
Lecture 3: Superscalar Fetch
00000000cfff984c
v f98100000000cfff9704
v f98200000000cfff9830
v f98400000000cfff9900
00000000cfff984c
v f981ff9704
v f982ff9830
v f984ff9900
00000000cf ff9900If target is too far away, ororiginal PC is close to “roll-
over”point, then target will be
mispredicted
32
BTB Miss?• Dir-Pred says “taken”• Target-Pred (BTB) misses
– Could default to fall-through PC (as if Dir-Pred said NT)• But we know that’s likely to be wrong!
• Stall fetch until target known … when’s that?– PC-relative: after decode, we can compute
target– Indirect: must wait until register read/exec
Lecture 3: Superscalar Fetch
33
Stall on BTB Miss
Lecture 3: Superscalar Fetch
I$
BTB ???
DirPred T
Decode+
PC
displacement
Next PC(unstall fetch)
34
BTB Miss Timing
Lecture 3: Superscalar Fetch
BTB Lookup(Miss)
Start I$ AccessCurrent PC
Decode +
Next PC
Start I$ Access
Cycle i i+1 i+2 i+3
Cycle ii+1i+2i+3
Stage 1 Stage 2 Stage 3 Stage 4BTB miss
I$ accessdecode
I$ access
stallstall stall
stall stall renameInject nopsi+4 I$ access stall
35
Decode-time Correction
Lecture 3: Superscalar Fetch
I$
BTB foo
DirPred T
Decode+
PC
displacement
Fetch continues downpath of “foo”
barLater, we discover
predicted target waswrong; flush insts
and resteer (3 cycles of
bubbles better than 20+)
Similar penalty as a BTB miss
36
What about Indirect Jumps?
• Stall until R5 is ready and branch executes– may be a while if Load R5 =
0[R3] misses to main memory• Fetch down NT-path
– why?
Lecture 3: Superscalar Fetch
I$
BTB ???
DirPred T
Decode
PC
Get target from R5
37
Subroutine Calls
Lecture 3: Superscalar Fetch
A: 0xFC34: CALL printf
B: 0xFD08: CALL printf
C: 0xFFB0: CALL printf
P: 0x1000: (start of printf)
0x1000FC31
0x1000FD01
0x1000FFB1
No Problem!
38
Subroutine Returns
Lecture 3: Superscalar Fetch
P: 0x1000: ST $RA [$sp]
0x1B98: LD $tmp [$sp]
A: 0xFC34: CALL printf
B: 0xFD08: CALL printf
A’:0xFC38: CMP $ret, 0
B’:0xFD0C: CMP $ret, 0
0x1B9C: RETN $tmp
0xFC381B901
X
39
Return Address Stack (RAS)• Keep track of call stack
Lecture 3: Superscalar Fetch
A: 0xFC34: CALL printfFC38
D004P: 0x1000: ST $RA [$sp]…
0x1B9C: RETN $tmp
FC38BTB
A’:0xFC38: CMP $ret, 0
FC38
40
Overflow
1. Wrap-around and overwrite• Will lead to eventual misprediction after four
pops2. Do not modify RAS
• Will lead to misprediction on next pop
Lecture 3: Superscalar Fetch
FC90 top of stack64AC: CALL printf
64B0 ??? 421C48C87300
41
How Can You Tell It’s a Return?• Pre-decode bit in BTB (return=1, else=0)• Wait until after decode
– Initially use BTB’s target prediction– After decode when you know it’s a return, treat
like it’s a BTB miss or BTB misprediction– Costs a few bubbles, but simpler and still better
than a full pipeline flush
Lecture 3: Superscalar Fetch