Upload
cooper
View
61
Download
4
Embed Size (px)
DESCRIPTION
Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures. Vladimir Uzelac and Aleksandar Milenkovi ć LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in Huntsville {uzelacv | milenka}@ece.uah.edu. Outline. - PowerPoint PPT Presentation
Citation preview
Microbenchmarks and Mechanisms for Reverse Engineering
of Branch Predictor Structures
Vladimir Uzelac and Aleksandar MilenkovićLaCASA Laboratory
Electrical and Computer Engineering DepartmentThe University of Alabama in Huntsville
{uzelacv | milenka}@ece.uah.edu
2
Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
Target Predictors Branch Target Buffer Indirect Branch Target Buffer
Outcome Predictors Loop Predictor Global/Bimodal Predictors
Conclusion
3
Motivation If we know branch predictor organization we could … Implement predictor-aware compiler optimizations
Code alignment to avoid BTB conflicts in critical code sections Code split to replace long correlations with shorter ones Camino environment [PLDI `05]
Have a “golden standard” for academic research Design tools for rapid BP
design space exploration and verification But, details are rarely publicly disclosed
In spite of hints in software optimization manuals Develop microbenchmarks and mechanisms for reverse
engineering of modern branch predictor units
4
Goals Microbenchmarks and mechanisms developed to
reverse engineer Pentium M’s branch predictor including Target predictor
BTB and IBTB Outcome predictor
Loop predictor Global outcome predictor Bimodal predictor
Branch predictor parameters Organization and size of all branch predictor structures Indexing, allocation, update, replacement policies Interdependencies between these structures
Validation of our effort through a functional PIN model
5
Presentation Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
Target Predictors Branch Target Buffer Indirect Branch Target Buffer
Outcome Predictors Loop Predictor Global/Bimodal Predictors
Conclusion
6
Reverse Engineering Flow Goal: determine a specific branch predictor
parameter (e.g., BTB size) Design benchmark(s) to stress
the parameter Influenced by the type
of observable events Build expectations for relevant event(s)
based on back-of-the-envelope analysis Execute benchmarks and
collect events (Vtune) Compare expectations with actual results Retire findings or modify benchmark Verify findings using functional PIN model
Goal:Branch Parameter
Microbenchmark to stress the parameter
Observable Events
BuildExpectations
Collect Events(Run in Vtune)
# of expectedevents
# of collected events
=
BP FunctionalModel
To PIN verification
YesParameterExtracted
No(Revisit Microbench.)
7
Outline Goals and Motivation Reverse Engineering Flow Predictors Details Deconstruction
Target Predictors Branch Target Buffer Indirect Branch Target Buffer
Outcome Predictors Loop Predictor Global/Bimodal Predictors
Conclusion
Branch Target Buffer (BTB)
Background: BTB is a cache structure Instructions are fetched
in 16-byte blocks (Intel) Can have multiple
branches per line BTB can have multiple
hits (same tags) => Offset field in each entry => Offset algorithm selects
the target among several offered
8
Try to find: Number of BTB entries (NBTB) Number of sets (NSETS) Number of ways (NWAYS) Index, Tag bits Offset bits and presence of
offset algorithm Bogus branches handling Replacement policy
TAG Target Offset
WAY NWAY
1
Repl.Bits
0
NSETS
IP
BTB
Core BTB Test Use B taken branches at the
distance D from each other Code executed many times to
amplify effects on performance counters Control how these branches
are presented to BTB To cope with different allocation policies Here, we execute each branch twice consecutively
Missprediction rate (MPR) as function of B and D is sufficient to conclude on BTB parameters
9
Branch 1
Branch 2
Branch B
~~
D
128256
5121024
0%20%40%60%
80%
100%
2 4 8 16 32 64 128
MPR
DB
10
BTB Capacity Tests Try to fill whole BTB using very small distances between branches Example: 4-way BTB with 512 entries, BTB index = IP[10:4] NBTB branches can fit for three distances
Branches fill sets consecutively For larger D, MPR = f(B,D)
Branches jump over sets For very small D, there are
more branches in the line than sets
MPR exist for any D if B>NBTB
MPR = f(B,D, BTB parameters)can be mathematically formalized
Branch 4
WAY 3
Branch 3Branch 2
EvictOne
Branch 1
WAY 1
WAY 2
NSET
0
WAY 3
Branch 5
Branch 5
Branch 4
WAY 3
Branch 3Branch 2
EvictOne
Branch 1
WAY 1
WAY 2
NSET
0
WAY 3
Branch 5
BTB Set Tests Try to fill one BTB set varying distance D When D > NSET all branches
collide in one set MPR is a function of B only
(only 4 branches can fit) Helps finding NWAYS and Index MSB
When D > NSET, change D’ between lasttwo to find Index LSB
D’ for which MPR disappear determines Index LSB
When D over Tag MSB distance, false hits occur
Only two branches produce MPR
11
...Branch 1
WAY 1
WAY N
Branch 2
False Hit
NSET
0
Index OffsetTagNot UsedIP
D=2TAG.MSB + 1
12
BTB Findings Number of BTB entries: 2048 Number of sets: 512 Number of ways : 4 Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] Offset algorithm: When multiple hits, selects the target with the
lowest offset yet no smaller than the current IP Bogus branches handling: Evict whole set Replacement policy: Tree based pseudo LRU
Index = IP [12:4] Tag = IP [21:13]
Way 0Way 3
Branch target buffer (BTB)
0
511
Target(32 bits)
BTB hit
BTB target
Type (2-3 bits)
Tag (9 bits)
BTB typeOffset (4 bits)
PLRU(3 bits)
IP[31:4]
13
Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
Target Predictors Branch Target Buffer Indirect Branch Target Buffer
Outcome Predictors Loop Predictor Global/Bimodal Predictors
Conclusion
Indirect Branch Target Buffer (IBTB)Background: Target predictor indexed
by program-path informationTry to find:1. Which branch parts affect the
PIR during update?2. How is PIR updated?3. Which branch IP bits affect
the hash access function? 4. What is hash access function?5. What are Index and Tag
fields?6. What is IBTB organization?
14
PIR IBTB
INDEX TAG
IP
Hash
Tag Target
031 n m3
4
5
6
PIR
F
IP031 q p
TA031 s r
Retired branch TypeOutcome
1
2
Branch 1Branch 2
Branch 3Newest Branch PIR Shift and Add
Branch 1Branch 2
Branch 3
PIR Shift
Path Information Register: Background
PIR is a (shift) register – updated with program branches Different ways to allocate newly occurred branch :
Shift and Add (add to lowest PIR bits)
Shift and Add with interleave(better indexing)
Shift and XOR
15
and AddNewest Branch
PIR Shift and Add with InterleaveNewest Branch
PIRShift count = 2
0 0
BRANCH BITS
XO
R
XO
R
XO
R
XO
R
XO
R
XO
R ...Shift And XOR
P1.SB2
iSpy
Target1 Target2
...
P1.SBN
...
P2.SBN
P2.SB2
PIR 1 PIR 2
P1.SB1 P2.SB1
Dq=2q+1
Dq=2q+1
16
PIR Organization Test PIR is the same prior to both Target1 and Target2
Branches are at large distance from each other (> 2q)
P1.SB1 and P2.SB1 differ in one bit – k = log2D If bit k affects the PIR there is no collisions and opposite
H block – H branches that affect the PIR For large H, P1.SB1 and P2.SB1 shifted out of PIR
Analysis MPR = f(H, D) gives following answers PIR History depth Which branch address/target bits affect the PIR PIR Update mechanism details (XOR or Add…)
P1.SB1 and P1.SB1 replaced with different types of branches
Both address and target bits tested in this way
PIR
F
IP031 q p
TA031 s r
Retired branch TypeOutcome
1
2
Dq=2q+1
H BlockD=2k
DIP=2l
P1.SB2
Target1
...
P1.SBN
...
P2.SBN
P2.SB2
PIR 1 PIR 2
H Block
Target1
D=2k
P1.SB1 P2.SB1
iSpy1 iSpy2iSpy
P1.SB2
Target1
...
P1.SBN
...
P2.SBN
P2.SB2
PIR 1 PIR 2
H Block
Target1
D=2k
P1.SB1 P2.SB1
17
IBTB Access Hash Function Test
Find which PIR and branch IP bits are XORed in the iBTB access hash function
Previously we found XOR
Reuse previous test Difference at P1.SB1 and P2.SB2
bit k makes targets not to collide
Use two Spies at distance DIP = 2l
If bits l and k are XORed in the hash function difference in PIR values is annulated
PIR IBTB
INDEX TAG
IP
Hash
Tag Target
031 n m3
4
5
6
18
IBTB Organization Test Employ N indirect branch targets
to fill iBTB in different ways By using N different PIR values
SB1…SBN create N different PIRs to the each of iSpy target SB1…SBN are at distance D=2k
from each other MPR = f(D,N) sufficient to
find IBTB organization Similarly as for the BTB
D=2k
P0.SB8…
P0.SB1
Dispatch
SB1 ...P1 P2 PN
iSpy
Target1 Target2 TargetN
1 2 N
1N 2...
N-1
...
SB2 SBN
19
IBTB Predictor Findings1. Which branch parts affect the PIR during update?
15 IP bits from conditional branch IP Combined 15 bits from indirect branch target and IP
2. How is PIR updated? Shifted for two bits left
prior to update (XOR)3. Which branch IP bits affect
the hash access function? 15 bits, IP[18:4]
4. What is hash access function? XOR
5. What are Index and Tag fields? Index = HASH[13:6], Tag = IP[14,5:0]
6. What is IBTB organization? A direct-mapped cache with 256 entries
IBTB
Target 32 bit
0
255
Index = HASH [13:6]Tag = HASH [14,5:0]
BTB hit
hittarget
Indirectpredictor hit
BTB target
Predicted target
Tag 7 bits
PIRIP
HASH
XOR
14 018 4
14 0
20
Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
Target Predictors Branch Target Buffer Indirect Branch Target Buffer
Outcome Predictors Loop Predictor Global/Bimodal Predictors
Conclusion
21
Loop PredictorWhat do we know? Each entry has two counters Counter MAX_VAL stores the loop
branch maximum count value Counter CURR_VAL stores the loop
branch current iterationAssumptions: Loop BP is an IP indexed cacheTry to find: Counters’ length Size and organization of the loop branch predictor buffer (Loop BPB) Allocation policy (when a branch becomes a candidate for a loop branch) Training policy – how new loop branch MAX_VAL is set
CURR_VAL MAX_VAL Prediction+1
0=
Prediction
22
Loop Counters Size Test
Test: “spy” loop (LSpy) has loop modulo L MPR exists if L > MAX_VAL counter lengthResults: Maximum predictable L is 64 (6-bit counters)
LSpy
L times Enter
Exit
23
Loop BPB Capacity and Set Tests Similar to the BTB Capacity/Set tests Employ B loops at the distance D
from each other MPR is a function of B, D and Loop BPB
parameters similarly as for the BTB
Branch 1
Branch 2
Branch B
~~
D
Increase Counter
COUNTER =COUNTER MAX.
Increase Counter
COUNTER =COUNTER MAX.
Increase Counter
COUNTER =COUNTER MAX.
Loop B
Loop 1
Loop 2
D
~~
24
Loop BPB Capacity and Set Tests Counters’ length: 6 bits Size and organization of the loop branch predictor buffer
Two-way cache with 128 entries Index = IP[9:4], Tag = IP[15:10]
Allocation policy: Branch allocated on first opposite outcome Training policy: Set MAX_VAL during 2nd loop iteration
MAX_VAL6 bits
CUR_VAL6 bits
Way 0
Hit
(Loop BPB)
Index = IP [9:4] Tag = IP [15:10]
Prediction0
64
Tag 6 bits
Way 1
Pred.1 bit
25
Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
Target Predictors Branch Target Buffer Indirect Branch Target Buffer
Outcome Predictors Loop Predictor Global/Bimodal Predictors
Conclusion
26
Global and Bimodal PredictorWhat do we know? All branches predicted dynamically
At least one predictor not tagged
Assumptions: Cascade organization
Bimodal predictor is not tagged Global predictor can correct Bimodal
Global is path indexed (BHR register)Try to find: Organization of Global Predictor Indexing to Global predictor (BHR and hashing function details) Bimodal predictor details
Size only (not tagged) Indexing bits (IP indexed)
27
BHR Organization Test
Similar to PIR Organization test iSpy with two targets replaced with the
conditional branch (cSpy) with two outcomes MPR =f(D, H) sufficient to find
BHR organization
Results: BHR affected in the same way as the PIR
BHR and PIR are the same registercSpy
H Block
P1.SB1
Target2 (T)
Target1 (nT)
... ...
P2.SB1
P2.SB2
P2.SBNP1.SBN
P1.SB2
D=2k
28
Global Predictor Organization Test
Similar to IBTB Organization test N different paths to cSpyN (always not taken) PIR values depend on distance D
cSpyN allocated to up to N different entries Similar to IBTB, MPR=f(D,N) is sufficient
to determine the predictor organization Eliminate correct prediction from
Bimodal predictor: cSpyT distance from SpyN is large –
target the same Bimodal entry Paths occurrence pattern:
T*PT, PN1, T*PT, PN2, …, T*PT, PNN, … Eliminate correct prediction from Loop Predictor if needed
...PN1
cSpyNH
Dispatch
PN2 PNNPT
P0.SB7…
P0.SB1
D
SB2 SBNSB1SBT
cSpyNcSpyT
29
Bimodal Predictor Organization Test
Reuse the previous test Make contentions in Global predictor
Change distance between cSpyT and cSpyN to try predicting brancheswith the Bimodal predictor
DG =2k
No contentions in Bimodal Predictor if bit k is used for Bimodal Index
...PN1
cSpyNH
Dispatch
PN2 PNNPT
P0.SB7…
P0.SB1
SB2 SBNSB1SBT
cSpyNcSpyTDG
30
Global and Bimodal Predictor Findings
Global: 4-way cache structure with 2048 entries Accessed with the hash function - PIR XORed with conditional branch IP
9 bits used as the index, 6 bits as the tagBimodal: A table with 4096 bimodal counters Indexed with IP [11:0]
Bimodal 2 bit
0
511
HitPrediction
Tag 6 bits
PIRIP
HASH
XOR
14 018 4
14 0
Index = HASH [14:6]Tag = HASH [5:0]
Global Predictor
Way 0
Way 3
31
Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction
Target Predictors Branch Target Buffer Indirect Branch Target Buffer
Outcome Predictors Loop Predictor Global/Bimodal Predictors
Conclusion
Limitations and Verification Generalization of reverse engineering flow is difficult
Different branch prediction organizations Implementation of microbenchmarks is a challenging task
Balance of observability of certain parameters and isolation of different parameters that share the same event
Certain knowledge on targeted predictor is needed E.g. Prediction in cache lines (AMD K8)
Tests must cover large design space Verification
Using PIN model – achieved more than 95% accuracy
32
Conclusion Microbenchmarks and mechanisms for reverse
engineering of path- or IP- indexed predictor structures Demonstrated on Pentium M
BTB, IBTB, Loop, Global/Bimodal
33
Offset = IP [3:0] Index = IP [12:4] Tag = IP [21:13]
Way 0Way 3
Branch target buffer (BTB)
0
511
Target(32 bits)
BTB hit
BTB target
Limit(6 bits)
Count(6 bits)
Way 0Way 1
Looppredictor hit
Loop branch predictor buffer (LPB)
Index = IP [9:4] Tag = IP [15:10]
Indirect target cache (iBTB)
Target (32 bit)
0
255
Index = HASH [13:6]Tag = HASH [14,5:0]
iBTB hit
Way 0Way 3
Global predictor
0
511
2bC
Globalpredictor hit
Index = HASH[14:6] Tag = HASH[5:0]
Bimodal Table
2bCIndex = IP[11:0]
Bimodaloutcome prediction
Globaloutcome prediction
Outcome prediction
iBTB target
0
63
Path Information Register (PIR)
Current Instruction
IP address
XOR Hash Access Function (HASH)15 bits
14 0
Type (2-3 bits)
Tag (9 bits)
BTB type
Tag (6 bits)
Tag (7 bits)
Tag (6 bits)
Offset (4 bits)
0
4095
BTB hit
LPB hit
Loopoutcome prediction
Loopoutcome prediction
PLRU(3 bits)
Prediction(1 bit)
14 0