Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical

Microbenchmarks and Mechanisms for Reverse Engineering

of Branch Predictor Structures

Vladimir Uzelac and Aleksandar MilenkovićLaCASA Laboratory

Electrical and Computer Engineering Department

The University of Alabama in Huntsville

{uzelacv | milenka}@ece.uah.edu

2

Outline

Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction

Target Predictors Branch Target Buffer Indirect Branch Target Buffer

Outcome Predictors Loop Predictor Global/Bimodal Predictors

Conclusion

3

Motivation

If we know branch predictor organization we could … Implement predictor-aware compiler optimizations

Code alignment to avoid BTB conflicts in critical code sections Code split to replace long correlations with shorter ones Camino environment [PLDI `05]

Have a “golden standard” for academic research Design tools for rapid BP

design space exploration and verification But, details are rarely publicly disclosed

In spite of hints in software optimization manuals Develop microbenchmarks and mechanisms for reverse

engineering of modern branch predictor units

4

Goals

Microbenchmarks and mechanisms developed to reverse engineer Pentium M’s branch predictor including

Target predictor BTB and IBTB

Outcome predictor Loop predictor Global outcome predictor Bimodal predictor

Branch predictor parameters Organization and size of all branch predictor structures Indexing, allocation, update, replacement policies Interdependencies between these structures

Validation of our effort through a functional PIN model

5

Presentation Outline




Conclusion

6

Reverse Engineering Flow

Goal: determine a specific branch predictor parameter (e.g., BTB size)

Design benchmark(s) to stress the parameter

Influenced by the type of observable events

Build expectations for relevant event(s) based on back-of-the-envelope analysis

Execute benchmarks and collect events (Vtune)

Compare expectations with actual results Retire findings or modify benchmark Verify findings using functional PIN model

Goal:Branch Parameter

Microbenchmark to stress the parameter

Observable Events

BuildExpectations

Collect Events(Run in Vtune)

# of expectedevents

# of collected events

=

BP FunctionalModel

To PIN verification

Yes

ParameterExtracted

No(Revisit Microbench.)

7

Outline

Goals and Motivation Reverse Engineering Flow Predictors Details Deconstruction



Conclusion

Branch Target Buffer (BTB)

Background: BTB is a cache structure Instructions are fetched

in 16-byte blocks (Intel) Can have multiple

branches per line BTB can have multiple

hits (same tags) => Offset field in each entry => Offset algorithm selects

the target among several offered

8

Try to find: Number of BTB entries (NBTB)

Number of sets (NSETS)

Number of ways (NWAYS)

Index, Tag bits Offset bits and presence of

offset algorithm Bogus branches handling Replacement policy

TAG Target Offset

WAY NWAY

1

Repl.Bits

0

NSETS

IP

BTB

Core BTB Test

Use B taken branches at the distance D from each other

Code executed many times to amplify effects on performance counters

Control how these branchesare presented to BTB

To cope with different allocation policies Here, we execute each branch twice consecutively

Missprediction rate (MPR) as function of B and D is sufficient to conclude on BTB parameters

9

Branch 1

Branch 2

Branch B

~~

D

128256

5121024

0%

20%

40%

60%

80%

100%

2 4 8 16 32 64128

MPR

DB

10

BTB Capacity Tests

Try to fill whole BTB using very small distances between branches Example: 4-way BTB with 512 entries, BTB index = IP[10:4] NBTB branches can fit for three distances

Branches fill sets consecutively For larger D, MPR = f(B,D)

Branches jump over sets

For very small D, there aremore branches in the line than sets

MPR exist for any D if B>NBTB

MPR = f(B,D, BTB parameters)can be mathematically formalized

Branch 4

WAY 3

Branch 3Branch 2

EvictOne

Branch 1

WAY 1

WAY 2

NSET

0

WAY 3

Branch 5

Branch 5

Branch 4

WAY 3

Branch 3Branch 2

EvictOne

Branch 1

WAY 1

WAY 2

NSET

0

WAY 3

Branch 5

BTB Set Tests

Try to fill one BTB set varying distance D When D > NSET all branches

collide in one set MPR is a function of B only

(only 4 branches can fit) Helps finding NWAYS and Index MSB

When D > NSET, change D’ between lasttwo to find Index LSB

D’ for which MPR disappear determines Index LSB

When D over Tag MSB distance, false hits occur

Only two branches produce MPR

11

...Branch 1

WAY 1

WAY N

Branch 2

False Hit

NSET

0

Index OffsetTagNot UsedIP

D=2TAG.MSB + 1

12

BTB Findings

Number of BTB entries: 2048 Number of sets: 512 Number of ways : 4 Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] Offset algorithm: When multiple hits, selects the target with the

lowest offset yet no smaller than the current IP Bogus branches handling: Evict whole set Replacement policy: Tree based pseudo LRU

Index = IP [12:4] Tag = IP [21:13]

Way 0Way 3

Branch target buffer (BTB)

0

511

Target(32 bits)

BTB hit

BTB target

Type (2-3 bits)

Tag (9 bits)

BTB typeOffset (4 bits)

PLRU(3 bits)

IP[31:4]

13

Outline




Conclusion

Indirect Branch Target Buffer (IBTB)

Background: Target predictor indexed

by program-path informationTry to find:1. Which branch parts affect the

PIR during update?2. How is PIR updated?3. Which branch IP bits affect

the hash access function? 4. What is hash access function?5. What are Index and Tag

fields?6. What is IBTB organization?

14

PIR IBTB

INDEX TAG

IP

Hash

Tag Target

031 n m3

4

5

6

PIR

F

IP031 q p

TA031 s r

Retired branch TypeOutcome

1

2

Branch 1

Branch 2Branch 3

Newest Branch PIR Shift and AddBranch 1

Branch 2

Branch 3

PIR Shift

Path Information Register: Background

PIR is a (shift) register – updated with program branches

Different ways to allocate newly occurred branch :

Shift and Add

(add to lowest PIR bits)

Shift and Add

with interleave

(better indexing)

Shift and XOR

15

and AddNewest Branch

PIR Shift and Add with InterleaveNewest Branch

PIRShift count = 2

0 0

BRANCH BITS

XO

R

XO

R

XO

R

XO

R

XO

R

XO

R ...

Shift And XOR

P1.SB2

iSpy

Target1 Target2

...

P1.SBN

...

P2.SBN

P2.SB2

PIR 1 PIR 2

P1.SB1 P2.SB1

Dq=2q+1

Dq=2q+1

16

PIR Organization Test

PIR is the same prior to both Target1 and Target2 Branches are at large distance from each other (> 2q)

P1.SB1 and P2.SB1 differ in one bit – k = log2D

If bit k affects the PIR there is no collisions and opposite

H block – H branches that affect the PIR For large H, P1.SB1 and P2.SB1 shifted out of PIR

Analysis MPR = f(H, D) gives following answers PIR History depth

Which branch address/target bits affect the PIR

PIR Update mechanism details (XOR or Add…)

P1.SB1 and P1.SB1 replaced with

different types of branches Both address and target bits tested in this way

PIR

F

IP031 q p

TA031 s r

Retired branch TypeOutcome

1

2

Dq=2q+1

H BlockD=2k

DIP=2l

P1.SB2

Target1

...

P1.SBN

...

P2.SBN

P2.SB2

PIR 1 PIR 2

H Block

Target1

D=2k

P1.SB1 P2.SB1

iSpy1 iSpy2iSpy

P1.SB2

Target1

...

P1.SBN

...

P2.SBN

P2.SB2

PIR 1 PIR 2

H Block

Target1

D=2k

P1.SB1 P2.SB1

17

IBTB Access Hash Function Test

Find which PIR and branch IP bits are

XORed in the iBTB access hash function

Previously we found XOR

Reuse previous test

Difference at P1.SB1 and P2.SB2

bit k makes targets not to collide

Use two Spies at distance DIP = 2l

If bits l and k are XORed in the hash function

difference in PIR values is annulated

PIR IBTB

INDEX TAG

IP

Hash

Tag Target

031 n m3

4

5

6

18

IBTB Organization Test

Employ N indirect branch targets to fill iBTB in different ways By using N different PIR values

SB1…SBN create N different PIRs to the each of iSpy target SB1…SBN are at distance D=2k

from each other MPR = f(D,N) sufficient to

find IBTB organization Similarly as for the BTB

D=2k

P0.SB8…

P0.SB1

Dispatch

SB1 ...P1 P2 PN

iSpy

Target1 Target2 TargetN

1 2 N

1N 2...

N-1

...

SB2 SBN

19

IBTB Predictor Findings

1. Which branch parts affect the PIR during update? 15 IP bits from conditional branch IP Combined 15 bits from indirect branch target and IP

2. How is PIR updated? Shifted for two bits left

prior to update (XOR)3. Which branch IP bits affect

the hash access function? 15 bits, IP[18:4]

4. What is hash access function? XOR

5. What are Index and Tag fields? Index = HASH[13:6], Tag = IP[14,5:0]

6. What is IBTB organization? A direct-mapped cache with 256 entries

IBTB

Target 32 bit

0

255

Index = HASH [13:6]Tag = HASH [14,5:0]

BTB hit

hittarget

Indirectpredictor hit

BTB target

Predicted target

Tag 7 bits

PIRIP

HASH

XOR

14 018 4

14 0

20

Outline




Conclusion

21

Loop Predictor

What do we know? Each entry has two counters Counter MAX_VAL stores the loop

branch maximum count value Counter CURR_VAL stores the loop

branch current iteration

Assumptions: Loop BP is an IP indexed cache

Try to find: Counters’ length Size and organization of the loop branch predictor buffer (Loop BPB) Allocation policy (when a branch becomes a candidate for a loop branch) Training policy – how new loop branch MAX_VAL is set

CURR_VAL MAX_VAL Prediction+1

0=

Prediction

22

Loop Counters Size Test

Test:

“spy” loop (LSpy) has loop modulo L

MPR exists if L > MAX_VAL counter length

Results: Maximum predictable L is 64 (6-bit counters)

LSpy

L times Enter

Exit

23

Loop BPB Capacity and Set Tests

Similar to the BTB Capacity/Set tests

Employ B loops at the distance D

from each other

MPR is a function of B, D and Loop BPB

parameters similarly as for the BTB

Branch 1

Branch 2

Branch B

~~

D

Increase Counter

COUNTER =COUNTER MAX.

Increase Counter


Increase Counter


Loop B

Loop 1

Loop 2

D

~~

24

Loop BPB Capacity and Set Tests

Counters’ length: 6 bits Size and organization of the loop branch predictor buffer

Two-way cache with 128 entries Index = IP[9:4], Tag = IP[15:10]

Allocation policy: Branch allocated on first opposite outcome Training policy: Set MAX_VAL during 2nd loop iteration

MAX_VAL6 bits

CUR_VAL6 bits

Way 0

Hit

(Loop BPB)

Index = IP [9:4] Tag = IP [15:10]

Prediction0

64

Tag 6 bits

Way 1

Pred.1 bit

25

Outline




Conclusion

26

Global and Bimodal Predictor

What do we know? All branches predicted dynamically

At least one predictor not tagged

Assumptions: Cascade organization

Bimodal predictor is not tagged Global predictor can correct Bimodal

Global is path indexed (BHR register)

Try to find: Organization of Global Predictor Indexing to Global predictor (BHR and hashing function details) Bimodal predictor details

Size only (not tagged) Indexing bits (IP indexed)

27

BHR Organization Test

Similar to PIR Organization test iSpy with two targets replaced with the

conditional branch (cSpy) with two outcomes MPR =f(D, H) sufficient to find

BHR organization

Results: BHR affected in the same way as the PIR

BHR and PIR are the same registercSpy

H Block

P1.SB1

Target2 (T)

Target1 (nT)

... ...

P2.SB1

P2.SB2

P2.SBNP1.SBN

P1.SB2

D=2k

28

Global Predictor Organization Test

Similar to IBTB Organization test N different paths to cSpyN (always not taken)

PIR values depend on distance D

cSpyN allocated to up to N different entries Similar to IBTB, MPR=f(D,N) is sufficient

to determine the predictor organization

Eliminate correct prediction from

Bimodal predictor: cSpyT distance from SpyN is large –

target the same Bimodal entry

Paths occurrence pattern:

T*PT, PN1, T*PT, PN2, …, T*PT, PNN, …

Eliminate correct prediction from Loop Predictor if needed

...

PN1

cSpyNH

Dispatch

PN2 PNNPT

P0.SB7…

P0.SB1

D

SB2 SBNSB1SBT

cSpyNcSpyT

29

Bimodal Predictor Organization Test

Reuse the previous test Make contentions in Global predictor

Change distance between cSpyT

and cSpyN to try predicting branches

with the Bimodal predictor

DG =2k

No contentions in Bimodal Predictor

if bit k is used for Bimodal Index

...

PN1

cSpyNH

Dispatch

PN2 PNNPT

P0.SB7…

P0.SB1

SB2 SBNSB1SBT

cSpyNcSpyTDG

30

Global and Bimodal Predictor Findings

Global: 4-way cache structure with 2048 entries Accessed with the hash function - PIR XORed with conditional branch IP

9 bits used as the index, 6 bits as the tag

Bimodal: A table with 4096 bimodal counters Indexed with IP [11:0]

Bimodal 2 bit

0

511

HitPrediction

Tag 6 bits

PIRIP

HASH

XOR

14 018 4

14 0

Index = HASH [14:6]Tag = HASH [5:0]

Global Predictor

Way 0

Way 3

31

Outline




Conclusion

Limitations and Verification

Generalization of reverse engineering flow is difficult Different branch prediction organizations

Implementation of microbenchmarks is a challenging task Balance of observability of certain parameters and isolation

of different parameters that share the same event Certain knowledge on targeted predictor is needed

E.g. Prediction in cache lines (AMD K8) Tests must cover large design space Verification

Using PIN model – achieved more than 95% accuracy

32

Conclusion

Microbenchmarks and mechanisms for reverse engineering of path- or IP- indexed predictor structures

Demonstrated on Pentium M BTB, IBTB, Loop, Global/Bimodal

33

Offset = IP [3:0] Index = IP [12:4] Tag = IP [21:13]

Way 0Way 3

Branch target buffer (BTB)

0

511

Target(32 bits)

BTB hit

BTB target

Limit(6 bits)

Count(6 bits)

Way 0Way 1

Looppredictor hit

Loop branch predictor buffer (LPB)

Index = IP [9:4] Tag = IP [15:10]

Indirect target cache (iBTB)

Target (32 bit)

0

255

Index = HASH [13:6]Tag = HASH [14,5:0]

iBTB hit

Way 0Way 3

Global predictor

0

511

2bC

Globalpredictor hit

Index = HASH[14:6] Tag = HASH[5:0]

Bimodal Table

2bCIndex = IP[11:0]

Bimodaloutcome prediction

Globaloutcome prediction

Outcome prediction

iBTB target

0

63

Path Information Register (PIR)

Current Instruction

IP address

XORHash Access

Function (HASH)15 bits

14 0

Type (2-3 bits)

Tag (9 bits)

BTB type

Tag (6 bits)

Tag (7 bits)

Tag (6 bits)

Offset (4 bits)

0

4095

BTB hit

LPB hit

Loopoutcome prediction

Loopoutcome prediction

PLRU(3 bits)

Prediction(1 bit)

14 0

Documents

Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical