Upload
thewayim
View
730
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Control Dependency
1
Problem– Dependency tracking in ARVI ignores control
dependency
– Can’t get practical Available registers• Make same patterns for different directions
– CANNOT predict that branch correctly
Example of control dependency
Practical Available Register Set of branch 2– r1, r3
Available Register Set in ARVI of branch 2– r3
When r3==1– r1 ==0 -> not taken– r1 !=0 -> taken
2
Dependence Tracking
3
Branch 1 not taken
Branch 1 taken
Improved Data Dependence Tracking
Resolve control dependency– Add Control flow information to tracking
• Add logical register to architecture– Called TA-register(target address)– For maintaining Target address of last
branch• TA is hidden source register of instructions
4
Behavior of branch instruction
Example – beq in MIPS instruction set architecture
5
i f ( r s == r t ) t henTA = PC + 4 + {14{i mm[ 15] }, i mm, 2' b0}
el seTA = PC + 4
endi fPC = TA
Improved Dependence Tracking
6
Branch 1 not taken
Branch 1 taken
ARS of Branch 13
Improved Tracking– Track control dependency well– When completed by the INST1
• It’s different with practical ARS• But it can be predicted well
– because TA has control flow information• When r3==1
– TA==2 Not Taken– Ta==4 Taken 7
Common code problem
Performance loss in not con-
trol dependence code– In common code
ARS of Branch 15– When completed by the
INST2– Practical ARS
• r2(INST1)– Previously proposed Track-
ing• r2(INST1)
– Improved Tracking• r2(INST1), TA 8
Distinguishing control flow in im-proved tracking
TA is wasted Information– it’s not mean that the
prediction isn’t correct– But mean that predictor
need more training
Information to Train– Previously Tracking
• r2 = 0 -> Taken– Improved Tracking
• r2 =0, TA=5 -> Taken• r2=0, TA=6 -> Taken
9
“SetTA” Instruction
Add “SetTA” Instruction – Save next instruction address to TA
ARS of branch 15 is still r2 and TA– But TA is always 6
Disadvantage– Wasted Instructions(INST6)– Programs will be Recompiled– Have to find start of common code
for adding “setTA” at compile time• It’s hard because an Assembly
language is not the structured
programming language(have
“goto”)
10
Encoding
Amount of information is changed by number
of registers in ARS– Amount of information
» Assume each length of values is
10bits• 1 register in ARS => 10bits• 2 registers in ARS => 20bits• 3 registers in ARS => 30bits
Must generate fixed length pattern from vari-
ous length information– -> HASH– Various Encodings are possible
11
Encoding of ARVI
XOR with each physical register values– Simple XOR HASH with XOR tree
12
Reducing Hash conflict
Programs more use lower bits than higher bits
of registers– Almost information is centralized in lower bits– Hash conflict occurs due to lower bits
For decentralizing information distribution– Different circular shifted values per logical reg-
ister numbers• Because physical number is changed in run-
time
13
Percentage of use of each bit
14
There are the bits that program use mostly Hash conflict occurs in that bits
164.
gzip
175.
vpr
176.
gcc
177.
mes
a
179.
art
181.
mcf
183.
equa
ke
197.
pars
er
255.
vorte
x
256.
bzip
2
aver
age
0
0.02
0.04
0.06
0.08
0.1
0.12
164.
gzip
175.
vpr
176.
gcc
177.
mes
a
179.
art
181.
mcf
183.
equa
ke
197.
pars
er
255.
vorte
x
256.
bzip
2
aver
age
0
0.02
0.04
0.06
0.08
0.1
0.12
Degree of centralization
15
High value mean use small number bits of reg-
isters– Information is centralized in small number of
bits– Decentralized well by circular shift
164.
gzip
175.
vpr
176.
gcc
177.
mes
a
179.
art
181.
mcf
183.
equa
ke
197.
pars
er
255.
vorte
x
256.
bzip
2
aver
age
non32 circular32
Proposed Encoding
XOR with each Logical register values– Different Circular shifted by logical number
– Serialize physical-logical mapping• Value information is shorter than
before(Disadventage) 16
Select Logical Register X
Select Logical Register X
– Select physical register value that mapped in logical
register X17
Delay
» nPR = Number of Physical Register
» nLR = Number of Logical Register
» L = Log2(nLR)
Simple XOR Hash– Log2(nPR) * XOR2 + AND2
Proposed Hash– Log2(nLR) *XOR2 + Select + AND2
• Select = XOR2 + ANDL + Gate + OR(nPR)
nPR > nLR *2– Log2(nPR) > Log2(nLR) + 1
– Approximately same or little bit slower 18
HW Resource
Simple XOR Hash– nPR *N*AND2 + (nPR-1)*N*XOR2
– nPR-1 * 3bitADD for Logical num tag
Proposed Hash– nPR *N*AND2 + nLR * Select + (nLR-1)*N*XOR2
• Select = nPR *( L * (XOR2 +2Gate) + ANDL) + N *
OR(NPR)
– No Logical num tag• Pattern has that information already
19
Suitable predictor for register-value-pattern
Characteristic of register-value-pattern– Need long pattern length for reliable prediction
• PHT is not suitable– Must save tags for comparing states
• Perceptron is not suitable[17][18]– Non-linear-separable[17][18]
• Each bit of value has relation of AND with others– Perceptron is not suitable
– Many various patterns for branches• If there is loop that r1 is changed from 0 to 999
– There is 999 not taken patterns and 1 taken pat-
tern– Long Delay for pattern generation
• Perceptron is not suitable[17][18]• Must hybrid with fast predictor[19][20]
20
Proposed predictor
21
Modified YAGS[21]– 1 Bimodial
• Saving Biases for each
branches– 2 Cache
• Save only pattern that
different with bias• Taken Cache
– Saving Not taken pat-
terns for taken biased
branches • Not Taken Cache
– Saving Taken patterns
for Not taken biased
branches
Block diagram
22
1 Fast predictor predict direction in early cycle When Modified YAGS hit and Depth tag is same with now
state– Update fetch direction in late cycle
When Modified YAGS miss then predicted direction of YAGS
is bias and we don’t know it is not trained or trained but not
save– Selector select biased direction or Fasted predictor direction
Outlines
Why We need branch prediction ??Related worksImproved Register-value-pattern genera-
tionExperiment and EvaluationContribution
Reference
23
Experimental environment
SimpleScalar3.0 – PISA Instruction Set Architecture– Little Endean
sim-outorder– Performance-based– Execution driven– Cycle timer
Benchmarks– 10 programs of SPEC 2k– Instructions coverage
• 150M ~ 250M instruction24
Processor Architecture Configuration
25
Memory Architecture Configuration
26
Predictor Configuration
27
Outlines
Why We need branch prediction ??Related worksImproved Register-value-pattern genera-
tionExperiment and EvaluationContribution
Reference
28
Register-Value-Pattern predictor
Register-Value-Pattern predictor predictor is
predict like Human doing. – If we know “this branch was taken before when
a=3 and b=4”• We predict the branch without calculation
when arrive a=3 and b=4 again.
– Commonsense design• Why it’s not possible 100% accuracy??
29
Factors of performance loss
1. Limitation of dependence tracking– 1.1 Load Branch– 1.2 Control Dependency
2. Hash conflict in encoding 3. Prediction Delay 4. Various Patterns for same direction
– 4.1 Pattern capacity of predictor– 4.2 Lack of training
30
Contribution
We improve some factors of performance loss– 1.2 Control Dependency– 2 Hash conflict in encoding– 4.1 Pattern capacity of predictor
But we still have assignments
31
Applications of Register-Value-Pat-tern
Register Value Pattern has limits at different kinds of
branches with Branch History Pattern– Higher performance in hybrid predictor with Branch His-
tory Pattern
Register-Value-Pattern with Branch register value based– Depth of dependence chain is 0
• Means Branch register is already updated– We are good to use Branch register value based
prediction in that case
Register-Value-Pattern for Value prediction– We can use register-value-pattern for value prediction as
Information 32
Outlines
Why We need branch prediction ??Related worksImproved Register-value-pattern genera-
tionExperiment and EvaluationContribution
Reference
33
Reference
[1] T. Yeh and Y. Patt. “Two-level Adaptive Branch Prediction” In
Proc 24th ACM/IEEE Int Symp. on Microarchitecture, 1991.
[2] T. Yeh and Y. Patt. “A Comparison of Dynamic Branch
Predictors that use Two Levels of Branch History” In Proc 20th
Ann Int Symp. on Computer Architecture,1993.
[3] S. Pan , K So and J. Rahmeh. “Improving the Accuracy of
Dynamic Branch Prediction Using Branch Correlation” In Proc
5th Annual Intl Conf. on Architectural Support for Prog. Lang.
and Operating Systems, 1992.
[4] R. Nair “Dynamic Path-Based Branch Correlation” In Proc
28th Ann Int Symp On Microarchitecture,1995.
[5] D. Jim´enez “Fast Path-Based Neural Branch Prediction” In
Proc 36th Ann IEEE/ACM Int Symp On Microarchitecure, 200334
Reference
[6] F. Gabbay and A. Mendelson “Speculative Execution
Based on Value Prediction” In Technical Report Technion,
1997 [7] J. Gonzalez and A. Gonzalez “Control-Flow
Speculation through Value Prediction for Superscalar
Processors” In Proc Int Conf On Parallel Architectures
and Compilation Techniques, 1999 [8] T. Heil, Z. Smith and J.E. Smith “Improving Branch
Predictor by Correlating on Data Value” In Proc 32nd Int
Symp On Microarchitecture,1999.
35
Reference
[9] K.Wang “Highly Accurate Data Value Prediction using
Hybrid Predictors” In Proc 30th Int Symp on
Microarchitecture, 1997. [10] M. Lipasti and J. Shen “Exceeding the Dataflow
Limit via Value Prediction” In proc 29th Int Symp on
Microarchitecture,1996. [11] W.Mohan and M.Franklin “Improving Data Value
Prediction Accuracy using Path Correlation” In Proc 6th
Int Conf on High performance Computing, 1999. [12] Y. Sazeides and J. Smith. “Implementations of Con-
text Based Value Predictors” In Technical Report #ECE-
TR-97- 8, University of Wisconsin-Madison, 1997.
36
Reference
[13]L. N. Vintan, M. Sbera, I. Z. Mihu and A. Florea, "An
alternative to branch prediction: pre-computed
branches," In ACM SIGARCH Computer Architecture
News archive Vol 31 , 2003. [14] L. He and Z. Liu, “A New Value Based Branch Pre-
dictor For SMT Processors” In Proc 16th IASTED Int Conf
on Parallel and Distributed Computing and System,
2004 [15] Y. Pan, X. Fan, L. He, D. Wang “A bypass Mechanism
to Enhance Branch Predictor for SMT”, In Proc 12th Asia-
Pacific Conf on Computer Systems Architecture AC-
SAC2007, vol 4697, 2007
37
Reference
[16] L. Chen, S. Dropsho and D. H. Albonesi “Dynamic
Data Dependence Tracking and its Application to Branch
Prediction” In Proc 9th Int Symp on Highperformance
Computer Architecture, 2003. [17]D.A.Jim´enez and C.Lin. “Dynamic Branch Prediction
with Perceptrons”.In Proc 7th Int Symp.on High
Performace Computer Architecutre,2001. [18] D.A.Jim´enez and C.Lin. “Neural Methods for
Dynamic Branch Prediction”.In ACM Transactions on
Computer Systems, 2002.
38
Reference
[19] P. Chang , E. Hao and Y. Patt “Alternative
Implementations of Hybrid Branch Predictors”.In Proc
28th Ann Int Symp.on Microarchitecture, 1995. [20] M. Evers, P. Chang and Y. Patt “Using Hybrid Branch
Predictors to Improve Branch Prediction Accuracy in The
Presence of Context Switches”. In Proc 23rd Ann Int
Symp. on Computer Architecture ,1996 [21] A.Eden and T. Mudge. “The YAGS branch prediction
scheme”In Proc 31st Ann ACM/IEEE Int Symp.on Microar-
chitectres, 1998 [22] P. N. Glaskowsky. “Pentium 4 (partially) previewed.
“In Microprocessor Report, 2000.
39
40