Click here to load reader

Advanced Microarchitecture

  • View

  • Download

Embed Size (px)


Advanced Microarchitecture. Lecture 4: Branch Predictors. Direction vs. Target. Direction: 0 or 1 Target: 32- or 64-bit value Turns out targets are generally easier to predict Don’t need to predict NT target T target doesn’t usually change or has “nice” pattern like subroutine returns. - PowerPoint PPT Presentation

Text of Advanced Microarchitecture

CS8803: Advanced Microarchitecture

Advanced MicroarchitectureLecture 4: Branch Predictors1Direction vs. TargetDirection: 0 or 1Target: 32- or 64-bit value

Turns out targets are generally easier to predictDont need to predict NT targetT target doesnt usually changeor has nice pattern like subroutine returnsLecture 4: Correlated Branch Predictors 2If youre predicting an instruction at-a-time, then the NT target may actually not be as easy for x86 since instructions have variable lengths, and so NT-target prediction is equivalent to instruction length prediction. However, most fetch architectures work on fetching an entire cache line at a time, and so predicting the next sequential cacheline address is easy.2Branches Have LocalityIf a branch was previously taken, theres a good chance itll be taken again in the future

for(i=0; i < 100000; i++){/* do stuff */}Lecture 4: Correlated Branch Predictors 3This branch will be taken99,999 times in a row.Simple PredictorAlways predict NTno fetch bubbles (always just fetch the next line)does horribly on previous for-loop exampleAlways predict Tdoes pretty well on previous examplebut what if you have other control besides loops?

p = calloc(num,sizeof(*p));if(p == NULL)error_handler( );Lecture 4: Correlated Branch Predictors 4This branch is practicallynever takenAssuming this is implemented in assembly as a conditional jump with error_handler as the taken target.4Last Outcome PredictorDo what you did last timeLecture 4: Correlated Branch Predictors 50xDC08:for(i=0; i < 100000; i++){0xDC44:if( ( i % 100) == 0 )tick( );

0xDC50:if( (i & 1) == 1)odd( );

}TNI.e., 1-bit counter5Misprediction Rates?Lecture 4: Correlated Branch Predictors 6DC08:TTTTTTTTTTT... TTTTTTTTTTNTTTTTTTTT100,000 iterationsHow often is branch outcome != previous outcome?2 / 100,000TNNTDC44:TTTTT... TNTTTTT ... TNTTTTT ...2 / 100DC50:TNTNTNTNTNTNTNTNTNTNTNTNTNTNT2 / 299.998%PredictionRate98.0%0.0%DC08, DC44 and DC50 refer to the hexadecimal PCs used on the previous slide.6Saturating Two-Bit CounterLecture 4: Correlated Branch Predictors 701FSM for Last-OutcomePrediction0123FSM for 2bC(2-bit Counter)Predict NTPredict TTransistion on T outcomeTransistion on NT outcomeExampleLecture 4: Correlated Branch Predictors 82T3T3T3NN1T00T1TTTTT1111T1T10T1T2T3T3T3TInitial Training/Warm-up1bC:2bC:Only 1 Mispredict per N branches now!DC08: 99.999%DC04: 99.0%Importance of Branches98% 99%Whoop-Dee-Do!Actually, its 2% misprediction rate 1%Thats a halving of the number of mispredictionsSo what?If misp rate equals 50%, and 1 in 5 insts is a branch, then number of useful instructions that we can fetch is:5*(1 + + ()2 + ()3 + ) = 10If we halve the miss rate down to 25%:5*(1 + + ()2 + ()3 + ) = 20Halving the miss rate doubles the number of useful instructions that we can try to extract ILP fromLecture 4: Correlated Branch Predictors 9Typical Organization of 2bC PredictorLecture 4: Correlated Branch Predictors 10PChash32 or 64 bitslog2 n bitsn entries/countersPredictionFSMUpdateLogictable updateActual outcome back to predictorsTypical HashJust take the log2n least significant bits of the PCMay need to ignore a few bitsIn a 32-bit RISC ISA, all instructions are 4 bytes wide, and all instruction addresses are 4-byte aligned least two significant bits of PC are always zeros and so they are not includedequivalent to right-shifting PC by two positions before hashingIn a variable-length CISC ISA (ex. x86), instructions may start on arbitrary byte boundariesprobably dont want to shiftLecture 4: Correlated Branch Predictors 11How about the Branch at 0xDC50?1bc and 2bc dont do too well (50% at best)But its still obviously predictableWhy?It has a repeating pattern:(NT)*How about other patterns?(TTNTN)*

Use branch correlationThe outcome of a branch is often related to previous outcome(s)Lecture 4: Correlated Branch Predictors 12Idea: Track the History of a BranchLecture 4: Correlated Branch Predictors 13PCPrevious Outcome1Counter if prev=030Counter if prev=1133prev = 130prediction = Nprev = 030prediction = Tprev = 130prediction = Nprev = 030prediction = Tprev = 13prediction = T3prev = 13prediction = T3prev = 13prediction = T2prev = 03prediction = T2In the animated example, the left circle corresponds to the 2bC used when the previous outcome was 0, and the right corresponds to 1. The not-used counter is shaded darker.13Deeper History Covers More PatternsWhat pattern has this branch predictor entry learned?Lecture 4: Correlated Branch Predictors 14PC03101310022Last 3 OutcomesCounter if prev=000Counter if prev=001Counter if prev=010Counter if prev=111001 1; 011 0; 110 0; 100 100110011001 (0011)*Predictor OrganizationsLecture 4: Correlated Branch Predictors 15PC HashDifferent pattern foreach branch PCPC HashShared set ofpatternsPC HashMix of bothEach trades off aliasing in different places. The first suffers from different static branches mapping into the same local history and counters. The second allows different static branches that exhibit the same local history to map into the same counters. The figures do not imply that the total number of branch history registers in the three figures are necessarily the same.15Example (1)1024 counters (210)32 sets ( )5-bit PC hash chooses a setEach set has 32 counters32 x 32 = 1024History length of 5 (log232 = 5)

Branch collisions1000s of branches collapsed into only 32 setsLecture 4: Correlated Branch Predictors 16PC Hash55Example (2)1024 counters (210)128 sets ( )7-bit PC hash chooses a setEach set has 8 counters128 x 8 = 1024History length of 3 (log28 = 3)

Limited Patterns/CorrelationCan now only handle history length of threeLecture 4: Correlated Branch Predictors 17PC Hash73Two-Level Predictor OrganizationBranch History Table (BHT)2a entriesh-bit history per entryPattern History Table (PHT)2b sets2h counters per setTotal Size in bitsh2a + 2(b+h)2Lecture 4: Correlated Branch Predictors 18PC HashabhEach entry is a 2-bit counterClasses of Two-Level Predictorsh = 0 or a = 0 (Degenerate Case)Regular table of 2bCs (b = log2counters)h > 0, a > 1Local History 2-level predictorh > 0, a = 1Global History 2-level predictorLecture 4: Correlated Branch Predictors 19Global vs. Local Branch HistoryLocal BehaviorWhat is the predicted direction of Branch A given the outcomes of previous instances of Branch A?Global BehaviorWhat is the predicted direction of Branch Z given the outcomes of all* previous branches A, B, , X and Y?*number of previous branches tracked limited by the history lengthLecture 4: Correlated Branch Predictors 20Why Global Correlations ExistExample: related branch conditions

p = findNode(foo);if ( p is parent )do something;

do other stuff; /* may contain more branches */

if ( p is a child )do something else;Lecture 4: Correlated Branch Predictors 21Outcome of secondbranch is alwaysopposite of the firstbranchA:B:Other Global CorrelationsTesting same/similar conditionscode might test for NULL before a function call, and the function might test for NULL againin some cases it may be faster to recompute a condition rather than save a previous computation in memory and re-load itpartial correlations: one branch could test for cond1, and another branch could test for cond1 && cond2 (if cond1 is false, then the second branch can be predicted as false)multiple correlations: one branch tests cond1, a second tests cond2, and a third tests cond1 cond2 (which can always be predicted if the first two branches are known).Lecture 4: Correlated Branch Predictors 22A Global-History PredictorLecture 4: Correlated Branch Predictors 23PC HashbhSingle global branchhistory register (BHR)PC Hashbhb+hSimilar Tradeoff Between B and HFor fixed number of countersLarger h Smaller bLarger h longer historyable to capture more patternslonger warm-up/training timeSmaller b more branches map to same set of countersmore interferenceLarger b Smaller hjust the oppositeLecture 4: Correlated Branch Predictors 24Motivation for Combined IndexingNot all 2h states are used(TTNN)* only uses half of the states for a history length of 3, and only of the states for a history length of 4(TN)* only uses two states no matter how long the history length isNot all bits of the PC are uniformly distributedNot all bits of the history are uniformly likely to be correlatedmore recent history more likely to be strongly correlatedLecture 4: Correlated Branch Predictors 25Combined Index Example: gshareS. McFarling (DEC-WRL TR, 1993)Lecture 4: Correlated Branch Predictors 26PC HashkkXORk = log2countersGshare exampleBranchAddressGlobalHistoryGselect4/4Gshare8/800000000000000010000000100000001000000000000000000000000000000001111111100000000111100001111111111111111100000001111000001111111Lecture 4: Correlated Branch Predictors 27Insufficient historyleads to a conflictFor history, left-most bit is the oldest and right-most is most recent. The Gselect example takes the 4 least significant bits from the branch address and the four most recent outcomes from the branch history and concatenates them together. Gshare takes the XOR of all eight bits from both sources. The example is meant to show that for these four address-history pairs, Gselect creates aliasing while Gshare continues to generate four distinct indexes. This example is taken from the original gshare paper (DEC-WRL TN36).27Some Interference May Be TolerableBranch A: always not-takenBranch B: always takenBranch C: TNTNTNBranch D: TTNNTTNNLecture 4: Correlated Branch Predictors 283030300300011101010100101110011028And Then It Might NotBranch X: TTTNTTTNBranch Y: TNTNTNBranch Z: TTTTLecture 4: Correlated Branch Predictors 2900011101010100101110011003333??Interference Reducing PredictorsThere are patterns and asymmetries in branchesNot all patterns occur with same frequencyBranches have biasesThis lecture:Bi-Mode (Lee et al., MICRO 97)gskew

Search related