CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture
Lecture12–BranchPredic<onandAdvancedOut-of-OrderSuperscalars
KrsteAsanovicElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152
LastTimeinLecture11
§ Phasesofinstruc>onexecu>on:– Fetch/decode/rename/dispatch/issue/execute/complete/commit
§ Data-in-ROBdesignversusunifiedphysicalregisterdesign§ Superscalarregisterrenaming
2
Control-FlowPenalty
3
I-cache
Fetch Buffer
Issue Buffer
Func. Units
Arch. State
Execute
Decode
Result Buffer Commit
PC
Fetch
Branch executed
Next fetch started
Modernprocessorsmayhave>10pipelinestagesbetweennextPCcalcula=onandbranchresolu=on!
Howmuchworkislostifpipelinedoesn’tfollowcorrectinstruc=onflow?
~Looplengthxpipelinewidth+buffers
ReducingControl-FlowPenalty
§ SoPwaresolu>ons– Eliminatebranches-loopunrolling
• Increasestherunlength– Reduceresolu>on>me-instruc>onscheduling
• Computethebranchcondi>onasearlyaspossible(oflimitedvaluebecausebranchesoPenincri>calpaththroughcode)
§ Hardwaresolu>ons– Findsomethingelsetodo(delayslots)
• Replacespipelinebubbleswithusefulwork(requiressoPwarecoopera>on)–quicklyseediminishingreturns
– Speculate,i.e.,branchpredic>on• Specula>veexecu>onofinstruc>onsbeyondthebranch• Manyadvancesinaccuracy,widelyused
4
BranchPredic<on
5
Mo=va=on:Branchpenal>eslimitperformanceofdeeplypipelinedprocessorsModernbranchpredictorshavehighaccuracy(>95%)andcanreducebranchpenal>essignificantly
Requiredhardwaresupport:
Predic=onstructures:• Branchhistorytables,branchtargetbuffers,etc.
Mispredictrecoverymechanisms:
• Keepresultcomputa=onseparatefromcommit • Killinstruc>onsfollowingbranchinpipeline• Restorestatetothatfollowingbranch
ImportanceofBranchPredic<on
§ Consider4-waysuperscalarwith8pipelinestagesfromfetchtodispatch,and80-entryROB,and3cyclesfromissuetobranchresolu>on
§ Onamispredict,couldthrowaway8*4+(80-1)=111instruc>ons
§ Improvingfrom90%to95%predic>onaccuracy,removes50%ofbranchmispredicts– If1/6instruc>onsarebranches,thenmovefrom60instruc>onsbetweenmispredicts,to120instruc>onsbetweenmispredicts
6
Sta<cBranchPredic<on
7
Overallprobabilityabranchistakenis~60-70%but:
ISAcanahachpreferreddirec>onseman>cstobranches,e.g.,MotorolaMC88110
bne0(preferredtaken) beq0(nottaken)ISAcanallowarbitrarychoiceofsta>callypredicteddirec>on,e.g.,HPPA-RISC,IntelIA-64typicallyreportedas~80%accurate
backward90%
forward50%
DynamicBranchPredic<onlearningbasedonpastbehavior
§ Temporalcorrela>on– Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecu>on
§ Spa>alcorrela>on– Severalbranchesmayresolveinahighlycorrelatedmanner(apreferredpathofexecu>on)
8
One-BitBranchHistoryPredictor
§ Foreachbranch,rememberlastwaybranchwent§ Hasproblemwithloop-closingbackwardbranches,astwomispredictsoccuroneveryloopexecu>on1. firstitera>onpredictsloopbackwardsbranchnot-taken(loop
wasexitedlast>me)2. lastitera>onpredictsloopbackwardsbranchtaken(loop
con>nuedlast>me)
9
BranchPredic<onBits
10
• Assume2BPbitsperinstruc>on• Changethepredic>onaPertwoconsecu>vemistakes!
¬takewrong
taken¬taken
taken
taken
taken¬takeright
takeright
takewrong
¬taken
¬taken¬taken
BPstate: (predicttake/¬take)x(lastpredic=onright/wrong)
BranchHistoryTable(BHT)
11
4K-entryBHT,2bits/entry,~80-90%correctpredic>ons
00FetchPC
Branch? TargetPC
+
I-Cache
Opcode offsetInstruc=on
kBHTIndex
2k-entryBHT,2bits/entry
Taken/¬Taken?
Exploi<ngSpa<alCorrela<onYehandPa),1992
12
Historyregister,H,recordsthedirec>onofthelastNbranchesexecutedbytheprocessor
if (x[i] < 7) theny += 1;
if (x[i] < 5) thenc -= 4;
Iffirstcondi>onfalse,secondcondi>onalsofalse
Two-LevelBranchPredictor
13
Pen=umProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)
0 0
kFetchPC
ShiPinTaken/¬Takenresultsofeachbranch
2-bitglobalbranchhistoryshiPregister
Taken/¬Taken?
Specula<ngBothDirec<ons?
§ Analterna>vetobranchpredic>onistoexecutebothdirec>onsofabranchspecula>vely– resourcerequirementispropor>onaltothenumberofconcurrentspecula>veexecu>ons
– onlyhalftheresourcesengageinusefulworkwhenbothdirec>onsofabranchareexecutedspecula>vely
– branchpredic>ontakeslessresourcesthanspecula>veexecu>onofbothpaths
§ Withaccuratebranchpredic>on,itismorecosteffec>vetodedicateallresourcestothepredicteddirec>on!
14
Limita<onsofBHTs
15
Onlypredictsbranchdirec>on.Therefore,cannotredirectfetchstreamun>laPerbranchtargetisdetermined.
UltraSPARC-IIIfetchpipeline
Correctlypredictedtakenbranch
penalty
JumpRegisterpenalty
A PCGenera>on/MuxP Instruc>onFetchStage1F Instruc>onFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstruc>onstoFunc>onalunitsR RegisterFileReadE IntegerExecute
Remainderofexecutepipeline(+another6stages)
BranchTargetBuffer(BTB)
16
• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• OnlytakenbranchesandjumpsheldinBTB• NextPCdeterminedbeforebranchfetchedanddecoded
2k-entry direct-mapped BTB (can also be associative) I-Cache PC
k
Valid
valid
EntryPC
=
match
predicted
target
targetPC
CombiningBTBandBHT§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)
§ BHTcanholdmanymoreentriesandismoreaccurate
17
A PCGenera>on/MuxP Instruc>onFetchStage1F Instruc>onFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstruc>onstoFunc>onalunitsR RegisterFileReadE IntegerExecute
BTB
BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch
BTB/BHTonlyupdateda[erbranchresolvesinEstage
UsesofJumpRegister(JR)
§ Switchstatements(jumptoaddressofmatchingcase)
§ Dynamicfunc>oncall(jumptorun->mefunc>onaddress)
§ Subrou>nereturns(jumptoreturnaddress)
18HowwelldoesBTBworkforeachofthesecases?
BTBworkswellifsamecaseusedrepeatedly
BTBworkswellifsamefunc>onusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunc>oncall)
BTBworkswellifusuallyreturntothesameplace⇒O[enonefunc=oncalledfrommanydis=nctcallsites!
Subrou<neReturnStack
19
SmallstructuretoaccelerateJRforsubrou>nereturns,typicallymuchmoreaccuratethanBTBs.
&fb()
&fc()
Pushcalladdresswhenfunc=oncallexecuted
Popreturnaddresswhensubrou=nereturndecoded
fa() { fb(); } fb() { fc(); } fc() { fd(); }
&fd() kentries(typicallyk=8-16)
CS252
ReturnStackinPipeline
§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubrou>necall/returnatdecode
20
A PCGenera>on/MuxP Instruc>onFetchStage1F Instruc>onFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstruc>onstoFunc>onalunitsR RegisterFileReadE IntegerExecute
RSRSPush/Popa[erdecodegiveslargebubbleinfetchstream.
ReturnStackpredic=onchecked
CS252
ReturnStackinPipeline
§ CanrememberwhetherPCissubrou>necall/returnusingBTB-likestructure
§ Insteadoftarget-PC,juststorepush/popbit
21
A PCGenera>on/MuxP Instruc>onFetchStage1F Instruc>onFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstruc>onstoFunc>onalunitsR RegisterFileReadE IntegerExecute
RS
Push/Popbeforeinstruc=onsdecoded!
ReturnStackpredic=onchecked
In-Ordervs.Out-of-OrderBranchPredic<on
22
§ Specula>vefetchbutnotspecula>veexecu>on-branchresolvesbeforelaterinstruc>onscomplete
§ Completedvaluesheldinbypassnetworkun>lcommit
§ Specula>veexecu>on,withbranchesresolvedaPerlaterinstruc>onscomplete
§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileun>lcommit
Fetch
Decode
Execute
Commit
In-OrderIssue Out-of-OrderIssue
Fetch
Decode
Execute
Commit
ROB
Br.Pred.
Resolve
Br.Pred.
Resolve
• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemul>pleinstruc>onspercycle
• Commontohave10-30pipelinestagesineitherstyleofdesign
In-Order
In-Order
In-Order
Out-of-Order
InOvs.OoOMispredictRecovery
§ In-orderexecu>on?– Designsonoinstruc>onissuedaPerbranchcanwrite-backbeforebranchresolves
– Killallinstruc>onsinpipelinebehindmispredictedbranch
§ Out-of-orderexecu>on?– Mul>pleinstruc>onsfollowingbranchinprogramordercancompletebeforebranchresolves
– Asimplesolu>onwouldbetohandlelikeprecisetraps• Problem?
23
BranchMispredic<oninPipeline
24
§ Canhavemul>pleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstruc>onsinROBthatfollowamispredictedbranch
§ MIPSR10Kusesfourmaskbitstotaginstruc>onsthataredependentonuptofourspecula>vebranches
§ Maskbitsclearedasbranchresolves,andreusedfornextbranch
Fetch Decode
Execute
Commit Reorder Buffer
Kill
Kill Kill
PC
Inject correct PC
Branch Prediction
Branch Resolution
Complete
RenameTableRecovery
§ Havetoquicklyrecoverrenametableonbranchmispredicts
§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspecula>vebranches
§ Alpha21264has80snapshots,oneperROBinstruc>on
25
CS152Administrivia
§ Lab3outonFriday,dueMondayApril6§ PS3dueMondayMarch16
26
CS252
CS252Administrivia
§ ReadingsnextweekonOoOsuperscalarmicroprocessors§ Discussionmee>nginSDH240,Monday3:30-4:30
– Newregularmee>ng>me
27
ImprovingInstruc<onFetch
§ Performanceofspecula>veout-of-ordermachinesoPenlimitedbyinstruc>onfetchbandwidth– specula>veexecu>oncanfetch2-3xmoreinstruc>onsthanarecommihed
– mispredictpenal>esdominatedby>metorefillinstruc>onwindow– takenbranchesarepar>cularlytroublesome
28
CS252
IncreasingTakenBranchBandwidth(Alpha21264I-Cache)
§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncri>calloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining
CachedInstruc>ons
LinePredict
WayPredict
TagWay0
TagWay1
=? =?
fastfetchpath
PCGenera>on
PC
BranchPredic>onInstruc>onDecodeValidityChecks
4insts
Hit/Miss/Way
29
CS252
TournamentBranchPredictor(Alpha21264)
§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredic>ngnextbranch
§ Globalhistoryisspecula>velyupdatedbutrestoredonmispredict
§ Claim90-100%successonrangeofapplica>ons
Local history table (1,024x10b)
PC
Local prediction (1,024x3b)
Global Prediction (4,096x2b)
Choice Prediction (4,096x2b)
Global History (12b) Prediction
30
TakenBranchLimit
§ Integercodeshaveatakenbranchevery6-9instruc>ons§ Toavoidfetchbohleneck,mustexecutemul>pletakenbranchespercyclewhenincreasingperformance
§ Thisimplies:– predic>ngmul>plebranchespercycle– fetchingmul>plenon-con>guousblockspercycle
31
CS252
BranchAddressCache(Yeh,Marr,Pa_)
PCk
EntryPC
=
match
Valid
valid
predicted
target#1
target#1len
len#1
predicted
target#2
target#2
ExtendBTBtoreturnmul>plebranchpredic>onspercycle
32
CS252
FetchingMul<pleBasicBlocks
§ Requireseither– mul>portedcache:expensive– interleaving:bankconflictswilloccur
§ Mergingmul>pleblockstofeedtodecodersaddslatency,increasingmispredictpenaltyandreducingbranchthroughput
33
CS252
TraceCache
§ KeyIdea:Packmul>plenon-con>guousbasicblocksintoonecon>guoustracecacheline
BR BR BR
• Singlefetchbringsinmul>plebasicblocks
• Tracecacheindexedbystartaddressandnextnbranchpredic>ons
• UsedinIntelPen>um-4processortoholddecodeduops
BR BR BR
34
Load-StoreQueueDesign
§ APercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbohlenecktosuperscalarperformance
§ Modernsuperscalarsuseverysophis>catedload-storereorderingtechniquestoreduceeffec>vememorylatencybyallowingloadstobespecula>velyissued
35
Specula<veStoreBuffer
§ Justlikeregisterupdates,storesshouldnotmodifythememoryun>laPertheinstruc>oniscommihed.Aspecula>vestorebufferisastructureintroducedtoholdspecula>vestoredata.
§ Duringdecode,storebufferslotallocatedinprogramorder
§ Storessplitinto“storeaddress”and“storedata”micro-opera>ons
§ “Storeaddress”execu>onwritestag§ “Storedata”execu>onwritesdata§ Storecommitswhenoldestinstruc>onandbothaddressanddataavailable:– clearspecula>vebitandeventuallymovedatatocache
§ Onstoreabort:– clearvalidbit
36
DataTags
StoreCommitPath
Specula=veStoreBuffer
L1DataCache
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
StoreAddress
StoreData
Loadbypassfromspecula<vestorebuffer
§ Ifdatainbothstorebufferandcache,whichshouldweuse?
Specula>vestorebuffer
§ Ifsameaddressinstorebuffertwice,whichshouldweuse?
Youngeststoreolderthanload37
Data
LoadAddress
Tags
Specula=veStoreBuffer L1DataCache
LoadData
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
MemoryDependencies
sd x1, (x2) ld x3, (x4)
§ Whencanweexecutetheload?
38
In-OrderMemoryQueue
§ Executeallloadsandstoresinprogramorder
=>LoadandstorecannotleaveROBforexecu>onun>lallpreviousloadsandstoreshavecompletedexecu>on
§ Cans>llexecuteloadsandstoresspecula>vely,andout-of-orderwithrespecttootherinstruc>ons
§ Needastructuretohandlememoryordering…
39
Conserva<veO-o-OLoadExecu<on
sd x1, (x2) ld x3, (x4)
§ Canexecuteloadbeforestore,ifaddressesknownandx4!=x2
§ Eachloadaddresscomparedwithaddressesofallpreviousuncommihedstores– canusepar>alconserva>vechecki.e.,bohom12bitsofaddress,tosavehardware
§ Don’texecuteloadifanypreviousstoreaddressnotknown
§ (MIPSR10K,16-entryaddressqueue)
40
AddressSpecula<on
sd x1, (x2) ld x3, (x4)
§ Guessthatx4!=x2 § Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommihedload/storeaddressesinprogramorder
§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstruc>ons
§ =>Largepenaltyforinaccurateaddressspecula>on
41
CS252
MemoryDependencePredic<on(Alpha21264)
sd x1, (x2)
ld x3, (x4)
§ Guessthatx4!=x2andexecuteloadbeforestore
§ Iflaterfindx4==x2,squashloadandallfollowinginstruc>ons,butmarkloadinstruc>onasstore-wait
§ Subsequentexecu>onsofthesameloadinstruc>onwillwaitforallpreviousstorestocomplete
§ Periodicallyclearstore-waitbits
42
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:– Arvind(MIT)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPaherson(UCB)
43