Download pdf - CS 152 Computer Architecture and Engineering CS252 ...inst.eecs.berkeley.edu/~cs152/sp20/lectures/L12-Superscalars.pdf · CS 152 Computer Architecture and Engineering CS252 Graduate

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture12–BranchPredic<onandAdvancedOut-of-OrderSuperscalars

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

LastTimeinLecture11

§  Phasesofinstruc>onexecu>on:–  Fetch/decode/rename/dispatch/issue/execute/complete/commit

§ Data-in-ROBdesignversusunifiedphysicalregisterdesign§  Superscalarregisterrenaming

2

Control-FlowPenalty

3

I-cache

Fetch Buffer

Issue Buffer

Func. Units

Arch. State

Execute

Decode

Result Buffer Commit

PC

Fetch

Branch executed

Next fetch started

Modernprocessorsmayhave>10pipelinestagesbetweennextPCcalcula=onandbranchresolu=on!

Howmuchworkislostifpipelinedoesn’tfollowcorrectinstruc=onflow?

~Looplengthxpipelinewidth+buffers

ReducingControl-FlowPenalty

§ SoPwaresolu>ons–  Eliminatebranches-loopunrolling

•  Increasestherunlength–  Reduceresolu>on>me-instruc>onscheduling

•  Computethebranchcondi>onasearlyaspossible(oflimitedvaluebecausebranchesoPenincri>calpaththroughcode)

§ Hardwaresolu>ons–  Findsomethingelsetodo(delayslots)

•  Replacespipelinebubbleswithusefulwork(requiressoPwarecoopera>on)–quicklyseediminishingreturns

–  Speculate,i.e.,branchpredic>on•  Specula>veexecu>onofinstruc>onsbeyondthebranch•  Manyadvancesinaccuracy,widelyused

4

BranchPredic<on

5

Mo=va=on:Branchpenal>eslimitperformanceofdeeplypipelinedprocessorsModernbranchpredictorshavehighaccuracy(>95%)andcanreducebranchpenal>essignificantly

Requiredhardwaresupport:

Predic=onstructures:• Branchhistorytables,branchtargetbuffers,etc.

Mispredictrecoverymechanisms:

• Keepresultcomputa=onseparatefromcommit • Killinstruc>onsfollowingbranchinpipeline• Restorestatetothatfollowingbranch

ImportanceofBranchPredic<on

§ Consider4-waysuperscalarwith8pipelinestagesfromfetchtodispatch,and80-entryROB,and3cyclesfromissuetobranchresolu>on

§ Onamispredict,couldthrowaway8*4+(80-1)=111instruc>ons

§  Improvingfrom90%to95%predic>onaccuracy,removes50%ofbranchmispredicts–  If1/6instruc>onsarebranches,thenmovefrom60instruc>onsbetweenmispredicts,to120instruc>onsbetweenmispredicts

6

Sta<cBranchPredic<on

7

Overallprobabilityabranchistakenis~60-70%but:

ISAcanahachpreferreddirec>onseman>cstobranches,e.g.,MotorolaMC88110

bne0(preferredtaken) beq0(nottaken)ISAcanallowarbitrarychoiceofsta>callypredicteddirec>on,e.g.,HPPA-RISC,IntelIA-64typicallyreportedas~80%accurate

backward90%

forward50%

DynamicBranchPredic<onlearningbasedonpastbehavior

§ Temporalcorrela>on– Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecu>on

§ Spa>alcorrela>on– Severalbranchesmayresolveinahighlycorrelatedmanner(apreferredpathofexecu>on)

8

One-BitBranchHistoryPredictor

§ Foreachbranch,rememberlastwaybranchwent§ Hasproblemwithloop-closingbackwardbranches,astwomispredictsoccuroneveryloopexecu>on1.  firstitera>onpredictsloopbackwardsbranchnot-taken(loop

wasexitedlast>me)2.  lastitera>onpredictsloopbackwardsbranchtaken(loop

con>nuedlast>me)

9

BranchPredic<onBits

10

• Assume2BPbitsperinstruc>on• Changethepredic>onaPertwoconsecu>vemistakes!

¬takewrong

taken¬taken

taken

taken

taken¬takeright

takeright

takewrong

¬taken

¬taken¬taken

BPstate: (predicttake/¬take)x(lastpredic=onright/wrong)

BranchHistoryTable(BHT)

11

4K-entryBHT,2bits/entry,~80-90%correctpredic>ons

00FetchPC

Branch? TargetPC

+

I-Cache

Opcode offsetInstruc=on

kBHTIndex

2k-entryBHT,2bits/entry

Taken/¬Taken?

Exploi<ngSpa<alCorrela<onYehandPa),1992

12

Historyregister,H,recordsthedirec>onofthelastNbranchesexecutedbytheprocessor

if (x[i] < 7) theny += 1;

if (x[i] < 5) thenc -= 4;

Iffirstcondi>onfalse,secondcondi>onalsofalse

Two-LevelBranchPredictor

13

Pen=umProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)

0 0

kFetchPC

ShiPinTaken/¬Takenresultsofeachbranch

2-bitglobalbranchhistoryshiPregister

Taken/¬Taken?

Specula<ngBothDirec<ons?

§ Analterna>vetobranchpredic>onistoexecutebothdirec>onsofabranchspecula>vely–  resourcerequirementispropor>onaltothenumberofconcurrentspecula>veexecu>ons

–  onlyhalftheresourcesengageinusefulworkwhenbothdirec>onsofabranchareexecutedspecula>vely

–  branchpredic>ontakeslessresourcesthanspecula>veexecu>onofbothpaths

§ Withaccuratebranchpredic>on,itismorecosteffec>vetodedicateallresourcestothepredicteddirec>on!

14

Limita<onsofBHTs

15

Onlypredictsbranchdirec>on.Therefore,cannotredirectfetchstreamun>laPerbranchtargetisdetermined.

UltraSPARC-IIIfetchpipeline

Correctlypredictedtakenbranch

penalty

JumpRegisterpenalty

A PCGenera>on/MuxP Instruc>onFetchStage1F Instruc>onFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstruc>onstoFunc>onalunitsR RegisterFileReadE IntegerExecute

Remainderofexecutepipeline(+another6stages)

BranchTargetBuffer(BTB)

16

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• OnlytakenbranchesandjumpsheldinBTB• NextPCdeterminedbeforebranchfetchedanddecoded

2k-entry direct-mapped BTB (can also be associative) I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

CombiningBTBandBHT§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

§ BHTcanholdmanymoreentriesandismoreaccurate

17


BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdateda[erbranchresolvesinEstage

UsesofJumpRegister(JR)

§ Switchstatements(jumptoaddressofmatchingcase)

§ Dynamicfunc>oncall(jumptorun->mefunc>onaddress)

§ Subrou>nereturns(jumptoreturnaddress)

18HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefunc>onusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunc>oncall)

BTBworkswellifusuallyreturntothesameplace⇒O[enonefunc=oncalledfrommanydis=nctcallsites!

Subrou<neReturnStack

19

SmallstructuretoaccelerateJRforsubrou>nereturns,typicallymuchmoreaccuratethanBTBs.

&fb()

&fc()

Pushcalladdresswhenfunc=oncallexecuted

Popreturnaddresswhensubrou=nereturndecoded

fa() { fb(); } fb() { fc(); } fc() { fd(); }

&fd() kentries(typicallyk=8-16)

CS252

ReturnStackinPipeline

§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubrou>necall/returnatdecode

20


RSRSPush/Popa[erdecodegiveslargebubbleinfetchstream.

ReturnStackpredic=onchecked

CS252

ReturnStackinPipeline

§ CanrememberwhetherPCissubrou>necall/returnusingBTB-likestructure

§ Insteadoftarget-PC,juststorepush/popbit

21


RS

Push/Popbeforeinstruc=onsdecoded!

ReturnStackpredic=onchecked

In-Ordervs.Out-of-OrderBranchPredic<on

22

§  Specula>vefetchbutnotspecula>veexecu>on-branchresolvesbeforelaterinstruc>onscomplete

§  Completedvaluesheldinbypassnetworkun>lcommit

§  Specula>veexecu>on,withbranchesresolvedaPerlaterinstruc>onscomplete

§  CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileun>lcommit

Fetch

Decode

Execute

Commit

In-OrderIssue Out-of-OrderIssue

Fetch

Decode

Execute

Commit

ROB

Br.Pred.

Resolve

Br.Pred.

Resolve

• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemul>pleinstruc>onspercycle

• Commontohave10-30pipelinestagesineitherstyleofdesign

In-Order

In-Order

In-Order

Out-of-Order

InOvs.OoOMispredictRecovery

§ In-orderexecu>on?–  Designsonoinstruc>onissuedaPerbranchcanwrite-backbeforebranchresolves

–  Killallinstruc>onsinpipelinebehindmispredictedbranch

§ Out-of-orderexecu>on?–  Mul>pleinstruc>onsfollowingbranchinprogramordercancompletebeforebranchresolves

–  Asimplesolu>onwouldbetohandlelikeprecisetraps•  Problem?

23

BranchMispredic<oninPipeline

24

§ Canhavemul>pleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstruc>onsinROBthatfollowamispredictedbranch

§ MIPSR10Kusesfourmaskbitstotaginstruc>onsthataredependentonuptofourspecula>vebranches

§ Maskbitsclearedasbranchresolves,andreusedfornextbranch

Fetch Decode

Execute

Commit Reorder Buffer

Kill

Kill Kill

PC

Inject correct PC

Branch Prediction

Branch Resolution

Complete

RenameTableRecovery

§ Havetoquicklyrecoverrenametableonbranchmispredicts

§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspecula>vebranches

§ Alpha21264has80snapshots,oneperROBinstruc>on

25

CS152Administrivia

§  Lab3outonFriday,dueMondayApril6§  PS3dueMondayMarch16

26

CS252

CS252Administrivia

§ ReadingsnextweekonOoOsuperscalarmicroprocessors§ Discussionmee>nginSDH240,Monday3:30-4:30

–  Newregularmee>ng>me

27

ImprovingInstruc<onFetch

§  Performanceofspecula>veout-of-ordermachinesoPenlimitedbyinstruc>onfetchbandwidth–  specula>veexecu>oncanfetch2-3xmoreinstruc>onsthanarecommihed

–  mispredictpenal>esdominatedby>metorefillinstruc>onwindow–  takenbranchesarepar>cularlytroublesome

28

CS252

IncreasingTakenBranchBandwidth(Alpha21264I-Cache)

§  Fold2-waytagsandBTBintopredictednextblock§  Taketagchecks,inst.decode,branchpredictoutofloop§  RawRAMspeedoncri>calloop(1cycleat~1GHz)§  2-bithysteresiscounterperblockpreventsovertraining

CachedInstruc>ons

LinePredict

WayPredict

TagWay0

TagWay1

=? =?

fastfetchpath

PCGenera>on

PC

BranchPredic>onInstruc>onDecodeValidityChecks

4insts

Hit/Miss/Way

29

CS252

TournamentBranchPredictor(Alpha21264)

§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredic>ngnextbranch

§ Globalhistoryisspecula>velyupdatedbutrestoredonmispredict

§ Claim90-100%successonrangeofapplica>ons

Local history table (1,024x10b)

PC

Local prediction (1,024x3b)

Global Prediction (4,096x2b)

Choice Prediction (4,096x2b)

Global History (12b) Prediction

30

TakenBranchLimit

§  Integercodeshaveatakenbranchevery6-9instruc>ons§  Toavoidfetchbohleneck,mustexecutemul>pletakenbranchespercyclewhenincreasingperformance

§  Thisimplies:–  predic>ngmul>plebranchespercycle–  fetchingmul>plenon-con>guousblockspercycle

31

CS252

BranchAddressCache(Yeh,Marr,Pa_)

PCk

EntryPC

=

match

Valid

valid

predicted

target#1

target#1len

len#1

predicted

target#2

target#2

ExtendBTBtoreturnmul>plebranchpredic>onspercycle

32

CS252

FetchingMul<pleBasicBlocks

§ Requireseither– mul>portedcache:expensive–  interleaving:bankconflictswilloccur

§ Mergingmul>pleblockstofeedtodecodersaddslatency,increasingmispredictpenaltyandreducingbranchthroughput

33

CS252

TraceCache

§  KeyIdea:Packmul>plenon-con>guousbasicblocksintoonecon>guoustracecacheline

BR BR BR

•  Singlefetchbringsinmul>plebasicblocks

•  Tracecacheindexedbystartaddressandnextnbranchpredic>ons

•  UsedinIntelPen>um-4processortoholddecodeduops

BR BR BR

34

Load-StoreQueueDesign

§ APercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbohlenecktosuperscalarperformance

§ Modernsuperscalarsuseverysophis>catedload-storereorderingtechniquestoreduceeffec>vememorylatencybyallowingloadstobespecula>velyissued

35

Specula<veStoreBuffer

§  Justlikeregisterupdates,storesshouldnotmodifythememoryun>laPertheinstruc>oniscommihed.Aspecula>vestorebufferisastructureintroducedtoholdspecula>vestoredata.

§  Duringdecode,storebufferslotallocatedinprogramorder

§  Storessplitinto“storeaddress”and“storedata”micro-opera>ons

§  “Storeaddress”execu>onwritestag§  “Storedata”execu>onwritesdata§  Storecommitswhenoldestinstruc>onandbothaddressanddataavailable:–  clearspecula>vebitandeventuallymovedatatocache

§  Onstoreabort:–  clearvalidbit

36

DataTags

StoreCommitPath

Specula=veStoreBuffer

L1DataCache

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

StoreAddress

StoreData

Loadbypassfromspecula<vestorebuffer

§  Ifdatainbothstorebufferandcache,whichshouldweuse?

Specula>vestorebuffer

§  Ifsameaddressinstorebuffertwice,whichshouldweuse?

Youngeststoreolderthanload37

Data

LoadAddress

Tags

Specula=veStoreBuffer L1DataCache

LoadData

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

MemoryDependencies

sd x1, (x2) ld x3, (x4)

§ Whencanweexecutetheload?

38

In-OrderMemoryQueue

§  Executeallloadsandstoresinprogramorder

=>LoadandstorecannotleaveROBforexecu>onun>lallpreviousloadsandstoreshavecompletedexecu>on

§ Cans>llexecuteloadsandstoresspecula>vely,andout-of-orderwithrespecttootherinstruc>ons

§ Needastructuretohandlememoryordering…

39

Conserva<veO-o-OLoadExecu<on

sd x1, (x2) ld x3, (x4)

§ Canexecuteloadbeforestore,ifaddressesknownandx4!=x2

§  Eachloadaddresscomparedwithaddressesofallpreviousuncommihedstores–  canusepar>alconserva>vechecki.e.,bohom12bitsofaddress,tosavehardware

§ Don’texecuteloadifanypreviousstoreaddressnotknown

§  (MIPSR10K,16-entryaddressqueue)

40

AddressSpecula<on

sd x1, (x2) ld x3, (x4)

§ Guessthatx4!=x2 §  Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommihedload/storeaddressesinprogramorder

§  Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstruc>ons

§  =>Largepenaltyforinaccurateaddressspecula>on

41

CS252

MemoryDependencePredic<on(Alpha21264)

sd x1, (x2)

ld x3, (x4)

§ Guessthatx4!=x2andexecuteloadbeforestore

§  Iflaterfindx4==x2,squashloadandallfollowinginstruc>ons,butmarkloadinstruc>onasstore-wait

§  Subsequentexecu>onsofthesameloadinstruc>onwillwaitforallpreviousstorestocomplete

§  Periodicallyclearstore-waitbits

42

Acknowledgements

§  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPaherson(UCB)

43