Advanced Microarchitecture. Lecture 11: Memory Scheduling. If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should have gotten value from the Store, but it didn’t!. Issue. Issue. Issue. Issue. Issue. Executing Memory Instructions. Cache Miss!. - PowerPoint PPT Presentation
CS8803: Advanced Microarchitecture
Advanced MicroarchitectureLecture 11: Memory Scheduling1Executing Memory InstructionsLecture 13: Memory Scheduling 2If R1 != R7, then Load R8 gets correct value from cacheIf R1 == R7, then Load R8 should have gotten value from the Store, but it didnt!Load R3 = 0[R6]Add R7 = R3 + R9Store R4 0[R7]Sub R1 = R1 R2Load R8 = 0[R1]IssueIssueCache Miss!IssueCache Hit!Miss servicedIssueIssueBut there was a later loadBasic example of address-based dependency ambiguities2Memory Disambiguation ProblemOrdering problem is a data-dependence violation
Why cant this happen with non-memory insts?Operand specifiers in non-memory insts are absoluteR1 refers to one specific locationOperand specifiers in memory insts are ambiguousR1 refers to a memory location specified by the value of R1. As pointers change, so does this location.
Determining whether it is safe to issue a load OOO requires disambiguating the operand specifiersLecture 13: Memory Scheduling 3Two ProblemsMemory disambiguationAre there any earlier unexecuted stores to the same address as myself? (Im a load)Binary question: answer is yes or no
Store-to-load forwarding problemWhich earlier store do I get my value from? (Im a load)Which later load(s) do I forward my value to? (Im a store)Non-binary question: answer is one or more instruction identifiersLecture 13: Memory Scheduling 4Load Store Queue (LSQ)Lecture 13: Memory Scheduling 5L0xF048417730x329042L/SPCSeqAddrValueS0xF04C417740x341025S0xF054417750x3290-17L0xF060417760x34181234L0xF840417770x3290-17L0xF858417780x33001S0xF85C417790x32900L0xF870417800x341025L0xF628417810x32900L0xF63C417820x33001OldestYoungest0x3290420x3410380x341812340x33001Data Cache25-17Most Conservative PolicyNo Memory ReorderingLSQ still needed for forwarded data (last slide)Easy to scheduleLecture 13: Memory Scheduling 6Ready!bidgrantbidgrantReady!1Least IPC, all memory executed sequentiallyLoads OOO Between StoresLet loads exec OOO w.r.t. each other, but no ordering past earlier unexecuted storesLecture 13: Memory Scheduling 7Srdexall earlier stores executedLLSLS=0L=1Loads Wait for Only STAsStores normally dont Execute until both inputs are ready: address and data
Only address is needed to disambiguateLecture 13: Memory Scheduling 8SLAddress readyData readyLoads Execute When ReadyMost aggressive approachRelies on fact that storeload forwarding is not the common case
Greatest potential IPC loads never stall
Potential for incorrect executionLecture 13: Memory Scheduling 9Perhaps good to have discussion here about when forwarding is unlikely vs. likely. ISA dependence? Program structures?9Detecting Ordering ViolationsCase 1: Older store execs before younger loadNo problem; if same address stld forwarding happensCase 2: Older store execs after younger loadStore scans all younger loadsAddress match ordering violation
Lecture 13: Memory Scheduling 10Detecting Ordering Violations (2)Lecture 13: Memory Scheduling 11L0xF048417730x329042S0xF04C417740x341025S0xF054417750x3290-17L0xF060417760x34181234L0xF840417770x3290-17L0xF858417780x33001S0xF85C417790x32900L0xF870417800x341025L0xF628417810x329042L0xF63C417820x33001Store broadcasts value,address and sequence #(-17,0x3290,41775)Loads CAM-match onaddress, only care ifstore seq-# is lower thanown seq(Load 41773 ignores because it has a lower seq #)IF younger load hadnt executed, andaddress matches, grab bcasted valueIF younger load has executed, andaddress matches, then ordering violation!-17Grab value, flush pipeline after load(0,0x3290,41779)An instruction may be involved inmore than one ordering violationNote that detecting address matches is actually not as trivial as you might think. Consider x86 where loads/stores may be multiple bytes wide and unaligned. A store may write to one or more bytes, of which only a subset overlap with some of the bytes read by a load. The overlap must be detected to ensure correct execution, although forwarding may be restricted to certain easy cases (e.g., perfect alignment) for the unsupported forwarding cases, you can always stall the load until the matching store or stores writeback to the cache and the load can read its value out directly from the cache. It seems that each generation of Intel Core processors is gradually supporting more forwarding cases.11Dealing with MisspeculationsInstructions using the loads stale/wrong value will propagate more wrong valuesThese must somehow be re-executedLecture 13: Memory Scheduling 12Easiest: flush all instructions after (and including?) the misspeculated load, and just refetchLoad uses forwarded valueCorrect value propagated when instructions re-executeFor x86, you probably want to flush *including* the load because the load may be part of a longer uop-flow, and so refetching just part of a flow does not make any sense (i.e., how do you resteer the front-end in this case?). Consider the CALL(with memory argument) instruction which decomposes, among other uops, into a load (which reads in the new PC) and a jump (to the new PC). The jump portion may have already jumped to the wrong place, but there is no easy way to redo only those uops in the flow following the load.12Recovery ComplicationsWhen flushing only part of the pipeline (everything after the load), RAT must be repaired to the state just after the load was renamedSolutions?Checkpoint at every loadNot so good, between loads and branches, very large number of checkpoints neededRollback to previous branch (which has its own checkpoint)Make sure load doesnt misspeculate on 2nd time aroundHave to redo the work between the branch and the load which were all correct the first time aroundWorks with undo-list style of recoveryLecture 13: Memory Scheduling 13Flushing is ExpensiveNot all later instructions are dependent on the bogus load valuePipeline latency due to refetch is exposedHunting down RS entries to squash is tricky
Lecture 13: Memory Scheduling 14Selective Re-ExecutionIdeal case w.r.t. maintaining high IPCVery complicatedneed to hunt down only data-dependent instsmessier because some instructions may have already executed (now in ROB) while others may not have executed yet (still in RS)iteratively walk dependence graph?use some sort of load/store coloring scheme?
P4 uses replay for load-latency misspeculationBut replay wouldnt work in this case (why?)Lecture 13: Memory Scheduling 15Well, you could perhaps force replay to work, but youd basically have to have a giant replay queue to hold all instructions until you can guarantee that they will not be down-stream from a load-store ordering violation.15Load/Store ExecutionSimpleScalar styleLecture 13: Memory Scheduling 16Storeallocea-compAddea-compst-datald-dataLSQRSscheduleeaDIndependentlyExecuteStoreStore completeForward valueto later LoadsSeaDIndependentlyScheduleCrack atDispatchtimeLoad is similar, but LD-data portion isdata-dependent on the LD ea-compAddLoadComplicationsLSQ needs data-capture supportStore Data needs to capture valueEA-comps can write to LSQ entries directly using LSQ index (no associative search)Lecture 13: Memory Scheduling 17Ld-dSt-daddL-eaxorS-eaLSQRSADDT17T12T43opdestsrcLsrcRSt-eaLsq-5T18#0Store normally doesnt have adest; overload field for LSQ indexLoad ea-comp done the same;Loads LSQ entry handles realdestination tag broadcastJust pointing out that the SimpleScalar (should say sim-outorder) model is not desirable in a real implementation because you dont want to add even more CAM logic (required for the data capture) to the LSQ.17SelectSelectComplications (2)Load must bid/select twiceonce for ea-comp portiononce for cache access (includes LSQ check)Lecture 13: Memory Scheduling 18Ld-eaLd-dataEa-compExecDataCacheData cache and LSQsearch in parallelRSLSQLoad/Store ExecutionPentium StyleLecture 13: Memory Scheduling 19Storedispatch/allocSTASTDLDstoreloadLSQRSscheduleAddLoadSTA and STD still execute independentlyLSQ does not need data-captureuses RSs data-capture (for data-capture scheduler)or RSPRFLSQPotentially adds a little delay from STD-ready to STLD forwardingSelectLoad ExecutionOnly one select/bidLecture 13: Memory Scheduling 20LoadLoadEa-compExecDataCacheLSQ search in parallelRSLSQLoad queue part doesntexecute, but justholds address fordetecting orderingviolationsWere not sure what the actual organization is in current processors. This approach can lead to unnecessary data cache and LSQ lookups in the case where the load is predicted to wait on earlier unresolved store addresses.20Store ExecutionSTA and STD independently issue from RSSTA does ea compSTD just reads operand and moves it to the LSQWhen both have executed and reached the LSQ, then perform LSQ search for younger loads that have already executed (i.e., ordering violations)Lecture 13: Memory Scheduling 21LSQ Hardware in More DetailCAM logic harder than regular scheduler because we need address + age information
Age information not needed for physical registers since register renaming guarantees one writer per addressNo easy way to prevent more than one store to the same addressLecture 13: Memory Scheduling 22Loads checking for earlier matching storesLecture 13: Memory Scheduling 23ST 0x4000ST 0x4000ST 0x4120LD 0x4000=Address BankData Bank======0No earliermatchesAddr matchValid storeUse thisstoreNeed to adjust this so thatload need not be at bottom,and that LSQ can wrap-aroundIf |LSQ| is large, logic can beadapted to have log delayAs mentioned in the earlier notes, the real circuitry gets quite a bit messier when you have to deal with different memory widths and/or unaligned accesses.23Similar Logic to Previous SlideData ForwardingLecture 13: Memory Scheduling 24ST 0x4000ST 0x4120ST 0x4000LD 0x4000Addr MatchIs LoadCaptureValueOverwrittenOverwrittenData BankThis logic is ugly, complicated, slow and power hungry!This logic must handle the situation where more than one store writes to the same add