Click here to load reader

Advanced Microarchitecture

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 12: Caches and Memory. 1. 1. “6T SRAM” cell 2 access gates 2T per inverter. b. b. SRAM {Over|Re}view. Chained inverters maintain a stable state Access gates provide access to the cell - PowerPoint PPT Presentation

Text of Advanced Microarchitecture

CS8803: Advanced Microarchitecture

Advanced MicroarchitectureLecture 12: Caches and Memory1SRAM {Over|Re}viewChained inverters maintain a stable stateAccess gates provide access to the cellWriting to a cell involves over-powering the two small storage invertersLecture 12: Caches and Memory 2100111bb6T SRAM cell

2 access gates2T per inverterShould definitely be review for the ECE crowd.2641-bit SRAM Array OrganizationLecture 12: Caches and Memory 31-of-8 Decoder1-of-8 DecoderWhy are we readingboth b and b?WordlineBitlinesColumnMuxSRAM Density vs. Speed6T cell must be small as possible to have dense storageBigger cachesSmaller transistors slower transistorsLecture 12: Caches and Memory 4*Long* metal line with alot of parasitic loadingSo dinky inverters cannotdrive their outputs veryquicklySense AmplifiersType of differential amplifierTwo inputs, amplifies the differenceLecture 12: Caches and Memory 5DiffAmpXYa (X Y) + VbiasbbBitlinesprechargedTo VddWordlineenabledSmall cell dischargesbitline very slowlySense amp sees the differencequickly and outputs bs valueSometimes prechargebitlines to Vdd/2 whichmakes a bigger deltafor faster sensingMulti-PortingLecture 12: Caches and Memory 6b1b1Wordline1b2b2Wordline2Wordlines = 2 portsBitlines = 4 portsArea = O(ports2)Lots of other techniques *not* discussed here such as sub-banking, hierarchical wordlines/bitlines, etc.6Port RequirementsARF, PRF, RAT all need many read and write ports to support superscalar executionLuckily, these have limited number of entries/bytes

Caches also need multiple portsNot as many portsBut the overall size is much largerLecture 12: Caches and Memory 7Delay Of Regular CachesI$low port requirement (one fetch group/$-line per cycle)latency only exposed on branch mispredictD$higher port requirement (multiple LD/ST per cycle)latency often on critical path of executionL2lower port requirement (most accesses hit in L1)latency less important (only observed on L1 miss)optimizing for hit rate usually more important than latencydifference between L2 latency and DRAM latency is large Lecture 12: Caches and Memory 8Also consider ports for snooping, although these only need to access the tag arrays.8BankingLecture 12: Caches and Memory 9DecoderDecoderDecoderDecoderSRAMArraySenseSenseSenseSenseColumnMuxingBig4-portedL1 DataCacheSDecoderSRAMArraySDecoderSRAMArraySDecoderSRAMArraySDecoderSRAMArray4 Banks, 1 port eachEach bank is much fasterSlow due to quadraticarea growthBank ConflictsBanking provides high bandwidthBut only if all accesses are to different banks

Banks typically address interleavedFor N banksAddr bank[Addr % N]Addr on cache line granularity

For 4 banks, 2 accesses, chance of conflict is 25%Need to match # banks to access patterns/BWLecture 12: Caches and Memory 10AssociativityYou should know this alreadyLecture 12: Caches and Memory 11foos valuefoodirect mappedfoos valuefoofoofully associativeRAMCAMfoofoos valuefooset associativeCAM/RAM hybrid?Set-Associative CachesSet-associativity good for reducing conflict missesCost: slower cache accessoften dominated by the tag array comparisonsBasically mini-CAM logicLecture 12: Caches and Memory 12Must trade off: Smaller cache size Longer latency Lower associativityEvery option hurts performance====40-50 bitcomparison!Way-PredictionIf figuring out the way takes too long, then just guess!Lecture 12: Caches and Memory 13WayPredLoadPCPayloadSXXXE====Tag checkstill occursto validateway-predMay be hard to predict way if the same load accesses different addressesWay-Prediction (2)Organize data array s.t. left most way is the MRULecture 12: Caches and Memory 14MRULRUAccessesWay-predict the MRU wayWay-prediction keeps hittingOn way-miss, move blockto MRU positionWay-prediction continuesto hitWay-Miss (Cache Hit)Complication: data array needs datapathfor swapping blocks (maybe 100s of bits)

Normally just update a few LRU bits inthe tag array (< 10 bits?)Physically swapping cache lines is likely to be too expensive: both in terms of the wiring needed (consider 32 or 64-byte cache lines) plus the power cost.14Partial TaggingLike BTBs, just use part of the tagLecture 12: Caches and Memory 15========Tag array lookupnow much faster!Partial tags lead to false hits:Tag 0x45120001 looks like a hitfor Address 0x3B120001Similar to way-prediction, full tagcomparison still needed to verifyreal hit --- not on critical path in the LSQPartial tagging can be used in the LSQ as wellLecture 12: Caches and Memory 16Do address check onpartial addresses onlyOn a partial hit,forward the dataSlower completetag check verifiesthe match/no match

Replay or flushas neededIf a store finds a later partially-matchedload, dont do pipeline flush right away

Penalty is too severe, wait for slowcheck before flushing the pipeInteraction With SchedulingBank conflicts, way-mispredictions, partial-tag false hitsAll change the latency of the load instruction

Increases frequency of replaysmore replay conditions exist/encountered

Need careful tradeoff betweenperformance (reducing effective cache latency)performance (frequency of replaying instructions)power (frequency of replaying instructions)Lecture 12: Caches and Memory 17Alternatives to Adding AssociativityMore Set-Assoc needed when number of items mapping to same cache set > number of waysNot all sets suffer from high conflict rates

Idea: provide a little extra associativity, but not for each and every setLecture 12: Caches and Memory 18Victim CacheLecture 12: Caches and Memory 19ABCDEXYZJKLMNABCDJKLMABCDXYZPQRJKLMABJVictimCacheAEBCNJKCDKLLMEvery access is a miss!ABCED and JKLMNdo not fit in a 4-wayset associative cacheVictim cache providesa fifth way so long asonly four sets overflowinto it at the same timeCan even provide 6thor 7th waysPQRTypically need some sort of cast out buffer for evictees anyway a dirty line in a writeback cache needs to be written back to the next-level cache, and this might not be able to happen right away (e.g., bus busy, next-level cache busy, etc.). Victim cache and writeback buffer can be combined into one structure.19Skewed AssociativityLecture 12: Caches and Memory 20ABCDWXYZAXYCBDWZLots of missesRegular Set-Associative CacheAXYCSkewed-Associative CacheBDWZFewer of missesRequired Associativity VariesProgram stack needs very little associativityspatial localitystack frame is laid out sequentiallyfunction usually only refers to own stack frameLecture 12: Caches and Memory 21f()g()h()j()k()Call StackAddresseslaid out inlinearorganizationMRULRULayout in 4-way CacheAssociativity not being used effectivelyStack CacheLecture 12: Caches and Memory 22f()g()h()j()k()Nice stackaccessesDisorganizedheap accessesLots of conflicts!RegularCacheStack CacheStack cache is direct-mapped, regular cache can be set-associative22Stack Cache (2)Stack cache portion can be a lot simpler due to direct-mapped structurerelatively easily prefetched for by monitoring call/retnsRegular cache portion can have lower associativitydoesnt have conflicts due to stack/heap interactionLecture 12: Caches and Memory 23Stack Cache (3)Which cache does a load access?Many ISAs have a default stack-pointer registerLecture 12: Caches and Memory 24LDQ 0[$sp]LDQ 12[$sp]LDQ 8[$t3]LDQ 24[$sp]LDQ 0[$t1]Stack CacheRegular CacheMOV $t3 = $spXNeed stack base and offsetinformation, and then needto check each cache accessagainst these bounds

Wrong cache replayMulti-Lateral CachesNormal cache is uni-lateral in that everything goes into the same place

Stack cache is an example of multi-lateral cachesmultiple cache structures with disjoint contentsI$ vs. D$ could be considered multi-lateralLecture 12: Caches and Memory 25Access PatternsStack cache showed how different loads exhibit different access patternsLecture 12: Caches and Memory 26Stack(multiple push/popsof frames)Heap(heavily data-dependentaccess patterns)Streaming(linear accesseswith low/no reuse)Low-Reuse AccessesStreamingonce youre done decoding MPEG frame, no need to revisitOtherLecture 12: Caches and Memory 27Fields map to different cache linesstruct tree_t { int valid; int other_fields[24]; int num_children; struct tree_t * children;};while(some condition) { struct tree_t * parent = getNextRoot(); if(parent->valid) { doTreeTraversalStuff(parent); doMoreStuffToTree(parent); pickFruitFromTree(parent); }}parent->valid accessed once,and then not used againFilter CachesSeveral proposed variationsannex cache, pollution control cache, etc.Lecture 12: Caches and Memory 28SmallFilterCacheMainCacheFill on missFirst-time missesare placed in filtercacheIf accessed again, promoteto the main cacheIf not accessed again, eventually LRUd outMain cache only containslines with proven reuse

One-time-use lines havebeen filtered outCan be thought of as thedual of the victim cacheTrouble w/ Multi-Lateral CachesMore complexityload may need to be routed to different placesmay require some form of prediction to pick the right oneguessing wrong can cause replaysor accessing multiple in parallel increases powerno bandwidth benefitmore sources to bypass fromcosts both latency and power in bypass network

Lecture 12: Caches and Memory 29Memory-Level Parallelism (MLP)What if memory latency is 10000 cycles?Not enough traditional ILP to cover this latencyRuntime dominated by waiting for memoryWhat matters is overlapping memory accessesMLP: number of outstanding cache misses [to main memory] that can be generated and executed in an overlapped manner.ILP is a property of a DFG; MLP is a metricILP is independent of the underlying execution engineMLP is dependent on the microarchitecture assumptionsYou can measure MLP for uniprocessor, CMP, etc.Lecture 12: Caches and Memory 30uArchs for MLPWIB Waiting Instruction BufferLecture 12: Caches and Memory 31SchedulerLoad missNo instructions inforward slice canexecuteEventual

Search related