Click here to load reader

Advanced Microarchitecture

  • View

  • Download

Embed Size (px)


Advanced Microarchitecture. Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Ilhyun Kim Updated by Mikko Lipasti. Outline. Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery - PowerPoint PPT Presentation

Text of Advanced Microarchitecture

  • Advanced MicroarchitectureProf. Mikko H. LipastiUniversity of Wisconsin-Madison

    Lecture notes based on notes by Ilhyun KimUpdated by Mikko Lipasti

  • OutlineInstruction scheduling overviewScheduling atomicitySpeculative schedulingScheduling recoveryComplexity-effective instruction scheduling techniquesBuilding large instruction windowsRunahead, CFP, iCFPScalable load/store handlingControl Independence

  • ReadingsRead on your own:Shen & Lipasti Chapter 10 on Advanced Register Data Flow skimI. Kim and M. Lipasti, Understanding Scheduling Replay Schemes, in Proceedings of the 10th International Symposium on High-performance Computer Architecture (HPCA-10), February 2004.Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike Upton, Continual Flow Pipelines, in Proceedings of ASPLOS 2004, October 2004.Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric Rotenberg, Haitham H. Akkary, Transparent Control Independence, in Proceedings of ISCA-34, 2007.To be discussed in class:T. Shaw, M. Martin, A. Roth, NoSQ: Store-Load Communication without a Store Queue, in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006.Pierre Salverda, Craig B. Zilles: Fundamental performance constraints in horizontal fusion of in-order cores. HPCA 2008: 252-263.Andrew Hilton, Santosh Nagarakatte, Amir Roth, "iCFP: Tolerating All-Level Cache Misses in In-Order Processors," Proceedings of HPCA 2009.Loh, G. H., Xie, Y., and Black, B. 2007. Processor Design in 3D Die-Stacking Technologies. IEEE Micro 27, 3 (May. 2007), 31-48.

  • Register Dataflow

  • Instruction schedulingA process of mapping a series of instructions into execution resourcesDecides when and where an instruction is executedData dependence graphMapped to two FUs

  • Instruction schedulingA set of wakeup and select operationsWakeupBroadcasts the tags of parent instructions selectedDependent instruction gets matching tags, determines if source operands are readyResolves true data dependences

    SelectPicks instructions to issue among a pool of ready instructionsResolves resource conflictsIssue bandwidthLimited number of functional units / memory ports

  • Scheduling loopBasic wakeup and select operationsready - requestrequest ngrant ngrant 0request 0grant 1request 1selectedissueto FUbroadcast the tagof the selected instSelect logicWakeup logicschedulingloop

  • Wakeup and Select

  • Scheduling AtomicityOperations in the scheduling loop must occur within a single clock cycleFor back-to-back execution of dependent instructions

  • Implication of scheduling atomicityPipelining is a standard way to improve clock frequency

    Hard to pipeline instruction scheduling logic without losing ILP~10% IPC loss in 2-cycle scheduling~19% IPC loss in 3-cycle scheduling

    A major obstacle to building high-frequency microprocessors

  • Scheduler DesignsData-Capture Schedulerkeep the most recent register value in reservation stationsData forwarding and wakeup are combined

  • Scheduler DesignsNon-Data-Capture Schedulerkeep the most recent register value in RF (physical registers)Data forwarding and wakeup are decoupled

    Complexity benefitssimpler scheduler / data / wakeup path

  • Mapping to pipeline stagesAMD K7 (data-capture)Pentium 4 (non-data-capture)DataDataData / wakeupwakeup

  • Scheduling atomicity & non-data-capture schedulerMulti-cycle scheduling loop

    Scheduling atomicity is not maintainedSeparated by extra pipeline stages (Disp, RF)Unable to issue dependent instructions consecutively

    solution: speculative scheduling

  • Speculative SchedulingSpeculatively wakeup dependent instructions even before the parent instruction starts executionKeep the scheduling loop within a single clock cycle

    But, nobody knows what will happen in the future

    Source of uncertainty in instruction scheduling: loadsCache hit / missStore-to-load aliasing eventually affects timing decisions

    Scheduler assumes that all types of instructions have pre-determined fixed latenciesLoad instructions are assumed to have a common case (over 90% in general) $DL1 hit latencyIf incorrect, subsequent (dependent) instructions are replayed

  • Speculative SchedulingOverviewUnlike the original Tomasulos algorithmInstructions are scheduled BEFORE actual execution occursAssumes instructions have pre-determined fixed latenciesALU operations: fixed latencyLoad operations: assumes $DL1 latency (common case)

  • Scheduling replaySpeculation needs verification / recoveryTheres no free lunch

    If the actual load latency is longer (i.e. cache miss) than what was speculatedBest solution (disregarding complexity): replay data-dependent instructions issued under load shadowCache missdetected

  • Wavefront propagationSpeculative execution wavefrontspeculative image of execution (from schedulers perspective)

    Both wavefront propagates along dependence edges at the same rate (1 level / cycle)the real wavefront runs behind the speculative wavefront

    The load resolution loop delay complicates the recovery processscheduling miss is notified a couple of clock cycles later after issue

  • Load resolution feedback delay in instruction schedulingScheduling runs multiple clock cycles ahead of executionBut, instructions can keep track of only one level of dependence at a time (using source operand identifiers)Broadcast/ wakeupSelectExecutionDispatch /PayloadRFMisc.NNN-1N-2N-3N-4Time delay betweensched and feedbackrecoverinstructionsin this path

  • Issues in scheduling replayCannot stop speculative wavefront propagationBoth wavefronts propagate at the same rateDependent instructions are unnecessarily issued under load missescheckerSched/ IssueExecache misssignalcycle ncycle n+1cycle n+2cycle n+3

  • Requirements of scheduling replayConditions for ideal scheduling replayAll mis-scheduled dependent instructions are invalidated instantlyIndependent instructions are unaffected

    Multiple levels of dependence tracking are needede.g. Am I dependent on the current cache miss?Longer load resolution loop delay tracking more levelsPropagation of recovery status should be faster than speculative wavefront propagationRecovery should be performed on the transitive closure of dependent instructionsloadmiss

  • Scheduling replay schemesAlpha 21264: Non-selective replayReplays all dependent and independent instructions issued under load shadowAnalogous to squashing recovery in branch mispredictionSimple but high performance penaltyIndependent instructions are unnecessarily replayed

  • Position-based selective replayIdeal selective recoveryreplay dependent instructions onlyDependence tracking is managed in a matrix formColumn: load issue slot, row: pipeline stages

  • Low-complexity scheduling techniquesFIFO (Palacharla, Jouppi, Smith, 1996)

    Replaces conventional scheduling logic with multiple FIFOsSteering logic puts instructions into different FIFOs considering dependencesA FIFO contains a chain of dependent instructionsOnly the head instructions are considered for issue

  • FIFO (contd)Scheduling example

  • FIFO (contd)Performance

    Comparable performance to the conventional schedulingReduced scheduling logic complexityMany related papers on clustered microarchitectureCan in-order clusters provide high performance? [Zilles reading]

  • Memory Dataflow

  • Key Challenge: MLPTolerate/overlap memory latencyOnce first miss is encountered, find another oneNave solutionImplement a very large ROB, IQ, LSQPower/area/delay make this infeasibleBuild virtual instruction window

  • RunaheadUse poison bits to eliminate miss-dependent load program sliceForward load slice processing is a very old ideaMassive Memory Machine [Garcia-Molina et al. 84]Datascalar [Burger, Kaxiras, Goodman 97]Runahead proposed by [Dundas, Mudge 97]Checkpoint state, keep runningWhen miss completes, return to checkpointMay need runahead cache for store/load communication

  • Waiting Instruction Buffer[Lebeck et al. ISCA 2002]Capture forward load slice in separate bufferPropagate poison bits to identify sliceRelieve pressure on issue queueReinsert instructions when load completesVery similar to Intel Pentium 4 replay mechanismBut not publicly known at the time

  • Continual Flow Pipelines[Srinivasan et al. 2004]Slice buffer extension of WIBStore operands in slice buffer as well to free up buffer entries on OOO windowRelieve pressure on rename/physical registersApplicable to data-capture machines (Intel P6) or physical register file machines (Pentium 4)Recently extended to in-order machines (iCFP)Challenge: how to buffer loads/stores

  • Scalable Load/Store QueuesLoad queue/store queueLarge instruction window: many loads and stores have to be buffered (25%/15% of mix)Expensive searchespositional-associative searches in SQ, associative lookups in LQ coherence, speculative load schedulingPower/area/delay are prohibitive

  • Store Queue/Load Queue ScalingMultilevel queuesBloom filters (quick check for independence)Eliminate associative load queue via replay [Cain 2004]Issue loads again at commit, in orderCheck to see if same value is returnedFilter load checks for efficiency:Most loads dont issue out of order (no speculation)Most loads dont coincide with coherence traffic

  • SVW and NoSQStore Vulnerability Window (SVW)Assign sequence numbers to storesTrack writes to cache with sequence numbersEfficiently filter out safe loads/stores by only checking against writes in vulnerability windowNoSQRely on load/store alias prediction to satisfy dependent pairsUse SVW technique to check

  • Store/Load Optimi

Search related