Advanced Microarchitecture. Lecture 8: Data-Capture Instruction Schedulers. Out-of-Order Execution. The goal is to execute instructions in dataflow order as opposed to the sequential order specified by the ISA - PowerPoint PPT Presentation
CS8803: Advanced Microarchitecture
Advanced MicroarchitectureLecture 8: Data-Capture Instruction Schedulers1Out-of-Order ExecutionThe goal is to execute instructions in dataflow order as opposed to the sequential order specified by the ISAThe renamer plays a critical role by removing all of the false register dependenciesThe scheduler is responsible for:for each instruction, detecting when all dependencies have been satisifed (and therefore ready to execute)propagating readiness information between instructionsLecture 8: Data-Capture Instruction Schedulers 2Out-of-Order Execution (2)Lecture 8: Data-Capture Instruction Schedulers 3StaticProgramFetchDynamicInstructionStreamRenameRenamedInstructionStreamScheduleDynamicallyScheduledInstructionsOut-of-order =out of the originalsequential orderJust a pictoral view of the different steps3Superscalar != Out-of-OrderLecture 8: Data-Capture Instruction Schedulers 4A: R1 = Load 16[R2]B: R3 = R1 + R4C: R6 = Load 8[R9]D: R5 = R2 4E: R7 = Load 20[R5]F: R4 = R4 1G: BEQ R4, #0CDEcache missBCDEFG10 cyclesBFG7 cyclesABCDEFGCDEFGB5 cyclesBCDEFG8 cyclesAcache miss1-wideIn-OrderAcache miss2-wideIn-OrderA1-wideOut-of-OrderAcache miss2-wideOut-of-OrderSuperscalar/In-order is not uncommon; Out-of-order/Single-Issue is not common (but possible).4Data-Capture SchedulerAt dispatch, instruction read all available operands from the register files and store a copy in the schedulerUnavailable operands will be captured from the functional unit outputsWhen ready, instructions can issue directly from the scheduler without reading additional operands from any other register filesLecture 8: Data-Capture Instruction Schedulers 5Fetch &DispatchARFPRF/ROBData-CaptureSchedulerFunctionalUnitsPhysical register updateBypassP-Pro family processors use data-capture-style schedulers. Dispatch is usually the same as the Allocate (or just alloc) stage(s).5Non-Data-Capture SchedulerLecture 8: Data-Capture Instruction Schedulers 6Fetch &DispatchARFPRFSchedulerFunctionalUnitsPhysical registerupdateMore on thisnext lecture!E.g., MIPS R10000, Alpha 21264 (and others)6Components of a SchedulerLecture 8: Data-Capture Instruction Schedulers 7ABCDFEGBuffer for unexecutedinstructionsMethod for trackingstate of dependencies(resolved or not)ArbiterBMethod for choosing between multiple ready instructions competing for the same resourceMethod for notificationof dependency resolutionScheduler Entries orIssue Queue (IQ) orReservation Stations (RS)The scheduler is sometimes also called the Instruction Window, but that can be a little confusing as sometime the instruction window refers to all instructions in flight in the processor, which can include those that have issues/executed and left the IQ/RS.7Lather, Rinse, RepeatScheduling Loop or Wakeup-Select LoopWake-Up Part:Instructions selected for execution notify dependents (children) that the dependency has been resolvedFor each instruction, check whether all input dependencies have been resolvedif so, the instruction is woken upSelect Part:Choose which instructions get to executeIf 3 add instructions are ready at the same time, but there are only two adders, someone must wait to try again in the next cycle (and again, and again until selected)Lecture 8: Data-Capture Instruction Schedulers 8Scalar Scheduler (Issue Width = 1)Lecture 8: Data-Capture Instruction Schedulers 9T14T16T39T6T17T39T15T39========T39T8T17T42Select LogicTo Execute LogicTag Broadcast BusThe boxes that turn green indicate the readiness of each of the operands.9Tags,ReadyLogic
SelectLogicSuperscalar Scheduler (detail of one entry)Lecture 8: Data-Capture Instruction Schedulers 10TagBroadcastBuses========bidgrantsSrcLRdyLValLIssuedSrcLRdyLValLDstSame logic as previous slide (but for one entry), but for four-way issue.10Interaction with ExecutionLecture 8: Data-Capture Instruction Schedulers 11ASelect LogicSRDSLopcodeValLValRValLValRValLValRValLValRPayload RAM11D = Destination Tag, SL = Left Source Tag, SR = Right Source Tag, ValL = Left Operand Value, ValR = Right Operand ValueThe scheduler is typically broken up into the CAM-based scheduling part, and a RAM-based payload part that holds the actual values (and instruction opcode and any other information required for execution) that get sent to the actual functional units/ALUs.Again, But SuperscalarLecture 8: Data-Capture Instruction Schedulers 12ABSelect LogicSRSRDDSLSLopcodeValLValRValLValRValLValRValLValRopcodeValLValRValLValRValLValRThe schedulercaptures thedata, hence Data-Capture12D = Destination Tag, SL = Left Source Tag, SR = Right Source Tag, ValL = Left Operand Value, ValR = Right Operand ValueThe animation shows the wakeup process on the left, and then illustrates how the same tag matches used in the wakeup can also be used to enable the capturing of output values on the data-path side.Issue WidthMaximum number of instructions selected for execution each cycle is the issue widthPrevious slide showed an issue width of twoThe slide before that showed the details of a scheduler entry for an issue width of fourHardware requirements:Typically, an Issue Width of N requires N tag broadcast busesNot always true: can specialize such that, for example, one issue slot can only handle branchesLecture 8: Data-Capture Instruction Schedulers 13Pipeline TimingLecture 8: Data-Capture Instruction Schedulers 14SelectPayloadWakeupA:ExecuteCaptureB:tag broadcastresultbroadcastenablecaptureon tag matchSelectPayloadExecuteWakeupC:enablecapturetag broadcastCycle iCycle i+1ABCSimple case with minimal pipelining; dependent instructions can execute in back-to-back cycles, but the achievable clock speed will be slow because each cycle contains too much work (i.e., select, payload read, execute, bypass and capture).14Pipelined TimingLecture 8: Data-Capture Instruction Schedulers 15SelectPayloadA:ExecuteCaptureB:tag broadcastresultbroadcastenablecaptureSelectPayloadExecuteCaptureC:enablecapturetag broadcastCycle iCycle i+1SelectPayloadExecuteCycle i+2Cycle i+3WakeupWakeupABCCant read and writepayload RAM at thesame time; may needto bypass the resultsFaster clock speed15Pipelined Timing (2)Previous slide placed the pipeline boundary at the writing of the ready bitsThis slide shows a pipeline where latches are placed right before the tag broadcastLecture 8: Data-Capture Instruction Schedulers 16SelectPayloadA:ExecuteCaptureB:tag broadcastresultbroadcastenablecaptureSelectPayloadExecuteCycle iCycle i+1Cycle i+2WakeupABWakeupThere are a variety of factors that impact the decision of where the cycle boundary should be placed. Some of this has to do with inserting the instructions into the RS and whether or not the newly inserted instructions can be immediately considered for scheduling or have to wait until the next cycle to do so.16More Pipelined TimingLecture 8: Data-Capture Instruction Schedulers 17SelectPayloadA:ExecuteCaptureB:tag broadcastresult broadcastand bypassenablecaptureC:Cycle iWakeupSelectWakeupPayloadExecuteSelectPayloadExecCaptureCycle i+1Cycle i+2Cycle i+3CaptureWakeuptag matchon firstoperandtag matchon secondoperand(now C is ready)No simultaneous read/write!ABCNeed a secondlevel of bypassingMore-er Pipelined TimingLecture 8: Data-Capture Instruction Schedulers 18SelectPayloadA:ExecuteCaptureC:Cycle iWakeupi+1i+2i+3SelectPayloadExecuteWakeupCaptureSelectPayloadExi+4i+5D:ACBDWakeupCaptureB:SelectSelectPayloadExecuteA&B bothready, onlyA selected,B bids againAC and CD mustbe bypassed, butno bypass for BDDependent instructionscannot execute inback-to-back cycles!Very aggressive pipelining, but now with a greater IPC penalty due to not being able to issue dependent instructions in back-to-back cycles. Good segue to the many research papers on aggressive and/or speculative pipelining of the scheduler (quite a few of these in ISCA/MICRO/HPCA the early 2000s).18Critical LoopsWakeup-Select Loop cannot be trivially pipelined while maintaining back-to-back execution of dependent instructionsLecture 8: Data-Capture Instruction Schedulers 19ABCABCRegularSchedulingNo Back-to-BackWorst-case IPC reduction by Shouldnt be that bad (previous slide had IPC of 4/3)Studies indicate 10-15% IPC penaltyLoose Loops Sink Chips, Borch et al.
IPC vs. Frequency10-15% IPC drop doesnt seem bad if we can double the clock frequencyLecture 8: Data-Capture Instruction Schedulers 201000ps500ps500ps2.0 IPC, 1GHz1.7 IPC, 2GHz2 BIPS3.4 BIPSFrequency doesnt doublelatch/pipeline overheadunbalanced stagesOther sources of IPC penaltiesbranches: pipe depth, predictor size, predict-to-update latencycaches/memory: same time in seconds, frequency, more cyclesPower limitations: more logic, higher frequency P=CV2f900ps450ps450ps900ps3505501.5GHzJust pointing out that the ideal performance (double clock speed combined with 10-15% IPC hit) is not likely achievable due to many other issues.20Select LogicGoal: minimize DFG height (execution time)NP-HardPrecedence Constrained Scheduling ProblemEven harder because the entire DFG is not known at scheduling timeScheduling decisions made now may affect the scheduling of instructions not even fetched yetHeuristicsFor performanceFor ease of implementationLecture 8: Data-Capture Instruction Schedulers 21The Select Logic is also sometimes called a picker.The NP-Hard part is even if you were given the entire DFG, you still couldnt efficiently find the optimal schedule. The assumption that you can see the entire DFG is obviously false in a processor. This is also harder because instruction latencies are not constant (i.e., changing the schedule can change whether certain loads hit or miss).21Simple Select LogicLecture 8: Data-Capture Instruction Schedulers 22Scheduler Entries1
S entriesyields O(S)gate delayGrant0 = 1Grant1 = !Bid0Grant2 = !Bid0 & !Bid1Grant3 = !Bid0 & !Bid1 & !Bid2Grantn-1 = !Bid0 & & !Bidn-2
1x0x1x2x3x4x5x6x7x8grant0xi = Bidigrantigrant1grant2grant3grant4grant5gran