Advanced Microarchitecture. Lecture 10: ALUs and Bypass. This Lecture: Execution Datapath. ALUs Scheduler to Execution Unit interface Execution unit organization Bypass networks Clustering. Result. Opcode. “ALU”. Adder. Logic. Shift. Mult. Div. Operand1. Operand2. ALUs. - PowerPoint PPT Presentation
CS8803: Advanced Microarchitecture
Advanced MicroarchitectureLecture 10: ALUs and Bypass1This Lecture: Execution DatapathALUsScheduler to Execution Unit interfaceExecution unit organizationBypass networksClusteringLecture 10: ALUs and Bypassing 2ALUsALU: Arithmetic Logic UnitsFU: Functional UnitsEU: Execution Units
Lecture 10: ALUs and Bypassing 3AdderALUWhats thedifference?ShiftAdderDivLogicMultALUOpcodeResultOperand1Operand2Implementation details, algorithms,etc. of adders, multipliers, dividersnot covered in this courseJust like everything else, there are lots of names for the same thing. In Intel-speak Function Unit Block (FUB) is a generic term for any circuit block, not just those for executing instructions.3Interfacing ALUs to the SchedulerIssue N instructionsRead N sets of operands, immediates, opcodes, destination tagsRoute to correct functional unitsLecture 10: ALUs and Bypassing 4Fetch &DispatchARFPRF/ROBData-CaptureSchedulerFunctionalUnitsPhysical register updateBypassData-Capture Payload RAMLecture 10: ALUs and Bypassing 5Select LogicopcodeValLValRPayload RAMopcodeValLValRIssuePort 3IssuePort 0opcodeValLValRIssuePort 2Select decisions,port bindings, etc.IssueLane 0IssueLane 1IssueLane 2IssueLane 3Effectivelyone nastycrossbarIllustrating the ugly wiring conditions to get from the Payload RAM output ports to the Issue Lanes (Execution Ports)5Register File OrganizationLecture 10: ALUs and Bypassing 6R1val(R1)R7R3R4val(R7)val(R3)val(R4)Each RF read port input has a 1-to-1correspondence with one and only oneRF read port output
No MUXing of outputs is requiredselect 3select 2select 1select 0Payload RAMIssue 0Issue 1Issue 2Issue 3Register FileEach select logic has a dedicated read input port on the Payload RAM, and that read comes out of a fixed output port that is hard-wired to the corresponding issue lane.6Register File Is An OverkillLecture 10: ALUs and Bypassing 7SelectSelectSelectSelectSRAM Row DecodersBut how do you assign which setof data gets routed to which setof read port outputs?RS entriesPayload RAMAnimation shows that the row-decoder is not necessary. The grant signals can also be used for the wordline enables.7Execution Lane Select BindingLecture 10: ALUs and Bypassing 8SelectSelectSelectSelectPayload RAM readport outputs are inthe same order asthe Select BlocksRS entriesPayload RAMSelect Port 3Select Port 2Select Port 1Single Entry Close-UpLecture 10: ALUs and Bypassing 9Select Port 0bid 0bid 1bid 2bid 3grant 0grant 1grant 2grant 3OpcodeSrc LSrc RSingle RS EntryOne RS entry can only bid on oneselect port, so payload neverdriven to more than one portEach select port only gives the grant to a singleRS entry, so more than one payload entry cannever drive the same payload output portTri-State DriverOutput buses connectedto all payload RAM entriesSrc RSiloSrc LSiloNeed to Swizzle at the EndLecture 10: ALUs and Bypassing 10OpcodeSiloNasty tangle ofwires (Srcs are64-128 bits each!)At least theres no logic/muxing involves in the tangle of wires.10Register FileSRAM ArrayNon-Data-Capture SchedulerLecture 10: ALUs and Bypassing 11SelectSelectSelectSelectRS entriesPayload RAMRegister FileRow DecodersSrc LtagsSrc R tagsExtra step of reading from the physical register file11Immediate Valuesdata-capture can store immediate values in payload baynon-DC needs separate storageCould add extra field to payloadcould allocate a physical register and store the immediate thereCould store in a separate immediate fileLecture 10: ALUs and Bypassing 12My best guess is that its just stored in the payload RAM12Select 0Select 1Select 2Select 3Distributed SchedulerGrant/Payload read lines may have to travel further horizontally (multiple RS widths)ScheduleExecute latency less critical than ScheduleSchedule (wakeup-select) loop latencyLecture 10: ALUs and Bypassing 13FAddFM/DALU1ALU2M/DStoreShiftLoadFP-LdFP-StPayload RAMNaive ALU OrganizationBesides making scheduling hard to scale, arbitrary any issue any ALU makes operand routing a horrible mess (needs full cross bar)Lecture 10: ALUs and Bypassing 14addshiftmultdivloadstoreFaddFMulFDivFrom Payload/RF Read PortsExecution-Port-Based LayoutJust need to fan-out data to FUs within the same execution lane; no cross-bar neededEach FU needs a valid input to know that the incoming data is meant for it and not another FU in the same laneOr just let them all compute in parallel and use only the output that you want wasted powerLecture 10: ALUs and Bypassing 15addaddshiftmultdivstoreloadFP ldFPCvtLane 0Lane 1Lane 2Lane 3Bypass Network OrganizationLecture 10: ALUs and Bypassing 16addshiftmultdivFrom Payload RAM/Register Filef 64 bitsf 64 bitsN 2 sets of inputsN=Issue Width, f=Num FUsO(f2N) area just for the bypass wiring!!! which is cubic since f = W(N)Previous slide had f=9 FUs, and thatdidnt even include all of the FP unitsf = Omega(N) since if you have an issue width of N, youre going to need at least N FUs.Like other asymptotic analysis of area (e.g., for the scheduler), this is layout dependent.16ALU StacksLecture 10: ALUs and Bypassing 17addaddshiftmultdivstoreloadFP ldFPCvtFP stFaddFmulFdivFrom Payload/RFInteger BypassFloating Point BypassBypass FU Fan-OutBypass MUXes reduced to one pair perALU stack (as opposed to one per FU)Last slide had 4 FUs with 4 sets of input muxes.This slide has 13 FUs but only 6 sets of input muxes.17Bypass SharingLecture 10: ALUs and Bypassing 18addaddshiftmultdivstoreloadFP ldFPCvtFP stFaddFmulFdivFrom Payload/RFInteger BypassFloating Point BypassBypass FU Fan-OutLocal FU OutputBypass wiring reduced to one outputper execution lane/ALU stackLast slide had one bypass path per FU (7 on the integer side, 5 on the FP side (FP-convert bypassed to both sides)), this slide only has four on the integer side and three on the FP side (basically one per execution port/ALU stack, with the special case again for FP-convert).18Bypass Sharing (2)If all FUs in a stack have the same latency, writeback conflicts are impossiblebecause only one instruction can issue to each lane per cycleBut not all FUs have the same latency:Lecture 10: ALUs and Bypassing 191-cycle add, to Lane 1SXXESXXE1E22-cycle shift, to Lane 1addshiftloadTwo instructions want to writeback using same bypass path!XBypass Sharing (3)How to resolve this structural hazard?Obvious solution: stallCreates scheduling headachesTreat bypass/WB as another structural resourceSeparate select logic* for bypass allocationLecture 10: ALUs and Bypassing 201-cycle add, to Lane 1SESXXXXE1E22-cycle shift, to Lane 1012345SWriteback Scoreboard0123456XTo BypassTo Bypass*Not same as regular selectlogic, just a table read/writescheduling headaches replays or other latency-misprediction issues.Forcing the scheduling logic to pre-allocate WB bandwidth was alluded to earlier.20Bypass Sharing (4)Lecture 10: ALUs and Bypassing 21SB: 1-cycle add, to Lane 1SSXXE1E2A: 2-cycle shift, to Lane 1012345EXXSC: 3-cycle load, to Lane 10123456677BCSelect88Wasted issue opportunity:B picked by select, but cannotissue due to WB conflictC could have issued, but isstalled by one cycleSE1SXXE2E3Bypass Critical PathLecture 10: ALUs and Bypassing 22addaddshiftmultdivstoreloadFP ldFPCvtFP stFaddFmulFdivTotal wire length is abouttwice the total width plustwice the total heightmain point is just that the wires are *long*22Bypass Critical Path (2)Lecture 10: ALUs and Bypassing 23Each executionlane/ALU stackis self-containedaddaddshiftmultdivstoreloadFP ldFPCvtLongest pathonly crossestotal widthonceshowing that you dont have to bypass all the way to the left side of the picture first before wrapping around (although it may be easier to draw or think about it that way).23Bypass Control ProblemWe now have the datapaths to forward values between ALUs/FUsHow do we orchestrate what goes where and when?
In particular, how do we set the controls of each of the bypass MUXes on a cycle-by-cycle basis?Lecture 10: ALUs and Bypassing 24ScoreboardingFor each value produced, make note (in the scoreboard) of where it will be availableFor each source, consult scoreboard to find out how to rendezvousLecture 10: ALUs and Bypassing 25Port 1: ADD P21 = SXXE012345671Port 0: ADD P17 = P21 + P421R4SXXE-170RaddEPort 2: MUL P30 = P21 * P17SXXEEmulR = use value delivered from payload or register file# = bypass path numberThe entry 17:0 means for physical register P17, you should read it off of bypass path #0, 21:R means for P21, you can just read it from the PRF.25Scoreboarding (2)Setting bypass controls is easyRead where the value will come from and feed to bypass MUXes in the operand read stageLecture 10: ALUs and Bypassing 26Payload(src tags)P21P4WBScoreboardR1addMay add scheduleexecute stages for data-capture schedulerwhy not for non-data-capture?Answer: the WB scoreboard can be read in parallel with the PRF access.26Scoreboarding (3)Updating can be more complicatedDepends on when SB read occurs w.r.t. operand readingearlier reads cause more disconnectLecture 10: ALUs and Bypassing 27SXXE1E2E3SXXESXXEValue bypassed, WB to RFRFValue read from RFAssume SB read in1st cycle after scheduleABCA needs to update SB this cyclefor C to correctly source its operandScoreboarding (4)Scoreboard can become a critical timing bottleneckAll sources must read from scoreboardAll destinations must update scoreboardOnce at schedule to indicate bypass locationOnce later to indicate value has written back to RF~ 4N ports for the scoreboard!If scoreboard becomes multi-cycle, things can get really crazyneed to bypass scoreboard reads/writes like inter-group rename bypassingLecture 10: ALUs and Bypassing 28So this argues against using a scoreboarding mechanism to orchestrate the bypass controls.28CAM-based BypassExtend data-capture concept to bypass networkLecture 10: ALUs and Bypassing 29Register Valuefrom Payload/RFRegister Tag====Lane 0Lane 1Lane 2Lane 3Use Lane 0Use Lane 1Use Lane 2Use Lane 3Use PL/RFResult ValueResult Tag29CAM-based Bypass (2)Must