Click here to load reader

Advanced Microarchitecture

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 2: Pipelining and Superscalar Review. Pipelined Design. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth or Throughput = Performance BW = num. tasks/unit time - PowerPoint PPT Presentation

Text of Advanced Microarchitecture

CS8803: Advanced Microarchitecture

Advanced MicroarchitectureLecture 2: Pipelining and Superscalar Review1This set of notes *really* should be review for anyone taking an advanced computer architecture course!

Several slides in this section were adapted from Shen and Lipastis book.Pipelined DesignMotivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.)Bandwidth or Throughput = PerformanceBW = num. tasks/unit timeFor a system that operates on one task at a time:BW = 1 / latencyPipelining can increase BW if many repetitions of same operation/taskLatency per task remains same or increasesLecture 2: Pipelining and Superscalar Review 2Pipelining IllustratedLecture 2: Pipelining and Superscalar Review 3Combinatorial LogicN Gate DelaysBW = ~(1/n)Combinatorial LogicN/2 Gate DelaysCombinatorial LogicN/2 Gate DelaysComb. LogicN/3 GatesComb. LogicN/3 GatesComb. LogicN/3 GatesBW = ~(2/n)BW = ~(3/n)T/kT/kPerformance ModelStarting from an unpipelined version with propagation delay T and BW=1/T

Perfpipe = BWpipe =1 / (T/k + S)

where S = latch delaywhere k = num stages

Lecture 2: Pipelining and Superscalar Review 4TSSk-stage pipelinedunpipelined4limit on BW is 1/SG/kG/kHardware Cost ModelStarting from an unpipelined version with hardware cost G

Costpipe = G + kL

where L = latch cost incl. controlwhere k = num stages

Lecture 2: Pipelining and Superscalar Review 5GLLk-stage pipelinedunpipelinedCost/Performance TradeoffLecture 2: Pipelining and Superscalar Review 6Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S)= LT + GS + LSk + GT/k

Optimal Cost/Performance: find min. C/P w.r.t. choice of k kC/PkoptGTLS--------=Lk + G1Tk+ Sddk= 0 + 0 + LS -GTk2Yeay! Calculus!6Optimal Pipeline Depth: koptLecture 2: Pipelining and Superscalar Review 7

Pipeline Depth kx104Cost/Performance Ratio (C/P)G=175, L=41, T=400, S=22G=175, L=21, T=400, S=117G = total HW cost, L = latch cost, T = unpipelined ckt latency, S = latch latencyCost?Hardware CostTransistor/Gate CountShould include additional logic to control the pipelineArea (related to gate count)Power!More gates more switchingMore gates more leakage

Many metrics to optimizeVery difficult to determine what really is optimalLecture 2: Pipelining and Superscalar Review 8Depending on your course organization, it may be worthwhile to discuss the various optimal pipeline depth papers.8Pipelining IdealismUniform SuboperationsThe operation to be pipelined can be evenly partitioned into uniform-latency suboperationsRepetition of Identical OperationsThe same operations are to be performed repeatedly on a large number of different inputsRepetition of Independent OperationsAll the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflictsLecture 2: Pipelining and Superscalar Review 9Good Examples:Automobile assembly lineFloating-Point multiplierInstruction pipeline (?)9Instruction Pipeline DesignUniform Suboperations NOT!Balance pipeline stagesStage quantization to yield balanced stagesMinimize internal fragmentation (some waiting stages)Identical operations NOT!Unifying instruction typesCoalescing instruction types into one multi-function pipeMinimize external fragmentation (some idling stages)Independent operations NOT!Resolve data and resource hazardsInter-instruction dependency detection and resolutionMinimize performance lossLecture 2: Pipelining and Superscalar Review 10The Generic Instruction CycleThe computation to be pipelined:Instruction Fetch (IF)Instruction Decode (ID)Operand(s) Fetch (OF)Instruction Execution (EX)Operand Store (OS)a.k.a. writeback (WB)Update Program Counter (PC)

Lecture 2: Pipelining and Superscalar Review 11The Generic Instruction PipelineLecture 2: Pipelining and Superscalar Review 12Based on Obvious Subcomputations:Instruction FetchInstruction DecodeOperand FetchInstruction ExecuteOperand StoreIFIDOF/RFEXOS/WBBalancing Pipeline StagesLecture 2: Pipelining and Superscalar Review 13TIF= 6 unitsTID= 2 unitsTID= 9 unitsTEX= 5 unitsTOS= 9 unitsWithout pipeliningTcyc TIF+TID+TOF+TEX+TOS= 31

PipelinedTcyc max{TIF, TID, TOF, TEX, TOS}= 9

Speedup= 31 / 9

Can we do better in terms of either performance or efficiency?IFIDOF/RFEXOS/WBBalancing Pipeline StagesTwo methods for stage quantizationMerging multiple subcomputations into oneSubdividing a subcomputation into multiple smaller ones

Recent/Current trendsDeeper pipelines (more and more stages)To a certain point: then cost function takes overMultiple different pipelines/subpipelinesPipelining of memory accesses (tricky)Lecture 2: Pipelining and Superscalar Review 14Granularity of Pipeline StagesLecture 2: Pipelining and Superscalar Review 15Coarser-Grained Machine Cycle: 4 machine cyc / instructionTIF&ID= 8 unitsTOF= 9 unitsTEX= 5 unitsTOS= 9 unitsFiner-Grained Machine Cycle: 11 machine cyc /instructionTcyc= 3 unitsTIF,TID,TOF,TEX,TOS = (6/2/9/5/9)IFIDOFOSEXIFIFIDOFOFOFEXEXOSOSOSHardware RequirementsLogic needed for each pipeline stageRegister file ports needed to support all (relevant) stagesMemory accessing ports needed to support all (relevant) stagesLecture 2: Pipelining and Superscalar Review 16IFIDOFOSEXIFIFIDOFOFOFEXEXOSOSOSPipeline ExamplesLecture 2: Pipelining and Superscalar Review 17IFRDALUMEMWBIFIDOFEXOSPC GENCache ReadCache ReadDecodeRead REGAdd GENCache ReadCache ReadEX 1EX 2Check ResultWrite ResultOSEXOFIDIFMIPS R2000/R3000AMDAHL 470V/7Instruction DependenciesData DependenceTrue Dependence (RAW)Instruction must wait for all required input operandsAnti-Dependence (WAR)Later write must not clobber a still-pending earlier readOutput Dependence (WAW)Earlier write must not clobber an already-finished later write

Control Dependence (a.k.a. Procedural Dependence)Conditional branches cause uncertainty to instruction sequencingInstructions following a conditional branch depends on the execution of the branch instructionInstructions following a computed branch depends on the execution of the branch instructionLecture 2: Pipelining and Superscalar Review 18Example: Quick Sort on MIPSLecture 2: Pipelining and Superscalar Review 19bge$10, $9, $36mul$15, $10, 4addu$24, $6, $15lw$25, 0($24)mul$13, $8, 4addu$14, $6, $13lw$15, 0($14)bge$25, $15, $36$35:addu$10, $10, 1. . .$36:addu$11, $11, -1. . .#for (;(j p Tp T1/pLecture 2: Pipelining and Superscalar Review 39

x = a + b; y = b * 2z =(x-y) * (x+y)ILP: Instruction-Level ParallelismILP is a measure of the amount of inter-dependencies between instructionsAverage ILP = num instructions / longest pathcode1:ILP = 1 (must execute serially)T1 = 3, T = 3code2:ILP = 3 (can execute at the same time)T1 = 3, T = 1Lecture 2: Pipelining and Superscalar Review 40code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3 code2:r1 r2 + 1r3 r9 / 17r4 r0 - r10 Longest path measured by the number of instructions in the path (not the number of edges)40ILP != IPCInstruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructionsILP is more a property of the program dataflow

IPC is the real observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machineThe ILP of a program is an upper-bound on the attainable IPCLecture 2: Pipelining and Superscalar Review 41Scope of ILP AnalysisLecture 2: Pipelining and Superscalar Review 42r1 r2 + 1r3 r1 / 17r4 r0 - r3r11 r12 + 1r13 r19 / 17r14 r0 - r20

ILP=2ILP=1ILP=3This is just to point out that when you talk about ILP, you need to be very clear about what part(s) of the program youre considering.42DFG AnalysisA: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 1K: R3 ST 0[R1]Lecture 2: Pipelining and Superscalar Review 43In-class example: draw out all of the dataflow graph nodes from A to K, find the longest path, compute the ILP.43In-Order Issue, Out-of-Order CompletionLecture 2: Pipelining and Superscalar Review 44Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR HazardIssue = send an instructionto executionINTFadd1Fadd2Fmul1Fmul2Fmul3Ld/StIn-orderInst.StreamExecutionBeginsIn-orderOut-of-orderCompletionExampleLecture 2: Pipelining and Superscalar Review 45A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 1K: R3 ST 0[R1]ABCycle 1:C2:D3:4:5:EF6:GHJK7:8:IPC = 10/8 = 1.25ABCDEFGHJKThis example is about IPC, not ILP.45Example (2)Lecture 2: Pipelining and Superscalar Review 46A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R9 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R9]J: R1 = R9 1K: R3 ST 0[R1]ABCycle 1:C2:D3:4:5:EFGIPC = 10/7 = 1.43HJ6:K7:ABCDEFGHJKTrack with Simple ScoreboardingScoreboard: a bit-array, 1-bit for each GPRIf the bit is not set: the register has valid dataIf the bit is set: the register has stale datai.e., some outstanding instruction is going to change itIssue in Order: RD Fn (RS, RT)If SB[RS] or SB[RT] is set RAW, stallIf SB[RD] is set WAW, stallElse, dispatch to FU (Fn) and set SB[RD]Complete out-of-orderUpdate GPR[RD], clear SB[RD]Lecture 2: Pipelining and Superscalar Review 47H&P-style notation47Out-of-Order IssueLecture 2: Pipelining and Superscalar Review 48INTFadd1Fadd2Fmul1Fmul2Fmul3Ld/StIn-orderInst.StreamDRDRDRDROut-of-orderCompletionOut ofProgramOrderExecutionNeed an extraStage/buffers forDependencyResolutionOOO ScoreboardingSimilar to In-Order scoreboardingNeed new tables to track status of individual instructions and functional unitsStill enforce dependenciesStall dispatch on WAWStall issue on RAWStall completion on WARL

Search related