Eliminating Stalls Using Compiler Support

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism. gcc 17% control transfer 5 instructions + 1 branch Reordering among 5 instructions may not uncover enough instruction level parallelism to eliminate all stalls - PowerPoint PPT Presentation

Text of Eliminating Stalls Using Compiler Support

  • Eliminating Stalls Using Compiler Support

  • Instruction Level Parallelismgcc 17% control transfer5 instructions + 1 branchReordering among 5 instructions may not uncover enough instruction level parallelism to eliminate all stallsTo eliminate remaining stalls we must look beyond single block and find more instruction level parallelismLoop level parallelism one opportunityIllustrate the above using DLX with Floating Point as an example

  • FP Loop: Where are the Hazards?Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slotInstructionInstructionLatency in producing resultusing result clock cyclesFP ALU opAnother FP ALU op3FP ALU opStore double2 Load doubleFP ALU op1Load doubleStore double0Integer opInteger op0

  • FP Loop Hazards Where are the stalls?InstructionInstructionLatency in producing resultusing result clock cyclesFP ALU opAnother FP ALU op3FP ALU opStore double2 Load doubleFP ALU op1Load doubleStore double0Integer opInteger op0Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

  • FP Loop Showing Stalls Rewrite code to minimize stalls?InstructionInstructionLatency in producing resultusing result clock cyclesFP ALU opAnother FP ALU op3FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1);F0=vector element 2stall 3ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 BNEZR1,Loop;branch R1!=zero 9stall;delayed branch slot

  • Revised FP Loop Minimizing Stalls Unroll loop 4 times code to make faster?InstructionInstructionLatency in producing resultusing result clock cyclesFP ALU opAnother FP ALU op3FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1) 2stall 3ADDDF4,F0,F2 4SUBIR1,R1,8 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI

  • Unroll Loop Four Times Rewrite loop to minimize stalls? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4

  • Unrolled Loop That Minimizes StallsWhat assumptions made when moved code?OK to move store past SUBI even though changes registerOK to move loads before stores: get right data?When is it safe for compiler to do such changes?1 Loop:LDF0,0(R1)2LDF6,-8(R1)3LDF10,-16(R1)4LDF14,-24(R1)5ADDDF4,F0,F26ADDDF8,F6,F27ADDDF12,F10,F28ADDDF16,F14,F29SD0(R1),F410SD-8(R1),F811SD-16(R1),F1212SUBIR1,R1,#3213BNEZR1,LOOP14SD8(R1),F16; 8-32 = -24 14 clock cycles, or 3.5 per iteration

  • Loop Unrolling in VLIW

  • Software PipeliningObservation: if iterations from loops are independent, then can get ILP by taking instructions from different iterationsSoftware pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop ( Tomasulo in SW)

  • SW Pipelining Example

  • Compile-time AnalysisCompiler analysis is performed to detect data dependences.Further analysis is performed to identify stalls (must have knowledge of the HW).Unroll loop and reorder code to eliminate stalls.

  • Compiler Perspective on Data Dependences Flow dependence (RAW hazard for HW)Instruction j writes a register or memory location that instruction i reads from and instruction j is execution first. Anti-dependence (WAR hazard for HW)Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first. Output dependence (WAW hazard for HW)Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

  • Dependency AnalysisEasy to determine for registers By looking at fixed register names dependences can be easily foundFor memory in some cases it is easy but in general it can be hardFrom same iteration 0(R1) != -8(R1) != -16(R1) != -24(R1)From different loop iterations 20(R6) != 20(R6) if R6 has changedIs 100(R4) = 20(R6)? If references are to two different arrays there is no dependence. But in general this is hard to determine.Unroll loop if instructions from different iterations are not dependent upon each other.

  • Dependence AnalysisFinal kind of dependence called control dependenceExampleif p1 {S1;}if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.Strict enforcement of control dependences limits parallelism unrolling eliminated conditional branches to overcome this limitation.

  • SummaryInstruction Level Parallelism can be uncovered by the compiler. Loops are an important source of instruction level parallelism.

    Dependency analysis is key to uncovering instruction level parallelism.