Exploiting Vector Parallelism in Software Pipelined Loops

Multimedia ExtensionsShort vector extensions in ILP processorsAltiVec, 3DNow!, SSE, etc.Accelerate loops in multimedia & DSP codesNew designs have floating point support

Page

Multimedia ExtensionsVector resources do not overwhelm the scalar resourcesScalar: 2 FP ops / cycleVector: 4 FP ops / cycleFull vectorization may underutilize scalar resources ILP techniques do not target vector resourcesNeed bothCourtesy of International Business Machines Corporation. Unauthorized use not permitted.

Page

Modulo Schedulingfor (i=0; i

Traditional Vectorizationfor (i=0; i

Vectorization without Distributionfor (i=0; i

Selective Vectorizationfor (i=0; i

ComplicationsComplex scheduling requirementsParticularly in statically scheduled machinesMemory alignmentExample assumes no communication costIn reality, explicit operations requiredOften through memoryReserve critical resourcesPotential long latencyPerformance improvement still possible

Page

Tomcatv main loop (50%)

Page

Tomcatv (SpecFP 95)1.7x Speedup overModulo Scheduling

Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*

TechniqueALUMEMFPUVECModulo Scheduling622460

Full Vectorization713046

Selective Vectorization7271927

Page

Tomcatv (SpecFP 95)

Page

Selective VectorizationBalance computation among resourcesMinimize II when loop is modulo scheduledCarefully manage communicationIncorporate alignment informationSoftware pipelining hides latencyAdapt a 2-cluster partitioning heuristic[Fidduccia & Matheyses 82][Kernighan & Lin 70]

Page

Selective Vectorizationscalarvectorcost

Page

Cost FunctionProjected II due to resources (ResMII)Bin-packing approach [Rau MICRO 94]With some modifications

Can ignore operation latencySoftware pipelining hides latencyVectorizable ops not on dependence cycles

for (i=0; i

EvaluationSUIF front-endDependence analysisDataflow optimization

Trimaran back-endModulo schedulerRegister allocatorVLIW SimulatorAdded vector opsSimulation BinaryC or Fortran

Page

EvaluationOperands communicated through memorySoftware responsible for realignment

Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*

Page

EvaluationSpecFP 92, 95, 2000Easier to extract dependence informationDetectable data parallelism64-bit data means vector length of 2Considered amenable to vectorization & SWPApply selective vectorization to DO loopsNo control flow, no function calls Fully simulate with training sets

Page

Traditional Vectorization

Page

Vectorization without Distribution

Page

Vectorization + Free Communication

Page

Vectorization without Distribution

Page

Selective Vectorization

Page

Selective Vectorizationtomcatvsu2corswimmgrid

Page

Communication SupportTransfer through memoryRegister to register copyUses fewer issue slotsFrees memory resourcesShared register fileVector elements addressable in scalar opsRequires no extra issue slots

Page

Through Memorytomcatvsu2corswimmgrid

Page

Reg to Reg Transfer Supporttomcatvsu2corswimmgrid

Page

Shared Register Filetomcatvsu2corswimmgrid

Page

Related WorkTraditional vectorizationAllen & Kennedy, WolfeSoftware PipeliningRaus iterative modulo schedulingClustered VLIW[Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34]Partitioning among clusters similarOurs is also an instruction selection problemNo dedicated communication resources

Page

ConclusionTargeting all FUs improves performanceSelective vectorizationVectorization better in the backendCost analysis more accurateSoftware pipeline vectorized loopsGood idea anywayFacilitates selective vectorizationHides communication and alignment latency

Page

ILP techniques are instruction scheduling techniques vectorization is a type of instruction selection thats why we need both

Mention what the notation in the code meansmention loop distributioncommunication between vector and scalar loopsThis is our contributionExample was very simple, but in reality there are complicationsPlanning to make publicly availablePACTTraditional never beats modulo scheduling for this architectureFree communication is unrealisticMention theoretical maximum for this architectureSay what percentages meanConsider two other design pointsMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages mean

Documents

Exploiting Vector Parallelism in Software Pipelined Loops