If you can't read please download the document
Upload
yves
View
26
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Exploiting Vector Parallelism in Software Pipelined Loops. Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Multimedia Extensions. Short vector extensions in ILP processors AltiVec, 3DNow!, SSE, etc. - PowerPoint PPT Presentation
Citation preview
Multimedia ExtensionsShort vector extensions in ILP processorsAltiVec, 3DNow!, SSE, etc.Accelerate loops in multimedia & DSP codesNew designs have floating point support
Page
Multimedia ExtensionsVector resources do not overwhelm the scalar resourcesScalar: 2 FP ops / cycleVector: 4 FP ops / cycleFull vectorization may underutilize scalar resources ILP techniques do not target vector resourcesNeed bothCourtesy of International Business Machines Corporation. Unauthorized use not permitted.
Page
ComplicationsComplex scheduling requirementsParticularly in statically scheduled machinesMemory alignmentExample assumes no communication costIn reality, explicit operations requiredOften through memoryReserve critical resourcesPotential long latencyPerformance improvement still possible
Page
Tomcatv main loop (50%)
Page
Tomcatv (SpecFP 95)1.7x Speedup overModulo Scheduling
Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*
TechniqueALUMEMFPUVECModulo Scheduling622460
Full Vectorization713046
Selective Vectorization7271927
Page
Tomcatv (SpecFP 95)
Page
Selective VectorizationBalance computation among resourcesMinimize II when loop is modulo scheduledCarefully manage communicationIncorporate alignment informationSoftware pipelining hides latencyAdapt a 2-cluster partitioning heuristic[Fidduccia & Matheyses 82][Kernighan & Lin 70]
Page
Selective Vectorizationscalarvectorcost
Page
Cost FunctionProjected II due to resources (ResMII)Bin-packing approach [Rau MICRO 94]With some modifications
Can ignore operation latencySoftware pipelining hides latencyVectorizable ops not on dependence cycles
for (i=0; i
EvaluationSUIF front-endDependence analysisDataflow optimization
Trimaran back-endModulo schedulerRegister allocatorVLIW SimulatorAdded vector opsSimulation BinaryC or Fortran
Page
EvaluationOperands communicated through memorySoftware responsible for realignment
Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*
Page
EvaluationSpecFP 92, 95, 2000Easier to extract dependence informationDetectable data parallelism64-bit data means vector length of 2Considered amenable to vectorization & SWPApply selective vectorization to DO loopsNo control flow, no function calls Fully simulate with training sets
Page
Traditional Vectorization
Page
Vectorization without Distribution
Page
Vectorization + Free Communication
Page
Vectorization without Distribution
Page
Selective Vectorization
Page
Selective Vectorizationtomcatvsu2corswimmgrid
Page
Communication SupportTransfer through memoryRegister to register copyUses fewer issue slotsFrees memory resourcesShared register fileVector elements addressable in scalar opsRequires no extra issue slots
Page
Through Memorytomcatvsu2corswimmgrid
Page
Reg to Reg Transfer Supporttomcatvsu2corswimmgrid
Page
Shared Register Filetomcatvsu2corswimmgrid
Page
Related WorkTraditional vectorizationAllen & Kennedy, WolfeSoftware PipeliningRaus iterative modulo schedulingClustered VLIW[Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34]Partitioning among clusters similarOurs is also an instruction selection problemNo dedicated communication resources
Page
ConclusionTargeting all FUs improves performanceSelective vectorizationVectorization better in the backendCost analysis more accurateSoftware pipeline vectorized loopsGood idea anywayFacilitates selective vectorizationHides communication and alignment latency
Page
ILP techniques are instruction scheduling techniques vectorization is a type of instruction selection thats why we need both
Mention what the notation in the code meansmention loop distributioncommunication between vector and scalar loopsThis is our contributionExample was very simple, but in reality there are complicationsPlanning to make publicly availablePACTTraditional never beats modulo scheduling for this architectureFree communication is unrealisticMention theoretical maximum for this architectureSay what percentages meanConsider two other design pointsMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages mean