Chapter 4 - Data-Level Parallelism in Vector, SIMD, and GPU Architectures CSCI/ EENG – 641 - W01 Computer Architecture 1 Dr. Babak Beheshti Slides based

  • View
    215

  • Download
    0

Embed Size (px)

Text of Chapter 4 - Data-Level Parallelism in Vector, SIMD, and GPU Architectures CSCI/ EENG – 641 - W01...

SoECS Moves (OW) in Support of Research Lab Establishment

Chapter 4 - Data-Level Parallelism in Vector, SIMD, and GPU Architectures CSCI/ EENG 641 - W01Computer Architecture 1Dr. Babak Beheshti

Slides based on the PowerPoint Presentations created by David Patterson as part of the Instructor Resources for the textbook by Hennessy & PattersonCS252 S051TaxonomyFlynns TaxonomyClassify by Instruction Stream and Data StreamSISDSingle Instruction Single DataConventional processorSIMDSingle Instruction Multiple DataOne instruction streamMultiple data itemsSeveral Examples ProducedMISD Multiple Instruction Single DataSystolic Arrays MIMD Multiple Instruction Multiple DataMultiple Threads of executionGeneral Parallel ProcessorsSIMD - Single Instruction Multiple DataOriginally thought to be the ultimatemassively parallel machine!Some machines builtIlliac IVThinking Machines CM2MasParVector processors (special category!)

SIMD - Single Instruction Multiple DataEach PE is asimple ALU(1 bit in CM-1,small processor in some)Control Procissues sameinstruction toeach PE in eachcycleEach PE has different data

SIMDSIMD performance depends onMapping problem processor architectureImage processingMaps naturally to 2D processor arrayCalculations on individual pixels trivialCombining data is the problem!Some matrix operations also

SIMDMatrix multiplicationEach PE* then+PEij Cij

Note the B matrixis transposed!Parallel ProcessingCommunication patternsIf the system provides the correct data paths,then good performance is obtainedeven with slow PEsWithout effective communicationbandwidth,even fast PEs are starved of data!In a multiple PE system, we haveRaw communication bandwidthEquivalent processor memory bandwidthCommunications patternsImagine the Matrix Multiplication problem if the matrices are not already transposed!Network topologyVector Processors - The SupercomputersOptimized for vector & matrix operations

Conventionalscalar processor section not shownExampleDot product

or in terms of the elements

Fetch each element of each vector in turn

StrideDistance between successive elements of a vector1 in dot-product case

Vector Processors - Vector operationsy = A l By = Sak * bkVector Processors - Vector operationsExampleMatrix multiply

or in terms of the elements

C = A Bcij = Saik * bkj

Vector OperationsFetch data into vector registerAddress Generation Unit manages stride

Very high effectivebandwidthto memoryLong burstaccesses withAGU managingaddressesCopyright 2012, Elsevier Inc. All rights reserved.SIMDSIMD architectures can exploit significant data-level parallelism for:matrix-oriented scientific computingmedia-oriented image and sound processors

SIMD is more energy efficient than MIMDOnly needs to fetch one instruction per data operationMakes SIMD attractive for personal mobile devices

SIMD allows programmer to continue to think sequentiallyIntroductionThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer12Copyright 2012, Elsevier Inc. All rights reserved.SIMD ParallelismVector architecturesSIMD extensionsGraphics Processor Units (GPUs)

For x86 processors:Expect two additional cores per chip per yearSIMD width to double every four yearsPotential speedup from SIMD to be twice that from MIMD!IntroductionThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer13Copyright 2012, Elsevier Inc. All rights reserved.Vector ArchitecturesBasic idea:Read sets of data elements into vector registersOperate on those registersDisperse the results back into memory

Registers are controlled by compilerUsed to hide memory latencyLeverage memory bandwidthVector ArchitecturesThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer14Copyright 2012, Elsevier Inc. All rights reserved.VMIPSExample architecture: VMIPSLoosely based on Cray-1Vector registersEach register holds a 64-element, 64 bits/element vectorRegister file has 16 read ports and 8 write portsVector functional unitsFully pipelinedData and control hazards are detectedVector load-store unitFully pipelinedOne word per clock cycle after initial latencyScalar registers32 general-purpose registers32 floating-point registersVector ArchitecturesThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer15Copyright 2012, Elsevier Inc. All rights reserved.VMIPS InstructionsADDVV.D: add two vectorsADDVS.D: add vector to a scalarLV/SV: vector load and vector store from address

Example: DAXPYL.D F0,a; load scalar aLVV1,Rx; load vector XMULVS.DV2,V1,F0; vector-scalar multiplyLVV3,Ry; load vector YADDVVV4,V2,V3; addSVRy,V4; store the resultRequires 6 instructions vs. almost 600 for MIPSVector ArchitecturesThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer16Copyright 2012, Elsevier Inc. All rights reserved.Vector Execution TimeExecution time depends on three factors:Length of operand vectorsStructural hazardsData dependencies

VMIPS functional units consume one element per clock cycleExecution time is approximately the vector length

ConvoySet of vector instructions that could potentially execute togetherVector ArchitecturesThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer17Copyright 2012, Elsevier Inc. All rights reserved.ChimesSequences with read-after-write dependency hazards can be in the same convey via chaining

ChainingAllows a vector operation to start as soon as the individual elements of its vector source operand become available

ChimeUnit of time to execute one convoym convoys executes in m chimesFor vector length of n, requires m x n clock cyclesVector ArchitecturesThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer18Copyright 2012, Elsevier Inc. All rights reserved.ExampleLVV1,Rx;load vector XMULVS.DV2,V1,F0;vector-scalar multiplyLVV3,Ry;load vector YADDVV.DV4,V2,V3;add two vectorsSVRy,V4;store the sum

Convoys:1LVMULVS.D2LVADDVV.D3SV

3 chimes, 2 FP ops per result, cycles per FLOP = 1.5For 64 element vectors, requires 64 x 3 = 192 clock cyclesVector ArchitecturesThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer19Copyright 2012, Elsevier Inc. All rights reserved.ChallengesStart up timeLatency of vector functional unitAssume the same as Cray-1Floating-point add => 6 clock cyclesFloating-point multiply => 7 clock cyclesFloating-point divide => 20 clock cyclesVector load => 12 clock cyclesImprovements:> 1 element per clock cycleNon-64 wide vectorsIF statements in vector codeMemory system optimizations to support vector processorsMultiple dimensional matricesSparse matricesProgramming a vector computerVector ArchitecturesThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer20Copyright 2012, Elsevier Inc. All rights reserved.Multiple LanesElement n of vector register A is hardwired to element n of vector register BAllows for multiple hardware lanesVector Architectures

The University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer21Copyright 2012, Elsevier Inc. All rights reserved.Vector Length RegisterVector length not known at compile time?Use Vector Length Register (VLR)Use strip mining for vectors over the maximum length:low = 0;VL = (n % MVL); /*find odd-size piece using modulo op % */for (j = 0; j =0; i=i-1)x[i] = x[i] + s;

No loop-carried dependenceDetecting and Enhancing Loop-Level ParallelismThe University of Adelaide, School of Computer Science16 November 2012Chapter 2 Instructions: Language of the Computer45Copyright 2012, Elsevier Inc. All rights reserved.Loop-Level ParallelismExample 2:for (i=0; i