Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
CS61C:GreatIdeasinComputerArchitecture
Amdahl’sLaw,Data-levelParallelism
Instructors:JohnWawrzynek &VladimirStojanovichttp://inst.eecs.berkeley.edu/~cs61c/
1
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages2
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Today’sLecture
UsingParallelismforPerformance
• Twobasicways:– Multiprogramming
• runmultipleindependentprogramsinparallel• “Easy”
– Parallelcomputing• runoneprogramfaster• “Hard”
• We’llfocusonparallelcomputingfornextfewlectures
3
Single-Instruction/Single-DataStream(SISD)
• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines
4
ProcessingUnit
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
5
Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)
• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.– MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers
6
InstructionPool
PU
PU
PU
PU
DataPoo
l
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.– Rare,mainlyofhistoricalinterestonly
7
Flynn*Taxonomy,1966
• In2013,SIMDandMIMDmostcommonparallelisminarchitectures– usuallybothinsamesystem!
• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)– SingleprogramthatrunsonallprocessorsofaMIMD– Cross-processorexecutioncoordinationusingsynchronization
primitives• SIMD(akahw-leveldataparallelism):specializedfunction
units,forhandlinglock-stepcalculationsinvolvingarrays– Scientificcomputing,signalprocessing,multimedia
(audio/videoprocessing)
8
*Prof.MichaelFlynn,Stanford
BigIdea:Amdahl’s(Heartbreaking)Law• SpeedupduetoenhancementEis
Speedupw/E= ----------------------Exectimew/oEExectimew/E
• SupposethatenhancementEacceleratesafractionF(F<1)ofthetaskbyafactorS(S>1)andtheremainderofthetaskisunaffected
ExecutionTimew/E=
Speedupw/E=9
ExecutionTimew/oE× [(1-F)+F/S]
1/[(1-F)+F/S]
BigIdea:Amdahl’sLaw
10
Speedup=1(1- F)+F
SNon-speed-uppart Speed-uppart
10.5+0.5
2
10.5+0.25
= = 1.33
Example:theexecutiontimeofhalfoftheprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?
Example#1:Amdahl’sLaw
• Consideranenhancementwhichruns20timesfasterbutwhichisonlyusable25%ofthetime
Speedupw/E=1/(.75+.25/20)=1.31
• Whatifitsusableonly15%ofthetime?Speedupw/E=1/(.85+.15/20)=1.17
• Amdahl’sLawtellsusthattoachievelinearspeedupwith100processors,noneoftheoriginalcomputationcanbescalar!
• Togetaspeedupof90from100processors,thepercentageoftheoriginalprogramthatcouldbescalarwouldhavetobe0.1%orless
Speedupw/E=1/(.001+.999/100)=90.9911
Speedupw/E=1/ [(1-F)+F/S]
12
Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited
Thenon-parallelportionlimitstheperformance
StrongandWeakScaling• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.– Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem
– Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors
• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork– Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf
13
Clickers/PeerInstruction
14
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
A:5B:16C:20D:100E:Noneoftheabove
Speedupw/E=1/ [(1-F)+F/S]
Administrativia
• MT2isTue,November10th:– Coverslecturematerialuptilltoday’slecture– Conflict:EmailFredorWilliambymidnighttoday
15
SIMDArchitectures• Data parallelism: executing same operation
on multiple data streams• Example to provide context:
– Multiplying a coefficient vector by a data vector (e.g., in filtering)y[i] := c[i] × x[i], 0 ≤ i < n
• Sources of performance improvement:– One instruction is fetched & decoded for entire
operation– Multiplications are known to be independent– Pipelining/concurrency in memory access as well
16
17
Intel“AdvancedDigitalMediaBoost”
• Toimproveperformance,Intel’sSIMDinstructions– Fetchoneinstruction,dotheworkofmultipleinstructions
18
FirstSIMDExtensions:MITLincolnLabsTX-2,1957
IntelSIMDExtensions
• MMX64-bitregisters,reusingfloating-pointregisters[1992]
• SSE2/3/4,new128-bitregisters[1999]• AVX,new256-bitregisters[2011]
– Spaceforexpansionto1024-bitregisters
19
20
XMMRegisters
• Architectureextendedwitheight128-bitdataregisters:XMMregisters– x8664-bitaddressarchitectureadds8additionalregisters
(XMM8– XMM15)
IntelArchitectureSSE2+128-BitSIMDDataTypes
216463
6463
6463
3231
3231
9695
9695 161548478079122121
6463 32319695 161548478079122121 16/128bits
8/128bits
4/128bits
2/128bits
• Note:inIntelArchitecture(unlikeMIPS)awordis16bits– Single-precisionFP:Doubleword(32bits)– Double-precisionFP:Quadword(64bits)
SSE/SSE2FloatingPointInstructions
xmm:oneoperandisa128-bitSSE2registermem/xmm:otheroperandisinmemoryoranSSE2register{SS}ScalarSingleprecisionFP:one32-bitoperandina128-bitregister{PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister{SD}ScalarDoubleprecisionFP:one64-bitoperandina128-bitregister{PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister{A}128-bitoperandisaligned inmemory{U}meansthe128-bitoperandisunaligned inmemory{H}meansmovethehighhalfofthe128-bitoperand{L}meansmovethelowhalfofthe128-bitoperand
22
Movedoesbothloadand
store
PackedandScalarDouble-PrecisionFloating-PointOperations
23
Packed
Scalar
Example:SIMDArrayProcessing
24
for each f in arrayf = sqrt(f)
for each f in array{
load f to the floating-point registercalculate the square rootwrite the result from the register to memory
}
for each 4 members in array{
load 4 members to the SSE registercalculate 4 square roots in one operationstore the 4 results from the register to memory
}SIMDstyle
Data-LevelParallelismandSIMD
• SIMDwantsadjacentvaluesinmemorythatcanbeoperatedinparallel
• Usuallyspecifiedinprogramsasloopsfor(i=1000; i>0; i=i-1)
x[i] = x[i] + s;• Howcanrevealmoredata-levelparallelismthanavailable inasingleiterationofaloop?
• Unrollloopandadjustiterationrate
25
LoopinginMIPSAssumptions:- $t1isinitiallytheaddressoftheelementinthearraywiththehighest
address- $f0containsthescalarvalues- 8($t2)istheaddressofthelastelementtooperateonCODE:Loop:1. l.d $f2,0($t1) ;$f2=arrayelement
2. add.d $f10,$f2,$f0 ;addsto $f23. s.d $f10,0($t1) ;storeresult4. addiu $t1,$t1,#-8 ;decrementpointer8byte5. bne $t1,$t2,Loop ;repeatloopif $t1!= $t2
26
LoopUnrolledLoop: l.d $f2,0($t1)
add.d $f10,$f2,$f0 s.d $f10,0($t1)l.d $f4,-8($t1)add.d $f12,$f4,$f0 s.d $f12,-8($t1)l.d $f6,-16($t1)add.d $f14,$f6,$f0s.d $f14,-16($t1)l.d $f8,-24($t1)add.d $f16,$f8,$f0 s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop
NOTE:1. Only1LoopOverheadevery4iterations2. Thisunrollingworksif
loop_limit(mod 4)=03.Usingdifferentregistersforeachiteration
eliminatesdatahazardsinpipeline
27
LoopUnrolledScheduledLoop:l.d $f2,0($t1)
l.d $f4,-8($t1)l.d $f6,-16($t1)l.d $f8,-24($t1)add.d $f10,$f2,$f0 add.d $f12,$f4,$f0add.d $f14,$f6,$f0add.d $f16,$f8,$f0s.d $f10,0($t1)s.d $f12,-8($t1)s.d $f14,-16($t1)s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop
4Loadsside-by-side:Couldreplacewith4-wideSIMDLoad
4Addsside-by-side:Could replacewith4-wideSIMDAdd
4Storesside-by-side:Could replacewith4-wideSIMDStore
28
LoopUnrollinginC• Insteadofcompilerdoingloopunrolling,coulddoityourselfinCfor(i=1000; i>0; i=i-1)
x[i] = x[i] + s;• Couldberewrittenfor(i=1000; i>0; i=i-4) {
x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;}
29
WhatisdownsideofdoingitinC?
GeneralizingLoopUnrolling
• Aloopofn iterations• k copiesofthebodyoftheloop• Assuming(n modk)≠0Thenwewillruntheloopwith1copyofthebody (nmodk)timesandwithkcopiesofthebodyfloor(n/k)times
30
Example:AddTwoSingle-PrecisionFloating-PointVectors
Computationtobeperformed:
vec_res.x = v1.x + v2.x;vec_res.y = v1.y + v2.y;vec_res.z = v1.z + v2.z;vec_res.w = v1.w + v2.w;
SSEInstructionSequence:(Note:Destinationontherightinx86assembly)movaps address-of-v1, %xmm0
// v1.w | v1.z | v1.y | v1.x -> xmm0addps address-of-v2, %xmm0
// v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x -> xmm0
movaps %xmm0, address-of-vec_res
31
mov aps :movefrommem toXMMregister,memoryaligned,packedsingleprecision
addps :addfrommem toXMMregister,packedsingleprecision
mov aps :movefromXMMregistertomem,memoryaligned,packedsingleprecision
InTheNews:IntelacquiredAltera
• Alterais2nd biggestFPGAmakerafterXilinx– FPGA(Field-ProgrammableGateArray)
• AlteraalreadyhasfabricationdealtouseIntel’s14nmtechnology
• IntelexperimentingwithFPGAnexttoserverprocessor
• MicrosofttouseprogrammablelogicchipstoaccelerateBingsearchengine
32
Break• Comejoin CSUAHackathon!! Formteamsof4,hack,andwin
awesomeprizes!Prizesincludemonitors,headsets,andmore!Deliciousfoodwillbeservedthroughout,includingsomesecretmidnight goodies!
• Registeryourteamday-ofandwriteagooddescriptionofyourproject(1-2sentences).Hackathonstartsfrom6PM on October30th - 2PM, October31st.Seeyouthere!
• Also,feelfreeto joinCSUA'sGeneralMeeting#2tolistenandnetworkwithJeffAtwood,ourspeaker,andsocializewithotherpeoplefromthe CScommunity.
33
34
Intel SSEIntrinsics
• Intrinsicsare CfunctionsandproceduresforinsertingassemblylanguageintoCcode,includingSSEinstructions– Withintrinsics, canprogramusingtheseinstructionsindirectly
– One-to-onecorrespondencebetween SSEinstructionsandintrinsics
ExampleSSEIntrinsics• Vectordatatype:
_m128d• Loadandstoreoperations:
_mm_load_pd MOVAPD/aligned,packeddouble_mm_store_pd MOVAPD/aligned,packeddouble_mm_loadu_pd MOVUPD/unaligned,packeddouble_mm_storeu_pd MOVUPD/unaligned,packeddouble
• Loadandbroadcastacrossvector_mm_load1_pd MOVSD+shuffling/duplicating
• Arithmetic:_mm_add_pd ADDPD/add,packeddouble_mm_mul_pd MULPD/multiple,packeddouble
CorrespondingSSEinstructions:Instrinsics:
35
Example:2x2MatrixMultiply
Ci,j =(A×B)i,j =∑ Ai,k× Bk,j2
k =1
DefinitionofMatrixMultiply:
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
1 0
0 1
1 3
2 4
x
C1,1=1*1 +0*2=1 C1,2=1*3+0*4=3
C2,1=0*1 +1*2=2 C2,2=0*3+1*4=4
=
36
Example:2x 2MatrixMultiply
• UsingtheXMMregisters– 64-bit/doubleprecision/twodoublesperXMMreg
C1
C2
C1,1
C1,2
C2,1
C2,2Storedinmemory inColumnorder
B1
B2
Bi,1
Bi,2
Bi,1
Bi,2
A A1,i A2,i
C1,1 C1,2
C2,1 C2,2
�
C1 C2
37
Example:2x 2MatrixMultiply
• Initialization
• I=1
C1
C2
0
0
0
0
B1
B2
B1,1
B1,2
B1,1
B1,2
A A1,1 A2,1 _mm_load_pd: Storedinmemory inColumnorder
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister
38
• Initialization
• I=1
C1
C2
0
0
0
0
B1
B2
B1,1
B1,2
B1,1
B1,2
A A1,1 A2,1 _mm_load_pd: Load2doubles intoXMMreg,Stored inmemory inColumnorder
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
39
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
Example:2x2MatrixMultiply
Example:2x 2MatrixMultiply
• Firstiterationintermediateresult
• I=1
C1
C2
B1
B2
B1,1
B1,2
B1,1
B1,2
A A1,1 A2,1 _mm_load_pd: Storedinmemory inColumnorder
0+A1,1B1,1
0+A1,1B1,2
0+A2,1B1,1
0+A2,1B1,2
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
40
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
Example:2x 2MatrixMultiply
• Firstiterationintermediateresult
• I=2
C1
C2
0+A1,1B1,1
0+A1,1B1,2
0+A2,1B1,1
0+A2,1B1,2
B1
B2
B2,1
B2,2
B2,1
B2,2
A A1,2 A2,2_mm_load_pd: Storedinmemory inColumnorder
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
41
Example:2x 2MatrixMultiply
• Seconditerationintermediateresult
• I=2
C1
C2
A1,1B1,1+A1,2B2,1
A1,1B1,2+A1,2B2,2
A2,1B1,1+A2,2B2,1
A2,1B1,2+A2,2B2,2
B1
B2
B2,1
B2,2
B2,1
B2,2
A A1,2 A2,2 _mm_load_pd: Storedinmemory inColumnorder
C1,1
C1,2
C2,1
C2,2
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
42
Example:2x2MatrixMultiply(Part1of2)
#include <stdio.h>//headerfileforSSEcompilerintrinsics#include <emmintrin.h>
//NOTE:vectorregisterswillberepresentedincommentsasv1=[a|b]
//wherev1isavariableoftype__m128danda,b aredoubles
int main(void) {//allocateA,B,Calignedon16-byteboundariesdoubleA[4]__attribute__((aligned(16)));doubleB[4]__attribute__((aligned (16)));doubleC[4] __attribute__((aligned(16)));int lda =2;int i =0;//declareseveral128-bitvectorvariables__m128dc1,c2,a,b1,b2;
//InitializeA,B,Cforexample/*A=(notecolumnorder!)
1001*/A[0]=1.0;A[1]=0.0;A[2]=0.0;A[3]=1.0;
/*B=(notecolumnorder!)1324*/B[0]=1.0;B[1]=2.0;B[2]=3.0;B[3]=4.0;
/*C=(notecolumnorder!)0000*/C[0]=0.0;C[1]=0.0;C[2]=0.0;C[3]=0.0;
43
Example:2x 2MatrixMultiply(Part2of2)
//usedalignedloadstoset//c1=[c_11|c_21]c1=_mm_load_pd(C+0*lda);//c2=[c_12|c_22]c2=_mm_load_pd(C+1*lda);
for(i =0;i <2;i++){/*a=i =0:[a_11|a_21]i =1:[a_12|a_22]*/a=_mm_load_pd(A+i*lda);/*b1=i =0:[b_11|b_11]i =1:[b_21|b_21]*/b1=_mm_load1_pd(B+i+0*lda);/*b2=i =0:[b_12|b_12]i =1:[b_22|b_22]*/b2=_mm_load1_pd(B+i+1*lda);
/*c1=i =0:[c_11+a_11*b_11 |c_21+a_21*b_11]i =1:[c_11+a_21*b_21 |c_21+a_22*b_21]*/c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));/*c2=i =0:[c_12+a_11*b_12 |c_22+a_21*b_12]i =1:[c_12+a_21*b_22 |c_22+a_22*b_22]*/c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));
}
//storec1,c2backintoCforcompletion_mm_store_pd(C+0*lda,c1);_mm_store_pd(C+1*lda,c2);
//printCprintf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);return0;
}
44
Innerloopfromgcc –O-SL2: movapd (%rax,%rsi),%xmm1 //LoadalignedA[i,i+1]->m1
movddup (%rdx),%xmm0 //LoadB[j],duplicate->m0mulpd %xmm1,%xmm0 //Multiplym0*m1->m0addpd %xmm0,%xmm3 //Addm0+m3->m3movddup 16(%rdx),%xmm0 //LoadB[j+1],duplicate->m0mulpd %xmm0,%xmm1 //Multiplym0*m1->m1addpd %xmm1,%xmm2 //Addm1+m2->m2addq $16,%rax //rax+16->rax (i+=2)addq $8,%rdx //rdx+8->rdx (j+=1)cmpq $32,%rax //rax ==32?jne L2 //jumptoL2ifnotequalmovapd %xmm3,(%rcx) //storealignedm3intoC[k,k+1]movapd %xmm2,(%rdi) //storealignedm2intoC[l,l+1]
45
AndinConclusion,…• Amdahl’sLaw:Serialsectionslimitspeedup• FlynnTaxonomy• IntelSSESIMDInstructions
– Exploitdata-levelparallelisminloops– Oneinstructionfetchthatoperatesonmultipleoperandssimultaneously
– 128-bitXMMregisters• SSEInstructionsinC
– EmbedtheSSEmachineinstructionsdirectlyintoCprogramsthroughuseofintrinsics
– Achieveefficiencybeyondthatofoptimizingcompiler
46