Amdahl’s Law, Data-level Parallelism

CS61C:GreatIdeasinComputerArchitecture

Amdahl’sLaw,Data-levelParallelism

Instructors:JohnWawrzynek &VladimirStojanovichttp://inst.eecs.berkeley.edu/~cs61c/

1

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages2

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Today’sLecture

UsingParallelismforPerformance

• Twobasicways:– Multiprogramming

• runmultipleindependentprogramsinparallel• “Easy”

– Parallelcomputing• runoneprogramfaster• “Hard”

• We’llfocusonparallelcomputingfornextfewlectures

3

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines

4

ProcessingUnit

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

5

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.– MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

6

InstructionPool

PU

PU

PU

PU

DataPoo

l

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.– Rare,mainlyofhistoricalinterestonly

7

Flynn*Taxonomy,1966

• In2013,SIMDandMIMDmostcommonparallelisminarchitectures– usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)– SingleprogramthatrunsonallprocessorsofaMIMD– Cross-processorexecutioncoordinationusingsynchronization

primitives• SIMD(akahw-leveldataparallelism):specializedfunction

units,forhandlinglock-stepcalculationsinvolvingarrays– Scientificcomputing,signalprocessing,multimedia

(audio/videoprocessing)

8

*Prof.MichaelFlynn,Stanford

BigIdea:Amdahl’s(Heartbreaking)Law• SpeedupduetoenhancementEis

Speedupw/E= ----------------------Exectimew/oEExectimew/E

• SupposethatenhancementEacceleratesafractionF(F<1)ofthetaskbyafactorS(S>1)andtheremainderofthetaskisunaffected

ExecutionTimew/E=

Speedupw/E=9

ExecutionTimew/oE× [(1-F)+F/S]

1/[(1-F)+F/S]

BigIdea:Amdahl’sLaw

10

Speedup=1(1- F)+F

SNon-speed-uppart Speed-uppart

10.5+0.5

2

10.5+0.25

= = 1.33

Example:theexecutiontimeofhalfoftheprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?

Example#1:Amdahl’sLaw

• Consideranenhancementwhichruns20timesfasterbutwhichisonlyusable25%ofthetime

Speedupw/E=1/(.75+.25/20)=1.31

• Whatifitsusableonly15%ofthetime?Speedupw/E=1/(.85+.15/20)=1.17

• Amdahl’sLawtellsusthattoachievelinearspeedupwith100processors,noneoftheoriginalcomputationcanbescalar!

• Togetaspeedupof90from100processors,thepercentageoftheoriginalprogramthatcouldbescalarwouldhavetobe0.1%orless

Speedupw/E=1/(.001+.999/100)=90.9911

Speedupw/E=1/ [(1-F)+F/S]

12

Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited

Thenon-parallelportionlimitstheperformance

StrongandWeakScaling• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.– Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem

– Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors

• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork– Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf

13

Clickers/PeerInstruction

14

Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?

A:5B:16C:20D:100E:Noneoftheabove

Speedupw/E=1/ [(1-F)+F/S]

Administrativia

• MT2isTue,November10th:– Coverslecturematerialuptilltoday’slecture– Conflict:EmailFredorWilliambymidnighttoday

15

SIMDArchitectures• Data parallelism: executing same operation

on multiple data streams• Example to provide context:

– Multiplying a coefficient vector by a data vector (e.g., in filtering)y[i] := c[i] × x[i], 0 ≤ i < n

• Sources of performance improvement:– One instruction is fetched & decoded for entire

operation– Multiplications are known to be independent– Pipelining/concurrency in memory access as well

16

17

Intel“AdvancedDigitalMediaBoost”

• Toimproveperformance,Intel’sSIMDinstructions– Fetchoneinstruction,dotheworkofmultipleinstructions

18

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

IntelSIMDExtensions

• MMX64-bitregisters,reusingfloating-pointregisters[1992]

• SSE2/3/4,new128-bitregisters[1999]• AVX,new256-bitregisters[2011]

– Spaceforexpansionto1024-bitregisters

19

20

XMMRegisters

• Architectureextendedwitheight128-bitdataregisters:XMMregisters– x8664-bitaddressarchitectureadds8additionalregisters

(XMM8– XMM15)

IntelArchitectureSSE2+128-BitSIMDDataTypes

216463

6463

6463

3231

3231

9695

9695 161548478079122121

6463 32319695 161548478079122121 16/128bits

8/128bits

4/128bits

2/128bits

• Note:inIntelArchitecture(unlikeMIPS)awordis16bits– Single-precisionFP:Doubleword(32bits)– Double-precisionFP:Quadword(64bits)

SSE/SSE2FloatingPointInstructions

xmm:oneoperandisa128-bitSSE2registermem/xmm:otheroperandisinmemoryoranSSE2register{SS}ScalarSingleprecisionFP:one32-bitoperandina128-bitregister{PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister{SD}ScalarDoubleprecisionFP:one64-bitoperandina128-bitregister{PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister{A}128-bitoperandisaligned inmemory{U}meansthe128-bitoperandisunaligned inmemory{H}meansmovethehighhalfofthe128-bitoperand{L}meansmovethelowhalfofthe128-bitoperand

22

Movedoesbothloadand

store

PackedandScalarDouble-PrecisionFloating-PointOperations

23

Packed

Scalar

Example:SIMDArrayProcessing

24

for each f in arrayf = sqrt(f)

for each f in array{

load f to the floating-point registercalculate the square rootwrite the result from the register to memory

}

for each 4 members in array{

load 4 members to the SSE registercalculate 4 square roots in one operationstore the 4 results from the register to memory

}SIMDstyle

Data-LevelParallelismandSIMD

• SIMDwantsadjacentvaluesinmemorythatcanbeoperatedinparallel

• Usuallyspecifiedinprogramsasloopsfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;• Howcanrevealmoredata-levelparallelismthanavailable inasingleiterationofaloop?

• Unrollloopandadjustiterationrate

25

LoopinginMIPSAssumptions:- $t1isinitiallytheaddressoftheelementinthearraywiththehighest

address- $f0containsthescalarvalues- 8($t2)istheaddressofthelastelementtooperateonCODE:Loop:1. l.d $f2,0($t1) ;$f2=arrayelement

2. add.d $f10,$f2,$f0 ;addsto $f23. s.d $f10,0($t1) ;storeresult4. addiu $t1,$t1,#-8 ;decrementpointer8byte5. bne $t1,$t2,Loop ;repeatloopif $t1!= $t2

26

LoopUnrolledLoop: l.d $f2,0($t1)

add.d $f10,$f2,$f0 s.d $f10,0($t1)l.d $f4,-8($t1)add.d $f12,$f4,$f0 s.d $f12,-8($t1)l.d $f6,-16($t1)add.d $f14,$f6,$f0s.d $f14,-16($t1)l.d $f8,-24($t1)add.d $f16,$f8,$f0 s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop

NOTE:1. Only1LoopOverheadevery4iterations2. Thisunrollingworksif

loop_limit(mod 4)=03.Usingdifferentregistersforeachiteration

eliminatesdatahazardsinpipeline

27

LoopUnrolledScheduledLoop:l.d $f2,0($t1)

l.d $f4,-8($t1)l.d $f6,-16($t1)l.d $f8,-24($t1)add.d $f10,$f2,$f0 add.d $f12,$f4,$f0add.d $f14,$f6,$f0add.d $f16,$f8,$f0s.d $f10,0($t1)s.d $f12,-8($t1)s.d $f14,-16($t1)s.d $f16,-24($t1)addiu $t1,$t1,#-32bne $t1,$t2,Loop

4Loadsside-by-side:Couldreplacewith4-wideSIMDLoad

4Addsside-by-side:Could replacewith4-wideSIMDAdd

4Storesside-by-side:Could replacewith4-wideSIMDStore

28

LoopUnrollinginC• Insteadofcompilerdoingloopunrolling,coulddoityourselfinCfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;• Couldberewrittenfor(i=1000; i>0; i=i-4) {

x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;}

29

WhatisdownsideofdoingitinC?

GeneralizingLoopUnrolling

• Aloopofn iterations• k copiesofthebodyoftheloop• Assuming(n modk)≠0Thenwewillruntheloopwith1copyofthebody (nmodk)timesandwithkcopiesofthebodyfloor(n/k)times

30

Example:AddTwoSingle-PrecisionFloating-PointVectors

Computationtobeperformed:

vec_res.x = v1.x + v2.x;vec_res.y = v1.y + v2.y;vec_res.z = v1.z + v2.z;vec_res.w = v1.w + v2.w;

SSEInstructionSequence:(Note:Destinationontherightinx86assembly)movaps address-of-v1, %xmm0

// v1.w | v1.z | v1.y | v1.x -> xmm0addps address-of-v2, %xmm0

// v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x -> xmm0

movaps %xmm0, address-of-vec_res

31

mov aps :movefrommem toXMMregister,memoryaligned,packedsingleprecision

addps :addfrommem toXMMregister,packedsingleprecision

mov aps :movefromXMMregistertomem,memoryaligned,packedsingleprecision

InTheNews:IntelacquiredAltera

• Alterais2nd biggestFPGAmakerafterXilinx– FPGA(Field-ProgrammableGateArray)

• AlteraalreadyhasfabricationdealtouseIntel’s14nmtechnology

• IntelexperimentingwithFPGAnexttoserverprocessor

• MicrosofttouseprogrammablelogicchipstoaccelerateBingsearchengine

32

Break• Comejoin CSUAHackathon!! Formteamsof4,hack,andwin

awesomeprizes!Prizesincludemonitors,headsets,andmore!Deliciousfoodwillbeservedthroughout,includingsomesecretmidnight goodies!

• Registeryourteamday-ofandwriteagooddescriptionofyourproject(1-2sentences).Hackathonstartsfrom6PM on October30th - 2PM, October31st.Seeyouthere!

• Also,feelfreeto joinCSUA'sGeneralMeeting#2tolistenandnetworkwithJeffAtwood,ourspeaker,andsocializewithotherpeoplefromthe CScommunity.

33

34

Intel SSEIntrinsics

• Intrinsicsare CfunctionsandproceduresforinsertingassemblylanguageintoCcode,includingSSEinstructions– Withintrinsics, canprogramusingtheseinstructionsindirectly

– One-to-onecorrespondencebetween SSEinstructionsandintrinsics

ExampleSSEIntrinsics• Vectordatatype:

_m128d• Loadandstoreoperations:

_mm_load_pd MOVAPD/aligned,packeddouble_mm_store_pd MOVAPD/aligned,packeddouble_mm_loadu_pd MOVUPD/unaligned,packeddouble_mm_storeu_pd MOVUPD/unaligned,packeddouble

• Loadandbroadcastacrossvector_mm_load1_pd MOVSD+shuffling/duplicating

• Arithmetic:_mm_add_pd ADDPD/add,packeddouble_mm_mul_pd MULPD/multiple,packeddouble

CorrespondingSSEinstructions:Instrinsics:

35

Example:2x2MatrixMultiply

Ci,j =(A×B)i,j =∑ Ai,k× Bk,j2

k =1

DefinitionofMatrixMultiply:

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

1 0

0 1

1 3

2 4

x

C1,1=1*1 +0*2=1 C1,2=1*3+0*4=3

C2,1=0*1 +1*2=2 C2,2=0*3+1*4=4

=

36

Example:2x 2MatrixMultiply

• UsingtheXMMregisters– 64-bit/doubleprecision/twodoublesperXMMreg

C1

C2

C1,1

C1,2

C2,1

C2,2Storedinmemory inColumnorder

B1

B2

Bi,1

Bi,2

Bi,1

Bi,2

A A1,i A2,i

C1,1 C1,2

C2,1 C2,2

�

C1 C2

37


• Initialization

• I=1

C1

C2

0

0

0

0

B1

B2

B1,1

B1,2

B1,1

B1,2

A A1,1 A2,1 _mm_load_pd: Storedinmemory inColumnorder

_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister

38

• Initialization

• I=1

C1

C2

0

0

0

0

B1

B2

B1,1

B1,2

B1,1

B1,2

A A1,1 A2,1 _mm_load_pd: Load2doubles intoXMMreg,Stored inmemory inColumnorder

_mm_load1_pd: SSEinstruction thatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)

39

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

Example:2x2MatrixMultiply


• Firstiterationintermediateresult

• I=1

C1

C2

B1

B2

B1,1

B1,2

B1,1

B1,2


0+A1,1B1,1

0+A1,1B1,2

0+A2,1B1,1

0+A2,1B1,2

c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructions firstdoparallelmultipliesandthenparalleladdsinXMMregisters


40

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

x

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

=


• Firstiterationintermediateresult

• I=2

C1

C2

0+A1,1B1,1

0+A1,1B1,2

0+A2,1B1,1

0+A2,1B1,2

B1

B2

B2,1

B2,2

B2,1

B2,2

A A1,2 A2,2_mm_load_pd: Storedinmemory inColumnorder



41


• Seconditerationintermediateresult

• I=2

C1

C2

A1,1B1,1+A1,2B2,1

A1,1B1,2+A1,2B2,2

A2,1B1,1+A2,2B2,1

A2,1B1,2+A2,2B2,2

B1

B2

B2,1

B2,2

B2,1

B2,2


C1,1

C1,2

C2,1

C2,2



42

Example:2x2MatrixMultiply(Part1of2)

#include <stdio.h>//headerfileforSSEcompilerintrinsics#include <emmintrin.h>

//NOTE:vectorregisterswillberepresentedincommentsasv1=[a|b]

//wherev1isavariableoftype__m128danda,b aredoubles

int main(void) {//allocateA,B,Calignedon16-byteboundariesdoubleA[4]__attribute__((aligned(16)));doubleB[4]__attribute__((aligned (16)));doubleC[4] __attribute__((aligned(16)));int lda =2;int i =0;//declareseveral128-bitvectorvariables__m128dc1,c2,a,b1,b2;

//InitializeA,B,Cforexample/*A=(notecolumnorder!)

1001*/A[0]=1.0;A[1]=0.0;A[2]=0.0;A[3]=1.0;

/*B=(notecolumnorder!)1324*/B[0]=1.0;B[1]=2.0;B[2]=3.0;B[3]=4.0;

/*C=(notecolumnorder!)0000*/C[0]=0.0;C[1]=0.0;C[2]=0.0;C[3]=0.0;

43

Example:2x 2MatrixMultiply(Part2of2)

//usedalignedloadstoset//c1=[c_11|c_21]c1=_mm_load_pd(C+0*lda);//c2=[c_12|c_22]c2=_mm_load_pd(C+1*lda);

for(i =0;i <2;i++){/*a=i =0:[a_11|a_21]i =1:[a_12|a_22]*/a=_mm_load_pd(A+i*lda);/*b1=i =0:[b_11|b_11]i =1:[b_21|b_21]*/b1=_mm_load1_pd(B+i+0*lda);/*b2=i =0:[b_12|b_12]i =1:[b_22|b_22]*/b2=_mm_load1_pd(B+i+1*lda);

/*c1=i =0:[c_11+a_11*b_11 |c_21+a_21*b_11]i =1:[c_11+a_21*b_21 |c_21+a_22*b_21]*/c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));/*c2=i =0:[c_12+a_11*b_12 |c_22+a_21*b_12]i =1:[c_12+a_21*b_22 |c_22+a_22*b_22]*/c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));

}

//storec1,c2backintoCforcompletion_mm_store_pd(C+0*lda,c1);_mm_store_pd(C+1*lda,c2);

//printCprintf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);return0;

}

44

Innerloopfromgcc –O-SL2: movapd (%rax,%rsi),%xmm1 //LoadalignedA[i,i+1]->m1

movddup (%rdx),%xmm0 //LoadB[j],duplicate->m0mulpd %xmm1,%xmm0 //Multiplym0*m1->m0addpd %xmm0,%xmm3 //Addm0+m3->m3movddup 16(%rdx),%xmm0 //LoadB[j+1],duplicate->m0mulpd %xmm0,%xmm1 //Multiplym0*m1->m1addpd %xmm1,%xmm2 //Addm1+m2->m2addq $16,%rax //rax+16->rax (i+=2)addq $8,%rdx //rdx+8->rdx (j+=1)cmpq $32,%rax //rax ==32?jne L2 //jumptoL2ifnotequalmovapd %xmm3,(%rcx) //storealignedm3intoC[k,k+1]movapd %xmm2,(%rdi) //storealignedm2intoC[l,l+1]

45

AndinConclusion,…• Amdahl’sLaw:Serialsectionslimitspeedup• FlynnTaxonomy• IntelSSESIMDInstructions

– Exploitdata-levelparallelisminloops– Oneinstructionfetchthatoperatesonmultipleoperandssimultaneously

– 128-bitXMMregisters• SSEInstructionsinC

– EmbedtheSSEmachineinstructionsdirectlyintoCprogramsthroughuseofintrinsics

– Achieveefficiencybeyondthatofoptimizingcompiler

46

Documents

Amdahl’s Law, Data-level Parallelism