132
Introduction to the Message Passing Interface (MPI) Jan Fostier May 3 rd 2017

Introduction to the Message Passing Interface (MPI)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to the Message Passing Interface (MPI)

IntroductiontotheMessagePassingInterface(MPI)

JanFostier

May3rd 2017

Page 2: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 3: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 4: Introduction to the Message Passing Interface (MPI)

Moore’sLaw

“Transistorcountdoubleseverytwoyears”

IllustrationfromWikipedia

Page 5: Introduction to the Message Passing Interface (MPI)

Moore’sLaw

2000 2011

IllustrationfromWikipedia

Page 6: Introduction to the Message Passing Interface (MPI)

Evolutionoftop500supercomputersovertime

Imagetakenfrom

www.to

p500

.org

=17.6PFlops/s(#1ranked)

=76TFlops/s(#500ranked)

Exponential increaseofsupercomputerpeakperformanceovertime

Page 7: Introduction to the Message Passing Interface (MPI)

Applicationarea– performanceshare

Imagetakenfrom

www.to

p500

.org

“research”

“unknown”

Page 8: Introduction to the Message Passing Interface (MPI)

Operatingsystemfamily

“Linux”

“Unix”

“Mixed”

Page 9: Introduction to the Message Passing Interface (MPI)

Motivationforparallelcomputing

• Wanttorunthesameprogramfaster§ Dependsontheapplicationwhatisconsideredanacceptable

runtimeo SETI@Home,Folding@Home,GIMPS:yearsmaybeacceptableo ForR&D applications:daysorevenweeksareacceptable

– CFD,CEM,Bioinformatics,Cheminformatics,andmanymoreo Predictionoftomorrow’sweathershouldtakelessthanadayof

computationtime.o Someapplicationsrequirereal-timebehavior

– Computergames,algorithmictrading

• Wanttorunbiggerdatasets• Wanttoreducefinancialcost and/orpowerconsumption

Page 10: Introduction to the Message Passing Interface (MPI)

Distributed-memoryarchitecture

Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)

Machine

CPU

Memory

Machine

CPU

Memory

Machine

CPU

Memory

Machine

CPU

Memory

NI NI NI NI

Page 11: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 12: Introduction to the Message Passing Interface (MPI)

MessagePassingInterface(MPI)

• MPI=libraryspecification,notanimplementation• Mostimportantimplementations

§ OpenMPI (http://www.open-mpi.org,MPI-3standard)§ IntelMPI(proprietary,MPI-3standard)

• Specifiesroutinesfor(amongothers)§ Point-to-point communication(between2processes)§ Collective communication(>2processes)§ Topology setup§ ParallelI/O

• BindingsforC/C++andFortran

Page 13: Introduction to the Message Passing Interface (MPI)

MPIreferenceworks

• MPIstandards:http://www.mpi-forum.org/docs/

• MPI:TheCompleteReference (M.Snir,S.Otto,SHuss-Lederman,D.Walker,J.Dongarra)Availablefromhttp://switzernet.com/people/emin-gabrielyan/060708-thesis-ref/papers/Snir96.pdf

• UsingMPI:PortableParallelProgrammingwiththeMessagePassingInterface,2nd ed.(W.Gropp,E.Lusk,A.Skjellum).

Page 14: Introduction to the Message Passing Interface (MPI)

MPIstandard

• Startedin1992(WorkshoponStandardsforMessage-PassinginaDistributedMemoryEnvironment)withsupportfromvendors,librarywritersandacademia.

• MPIversion1.0(May1994)• Finalpre-draftin1993(Supercomputing‘93conference)• FinalversionJune1994

• MPIversion2.0 (July1997)• Supportforone-sidedcommunication• Supportforprocessmanagement• SupportforparallelI/O

• MPIversion3.0(September2012)• Supportfornon-blockingcollectivecommunication• Fortran2008bindings• Newone-sidedcommunicationroutines

Page 15: Introduction to the Message Passing Interface (MPI)

HelloworldexampleinMPI#include <mpi.h>#include <iostream>#include <cstdlib>

using namespace std;

int main(int argc, char* argv[]) {int rank, size;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );

cout << “Hello World from process” << rank << “/” << size << endl;

MPI_Finalize();return EXIT_SUCCESS;

}

john@doe ~]$ mpirun –np 4 ./helloWorldHello World from process 2/4Hello World from process 3/4Hello World from process 0/4Hello World from process 1/4

Outputorder israndom

Page 16: Introduction to the Message Passing Interface (MPI)

BasicMPIroutines

• int MPI_Init(int *argc, char ***argv)§ Initialization:allprocessesmustcallthispriortoanyotherMPIroutine.§ Stripsof(possible)argumentsprovidedby“mpirun”.

• int MPI_Finalize(void)§ Cleanup:allprocessesmustcallthisroutineattheendoftheprogram.§ Allpendingcommunicationshouldhavefinishedbeforecallingthis.

• int MPI_Comm_size(MPI_Comm comm, int *size);§ Returnsthesizeofthe“Communicator”associatedwith“comm”§ Communicator=userdefinedsubsetofprocesses§ MPI_COMM_WORLD=communicatorthatinvolvesallprocesses

• int MPI_Comm_rank(MPI_Comm comm, int *rank);§ ReturntherankoftheprocessintheCommunicator§ Range:[0...size– 1]

Page 17: Introduction to the Message Passing Interface (MPI)

MessagePassingInterfaceMechanisms

Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)

Machine Machine Machine

Machine

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

Pprocesses(orinstances)ofthe“HelloWorld”program

mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram

Stackvariables:rank=0size=P

Stackvariables:rank=1size=P

Stackvariables:rank=2size=P

Stackvariables:rank=P-1size=P

“HelloWorld”

“HelloWorld”

“HelloWorld”

“HelloWorld”

Page 18: Introduction to the Message Passing Interface (MPI)

Terminology

• Computerprogram =passivecollectionofinstructions.

• Process =instanceofacomputerprogramthatisbeingexecuted.

• Multitasking =runningmultipleprocessesonaCPU.

• Thread =smalleststreamofinstructionsthatcanbemanagedbyanOSscheduler(=light-weightprocess).

• Distributed-memory system=multi-processorsystemswhereeachprocessorhasdirectaccess(fast)toitsownprivatememoryandreliesoninter-processorcommunicationtoaccessanotherprocessor’smemory(typicallyslower).

Page 19: Introduction to the Message Passing Interface (MPI)

Multithreadingversusmultiprocessing

Multi-threadingSingle processSharedmemoryaddressspaceProtectdata againstsimultaneouswritingLimitedtoasinglemachineE.g.Pthreads,CILK,OpenMP,etc.

MessagepassingMultiple processesSeparate memoryaddressspacesExplicitlycommunicate everything

Multiple machinespossibleE.g.MPI,UnifiedParallelC,PVM

SeqI SeqI SeqI SeqI

A B C D

SeqII SeqII SeqII SeqII

SeqI

SeqII

A B C D

Threadspawning

Threadjoining

Page 20: Introduction to the Message Passing Interface (MPI)

MessagePassingInterface(MPI)• MPImechanisms (dependsonimplementation)

• CompilinganMPIprogramfromsource• mpicc -O3main.cpp-omain

gcc -O3main.cpp-omain-L<IncludeDir>-l<mpiLibs>

• Alsompic++(ormpicxx),mpif77,mpif90,etc.

• RunningMPIapplications(manually)• mpirun -np <numberofprograminstances><yourprogram>• Listofworkernodesspecifiedinsomeconfig file.

• UsingMPIontheUgentHPCcluster• Loadappropriatemodulefirst,e.g.

• moduleloadintel/2017a

• CompilinganMPIprogramfromsource• mpigcc (usestheGNU“gcc”Ccompiler)• mpiicc (usestheIntel“icc”Ccompiler)• mpigxx (usestheGNU“g++”C++compiler)• mpiicpc (usestheIntel“icpc”C++compiler)

• Submitjobusingajobscript (seefurther)

C

C++

mpicc / mpic++defaults to gcc

Page 21: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 22: Introduction to the Message Passing Interface (MPI)

BasicMPIpoint-to-pointcommunication...int rank, size, count;char b[40];MPI_Status status;

... // init MPI and rank and size variables

if (rank != 0) {char * str = “Hello World”;MPI_Send(str, 12, MPI_CHAR, 0, 123, MPI_COMM_WORLD);

} else {for (int i = 1; i < size; i++) {

MPI_Recv(b, 40, MPI_CHAR, i, MPI_ANY_TAG, MPI_COMM_WORLD, &status);MPI_Get_count(&status, MPI_CHAR, &count);printf(“I received %s from process %d with size %d and tag %d\n”,

b, status.MPI_SOURCE, count, status.MPI_TAG);}

}...

john@doe ~]$ mpirun –np 4 ./ptpcommI received Hello World from process 1 with size 12 and tag 123I received Hello World from process 2 with size 12 and tag 123I received Hello World from process 3 with size 12 and tag 123

branchingonrank

Page 23: Introduction to the Message Passing Interface (MPI)

MessagePassingInterfaceMechanisms

Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)

Machine Machine Machine

Machine

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

HelloWorldprocess

Pprocesses(orinstances)ofthe“HelloWorld”program

mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram

rank=0Recv 3x

rank=1Send

rank=2Send

rank=P-1Send

“HelloWorld”“HelloWorld”“HelloWorld”

“HelloWorld”“HelloWorld”“HelloWorld”

Page 24: Introduction to the Message Passing Interface (MPI)

Blockingsendandreceive

int MPI_Send(void *buf, int count, MPI_Datatype datatype,int dest, int tag, MPI_Comm comm)

• buf:pointertothemessagetosend• count:numberofitemstosend• datatype:datatypeofeachitem

– numberofbytessent:count*sizeof(datatype)• dest:rankofdestinationprocess• tag:valuetoidentifythemessage[0...atleast(32767)]• comm:communicatorspecification(e.g.MPI_COMM_WORLD)

int MPI_Recv(void *buf, int count, MPI_Datatype datatype,int source, int tag, MPI_Comm comm,MPI_Status *status)

• buf:pointertothebuffertostorereceiveddata• count:upperbound(!)ofthenumberofitemstoreceive• datatype:datatypeofeachitem• source:rankofsourceprocess(orMPI_ANY_SOURCE)• tag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }

Page 25: Introduction to the Message Passing Interface (MPI)

SendingandreceivingTwo-sided communication:

• BoththesenderandreceiverareinvolvedindatatransferAsopposedtoone-sidedcommunication

• PostedsendmustmatchreceiveWhendoMPI_SendandMPI_recv match ?

• 1.Rankofreceiver process• 2.Rankofsending process• 3.Tag

- customvaluetodistinguishmessagesfromsamesender• 4.Communicator

RationaleforCommunicators• Usedtocreatesubsetsofprocesses• Transparentuseoftags

- modulescanbewritteninisolation- communicationwithinmodulethroughownCommunicator- communicationbetweenmodulesthroughsharedCommunicator

Page 26: Introduction to the Message Passing Interface (MPI)

MPIDatatypes

MPI_Datatype Cdatatype

MPI_CHAR signedchar

MPI_SHORT signedshortint

MPI_INT signedint

MPI_LONG signedlongin

MPI_UNSIGNED_CHAR unsignedchar

MPI_UNSIGNED_SHORT unsignedshortint

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsignedlongint

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE longdouble

MPI_BYTE noconversion,bitpatterntransferredasis

MPI_PACKED groupedmessages

Page 27: Introduction to the Message Passing Interface (MPI)

Queryingforinformation

• MPI_Status

• StoresinformationabouttheMPI_Recv operationtypedef struct MPI_Status {

int MPI_SOURCE;

int MPI_TAG;

int MPI_ERROR;

}• Doesnotcontainthesizeofthereceivedmessage

• int MPI_Get_count (MPI_Status *status, MPI_Datatype

datatype, int *count)

• returnsthenumberofdataitemsreceivedinthecountvariable• notdirectlyaccessiblefromstatusvariable

Page 28: Introduction to the Message Passing Interface (MPI)

Blockingsendandreceive

timea) sendercomesfirst,

idlingatsender(nobufferingofmessage)

send

data

reqtosend

oktosendrecv

b) sending/receivingataboutthesametime,idlingminimized

data

reqtosendoktosend recvsend

c) receivercomesfirst,idlingatreceiver

data

recv

sendreqtosendoktosend

SlidereproducedfromslidesbyJohnMellor- Crummey

Page 29: Introduction to the Message Passing Interface (MPI)

Deadlocks

int a[10], b[10], myRank;MPI_Status s1, s2;MPI_Comm_rank( MPI_COMM_WORLD, &myRank);

if ( myRank == 0 ) {MPI_Send( a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD );MPI_Send( b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD );

}else if ( myRank == 1 ) {

MPI_Recv( b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &s1 );MPI_Recv( a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &s2 );

}

Slide credit: John Mellor - Crummey

IfMPI_Sendisblocking(handshakeprotocol),thisprogramwilldeadlock• If'eager'protocolisused,itmayrun• Dependsonmessagesize

Page 30: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 31: Introduction to the Message Passing Interface (MPI)

Networkcostmodeling

• Variouschoicesofinterconnectionnetwork§ GigabitEthernet:cheap,butfartooslowforHPCapplications§ Infiniband /Myrinet:highspeedinterconnect

• Simpleperformancemodel forpoint-to-pointcommunicationTcomm =a +b*n

§ a =latency§ B=1/b =saturation(asymptotic)bandwidth(bytes/s)§ n =numberofbytestotransmit§ EffectivebandwidthBeff:

Beff =n

a+b∗n = na+nB

Page 32: Introduction to the Message Passing Interface (MPI)

ThecaseofGengar (UGentcluster)Bandwidthandlatency

core3&L1cache

6MByte ofL2cache

core4&L1cache

core2&L1cache

core1&L1cache

16GByteofRAM

Non-blockingInfiniband(switchedfabric)

6MbyteofL2cache

core3&L1cache

6MbyteofL2cache

core4&L1cache

core2&L1cache

core1&L1cache

6MbyteofL2cache

Bandwidth Latency

2ns

50ns~100Gbit/s

~20Gbit/s ~1μstothe193othermachines

CPU1 CPU2

Page 33: Introduction to the Message Passing Interface (MPI)

Measureeffectivebandwidth:ringtest

Idea:sendasinglemessage of size N inacircle

• IncreasethemessageNsizeexponentially• 1byte,2bytes,4bytes,...1024bytes,2048bytes,4096bytes

• Benchmarktheresults(measurewallclocktimeT),…• Bandwidth=N*P/T

0

1 2

3

P-1 ...

Page 34: Introduction to the Message Passing Interface (MPI)

Hands-on:ringtest inMPI

void sendRing( char *buffer, int length ) {/* send message in a ring here */

}

int main( int argc, char * argv[] ){

...char *buffer = (char*) calloc ( 1048576, sizeof(char) );int msgLen = 8;for (int i = 0; i < 18; i++, msgLen *= 2) {

double startTime = MPI_Wtime();sendRing( buffer, msgLen );double stopTime = MPI_Wtime();double elapsedSec = stopTime - startTime;if (rank == 0)

printf( "Bandwidth for size %d is : %f\", ... );}...

}

Page 35: Introduction to the Message Passing Interface (MPI)

Jobscript example

#!/bin/sh##PBS -o output.file#PBS -e error.file#PBS -l nodes=2:ppn=all#PBS -l walltime=00:02:00#PBS -m n

cd $VSC_SCRATCH/<yourdirectory>module load intel/2017amodule load scriptsmympirun ./<program name> <program arguments>

qsub mpijob.shqstat /qdel /etc remainsthesame

mpijob.shjobscriptexample:

Page 36: Introduction to the Message Passing Interface (MPI)

Hands-on:ringtest inMPI(solution)

void sendRing( char *buffer, int msgLen ){

int myRank, numProc;MPI_Comm_rank( MPI_COMM_WORLD, &myRank );MPI_Comm_size( MPI_COMM_WORLD, &numProc );MPI_Status status;

int prevR = (myRank - 1 + numProc) % numProc;int nextR = (myRank + 1 ) % numProc;

if (myRank == 0) { // send first, then receiveMPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,

&status );} else { // receive first, then send

MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,&status );

MPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);}

}

Page 37: Introduction to the Message Passing Interface (MPI)

BasicMPIroutines

TimingroutinesinMPI

• double MPI_Wtime( void )• returnsthetimeinsecondsrelativeto“sometime”inthepast• “sometime”inthepastisfixedduringprocess

• double MPI_Wtick( void )• ReturnstheresolutionofMPI_Wtime()inseconds• e.g.10-3 =millisecondresolution

Page 38: Introduction to the Message Passing Interface (MPI)

BandwidthonGengar(Ugentcluster)

EffectiveBWincreases forlarger messages

Messagesize(byte)

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 0

200

400

600

800

1000

1200

GigabitEthernet

Infiniband

EffectiveBa

ndwidthB

eff(M

byte/s)

Page 39: Introduction to the Message Passing Interface (MPI)

BenchmarkresultsComparisonofCPUload

Infiniband

GigabitEthernet

Page 40: Introduction to the Message Passing Interface (MPI)

ExchangingmessagesinMPIint MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype

sendtype, int dest, int sendtag, void *recvbuf,int recvcount, MPI_Datatype recvtype, int source,int recvtag, MPI_Comm comm, MPI_Status *status )• sendbuf:pointertothemessagetosend• sendcount:numberofelementstotransmit• sendtype:datatypeoftheitemstosend• dest:rankofdestinationprocess• sendtag:identifierforthemessage• recvbuf:pointertothebuffertostorethemessage(disjoint withsendbuf)• recvcount:upperbound(!)tothenumberofelementstoreceive• recvtype:datatypeoftheitemstoreceive• source:rankofthesourceprocess(orMPI_ANY_SOURCE)• recvtag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }• sendbuf:pointertothebuffertosend

int MPI_Sendrecv_replace( ... )• Bufferisreplacebyreceiveddata

Page 41: Introduction to the Message Passing Interface (MPI)

BasicMPIroutinesSendrecv example

const int len = 10000;int a[len], b[len];if ( myRank == 0 ) {

MPI_Send( a, len, MPI_INT, 1, 0, MPI_COMM_WORLD );MPI_Recv( b, len, MPI_INT, 2, 2, MPI_COMM_WORLD, &status);

} else if ( myRank == 1 ) {MPI_Sendrecv( a, len, MPI_INT, 2, 1, b, len, MPI_INT, 0,

0, MPI_COMM_WORLD, &status );} else if ( myRank == 2) {

MPI_Sendrecv( a, len, MPI_INT, 0, 2, b, len, MPI_INT, 1,1, MPI_COMM_WORLD, &status );

}

• CompatibilitybetweenSendrecv and'normal'sendandrecv• Sendrecv canhelptopreventdeadlocks

2

0 1

safe to exchange !

0

12

Page 42: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 43: Introduction to the Message Passing Interface (MPI)

Non-blockingcommunication

Idea:• Dosomethingusefulwhilewaitingforcommunicationstofinish• Trytooverlapcommunicationsandcomputations

How?• Replaceblockingcommunicationbynon-blockingvariants

MPI_Send(...) MPI_Isend(..., MPI_Request *request)MPI_Recv(...) MPI_Irecv(..., status, MPI_Request *request)

• I=intermediate functions• MPI_Isend and MPI_Irecv routinesreturnimmediately• Needpolling routinestoverifyprogress

• request handleisusedtoidentifycommunications• status fieldmovedtopollingroutines(seefurther)

Page 44: Introduction to the Message Passing Interface (MPI)

Non-blockingcommunications

Asynchronousprogress• =abilitytoprogresscommunicationswhileperformingcalculations• Dependsonhardware

• GigabitEthernet=verylimited• Infiniband=muchmorepossibilities

• DependsonMPIimplementation• MultithreadedimplementationsofMPI(e.g.OpenMPI)• Daemonforasynchronousprogress(e.g.LAMMPI)

• Dependsonprotocol• Eagerprotocol• Handshakeprotocol

• Stillthesubjectofongoingresearch

Page 45: Introduction to the Message Passing Interface (MPI)

Non-blockingsendingandreceiving

timea) networkinterfacesupports

overlappingcomputationsandcommunications

b) networkinterfacehasnosuchsupport

Isend

data

reqtosend

oktosendIrecv

SlidereproducedfromslidesbyJohnMellor- Crummey

Isend

data

reqtosend

oktosendIrecv

Page 46: Introduction to the Message Passing Interface (MPI)

Non-blockingcommunicationsPolling/waitingroutines

int MPI_Wait( MPI_Request *request, MPI_Status *status )

request: handle to identify communicationstatus: status information (cfr. 'normal' MPI_Recv)

int MPI_Test( MPI_Request *request, int *flag, MPI_Status *status )

Returns immediately. Sets flag = true if communication has completed

int MPI_Waitany( int count, MPI_Request *array_of_requests,int *index, MPI_Status *status )

Waits for exactly one communication to completeIf more than one communication has completed, it picks a random oneindex returns the index of completed communication

int MPI_Testany( int count, MPI_Request *array_of_requests,int *index, int *flag, MPI_Status *status )

Returns immediately. Sets flag = true if at least one communication completedIf more than one communication has completed, it picks a random oneindex returns the index of completed communicationIf flag = false, index returns MPI_UNDEFINED

Page 47: Introduction to the Message Passing Interface (MPI)

Example:client-servercodeif ( rank != 0 ) { // client code

while ( true ) { // generate requests and send to the servergenerate_request( data, &size );MPI_Send( data, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD );

}} else { // server code (rank == 0)

MPI_Request *reqList = new MPI_Request[nProc];for ( int i = 0; i < nProc - 1; i++ )

MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,MPI_COMM_WORLD, &reqList[i] );

while ( true ) { // main consumer loopMPI_Status status;int reqIndex, recvSize;

MPI_Waitany( nProc–1, reqList, &reqIndex, &status );MPI_Get_count ( &status, MPI_CHAR, &recvSize );do_service( buffer[reqIndex].data, recvSize );MPI_Irecv( buffer[reqIndex].data, MAX_LEN, MPI_CHAR,

status.MPI_SOURCE, tag, MPI_COMM_WORLD, &reqList[reqIndex] );

}}

Page 48: Introduction to the Message Passing Interface (MPI)

Non-blockingcommunicationsPolling / waiting routines (cont'd)

int MPI_Waitall( int count, MPI_Request *array_of_requests,MPI_Status *array_of_statuses )

Waits for all communications to completeint MPI_Testall ( int count, MPI_Request *array_of_requests,

int *flag, MPI_Status *array_of_statuses )Returns immediately. Sets flag = true if all communications have completed

int MPI_Waitsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )

Waits for at least one communications to completeoutcount contains the number of communications that have completedCompleted requests are set to MPI_REQUEST_NULL

int MPI_Testsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )

Same as Waitsome, but returns immediately. flag field no longer needed, returns outcount = 0 if no completed communications

Page 49: Introduction to the Message Passing Interface (MPI)

Example:improvedclient-servercodeif ( rank != 0 ) { // same client code

...} else { // server code (rank == 0)

MPI_Request *reqList = new MPI_Request[nProc-1];MPI_Status *status = new MPI_Status[nProc-1];int *reqIndex = new MPI_Request[nProc];

for ( int i = 0; i < nProc - 1; i++ )MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,

MPI_COMM_WORLD, &reqList[i] );while ( true ) { // main consumer loop

int numMsg;MPI_Waitsome( nProc–1, reqList, &numMsg, reqIndex, status );for ( int i = 0; i < numMsg; i++ ) {

MPI_Get_count ( &status[i], MPI_CHAR, &recvSize );do_service( buffer[reqIndex[i]].data, recvSize);MPI_Irecv( buffer[reqIndex[i]].data, MAX_SIZE, MPI_CHAR,

status[i].MPI_SOURCE, tag, MPI_COMM_WORLD,&reqList[reqIndex[i]] );

}}

}

Page 50: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 51: Introduction to the Message Passing Interface (MPI)

Barriersynchronization

BarrierSynchronization

time

Waitingtime(processdependent)

CalltoMPI_Barrier

processrank Computations

Idletime

MPI_Barrier(MPI_Comm comm)Thisfunctiondoesnotreturnuntilallprocessesincommhavecalledit.

Page 52: Introduction to the Message Passing Interface (MPI)

BroadcastMPI_Bcast(void *buffer, int count, MPI_Datatype datatype,

int root, MPI_Comm comm)MPI_Bcastbroadcastscount elementsoftypedatatype storedinbuffer attheroot processtoallotherprocessesincomm wherethisdatais storedinbuffer.

data

processrank

bufferatnon-root process(contentswillbeoverwritten)

countelements

data

data

data

data

data

bufferatroot process(containsusefuldata)

MPI_Bcast

processrank

bufferatallprocessesnowcontainsamedata

countelements

p1

p0

p2

p4

p3

Page 53: Introduction to the Message Passing Interface (MPI)

Broadcastexample...int rank, size;

... // init MPI and rank and size variables

int root = 0;char buffer[12];

if (rank == root) sprintf(buffer, “Hello world”);

MPI_Bcast(buffer, 12, MPI_CHAR, root, MPI_COMM_WORLD);

printf(“Process %d has %s stored in the buffer.\n”, buffer, rank);

...

john@doe ~]$ mpirun –np 4 ./broadcastProcess 1 has Hello World stored in the buffer.Process 0 has Hello World stored in the buffer.Process 3 has Hello World stored in the buffer.Process 2 has Hello World stored in the buffer.

fillthebufferattherootprocessonly

allprocessesmustcallMPI_Bcast

Page 54: Introduction to the Message Passing Interface (MPI)

Broadcastalgorithm

data(nbytes)

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.

• Binarytreealgorithm takesonly(a +bn)élog2Pù time.

Page 55: Introduction to the Message Passing Interface (MPI)

ScatterMPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendType,

void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)

MPI_Scatter partitionsasendbuf attheroot processintoPequalpartsofsizesendcount andsendseachprocessincomm (includingroot)aportioninrankorder.

d0

processrank

sendbuffer(onlymattersatroot process)

sendcountelements

receivebuffer

MPI_Scatter

processrank

recvcountelements

p1

p0

p2

p4

p3

d1 d2 d3 d4

d2

d1 d0

d3

d4

d1 d2 d3 d4

d0

root=p1

Page 56: Introduction to the Message Passing Interface (MPI)

Scatterexampleint root = 0;char recvBuf[7];

if (rank == root) { char sendBuf[25];sprintf(sendBuf, “This is the source data.”);

MPI_Scatter(sendBuf, 6, MPI_CHAR, recvBuf, 6, MPI_CHAR,root, MPI_COMM_WORLD);

} else {MPI_Scatter(NULL, 0, MPI_CHAR, recvBuf, 6, MPI_CHAR,

root, MPI_COMM_WORLD);}

recvBuf[6] = ’\0’;printf(“Process %d has %s in receive buffer\n”, rank, recvBuf);

...john@doe ~]$ mpirun –np 4 ./scatterProcess 1 has s the stored in the buffer.Process 0 has This i stored in the buffer.Process 3 has data. stored in the buffer.Process 2 has source stored in the buffer.

fillthesendbufferattherootprocessonly

firstthreeparametersareignoredonnon-rootprocesses

Page 57: Introduction to the Message Passing Interface (MPI)

Scatteralgorithm

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.

• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)

oneblock=nbytes

Page 58: Introduction to the Message Passing Interface (MPI)

Scatter(vectorvariant)MPI_Scatterv(void *sendbuf, int *sendcnts, int *displs,

MPI_Datatype sendType, void *recvbuf, int recvcnt,MPI_Datatype recvType, int root, MPI_Comm comm)

Partitionsofsendbuf don’tneedtobeofequalsizeandarespecifiedperreceivingprocess:thefirstindexbythedispls array,theirsizebythesendcnts array.

processrank

sendbuffer(onlymattersatroot process)

sendcnts[i]elements

receivebuffer

MPI_Scatterv

processrank

recvcountelements

p1

p0

p2

p4

p3

root=p1

displs[i]

Page 59: Introduction to the Message Passing Interface (MPI)

GatherMPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendType,

void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)

MPI_Gathergathersequalpartitionsofsizerecvcount fromeachofthePprocessesincomm (includingroot)andstorestheminrecvbuf attheroot processinrankorder.

processrank

sendcountelements

d2

d1

d3

d4

d0

d0

processrank

receivebuffer(onlymattersatroot process)

recvcountelements

sendbuffer

p1

p0

p2

p4

p3

d1 d2 d3 d4

root=p1

d2

d1

d3

d4

d0

MPI_Gather

GatheristheoppositeofScatter

Avectorvariant,MPI_Gatherv,exists,asimilargeneralizationasMPI_Scatterv

Page 60: Introduction to the Message Passing Interface (MPI)

Gatherexampleint root = 0;int sendBuf = rank;

if (rank == root) { int *recvBuf = new int[size];

MPI_Gather(&sendBuf, 1, MPI_INT, recvBuf, 1, MPI_INT, root, MPI_COMM_WORLD);

cout << “Receive buffer at root process: ” << endl;for (size_t i = 0; i < size; i++)

cout << recvBuf[i] << “ ”;cout << endl;

delete [] recvBuf;} else {

MPI_Gather(&sendBuf, 1, MPI_INT, NULL, 1, MPI_INT,root, MPI_COMM_WORLD);

}

john@doe ~]$ mpirun –np 4 ./gatherReceive buffer at root process:0 1 2 3

receivebufferexistsattherootprocessonly

receiveparametersareignoredonnon-rootprocesses

Page 61: Introduction to the Message Passing Interface (MPI)

Gatheralgorithm

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Linearalgorithm,subsequentsendingofnbytesfromP-1processestoroottakes(a +bn)(P- 1)time.

• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)

oneblock=nbytes

Page 62: Introduction to the Message Passing Interface (MPI)

AllGatherMPI_Allgather(void *sendbuf, int sendcnt, MPI_Datatype sendType,

void *recvbuf, int recvcnt, MPI_Datatype recvType,MPI_Comm comm)

MPI_Allgather isageneralizationofMPI_Gather,inthatsensethatthedataisgatheredbyallprocessed,insteadofjusttherootprocess.

processrank

sendcountelements

processrank recvcountelementssendbuffer

p1

p0

p2

p4

p3

receivebuffer

MPI_Allgather

receivebuffer

Page 63: Introduction to the Message Passing Interface (MPI)

Allgather algorithm

p1

p0

p2

p4

p3

p5

p6

p7

processrank

• Pcallstogather takesP[a élog2Pù +bn(P-1)]time(usingthebestgatheralgorithm)• Gatherfollowedbybroadcasttakes2a élog2Pù +bn(Pélog2Pù +P-1) time.• “Butterfly”algorithm takesonlya log2P +bn(P-1)time(incasePisapoweroftwo)

nbytes 2nbytes4nbytes

oneblock=nbytes

Page 64: Introduction to the Message Passing Interface (MPI)

AlltoallcommunicationMPI_Alltoall(void *sendbuf, int sendcnt, MPI_Datatype sendType,

void *recvbuf, int recvcnt, MPI_Datatype recvType,int root, MPI_Comm comm)

UsingMPI_Alltoall,everyprocesssendsadistinctmessagetoeveryotherprocess.

processrank sendcountelements

a0

processrank

p1

p0

p2

p4

p3

MPI_Alltoall

a1 a2 a3 a4

b0 b1 b2 b3 b4

e0 e1 e2 e3 e4

d0 d1 d2 d3 d4

c0 c1 c2 c3 c4

sendbuffer

recvcountelements

a0 b0 c0 d0 e0

a1 b1 c1 d1 e1

a4 b4 c4 d4 e4

a3 b3 c3 d3 e3

a2 b2 c2 d2 e2

receivebuffer

Avectorvariant,MPI_Alltoallv,exists,allowingfordifferentsizesforeachprocess

Page 65: Introduction to the Message Passing Interface (MPI)

ReduceMPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype

dataType, MPI_Op op, int root, MPI_Comm comm)Thereduceoperationaggregates(“reduces”)scattereddataattherootprocess

processrank

sendbuffer

countelements

MPI_Reduce

processrank

p1

p0

p2

p4

p3root=p0

d2

d1

d3

d4

d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4

receivebuffer

◊ =operation,likesum,product,maximum,etc.

countelements

Page 66: Introduction to the Message Passing Interface (MPI)

Reduceoperations

MPI_MAX maximum

MPI_MIN minimum

MPI_SUM sum

MPI_PROD product

MPI_LAND logical AND

MPI_BAND bitwise AND

MPI_LOR logicalOR

MPI_BOR bitwiseOR

MPI_LXOR logicalexclusiveOR

MPI_BXOR bitwise exclusiveOR

MPI_MAXLOC maximumanditslocation

MPI_MINLOC maximumanditslocation

Availablereduceoperations(associativeandcommutative)Userdefinedoperationsarealsopossible

Page 67: Introduction to the Message Passing Interface (MPI)

AllreduceoperationMPI_Allreduce(void *sendbuf, void *recvbuf, int count,

MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Similartothereduceoperation,buttheresultisavailableoneveryprocess.

processrank

sendbuffer

countelements

MPI_Allreduce

processrank

p1

p0

p2

p4

p3

d2

d1

d3

d4

d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4

receivebuffer

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

Page 68: Introduction to the Message Passing Interface (MPI)

ScanoperationMPI_Scan(void *sendbuf, void *recvbuf, int count,

MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Ascanperformsapartialreductionofdata,everyprocesshasadistinctresult

processrank

sendbuffer

countelements

MPI_Scan

processrank

p1

p0

p2

p4

p3

d2

d1

d3

d4

d0 d0

receivebuffer

d0◊ d1 ◊ d2

d0◊ d1

d0◊ d1 ◊ d2 ◊ d3

d0◊ d1 ◊ d2 ◊ d3 ◊ d4

Page 69: Introduction to the Message Passing Interface (MPI)

Hands-onMatrix-vectormultiplication

Master :CoordinatestheworkofothersSlave :doesabitofwork

Task:computeA.bA:doubleprecision(mxn)matrixb:doubleprecision(nx1)columnmatrix

Master algorithm Slave algorithm1. Broadcast b to each slave 1. Broadcast b (in fact receive b)2. Send 1 row of A to each slave 2. do {3. while ( not all m results received ) { Receive message m

Receive result from any slave s if( m != termination )if ( not all m rows sent ) compute result

Send new row to slave s send result to master else } while( m != termination )

Send termination message to s 3. slave terminates}

4. continue

Page 70: Introduction to the Message Passing Interface (MPI)

Matrix-vectormultiplicationint main( int argc, char** argv ) {

int rows = 100, cols = 100; // dimensions of adouble **a;double *b, *c;int master = 0; // rank of masterint myid; // rank of this processint numprocs; // number of processes// allocate memory for a, b and ca = (double**)malloc(rows * sizeof(double*));for( int i = 0; i < rows; i++ )

a[i]=(double*)malloc(cols * sizeof(double));b = (double*)malloc(cols * sizeof(double));c = (double*)malloc(rows * sizeof(double));MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myid );MPI_Comm_size( MPI_COMM_WORLD, &numprocs );if( myid == master )

// execute master codeelse

// execute slave codeMPI_Finalize();

}

Page 71: Introduction to the Message Passing Interface (MPI)

Matrixvectormultiplication// initialize a and bfor(int j=0;j<cols;j++) {b[j]=1.0; for(int i=0;i<rows;i++) a[i][j]=i;}// broadcast b to each slaveMPI_Bcast( b, cols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD );// send row of a to each slave, tag = row numberint numsent = 0;for( int i = 0; (i < numprocs-1) && (i < rows); i++ ) {

MPI_Send(a[i], cols, MPI_DOUBLE_PRECISION, i+1,i,MPI_COMM_WORLD);numsent++;

}for( int i = 0; i < rows; i++ ) {

MPI_Status status; double ans; int sender;MPI_Recv( &ans, 1, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE,

MPI_ANY_TAG, MPI_COMM_WORLD, &status );c[status.MPI_TAG] = ans;sender = status.MPI_SOURCE;if ( numsent < rows ) { // send more work if any

MPI_Send( a[numsent], cols, MPI_DOUBLE_PRECISION,sender, numsent, MPI_COMM_WORLD );

numsent++;} else // send termination message

MPI_Send( MPI_BOTTOM, 0, MPI_DOUBLE_PRECISION, sender,rows, MPI_COMM_WORLD );

}

Page 72: Introduction to the Message Passing Interface (MPI)

Matrix-vectormultiplication// broadcast b to each slave (receive here)MPI_Bcast( b,cols,MPI_DOUBLE_PRECISION,master,MPI_COMM_WORLD );// send row of a to each slave, tag = row numberif( myid <= rows ) {

double* buffer=(double*)malloc(cols*sizeof(double));while (true) {

MPI_Status status;MPI_Recv( buffer, cols, MPI_DOUBLE_PRECISION, master,

MPI_ANY_TAG, MPI_COMM_WORLD, &status );if( status.MPI_TAG != rows ) { // not a termination message

double ans = 0.0;for(int i=0; i < cols; i++)

ans += buffer[i]*b[i];MPI_Send( &ans, 1, MPI_DOUBLE_PRECISION, master,

status.MPI_TAG, MPI_COMM_WORLD );} else

break;}

} // more processes than rows => no work for some nodes

Page 73: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 74: Introduction to the Message Passing Interface (MPI)

Networkcostmodeling

• Simpleperformancemodel formultiple communications– Bisectionbandwidth:sumofthebandwidthsofthe(averagenumber

of)linkstocuttopartitionthenetworkintwohalves.– Diameter:maximumnumberofhopstoconnectanytwodevices

Network(firsthalf)

Network(secondhalf)

bisection

Page 75: Introduction to the Message Passing Interface (MPI)

Bustopology

Bus =communicationchannelthatisshared byallconnecteddevices• Nomorethantwodevices cancommunicateatanygiventime• Hardwarecontrolswhichdeviceshaveaccess• Highriskofcontention whenmultipledevicestrytoaccessthebus

simultaneously.Busisa“blocking”interconnect.

bisection

bus

Busnetworkproperties• Bisectionbandwidth=point-to-pointbandwidth(independentof#devices)• Diameter=1(singlehop)

Page 76: Introduction to the Message Passing Interface (MPI)

Crossbarswitch

i4i3i2i1

o4

o3

o1

inputdevices

outputdevices

switch

Crossbarswitchcanconnectanycombination ofinput/outputpairsatfullbandwidth=fullynon-blocking

Page 77: Introduction to the Message Passing Interface (MPI)

Staticroutingincrossbarswitch

i4i3i2i1

o4

o3

o2

o1

inputdevices

outputdevices stateA

Switchesatcrossingofinput/outputshouldbeinstateB.OtherswitchesshouldbeinstateA.Pathselectionisfixed=staticrouting

Input Output

i1 o2i2 o4i3 o1i4 o3

stateB

Input– outputpairings

Page 78: Introduction to the Message Passing Interface (MPI)

Crossbarswitchfordistributedmemorysystems

• CrossbarimplementationofaswitchwithP=4machines.• Fullynon-blocking• Expensive:P2 switchesandO(P2)wires

M4

M3

M1

M4

M3

M1

4portnon-blocking

switch

Usuallyimplementedinaswitch

Page 79: Introduction to the Message Passing Interface (MPI)

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

ninputs noutputs

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

Page 80: Introduction to the Message Passing Interface (MPI)

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

ninputs noutputs

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

Page 81: Introduction to the Message Passing Interface (MPI)

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

ninputs noutputs

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

Page 82: Introduction to the Message Passing Interface (MPI)

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

• Adaptiveroutingmaybenecessary:cannotfindaconnectionforE!

AB

AB

CD

DC

E

E

Page 83: Introduction to the Message Passing Interface (MPI)

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

• Adaptiveroutingmaybenecessary:rerouteanexistingpath

AB

AB

CD

DC

E

E

Page 84: Introduction to the Message Passing Interface (MPI)

Closswitchingnetworks

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

4x4CB4x4CB

inputstage middlestage outputstage

• Builtinthreestages,usingcrossbarswitchesascomponents.

• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.

• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.

• Allinput/outputpairscancommunicatesimultaneously(non-blocking)

• RequiresfarlesswiringthanconventionalCBswitches.

• Adaptiveroutingmaybenecessary:Ecannowbeconnected

AB

AB

CD

DC

E

E

Page 85: Introduction to the Message Passing Interface (MPI)

Switchexamples

non-blockingswitchesInfinibandswitch(24ports) GigabitEthernetswitch(24ports)

Page 86: Introduction to the Message Passing Interface (MPI)

Meshnetworks

bisection

Page 87: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 88: Introduction to the Message Passing Interface (MPI)

Basicperformanceterminology

• Runtime (“Howlongdoesittaketorunmyprogram”)§ Inpractice,onlywallclocktime matters§ DependsonthenumberofparallelprocessesP§ TP =runtimeusingPprocesses

• Speedup (“Howmuchfasterdoesmyprogramruninparallel”)§ SP =T1 /TP§ Intheidealcase,SP =P§ Superlinearspeedup areusuallyduetocacheeffects

• Parallelefficiency (“Howwellistheparallelinfrastructureused”)§ hP =SP /P (0£ hP £ 1)§ Intheidealcase,hP =1=100%§ Dependsontheapplicationwhatisconsideredacceptableefficiency

Page 89: Introduction to the Message Passing Interface (MPI)

Strongscaling:Amdahl’slaw• StrongScaling=increasingthenumberofparallelprocessesfora

fixed-size problem• Simplemodel:partitionsequentialruntimeT1 inaparallelizable

fraction(1-s) andinherentlysequentialfraction(s).§ T1 =sT1 +(1-s)T1 (0£ s £ 1)

§ Therefore, TP =sT1 +𝟏"𝐬 𝐓𝟏𝐏

sequential parallelizableT1=

sequentialT2= parallel

sequentialT3= parallel

sequentialT4= parallel

…sequentialT¥ =

sT1 (1-s)T1

Page 90: Introduction to the Message Passing Interface (MPI)

Strongscaling:Amdahl’slaw

• Consequently,SP =&'&(

=&'

)&'*+,- .+/

= ')*+,-/

(Amdahl’slaw)

• Speedupisbounded S¥ =')

§ e.g.s=1%,S¥ = 100§ e.g.s=5%,S¥ = 20

• SourcesofsequentialfractionsT1§ Processstartup overhead§ Inherentlysequential portionsofthecode§ Dependenciesbetweensubtasks§ Communication overhead§ Functioncalling overhead§ Loadmisbalance

Page 91: Introduction to the Message Passing Interface (MPI)

Amdahl’slaw

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Parallelspe

edup

NumberofparallelprocessesP

s=5%

s=2%

s=1%

s=0%

ForsmallP,closetolinearspeedup

linearspeedup

Page 92: Introduction to the Message Passing Interface (MPI)

Amdahl’slaw

0

20

40

60

80

100

120

140

1 4 16 64 256 1024 4096 16384 65536

Parallelspe

edup

NumberofparallelprocessesP

s=5%

s=2%

s=1%

s=0%

S¥ =')

Samegraphasonpreviousslide,butlogarithmicx-as

Page 93: Introduction to the Message Passing Interface (MPI)

Amdahl’slaw

0

0.2

0.4

0.6

0.8

1

1 4 16 64 256 1024 4096 16384 65536

Parallelefficiency

NumberofparallelprocessesP

s=5%

s=2%

s=1%

s=0%

Samegraphasonpreviousslide,butexpressedasparallelefficiency

Page 94: Introduction to the Message Passing Interface (MPI)

Weakscaling:Gustafson’slaw

• Weakscaling:increasingboththenumberofparallelprocessesPandtheproblemsizeN.

• Simplemodel:assumethatthesequentialpartisconstant,andthattheparallelizablepartisproportionaltoN.§ T1(N)=T1(1)[s+(1-s)N] (0£ s £ 1)

sequential parallelizableT1(1)=

sT1(1) (1-s)T1(1)

sequential parallelizableT1(2)=

sequential parallelizableT1(3)=

sequential parallelizableT1(4)=

parallelizable

parallelizable

parallelizable

parallelizable

parallelizableparallelizable

Page 95: Introduction to the Message Passing Interface (MPI)

Gustafson’slaw

Therefore,TP(N)=T1(1)[s+'") 1(

]andhence TP(P)=T1(1)SpeedupSP(P)=s+(1-s)P=P- s(P-1)(Gustafson’slaw)

• Solveincreasinglylargerproblemsusingaproportionallyhighernumberofprocesses(N=P).

sequential parallelizableT1(1)=

sT1(1) (1-s)T1(1)

sequential parallelizableT2(2)=

sequential parallelizableT3(3)=

sequential parallelizableT4(4)=

Page 96: Introduction to the Message Passing Interface (MPI)

Gustafson’slaw

0

50

100

150

200

250

0 50 100 150 200 250

Parallelspe

edup

Number ofparallelprocesses P,problem size N

s=5%

s=2%

s=1%

s=0%

Page 97: Introduction to the Message Passing Interface (MPI)

Gustafson’slaw

0.94

0.95

0.96

0.97

0.98

0.99

1

1 4 16 64 256 1024 4096 16384 65536

Parallelefficiency

Number ofparallelprocesses P,problem size N

s=5%

s=2%

s=1%

s=0%

Samegraphasonpreviousslide,butexpressedasparallelefficiencyandlogscale

h¥ =1-s

Page 98: Introduction to the Message Passing Interface (MPI)

Outline

• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)

§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication

§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance

• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw

• Parallelprogramdevelopment:casestudies

Page 99: Introduction to the Message Passing Interface (MPI)

CaseStudy1:Parallelmatrix-matrixproduct

Page 100: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelmatrixmultiplication

• Matrix-matrixmultiplication:C=a*A*B+b*C(BLASxgemm)§ AssumeC=mxn;A=mxk;B=kxnmatrix.§ a, b =scalars;assumea = b =1 inwhatfollows,i.e.C=C+A*B

• Initially,matrixelementsaredistributed amongPprocesses§ AssumesameschemeforeachmatrixA,BandC

• EachprocesscomputesvaluesforCthatarelocaltothatprocess§ RequireddatafromAandBthatisnotlocalneedstobe

communicated§ Performancemodelingassumingthea, b, gmodel

o a =latencyo b =perelementtransfertimeo g =timeforsinglefloatingpointcomputation

Casestud

yreprod

uced

from

J.Dem

mel

Page 101: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelmatrix-matrixproduct

• Twodimensionalpartitioning§ Partitionmatricesin2Dinanrxcmesh(P=r*c)§ X(I,J)referstoblock(I,J)ofmatrixX(X={A,B,C})§ Processpn isalsodenotedbypi,j (n=i*c+j)andholdsX(I,J)

p0,0 p0,1 p0,2 … p0,c-1

p1,0 …

pi,j

pr-1,0 pr-1,c-1

Casestud

yreprod

uced

from

J.Dem

mel

k

n

B(I,J)e.g.matrixB

rprocesses

cprocesses

Page 102: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelmatrix-matrixproduct

• Secondapproach:twodimensionalpartitioning(SUMMA)§ Processpi,j needstocomputeC(I,J):

C I, J = C I, J +8A I, i ∗ B(i, J)?"'

@AB

∀I = 0… r∀J = 0… c

C(I,J) A(I,J)

B(I,J)

+=*

C A

B

datalocalinpi,jdatathatneedstobecommunicatedtopi,j

datanotneededbypi,j

dothisinparallel

Casestud

yreprod

uced

from

J.Dem

mel

kelements

indexi referstoasinglecolumnofAorsinglerowofB

A(I,i)B(i,J)

Page 103: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelmatrix-matrixproduct

• Secondapproach:twodimensionalpartitioning§ SUMMAalgorithm(allIandJinparallel)

§ Costforinnerloop(executedktimes):o log2c(a +b(m/r))+log2r(a +b(n/c))+2mng/P

§ TotalTP =2kmng/P+ka(log2c+log2r)+kb((m/r)log2c)+(n/c)log2r)

§ Forn=m=kandr=c=sqrt(P)wefind:ParallelefficiencyhP = 1/(1+(a/g)(Plog2P)/(2n2)+(b/g)√PlogP/n)Isoefficiency whenngrowsas√P(constantmemorypernode!)

Casestud

yreprod

uced

from

J.Dem

mel

for i = 0 to k–1broadcast A(I, i) within process rowbroadcast B(i, J) within process columnC(I,J) += A(I, i) * B(i, J)

endfor

Page 104: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelmatrix-matrixproduct

• Evenmoreefficient:use“blocking”algorithm

• SUMMAalgorithmisimplementedinPBLAS=ParallelBLAS• Algorithmcanbeextendedtoblock-cycliclayout(seefurther)

for i = 0 to k–1 step bend = min(i+b-1, k-1)broadcast A(I, i:end) within process rowbroadcast B(i:end, J) within process columnC(I,J) += A(I, i:end) * B(i:end, J)

endfor

PerformthisproductusingLevel-3BLAS

Page 105: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelGaussianElimination

Imagetakefrom

J.Dem

mel

Page 106: Introduction to the Message Passing Interface (MPI)

CaseStudy2:ParallelSorting

Page 107: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Sequential sortingofnkeys(1st Bachelor)§ Bubblesort:O(n2)§ Mergesort:O(nlogn),eveninworst-case§ Quicksort:O(nlogn) expected,O(n2)worst-case,fastinpractice

• Parallel sortingofnkeys,usingPprocesses§ Initially,eachprocessholdsn/pkeys(unsorted)§ Eventually,eachprocessholdsn/pkeys(sorted)

o Keysperprocessaresortedo Ifq<r,eachkeyassignedtoprocessqislessthanorequalto

everykeyassignedtoprocessr(sortkeysinrankorder)

Page 108: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

void Bubble_sort(int *a, int n) {for (int listLen = n; listLen >= 2; listLen--)

for (int i = 0; i < listLen-1; i++)if (a[i] > a[i+1]) {

temp = a[i];a[i] = a[i+1]a[i+1] = temp;

}

}

Bubblesort algorithm:O(n2)

Example: 52481® 2451 8® 241 58® 21 458® 1 2458

“Compare-swap”operation

AlgorithmreproducedfromP.Pacheco

afteriteration1afteriteration2afteriteration3afteriteration4

Page 109: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Bubblesort§ Resultofcurrentstep(a[i]>a[i+1])dependsonpreviousstep

o Valueofa[i]isdeterminedbypreviousstepo Algorithmis“inherentlyserial”o Notmuchpointintryingtoparallelizethisalgorithm

• Odd-eventranspositionsort§ Decouplealgorithmintwophases:evenandodd

o Evenphase:compare-swaponfollowingelements:(a[0],a[1]),(a[2],a[3]),(a[4],a[5]),…

o Oddphase:compare-swapoperationsonfollowingelements:(a[1],a[2]),(a[3],a[4]),(a[5],a[6]),…

Page 110: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

void Even_odd_sort(int *a, int n) {for (int phase = 0; phase < n; phase++)

if (phase % 2 == 0) { // even phasefor (int i = 0; i < n-1; i += 2)

if (a[i] > a[i+1]) {temp = a[i];a[i] = a[i+1]a[i+1] = temp;

}} else { // odd phase

for (int i = 1; i < n-1; i += 2)if (a[i] > a[i+1]) {

temp = a[i];a[i] = a[i+1]a[i+1] = temp;

}}

}

Even-oddtranspositionsortalgorithm:O(n2)

“Compare-swap”operation

“Compare-swap”operation

AlgorithmreproducedfromP.Pacheco

Page 111: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

Example: 52481® 25481® 24518® 24158® 21458® 12458

evenphaseoddphaseevenphaseoddphaseevenphase

Even-oddtranspositionsortalgorithm:O(n2)

Parallelism withineachevenoroddphaseisnowobvious:Compare-swapbetween(a[i],a[i+1])independentfrom(a[i+2],a[i+3])

Page 112: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

Parallelalgorithm• First,assumeP==n(oneelementperprocess)

a[i-1] a[i] a[i+1]

ImagereproducedfromP.Pacheco

a[i-1] a[i] a[i+1]

phasej

phasej+1

Communicatevaluewithneighbor• Rightprocess(highestrank)keepslargestvalue• Leftprocess(lowestrank)keepssmallestvalue

a[i-2]

a[i-2]

executeinparallelduringphasej+1

executeinparallelduringphasej

Page 113: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

Parallelalgorithm• Now,assumen/P>>1 (asistypicallythecase)• Example:P=4;n=16

AlgorithmreproducedfromP.Pacheco

initialvalues 15,11,9,16 3,14,8,7 4,6,12,10 5,2,13,1

localsorting 9,11,15,16 3,7,8,14 4,6,10,12 1,2,5,13

phase0(even) 3,7,8,9 11,14,15,16 1,2,4,5 6,10,12,13

phase1(odd) 3,7,8,9 1,2,4,5 11,14,15,16 6,10,12,13

phase2(even) 1,2,3,4 5,7,8,9 6,10,11,12 13,14,15,16

phase3(odd) 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16

process0 process1 process2 process3

Page 114: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

sort local keysfor (int phase = 0; phase < P; phase++) {

neighbor = computeNeighbor(phase, myRank);if (I’m not idle) { // first and/or last process may be idle

send all my keys to neighborreceive all keys from neighborif (myRank < neighbor)

keep smaller keyselse

keep larger keys}

}

Paralleleven-oddtranspositionsortpseudocode

AlgorithmreproducedfromP.Pacheco

Theorem: Parallelodd-eventranspositionsortalgorithmwillsorttheinputlistafterP(=numberofprocesses)phases.

Page 115: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

int computeNeighbor(int phase, int myRank) {int neighbor;if (phase % 2 == 0) {

if (myRank % 2 == 0)neighbor = myRank + 1;

elseneighbor = myRank - 1;

} else {if (myRank % 2 == 0)

neighbor = myRank – 1;else

neighbor = myRank + 1;}if (neighbor == -1 || neighbor == P-1)

neighbor = MPI_PROC_NULL;return neigbor;

}

ImplementationofcomputeNeighbor (MPI)

AlgorithmreproducedfromP.Pacheco

WhenusedasdestinationorsourcerankinMPI_SendorMPI_Recv,nocommunicationtakesplace

Page 116: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

if (myRank % 2 == 0) {MPI_Send(...)MPI_Recv(...)

} else {MPI_Recv(...)MPI_Send(...)

}

ImplementationofdataexchangeinMPI• Becarefulofdeadlocks• Inbothevenandoddphases,communicationalwaystakesplace

betweenaprocesswitheven,andaprocesswithoddrank

AlgorithmreproducedfromP.Pacheco

Exchangeordertopreventdeadlocks!

MPI_Sendrecv(...)

OR

Page 117: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

Parallelodd-eventranspositionsortalgorithmanalysis• Initialsorting:O(n/Plog(n/P))time

• Useanefficientsequentialsortingalgorithm,e.g.quicksortormergesort

• Perphase:2(a +n/Pb)+gn/P• TotalruntimeTP(n)=O(n/Plog(n/P)+2(aP +nb)+gn

=1/PO(nlogn)+O(n)• LinearspeedupwhenPissmallandnislarge• However,badasymptoticbehaviour

• WhennandPincreaseproportionally,runtimeperprocessisO(n)

• Whatwereallywant:O(logn)• Difficult!(butpossible!)

AlgorithmreproducedfromP.Pacheco

Page 118: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Sortingnetworks (=graphicaldepictionofsortingalgorithms)§ Numberofhorizontal“wires”(=elementstosort)§ Connectedbyvertical“comparators”(=compareandswap)

elem

entsto

sort(n

)

time

Example(4elementstosort)

x

y

min(x,y)

max(x,y)

Page 119: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

Bubblesort algorithm(sequential) Samebubblesortalgorithm(parallel)

Sequentialruntime=numberofcomparators

=“size ofthesortingnetwork”=n*(n– 1)/2

Parallelruntime(assumeP==nprocesses)

=“depth ofsortingnetwork”=2n– 3

Page 120: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithmDefinition:depthofasortingnetwork (=parallelruntime)

• Zeroattheinputsoreachwire• Foracomparatorwithinputswithdepthd1 andd2,thedepthof

itsoutputsis1+max(d1,d2)• Depthofthesortingnetwork=maximumdepthofeachoutput

0

0

0

0

0

0

1

1

2 3 4 5 6 7 8 9

3 5 7 9

2 4 6 8

3 5 7

4 6

5

3 5 7

4 6

5

Page 121: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Sortingnetworkofodd-eventranspositionsort

• Parallelruntime=“depth ofsortingnetwork”=n• …orP(weassumen==P)

Page 122: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Canwedobetter?• SequentialsortingalgorithmsareO(nlogn)• Ideally,Pandncanscaleproportionally:P=O(n)• ThatmeansthatwewanttosortnnumbersinO(logn)time

• Thisispossible(!),however,bigconstantpre-factor• WewilldescribeanalgorithmthatcansortnnumbersinO(log2 n)

paralleltime(usingP=O(n)processes)• Thisalgorithmhasbestperformanceinpractice• …unlessnbecomeshuge(n>22000)• Nobodywantstosortthatmanynumbers

Page 123: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Theorem:Ifasortingnetworkwithninputssortsall2n binarystringsoflengthncorrectly,thenitsortsallsequencescorrectly(proof:seereferences).

1

1

1

0

0

0

0

1

1

1

0

0

WewilldesignanalgorithmthatcansortbinarysequencesinO(log22n) time

Example

Page 124: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Step1: createasortingnetworkthatsortsbitonic sequences• Definition:Abitonic sequenceisasequencewhichisfirst

increasingandthendecreasing,orcanbecircularlyshiftedtobecomeso.§ (1,2,3,3.14,5,4,3,2,1)isbitonic§ (4,5,4,3,2,1,1,2,3)isbitonic§ (1,2,1,2)isnotbitonic

• Overzerosandones,abitonic sequenceisoftheform§ 0i1j0k or1i0j1k(withe.g.0i=0000…0=i consecutivezeros)§ i,jorkcanbezero

Page 125: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Now,let’screateasortingnetworkthatsortsabitonic sequence• Ahalf-cleaner networkconnectslinei withlinei +n/2

• Iftheinputisabinarybitonic sequencethenfortheoutput§ Elementsinthetophalfaresmaller thanthecorresponding

elementsinthebottomhalf,i.e.halvesarerelativelysorted.§ Oneofthehalvesoftheoutputconsistsofonlyzerosorones(i.e.

is“clean”),theotherhalfisbitonic.

Half-cleanerexample(n=8)

Page 126: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Exampleofahalf-cleanernetwork

0

1

0

1

1

0

0

0

0

0

01

0

0

1

1

lowerhalfisbitonic

upperhalfis“clean”

input=bitonicsequence

upperhalfissmallerthanlowerhalf

Page 127: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Therefore,abitonic sorter[n] (i.e.networkthatsortsabitonicsequenceoflengthn)isobtainedas

Half-cleaner[n]

Bitonic-sorter[n/2]

Bitonic-sorter[n/2]

Abitonic sortersortsabitonic sequenceoflengthn=2k using• size=nk/2=n/2log2ncomparators(=sequentialtime)• depth=k=log2n(=paralleltime)

Page 128: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm• Step2:Buildanetworkmerger[n] thatmergestwosorted

sequencesoflengthn/2sothattheoutputissorted§ Flipsecondsequenceandconcatenatefirstandflippedsecond§ Concatenatedsequenceisbitonic,sortusingstep1

0

1

1

0

1

0

1

1

0

0

11

1

0

1

1

tobitonic-sorter[n/2]

sortedsequence1

sortedsequence2

tobitonic-sorter[n/2]

Page 129: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Step3:Buildasorter[n] networkthatsortsarbitrarysequences§ Dothisrecursivelyfrommerger[n]buildingblocks§ Depth:D(1)=0andD(n)=D(n/2)+log2n=O(log22n)

merger[n](fromStep2)

sorter[n/2]

sorter[n/2]

Page 130: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

Example:sorter[16]

merger[16]

merger[8]

merger[4]merger[2]

flip-cleaner[16]

half-cleaner[8]

Page 131: Introduction to the Message Passing Interface (MPI)

Casestudy:parallelsortingalgorithm

• Furtherreading ofbitonic networks:§ http://valis.cs.uiuc.edu/~sariel/teach/2004/b/

webpage/lec/14_sortnet_notes.pdf• Incasenisnotapoweroftwo:

§ http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/oddn.htm

Page 132: Introduction to the Message Passing Interface (MPI)