Non-Blocking Collectives for MPI-2 - ETH Zhtor.inf.ethz.ch/publications/img/cea-nbcoll.pdf · Non-Blocking Collectives for MPI-2 ... MPICH2 - o TCP - G*s+g TCP o. Some Considerationsabout

Some Considerations about Interconnects Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-Blocking Collectives for MPI-2– overlap at the highest level –

Torsten Höfler

Department of Computer ScienceIndiana University / Technical University of Chemnitz

Commissariat à l’Énergie AtomiqueDirection des applications militaires (CEA-DAM)

Bruyéres-le-chatel, France18th January 2007


Outline

1 Some Considerations about Interconnects

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


The LogGP Model

CPU

NetworkL

level

time

Sender Receivero s or

g + m*G g + m*G


Interconnect Trends

Technology Change

modern interconnects have co-processors (Quadrics,InfiniBand, Myrinet)TCP/IP is optimized for lower host-overhead (see our work)Ethernet protocol offloadL + g + m · G >> o

⇒ we prove our expectations with benchmarks of the user CPUoverhead


LogGP Model Examples - TCP

0

100

200

300

400

500

600

0 10000 20000 30000 40000 50000 60000

Tim

e in

mic

rose

cond

s

Datasize in bytes (s)

MPICH2 - G*s+gMPICH2 - o

TCP - G*s+gTCP o


LogGP Model Examples - Myrinet/GM

0

50

100

150

200

250

300

350

0 10000 20000 30000 40000 50000 60000

Tim

e in

mic

rose

cond

s


Open MPI - G*s+gOpen MPI - o

Myrinet/GM - G*s+gMyrinet/GM - o


LogGP Model Examples - InfiniBand/OpenIB

0

10

20

30

40

50

60

70

80

90

0 10000 20000 30000 40000 50000 60000

Tim

e in

mic

rose

cond

s


Open MPI - G*s+gOpen MPI - o

OpenIB - G*s+gOpenIB - o


Literature

[1] T. Hoefler, A. LICHEI, AND W. REHM: Low-Overhead LogGPParameter Assessment for Modern Interconnection Networks.Accepted at PMEO-PDS 2007 in conjunction with IPDPS 2007

[2] T. Hoefler, J. SQUYRES, G. FAGG, G. BOSILCA, W. REHM AND

A. LUMSDAINE: A New Approach to MPI Collective CommunicationImplementations. In proceedings of the 6th Austrian-HungarianWorkshop on Distributed and Parallel Systems


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Modelling the Benefits

LogGP Models - general

tbarr = (2o + L) · dlog2Pe

tallred = 2 · (2o + L + m · G) · dlog2Pe + m · γ · dlog2Pe

tbcast = (2o + L + m · G) · dlog2Pe

CPU and Network LogGP parts

tCPUbarr = 2o · dlog2Pe tNET

barr = L · dlog2Pe

tCPUallred = (4o + m · γ) · dlog2Pe tNET

allred = 2 · (L + m · G) · dlog2Pe

tCPUbcast = 2o · dlog2Pe tNET

bcast = (L + m · G) · dlog2Pe


User Overhead Benchmarks

LAM/MPI 7.1.2

10 20 30 40 50 60Communicator Size 1 10

100 1000

10000 100000

Data Size

0 0.005 0.01

0.015 0.02

0.025 0.03

CPU Usage (share)


User Overhead Benchmarks

MPICH2 1.0.3

10 20 30 40 50 60Communicator Size 1 10

100 1000

10000 100000

Data Size

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

CPU Usage (share)


Send/Recv is there - Why Collectives?

Gorlach, ’04: ”Send-Receive Considered Harmful”⇔ Dijkstra, ’68: ”Go To Statement Considered Harmful”

point to pointif ( rank == 0) then

do i=1,pcall MPI_SEND(...)

enddoelse

call MPI_RECV(...)endif

vs. collectivecall MPI_SCATTER(...)

cmp. math libraries vs. loops


Putting Everything Together

non blocking collectives?JoD mentions ”split collectives”example:

MPI_Bcast_begin(...)MPI_Bcast_end(...)

no nesting with other collsvery limitednot in the MPI-2 standardvotes: 11 yes, 12 no, 2 abstain


Performance Benefits

overlap

leverage hardware parallelism (e.g. InfiniBandTM)overlap similar to non-blocking point-to-point

pseudo synchronizationavoidance of explicit pseudo synchronizationlimit the influence of OS noise


Process Skew

caused by OS interference or unbalanced applicationespecially if processors are overloadedworse for big systemscan cause dramatic performance decreaseall nodes wait for the last

ExamplePetrini et. al. (2003) ”The Case of the Missing SupercomputerPerformance: Achieving Optimal Performance on the 8,192Processors of ASCI Q”


Process Skew

caused by OS interference or unbalanced applicationespecially if processors are overloadedworse for big systemscan cause dramatic performance decreaseall nodes wait for the last

ExamplePetrini et. al. (2003) ”The Case of the Missing SupercomputerPerformance: Achieving Optimal Performance on the 8,192Processors of ASCI Q”


Process Skew - MPI Example - Jumpshotpr

oces

ses

time

P0

P1

P3

P2


Process Skew - NBC Example - Jumpshotpr

oces

ses

time

P0

P1

P3

P2


Literature

[3] T. Hoefler, J. SQUYRES, W. REHM, AND A. LUMSDAINE: A Casefor Non-Blocking Collective Operations. In Frontiers of HighPerformance Computing and Networking, pages 155-164, SpringerBerlin / Heidelberg, ISBN: 978-3-540-49860-5 Dec. 2006

[4] T. Hoefler, J. SQUYRES, G. BOSILCA, G. FAGG, A. LUMSDAINE,AND W. REHM: Non-Blocking Collective Operations for MPI-2. OpenSystems Lab, Indiana University. presented in Bloomington, IN, USA,School of Informatics, Aug. 2006


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Non-Blocking Collectives - Interface

extension to MPI-2”mixture” between non-blocking ptp and collectivesuses MPI_Requests and MPI_Test/MPI_Wait

MPI_Ibcast(buf1, p, MPI_INT, 0, MPI_COMM_WORLD, &req);MPI_Wait(&req);

ProposalHoefler et. al. (2006): ”Non-Blocking Collective Operations forMPI-2”


Non-Blocking Collectives - Interface

extension to MPI-2”mixture” between non-blocking ptp and collectivesuses MPI_Requests and MPI_Test/MPI_Wait

MPI_Ibcast(buf1, p, MPI_INT, 0, MPI_COMM_WORLD, &req);MPI_Wait(&req);

ProposalHoefler et. al. (2006): ”Non-Blocking Collective Operations forMPI-2”


Non-Blocking Collectives - Implementation

implementation available with LibNBCwritten in ANSI-C and uses only MPI-1central element: collective schedulea coll-algorithm can be represented as a scheduletrivial addition of new algorithms

Example: dissemination barrier, 4 nodes, node 0:

send to 1 recv from 3 end send to 2 recv from 2 end

LibNBC download: http://www.unixer.de/NBC


LibNBC Benchmarks - Gather withInfiniBand/MVAPICH on 64 nodes

0

5000

10000

15000

20000

25000

30000

0 50000 100000 150000 200000 250000 300000

Run

time

(s)

Datasize (bytes)

MPI_GatherNBC_Igather


LibNBC Benchmarks - Scatter withInfiniBand/MVAPICH on 64 nodes

0

5000

10000

15000

20000

25000

30000

0 50000 100000 150000 200000 250000 300000

Run

time

(s)

Datasize (bytes)

MPI_ScatterNBC_Iscatter


LibNBC Benchmarks - Alltoall withInfiniBand/MVAPICH on 64 nodes

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 50000 100000 150000 200000 250000 300000

Run

time

(s)

Datasize (bytes)

MPI_AlltoallNBC_Ialltoall


LibNBC Benchmarks - Allreduce withInfiniBand/MVAPICH on 64 nodes

0

10000

20000

30000

40000

50000

60000

0 50000 100000 150000 200000 250000 300000

Run

time

(s)

Datasize (bytes)

MPI_AllreduceNBC_Iallreduce


Literature

[5] T. Hoefler AND A. LUMSDAINE: Design, Implementation, andUsage of LibNBC. Open Systems Lab, Indiana University. presentedin Bloomington, IN, USA, School of Informatics, Aug. 2006

[6] T. Hoefler, A. LUMSDAINE AND W. REHM: Implementation andPerformance Analysis of Non-Blocking Collective Operations for MPI.Under submission (ask me for a copy)


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Pseudocode Overlap with Double Buffering

do ....! send communication bufferMPI_Ialltoall(buffer_comm, ...,, request)

! do useful workdo_work(buffer_work)

! finish communicationMPI_Wait(request, ...)

! swap the buffersbuffer_tmp = buffer_commbuffer_comm = buffer_workbuffer_work = buffer_tmp

enddo


Linear Solvers - Domain Decomposition

First ExampleNaturally Independent Computation - Linear Solvers

iterative linear solvers are used in many scientific kernelsoften used operation is vector-matrix-multiplymatrix is domain-decomposed (e.g., 3D)only outer (border) elements need to be communicatedcan be overlapped


Domain Decomposition

nearest neighbor communicationcan be implemented with MPI_Alltoallv

��

��

��

��

��

��

��

��

��

��

��

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��

��

��

��

��

��

��

��

��

��

��

��

�� Process−local data

Halo−data2D Domain

P1 P2 P3

P4 P5 P6 P7

P10P9P8 P11

P0


Parallel Speedup (Best Case)

0

20

40

60

80

100

8 16 24 32 40 48 56 64 72 80 88 96

Spe

edup

Number of CPUs

Eth blockingEth non-blocking

0

20

40

60

80

100

8 16 24 32 40 48 56 64 72 80 88 96S

peed

upNumber of CPUs

IB blockingIB non-blocking

Cluster: 128 2 GHz Opteron 246 nodesInterconnect: Gigabit Ethernet, InfiniBandTM

System size 800x800x800 (1 node ≈ 5300s)


Parallel Gain with Non-Blocking Communication

gain = tnbl/tbl

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0 8 16 24 32 40 48 56 64 72 80 88 96

Rel

ativ

e S

peed

up

Number of CPUs

IBEth


Linear Solvers - Domain Decomposition

Second ExampleData Parallel Loops - Parallel Compression

automatic transformations (C++ templates), typical loopstructure:

for (i=0; i < N/P; i++) {compute(i);

}comm(N/P);


Parallel Speedup (Best Case)

0

10

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70 80 90 100

Spe

edup

# Processors

MPI/blockingNBC/pipe

NBC/tileNBC/wintile

0

10

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70 80 90 100S

peed

up# Processors


NBC/tileNBC/wintile

Cluster: 64 2 GHz Opteron 246 nodesInterconnect: Gigabit Ethernet, InfiniBandTM

System size 64*50 MB


Communication Overhead

MVAPICH 0.9.4

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90 100

Com

mun

icat

ion

Ove

rhea

d

# Processors


NBC/tileNBC/wintile


Literature

[7] T. Hoefler P. GOTTSCHLING, W. REHM AND A. LUMSDAINE:Optimizing a Conjugate Gradient Solver with Non-Blocking CollectiveOperations. In 13th European PVM/MPI User’s Group Meeting,Proceedings, LNCS 4192, presented in Bonn, Germany, pages374-382, Springer, ISSN: 0302-9743, ISBN: 3-540-39110-X Sep.2006

[8] T. Hoefler, P. GOTTSCHLING AND A. LUMSDAINE:Transformations for enabling non-blocking collective communication inhigh-performance applications. Under submission (ask me for a copy)


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Ongoing Work

3D-FFToptimized version of 3D-FFT with full overlap or pipelinedstill in development (ABINIT version)very promising

LOBPCG Method in ABINITdeveloped by G. Zérah (CEA)could use NBC for parallel matrix-matrix multiplication

LibNBCFortran bindings for LibNBCoptimized collectives


Ongoing Work (continued)

Collective Communication

optimized collectives for InfiniBandTM

using special hardware support

Network Modelling

refined LogGP model parametrizationmodelling of collective algorithms

Collaborate with Application Engineers

I want to help to apply NBC to your problem!


Discussion

THE ENDQuestions?

Thank you for your attention!

Documents

Non-Blocking Collectives for MPI-2 - ETH Zhtor.inf.ethz.ch/publications/img/cea-nbcoll.pdf · Non-Blocking Collectives for MPI-2 ... MPICH2 - o TCP - G*s+g TCP o. Some Considerationsabout