34
Performance Comparison of Pure Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Parallelization Models on SMP Clusters Clusters Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

  • Upload
    rex

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters. Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr. - PowerPoint PPT Presentation

Citation preview

Page 1: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Performance Comparison of Performance Comparison of Pure MPI vs Hybrid MPI-Pure MPI vs Hybrid MPI-

OpenMP Parallelization Models OpenMP Parallelization Models on SMP Clusterson SMP Clusters

Nikolaos Drosinos and Nectarios Koziris

National Technical University

of Athens

Computing Systems

Laboratory

{ndros,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr

Page 2: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 2

OverviewOverview

Introduction Pure Message-passing Model Hybrid Models

• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model

Experimental Results Conclusions – Future Work

Page 3: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 3

MotivationMotivation

Active research interest in • SMP clusters• Hybrid programming models

However:• Mostly fine-grain hybrid paradigms (masteronly model)• Mostly DOALL multi-threaded parallelization

Page 4: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 4

ContributionContribution

Comparison of 3 programming models for the parallelization of tiled loops algorithms

• pure message-passing• fine-grain hybrid• coarse-grain hybrid

Advanced hyperplane scheduling• minimize synchronization need• overlap computation with communication• preserves data dependencies

Page 5: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 5

Algorithmic ModelAlgorithmic Model

Tiled nested loops with constant flow data dependencies

FORACROSS tile0 DO

FORACROSS tilen-2 DO

FOR tilen-1 DO

Receive(tile);

Compute(tile);

Send(tile);

END FOR

END FORACROSS

END FORACROSS

Page 6: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 6

Target ArchitectureTarget Architecture

SMP clusters

Page 7: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 7

OverviewOverview

Introduction Pure Message-passing Model Hybrid Models

• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model

Experimental Results Conclusions – Future Work

Page 8: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 8

Pure Message-passing ModelPure Message-passing Model

tile0 = pr0;…tilen-2 = prn-2;FOR tilen-1 = 0 TO DO

Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr);END FOR

1

11 minmax

n

nn

x

Page 9: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 9

Pure Message-passing ModelPure Message-passing Model

Page 10: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 10

OverviewOverview

Introduction Pure Message-passing Model Hybrid Models

• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model

Experimental Results Conclusions – Future Work

Page 11: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 11

Hyperplane SchedulingHyperplane Scheduling

Implements coarse-grain parallelism assuming inter-tile data dependencies Tiles are organized into data-independent subsets (groups) Tiles of the same group can be concurrently executed by multiple threads Barrier synchronization between threads

Page 12: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 12

Hyperplane SchedulingHyperplane Scheduling

X

YZ

mpi_rank = (1,1) omp_tid = (1,1)

tile = 3

6 MPI processes x 6 OpenMP threads

M0 = 3M1 = 2m0 = 3m1 = 2

tile (mpi_rank,omp_tid,tile) group tilethprmN

i

N

iiii

2

0

2

0

Page 13: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 13

Hyperplane SchedulingHyperplane Scheduling#pragma omp parallel{ group0 = pr0; … groupn-2 = prn-2; tile0 = pr0 * m0 + th0; … tilen-2 = prn-2 * mn-2 + thn-2; FOR(groupn-1){ tilen-1 = groupn-1 - ;

if(0 <= tilen-1 <= ) compute(tile); #pragma omp barrier }}

tnn 11 minmax

2

0

n

iitile

Page 14: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 14

OverviewOverview

Introduction Pure Message-passing Model Hybrid Models

• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model

Experimental Results Conclusions – Future Work

Page 15: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 15

Fine-grain ModelFine-grain Model

Incremental parallelization of computationally intensive parts Pure MPI + hyperplane scheduling Inter-node communication outside of multi-threaded part (MPI_THREAD_MASTERONLY) Thread synchronization through implicit barrier of omp parallel directive

Page 16: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 16

Fine-grain ModelFine-grain Model

FOR(groupn-1){ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,groupn-1)) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr);}

Page 17: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 17

OverviewOverview

Introduction Pure Message-passing Model Hybrid Models

• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model

Experimental Results Conclusions – Future Work

Page 18: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 18

Coarse-grain ModelCoarse-grain Model

Threads are only initialized once SPMD paradigm (requires more programming effort)Inter-node communication inside multi-threaded part (requires MPI_THREAD_FUNNELED) Thread synchronization through explicit barrier (omp barrier directive)

Page 19: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 19

Coarse-grain ModelCoarse-grain Model#pragma omp parallel{ thread_id=omp_get_thread_num(); FOR(groupn-1){ #pragma omp master{ Pack(snd_buf, tilen-1 – 1, pr); MPI_Isend(snd_buf, dest(pr)); MPI_Irecv(recv_buf, src(pr)); } if(valid(tile,thread_id,groupn-1)) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tilen-1 + 1, pr); } #pragma omp barrier }}

Page 20: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 20

OverviewOverview

Introduction Pure Message-passing Model Hybrid Models

• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model

Experimental Results Conclusions – Future Work

Page 21: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 21

Experimental ResultsExperimental Results

8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) FastEthernet interconnection ADI micro-kernel benchmark (3D)

Page 22: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 22

Alternating Direction Implicit Alternating Direction Implicit (ADI)(ADI)

Stencil computation used for solving partial differential equations Unitary data dependencies 3D iteration space (X x Y x Z)

X

Y

Z

Seque

ntial

Exe

cutio

nProcessor Mapping

DataDependencies

Page 23: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 23

ADI – 2 dual SMP nodesADI – 2 dual SMP nodes

Pure MPI Hybrid

XX

Y Y

Pure MPI Hybrid

XX

Y Y

Z Z

Z Z

: MPI processes: OpenMP threads: MPI communication: OpenMP synchronization

: node 0, CPU 0: node 0, CPU 1: node 1, CPU 0: node 1, CPU 1

X<Y X>Y

Page 24: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 24

ADI X=128 Y=512 Z=8192 – 2 ADI X=128 Y=512 Z=8192 – 2 nodesnodes

Page 25: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 25

ADI X=256 Y=512 Z=8192 – 2 ADI X=256 Y=512 Z=8192 – 2 nodesnodes

Page 26: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 26

ADI X=512 Y=512 Z=8192 – 2 ADI X=512 Y=512 Z=8192 – 2 nodesnodes

Page 27: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 27

ADI X=512 Y=256 Z=8192 – 2 ADI X=512 Y=256 Z=8192 – 2 nodesnodes

Page 28: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 28

ADI X=512 Y=128 Z=8192 – 2 ADI X=512 Y=128 Z=8192 – 2 nodesnodes

Page 29: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 29

ADI X=128 Y=512 Z=8192 – 2 ADI X=128 Y=512 Z=8192 – 2 nodesnodes

Computation Communication

Page 30: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 30

ADI X=512 Y=128 Z=8192 – 2 ADI X=512 Y=128 Z=8192 – 2 nodesnodes

Computation Communication

Page 31: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 31

OverviewOverview

Introduction Pure Message-passing Model Hybrid Models

• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model

Experimental Results Conclusions – Future Work

Page 32: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 32

ConclusionsConclusions

Tiled loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm Hybrid models can be competitive to the pure message-passing paradigm Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated Programming efficiently in OpenMP not easier than programming efficiently in MPI

Page 33: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 33

Future WorkFuture Work

Application of methodology to real applications and standard benchmarks Work balancing for coarse-grain model Investigation of alternative topologies, irregular communication patterns Performance evaluation on advanced interconnection networks (SCI, Myrinet)

Page 34: Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

April 27, 2004 IPDPS 2004 34

Thank You!Thank You!

Questions?