Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops

Load Balancing Hybrid Programming Load Balancing Hybrid Programming Models for SMP Clusters and Fully Models for SMP Clusters and Fully

Permutable LoopsPermutable Loops

Nikolaos Drosinos and Nectarios Koziris

National Technical University

of Athens

Computing Systems

Laboratory

{ndros,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr

Oslo, June 15, 2005 ICPP-HPSEC 2005 2

MotivationMotivation

fully permutable loops always a computational challenge for HPC hybrid parallelization attractive for DSM architectures currently, popular free message passing libraries provide limited multi-threading support SPMD hybrid parallelization suffers from intrinsic load imbalance


ContributionContribution

two static thread load balancing schemes (constant-variable) for coarse-grain funneled hybrid parallelization of fully permutable loops

• generic• simple to implement

experimental evaluation against micro-kernel benchmarks of different programming models

• message passing• fine-grain hybrid• coarse-grain hybrid (unbalanced, balanced)


Algorithmic modelAlgorithmic model

foracross tile1 do

…

foracross tileN do

for tilen-1 do

Receive(tile);

Compute(A,tile);

Send(tile);

Restrictions: fully permutable loops unitary inter-process dependencies


Message passing Message passing parallelizationparallelization

tiling transformation (overlapped?) computation and communication phases pipelined execution

portable scalable highly optimized


Hybrid parallelizationHybrid parallelization

So… why bother?


Hybrid parallelization: why Hybrid parallelization: why bother Ibother I

shared memory programming model vs message passing programming model for shared memory architecture


Hybrid parallelization: why Hybrid parallelization: why bother IIbother II

DSM architectures are popular!


Fine-grain hybrid Fine-grain hybrid parallelizationparallelization

incremental parallelization of loops relatively easy to implement popular

Amdahl’s law restricts parallel efficiency overhead of thread structures re-initialization restrictive programming model for many applications


Coarse-grain hybrid Coarse-grain hybrid parallelizationparallelization

generic SPMD programming style good parallelization efficiency no thread re-initialization overhead

more difficult to implement intrinsic load imbalance assuming common funneled thread support level


MPI thread support levelsMPI thread support levels

single masteronly funneled serialized multiple

fine-grain hybrid

coarse-grain hybrid

comm

comp

comp

comp

comm…

comm

comp

comp

…comp


Load balancingLoad balancing

Idea

Consequencemaster thread assumes a smaller fraction of the process tile computational load compared to other threads

othercomp

mastercomm

mastercomp ttt


Load balancing (2)Load balancing (2)

T………total number of threadsp………current process id

1

1,

,

11

N

Cdir

dirdircomm

tilecomp

p

tt

Tbal

datastartupcomm

compcomp

txtxt

txxtAssuming

It follows


Load balancing (3)Load balancing (3)

X1

X2

87% 87% 87% 92%

95% 95% 95% 100%

Z

thread 0 thread 1process (0,0)

process (3,1)


Experimental ResultsExperimental Results

8-node dual SMP Linux Cluster (800 MHz PIII, 256 MB RAM, kernel 2.4.26) MPICH v.1.2.6 (--with-device=ch_p4, --with-comm=shared, P4_SOCKBUFSIZE=104KB) Intel C++ compiler 8.1 (-O3 -static

-mcpu=pentiumpro) FastEthernet interconnection network


Alternating Direction Implicit Alternating Direction Implicit (ADI)(ADI)

Stencil computation used for solving partial differential equations Unitary data dependencies 3D iteration space (X x Y x Z)

X

Y

Z

Seque

ntial

Exe

cutio

nProcessor Mapping

DataDependencies


ADIADI


Synthetic benchmarkSynthetic benchmark


ConclusionsConclusions

fine-grain hybrid parallelization inefficient unbalanced coarse-grain hybrid parallelization also inefficient balancing improves hybrid model performance variable balanced coarse-grain hybrid model most efficient approach overall relative performance improvement increases for higher communication vs computation needs


Thank You!Thank You!

Questions?


ADIADI


Synthetic benchmarkSynthetic benchmark

Documents

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops