What is required for "standard" distributed parallel programming model?

What is required for "standard" distributed parallel programming

model?

Mitsuhisa SatoTaisuke Boku and Jinpil Lee

University of Tsukuba

2

My Background and Position OpenMP

A standard parallel programming model and API for shared memory multiprocessors

Extend the base language (Fortran/C/C++) with directives or pragma Incremental parallel programming, keep sequential semantics with ignoring

directives allows range of programming styles For scientific applications. Support for loop-based parallelism Target: small-scale( ～ 16processors ） to medium-scale ( ～ 64processors ）

First draft is published in 1997, now this standard is getting accepted for multi-core era.

Omni OpenMP compiler project (… now, inactive) The project done in Real World Computing Partnership (RWCP, ～ 2002) Research Objectives

Portable implementation of OpenMP for SMPs Design and implementation of Cluster-enabled OpenMP for PC/WS/SMP

clusters Support seamless programming from SMPs to clusters. Using page-based Software Distributed Shared Memory System

Free and Open-Source, released since 1998

3

Agenda

OpenMPD : directive-based programming model for distributed memory

What is required for "standard" distributed parallel programming model?

4

OpenMPD : directive-based programming model for distributed

memory Objectives

Providing a simple and “easy-to-understand” programming model for distributed memory OpenMP is just for shared memory, not for distributed

memory

Supporting data parallelization and typical parallelization pattern by adding directive similar to OpenMP (inspired by OpenMP)

5

Features of OpenMPD Directive-based programming model for distributed memory

system

C programming language (Fortran) + directives

Explicit communication and synchronization All action is taken by directive for being “easy-to-understand” in

performance tuning

Support typical communication pattern Scatter/gather, reduction, neighbor communication, …

“Directives” describe typical data parallelization array distribution, data synchronization, …

Highly portable implementation with translation to MPI the compiler translate the directives into parallel code using MPI

functions

6

Code Example

int array[YMAX][XMAX];

#pragma ompd distvar(var = array;dim = 2)

main(){ int i, j, res; res = 0;

#pragma ompd for affinity(array) reduction(+:res) for(i = 0; i < 10; i++) for(j = 0; j < 10; j++){ array[i][j] = func(i, j); res += array[i][j]; }}

add to the serial code : incremental parallelization

data distribution

work sharing and data synchronization

7

The same code written in MPIint array[YMAX][XMAX];

main(int argc, char**argv){ int i,j,res,temp_res, dx,llimit,ulimit,size,rank; MPI_Init(argc, argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); dx = YMAX/size; llimit = rank * dx; if(rank != (size - 1)) ulimit = llimit + dx; else ulimit = YMAX; temp_res = 0; for(i = llimit; i < ulimit; i++) for(j = 0; j < 10; j++){ array[i][j] = func(i, j); temp_res += array[i][j]; }

MPI_Allreduce(&temp_res, &res, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Finalize();}

8

Array data distribution

Each processor computes on different regions #pragma ompd distvar(var=list; dim=num; sleeve=size)

CPU1

CPU2

CPU3

CPU0

array[]

Reference to assigned to other nodes

Synchronization on data→

Sync on sleeve area

Sync. on whole array

The programmer choose which sync is required

In current implementation, whole array are replicated in each node

9

Data synchronization of array (Gather)

Gather operation to distribute data to every nodes #pragma ompd gather(var=list) Execute communication to get data assigned to other nodes Most easy way to synchronize

CPU1

CPU2

CPU3

CPU0

array[]

Now, we can access correct data by local access !!

→ But, communication is expensive!

10

Data synchronization of array (Sleeve)

Exchange data only on “sleeve” region If neighbor data is required to communicate, then only

sleeve area can be considered. example ： b[i] = array[i-1] + array[i+1]

CPU1

CPU2

CPU3

CPU0

array[]

Programmer specifies sleeve region explicitlyDirective ： #pragma ompd sync_sleeve(var=array)

#pragma ompd distvar(var = array; dim = 1); sleeve = 1)Different from gather operation, communcation on sleeve is cheaper.

User has to specify sleeve region with the size.

11

Parallel Execution of “for” loop

array[]

CPU1

CPU2

CPU3

CPU0

Execute for loop to compute on arrayData region to be computed

by for loop

Execute for loop in parallel with affinity to array distribution ： #pragma for affinity(array)

Array distribution

for(i=2; i <=10; i++)

12

Experimental Results

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8

Number of Nodes

Spe

ed

up

ep- openmpdep- mpihimeno- openmpdhimeno- mpicg- openmpdcg- mpi

constant speed-up withmoderate scalability

performance degraded by lack ofmulti-dim. array distribution

13

Related Work

OpenMP Just only for shared memory

Unified Parallel C PGAS (Partitioned Global Address Space)

Language

Co-Array Fortran Also, PGAS

Above two providing alternative programming models of MPI for distributed memory

OpenWP?

14

Future Work and Plan for OpenMPD

Multi-dimensional array distribution and nested parallel loop execution

Integration of PGAS feature for more flexible communication pattern and data distribution

Current OpenMPD only support typical cases. Remote memory access (one-side communication) Part of assigned data should be allocated in each node

Address translation is required.

Supporting hybrid programming with OpenMP within node in SMP/multicore node clusters, even with MPI!

15

Agenda

OpenMPD : directive-based programming model for distributed memory

What is required for "standard" distributed parallel programming model?

16

Message Passing Model (MPI)

Message passing model was the dominant programming model in the past.

…. Yes. Message passing is the dominant programming model

today. … Unfortunately, yes…

Will OpenMP be a programming model for future system?

… I hope so, but it is not perfect. OpenMP is only for shared memory model. (I think) some features for performance turning are missing

data mapping, scalability, IO…

17

For application programmers

Are programmers satisfied with MPI? yes…? Many programmers writes MPI.

Is MPI enough for parallelizing scientific parallel programs?

Application programmer’s concern is to get their answers faster!!

Automatic parallelizing compiler is the best, but … many problems remain.

18

“Life is too short for MPI”(from WOMPAT2001 T-shirt message)

Simple N-body problem for(i = 0; i < n_particles; i++) { p = &particles[i]; ax = 0.0; ay = 0.0; az = 0.0; for(j = 0; j < n_particles; j++){ if(i == j) continue; q = &particles[j]; dx = p->x - q->x; dy = p->y - q->y; dz = p->z - q->z; X = dx * dx + dy * dy + dz * dz; if (X < b2) { f = q->m * (X - a2) * (X - b2); ax += f * dx; ay += f * dy; az += f * dz; } } p->ax = ax; p->ay = ay; p->az = az; } for(i = 0; i < n_particles; i++){

p = &particles[i];p->x += p->vx * DT;p->y += p->vy * DT;p->z += p->vz * DT;p->vx += p->ax * DT;p->vy += p->ay * DT;p->vz += p->az * DT;

}

MPI•Data partitioning•scheduling•communication (broadcast, reduction)

OpenMPjust put #pragma omp parallel at loop!!!

It takes several hours with MPI

It takes just a few 10 min!!!#pragma omp parallel

#pragma omp parallel

19

JedeHPC++mpc++HPFLindaMentatFortran MOccamAPLSAL

pC++SISALNESLClikpHaskelPrologOrcampCC*dataparallel C

Split-CFortran DVCharm++CODEZPLFortran X3H5…..

Parallel programming languages

Programming language design reflects its model.

So far, many parallel programming languages were proposed in computer science community.

Are they actually used by application users? Where were they gone? What is missing in them?

20

Think about MPI, …

Why was MPI accepted and so successful?

Portability: Most parallel computing platforms can run MPI programs (even in SMP).

Many free and portable software such as MPICH.

Education: MPI Standard allows many programmers to learn MPI parallel programming.

In university By book

21

Discussion The demand for parallel programming is increasing!!

Low cost PC clusters SMP in PC box. On-chip multiprocessors, … multiprocessors even in PDA, now!

Of course, … clear and excellent concept of modeling, good performance, … many factors are important!

Standardization and Education are important for widespread use.

Standardization enables a good education. It must be available in many platforms.

22

Discussion Cost of parallelization is also important for acceptance by

application programmers. Easy to transfer from an original sequential program. What application programmers need to learn must be small.

We have a plan to organize the group for “standard” parallel programming language for petaflops systems

Will be supported by RIKEN Try to find a fund for development Should be international.

For the standard, “agreement” process is important rather than “advanced” idea.

Standardization and Education

Documents

What is required for "standard" distributed parallel programming model?