Heterogeneous and Grid Compuitng2 Programming systems u Programming systems –For parallel computing »Traditional systems (MPI, HPF) do not address the

Heterogeneous and Grid Compuitng

2

Programming systems

Programming systems– For parallel computing

» Traditional systems (MPI, HPF) do not address the extra challenges of heterogeneous parallel computing

» mpC, HeteroMPI

– For high performance distributed computing» NetSolve/GridSolve


3

mpC

mpC– An extension of ANSI C for programming parallel

computations on networks of heterogeneous computers

– Support efficient, portable and modular heterogeneous parallel programming

– Addresses the heterogeneity of both processors and communication network


4

mpC (ctd)

A parallel mpC program is a set of parallel processes interacting (that is, synchronizing their work and transferring data) by means of message passing

The mpC programmer cannot determine how many processes make up the program and which computers execute which processes

– This is specified by some means external to the mpC language

– Source mpC code only determines, which process of the program performs which computations.


5

mpC (ctd)

The programmer describes the algorithm– The number of processes executing the algorithm– The total volume of computation to be performed

by each process» A formula including the parameters of the algorithm

» The volume is measured in computation units provided by the application programmer

The very code that has been used to measure the speed of the processors


6

mpC (ctd)

The programmer describes the algorithm (ctd)– The total volume of data transferred between each

pair of the processes– How the processes perform the computations and

communications and interact» In terms of traditional algorithmic patterns (for, while,

parallel for, etc)» Expressions in the statements specify not the

computations and communications themselves but rather their amount

Parameters of the algorithm and locally declared variables can be used


7

mpC (ctd) The abstract processes of the algorithm are

mapped to the real parallel processes of the program– The mapping of the abstract processes should minimize

the execution time of this program


8

mpC (ctd) Example (see handouts for full code):

algorithm HeteroAlgorithm(int n, double v[n]){ coord I=n; node { I>=0: v[I]; };}; …int [*]main(int [host]argc, char **[host]argv){ … { net HeteroAlgorithm(N, volumes) g; … }}


9

mpC (ctd) The program calculates the mass of a metallic

construction welded from N heterogeneous rails– It defines group g consisting of N abstract processes, each

calculating the mass of one of the rails

– The calculation is performed by numerical 3D integration of the density function Density with a constant integration step

» The volume of computation to calculate the mass of each rail is proportional to the volume of this rail

– i-th element of array volumes contains the volume of i-th rail» The program specifies that the volume of computation performed by

each abstract process of g is proportional to the volume of its rail


10

mpC (ctd)

The library nodal function MPC_Wtime is used to measure the wall time elapsed to execute the calculations

Mapping of abstract processes to real processes – Based on the information about the speed, at which the real

processes run on physical processors of the executing network


11

mpC (ctd) By default, the speed estimation obtained on initialization

of the mpC system on the network is used– The estimation is obtained by running a special test program

mpC allows the programmer to change at runtime the default estimation of processor speed by tuning it to the computations, which will be really executed– The recon statement


12

mpC (ctd)

An irregular problem– Characterized by inherent coarse/large-

grained structure– This structure determines a natural

decomposition of the problem into a small number of subtasks» Of different size» Can be solved in parallel


13

mpC (ctd)

The whole program solving the irregular problem– A set of parallel processes– Each process solves its subtask

» As sizes of subtasks are different, processes perform different volumes of computation

– The processes are interacting via message passing Calculation of the mass of a mettalic

«hedgehog» is an example of irregular problem


14

mpC (ctd) A regular problem

– The most natural decomposition is a large number of small identical subtasks that can be solved in parallel

– As the subtasks are identical, they are of the same size

Multiplication of two nxn dense matrices is an example of a regular problem– Naturally decomposed into n2 identical subtasks

» Computation of one element of the resulting matrix

How to efficiently solve a regular problem on a network of heterogeneous computers?


15

mpC (ctd) Main idea

– Transform the problem into an irregular problem» Whose structure is determined by the structure of the

executing network

The whole problem– Decomposed into a set of relatively large subproblems– Each subproblem is made of a number of small

identical subtasks stuck together– The size of each subproblem depends on the speed of

the processor solving this subproblem


16

mpC (ctd)

The parallel program– A set of parallel processes– Each process solves one subproblem on a

separate physical processor» The volume of computation performed by each

of these processes should be proportional to its speed

– The processes are interacting via message passing


17

mpC (ctd) Example. Parallel multiplication on a heterogeneous

network of matrix A and the transposition of matrix B, where A, B are dense square nxn matrices.

, =>

A B C=AxBT


18

mpC (ctd) One step of parallel multiplication of matrices A and BT.

The pivot row of blocks of matrix B (shown slashed) is first broadcast to all processors. Then, each processor in parallel with others computes its part of the corresponding column of blocks of the resulting matrix C.

, =>

A B C


19

mpC (ctd) See handouts for the mpC program implementing

this algorithm– The program first update the estimation of the speeds

of processors with the code» Executed at each step of the main loop

– The program first detects the number of physical processors


20

mpC: inter-process communication

Basic subset of mpC is based on the performance model of parallel algorithm ignoring communication operations

– It presumes that» contribution of the communications into the total execution

time of the algorithm is negligibly small compared to that of the computations

– It is acceptable for» Computing on heterogeneous clusters» MP algorithms not frequently sending short messages

– Not acceptable for “normal” algorithms running on common heterogeneous networks of computers


21

mpC: inter-process communication (ctd)

Compiler can optimally map parallel algorithms with substantial contribution of communication operations into the execution time only if programmers can specify

– Absolute volumes of computation performed by processes

– Volumes of data transferred between the processes


22

mpC: inter-process communication (ctd)

Volume of communication– Can be naturally measured in bytes

Volume of computation– What is the natural unit of measurement?

» To allow the compiler to accurately estimate the execution time

– In mpC, the unit is the very code which has been most recently used to estimate the speed of physical processors » Normally specified as part of the recon statement


23

mpC: N-body problem

The system of bodies consists of large groups of bodies, with different groups at a good distance from each other. The bodies move under the influence of Newtonian gravitational attraction


24

mpC: N-body problem (ctd)

Parallel N-body algorithm– There is one-to-one mapping between groups of

bodies and parallel processes of the algorithm– Each process

» Holds in its memory all data characterising bodies of its group

Masses, positions and velocities of bodies

» Responsible for its updating


25


Parallel N-body algorithm (ctd)– The effect of each remote group is approximated

by a single equivalent body» To update its group, each process requires the total

mass and the center of mass of all remote groups The total mass of each group of bodies is constant. It is

calculated once. Each process receives from each of other processes its calculated total mass, and stores all the masses.

The center of mass of each group is a function of time. At each step of simulation, each process computes its center and sends it to other processes.


26


Parallel N-body algorithm (ctd)– At each step of simulation the updated system of

bodies is visualised» To do it, all groups of bodies are gathered to the process

responsible for the visualisation, which is the host-process.

– In general different groups have different sizes » Different processes perform different volumes of

computation» different volumes of data are transferred between different

pairs of processes


27

mpC: N-body problem (ctd) Parallel N-body algorithm (ctd)

The POV of each individual process: the system includes all bodies of its group, with each remote group approximated by a single equivalent body .


28


Pseudocode of the N-body algorithm:

Initialise groups of bodies on the host-processVisualize the groups of bodiesScatter the groups across processesCompute masses of the groups in parallelCommunicate to share the masses among processeswhile(1) { Compute centers of mass in parallel Communicate the centers among processes Update the state of the groups in parallel Gather the groups to the host-process Visualize the groups of bodies}


29

mpC N-body application

The core is the specification of the performance model of the algorithm:algorithm Nbody(int m, int k, int n[m]){ coord I=m; node { I>=0: bench*((n[I]/k)*(n[I]/k)); }; link { I>0: length*(n[I]*sizeof(Body)) [I]->[0];};

parent [0];};


30

mpC N-body application (ctd)

The most principle fragments of the rest of code:void [*] main(int [host]argc, char **[host]argv){ ... // Make the test group consist of first Tgsize // bodies of the very first group of the system OldTestGroup[] = (*(pTestGroup)Groups[0])[]; recon Update_group(TGsize, &OldTestGroup, &TestGroup, 1,NULL,NULL,0 ); { net Nbody(NofGroups, TGsize, NofBodies) g; … }}


31

mpC: algorithmic patterns

One more important feature of parallel algorithm is still not reflected in the performance model

– The order of execution of computations and communications

As the model says nothing about how parallel processes interact during execution of the algorithm, the compiler assumes that

– First, all processes execute all their computations in parallel

– Then the processes execute all the communications in parallel

– There is a synchronisation barrier between execution of the computations and communications


32

mpC: algorithmic patterns (ctd)

These assumption are unsatisfactory in case of– Data dependencies between computations performed

by different processes» One process may need data computed by other processes in

order to start its computations

» This serialises some computations performed by different parallel processes ==> The real execution time of the algorithm will be longer

– Overlapping of computations and communications» The real execution time of the algorithm will be shorter


33


Thus, if estimation is not based on the actual scenario of interaction of parallel processes

– It may be not accurate which leads to non-optimal mapping of the algorithm to the executing network

Example. An algorithm with fully serialised computations.

– Optimal mapping:» All the processes are asigned to the fastest physical processor

– Mapping based on the above assumptions:» Involves all available physical processors


34


mpC addresses the problem– The programmer can specify the scenario of

interaction of parallel processes during execution of the parallel algorithm

– That specification is a part of the network type definition» The scheme declaration


35


Example 1. N-body algorithmalgorithm Nbody(int m, int k, int n[m]){ coord I=m; node { I>=0: bench*((n[I]/k)*(n[I]/k)); }; link { I>0: length*(n[I]*sizeof(Body)) [I]->[0];}; parent [0];

scheme { int i; par (i=0; i<m; i++) 100%%[i]; par (i=1; i<m; i++) 100%%[i]->[0]; };

};


36


Example 2. Matrix multiplication.algorithm ParallelAxBT(int p, int n, int r, int d[p]) {coord I=p;node

{ I>=0: bench*((d[I]*n)/(r*r)); };link (J=p)

{ I!=J: length*(d[I]*n*sizeof(double)) [J]->[I]; };parent [0];


37


Example 2. Matrix multiplication (ctd) scheme { int i, j, PivotProc=0, PivotRow=0; for(i=0; i<n/r; i++, PivotRow+=r) { if(PivotRow>=d[PivotProc]) { PivotProc++; PivotRow=0; } for(j=0; j<p; j++) if(j!=PivotProc) (100.*r/d[PivotProc])%%[PivotProc]->[j]; par(j=0; j<p; j++) (100.*r/n)%%[j]; } };};


38

mpC: the timeof operator

Further modification of the matrix multiplication program: [host]: {

int m; struct {int p; double t;} min;

double t; min.p = 0; min.t = DBL_MAX; for(m=1; m<=p; m++) {

Partition(m, speeds, d, n, r); t = timeof(net ParallelAxBT(m, n, r, d) w); if(t<min.t) { min.p = m; min.t = t; } } p = min.p;

}


39

mpC: the timeof operator (ctd)

Operator timeof estimates the execution time of the parallel algorithm without its real execution

– The only operand specifies a fully specified network type» The value of all parametrs of the network type must be specified

– The operator does not create an mpC network of this type

– Instead, it calculates the time of execution of the corresponding parallel algorithm on the executing network » Based on

the provided performance model of the algorithm the most recent performance characteristics of physical processors

and communication links


40

mpC: mapping

Dispatcher maps abstract processes of the mpC network to the processes of the parallel program

– At runtime– Trying to minimize of the execution time

The mapping is based on» The model of the executing network of computers

» A map of processes of the parallel program The total number of processes running on each computer The number of free processes


41

mpC: mapping (ctd)

The mapping is based on (ctd)» The performance model of the parallel algorithm represented

by this mpC network The number of parallel processes executing the algorithm The absolute volume of computations performed by each

of the processes The absolute volume of data transferred between each

pair of processes The scenario of interaction between the parallel

processes during the algorithm execution


42

mpC: mapping (ctd)

Two main features:– Estimation of each particular mapping

» Based on Formulas for

– Each computation unit in the scheme declaration

– Each communication unit in the scheme declaration Rules for each sequential and parallel algorithmic pattern

–for, if, par, etc.


43

HeteroMPI

HeteroMPI– An extension of MPI– Programmer can describe the performance model of

the implemented algorithm» In a small model definition language shared with mpC

– Given this description» HeteroMPI tries to create a group of processes executing the

algorithm faster than any other group


44

HeteroMPI (ctd)

Standard MPI approach to group creation– Acceptable in homogeneous environments

» If there is one process per processor» Any group will execute the algorithm with the same speed

– Not acceptable » In heterogeneous environments» If there are more that one process per processor

In HeteroMPI– The programmer can describe the algorithm– The description is translated into a set of functions

» Making up an algorithm-specific part of HeteroMPI run-time system


45

HeteroMPI (ctd)

A new operation to create a group of processes:HMPI_Group_create(

HMPI_Group* gid,

const HMPI_Model* perf_model,

const void* model_parameters)

Collective operation– In the simplest case, called by all processes HMPI_COMM_WORLD


46

HeteroMPI (ctd)

Dynamic update of the estimation of the processors speed can be performed by

HMPI_Recon(

HMPI_Benchmark_function func,

const void* input_p,

int num_of_parameters,

const void* output_p)

Collective operation– Called by all processes of HMPI_COMM_WORLD


47

HeteroMPI (ctd)

Prediction of the execution time of the algorithmHMPI_Timeof(

HMPI_Model *perf_model,

const void* model_parameters)

Local operation– Can be called by any processes


48

HeteroMPI (ctd)

Another collective operation to create a group of processes:

HMPI_Group_auto_create(

HMPI_Group* gid,

const HMPI_Model* perf_model,

const void* model_parameters) Used if the programmer wants HeteroMPI to find the

optimal number of processes


49

HeteroMPI (ctd)

Other HMPI operationsHMPI_Init()HMPI_Finalize()HMPI_Group_free()HMPI_Group_rank()HMPI_Group_size()MPI_Comm *HMPI_Get_comm(HMPI_Group *gid)

HMPI_Get_comm– Creates an MPI communicator with the group defined

by gid


50

Grid Computing vs Distributed Computing

Definitions of Grid computing are various and vague– A new computing model for better use of many separate

computers connected by a network– => Grid computing targets heterogeneous networks

What is the difference between Grid-based heterogeneous platforms and traditional distributed heterogeneous platforms?– A single login to a group of resources is the core– Grid operating environment – services built on top of this

» Different models of GOE supported by different Grid middleware (Globus, Unicore)


51

GridRPC

High-performance Grid programming systems are based on GridRPC– RPC – Remote Procedure Call

» Task, input data, output data, remote computer

– GridRPC» Task, input data, output data» Remote computer is picked by the system


52

NetSolve NetSolve

– Programming system for HPDC on global networks» Based on the GridRPC mechanism

– Some components of the application are only available on remote computers

NetSolve application– The user writes a client program

» Any program (in C, Fortran, etc) with calls the NetSolve client interface

» Each call specifies Remote task Location of the input data on the user’s computer Location of the output data (on the user’s computer)


53

NetSolve (ctd)

Execution of the NetSolve application– A NetSolve call results in

» A task to be executed on a remote computer» The NetSolve programming system

Selects the remote computer Transfers input data to the remote computer Delivers output data to the user’s computer

– The mapping of the remote tasks to computers» The core operation having an impact on the

performance of the application


54

NetSolve (ctd)

1. Assign (“task”)

netslInfo()

Agent

Server AProxy

Client

netsl (“task”, in, out)

Server B

netslX()

2. Upload (in)

3. Download (out)


55

NetSolve (ctd)

Mapping algorithm– Each task is scheduled separately and

independently on other tasks» A NetSolve application is seen as a sequence of

independent tasks

– Based on two performance models (PMs)» The PM of heterogeneous network of computers» The PM of a task


56

NetSolve (ctd)

Client interface– User’s command line interface

» NS_problems, NS_probdesc

– C program interface» Blocking call

int netsl(char *problem_name, …<argument_list>…)

» Non-blocking call request=netslnb(…); info = netslpr(request); info = netslwt(request);


57

NetSolve (ctd)

Network of computers– A set of interconnected heterogeneous processors

» Each processor is characterized by the execution time of the same serial code

Matrix multiplication of two 200×200 matrices Obtained once on the installation of NetSolve and

does not change» Communication links

The same way as in NWS (latency + bandwidth) Dynamic (periodically updated)


58

NetSolve (ctd)

The performance model of a task– Provided by the person installing the task on a

remote computer– A formula to calculate the execution time of the

task by the solver» Uses parameters of the task and the execution time of

the standard computation unit (matrix multiplication)

– The size of input and output data– The PM = a distributed set of performance models


59

NetSolve (ctd)

The mapping algorithm– Performed by the agent– Minimizes the total execution time, Ttotal

» Ttotal= Tcomputation + Tcommunication

» Tcomputation

Uses the formulas of the PM of the task » Tcommunication= Tinput delivery + Toutput receive

Uses characteristics of the communication link and the size of input and output data


60

NetSolve (ctd)

Link to NetSolve software and documentation– http://icl.cs.utk.edu/netsolve/

Documents

Heterogeneous and Grid Compuitng2 Programming systems u Programming systems –For parallel computing »Traditional systems (MPI, HPF) do not address the