View
223
Download
4
Embed Size (px)
Citation preview
Heterogeneous and Grid Compuitng
2
Programming systems
Programming systems– For parallel computing
» Traditional systems (MPI, HPF) do not address the extra challenges of heterogeneous parallel computing
» mpC, HeteroMPI
– For high performance distributed computing» NetSolve/GridSolve
Heterogeneous and Grid Compuitng
3
mpC
mpC– An extension of ANSI C for programming parallel
computations on networks of heterogeneous computers
– Support efficient, portable and modular heterogeneous parallel programming
– Addresses the heterogeneity of both processors and communication network
Heterogeneous and Grid Compuitng
4
mpC (ctd)
A parallel mpC program is a set of parallel processes interacting (that is, synchronizing their work and transferring data) by means of message passing
The mpC programmer cannot determine how many processes make up the program and which computers execute which processes
– This is specified by some means external to the mpC language
– Source mpC code only determines, which process of the program performs which computations.
Heterogeneous and Grid Compuitng
5
mpC (ctd)
The programmer describes the algorithm– The number of processes executing the algorithm– The total volume of computation to be performed
by each process» A formula including the parameters of the algorithm
» The volume is measured in computation units provided by the application programmer
The very code that has been used to measure the speed of the processors
Heterogeneous and Grid Compuitng
6
mpC (ctd)
The programmer describes the algorithm (ctd)– The total volume of data transferred between each
pair of the processes– How the processes perform the computations and
communications and interact» In terms of traditional algorithmic patterns (for, while,
parallel for, etc)» Expressions in the statements specify not the
computations and communications themselves but rather their amount
Parameters of the algorithm and locally declared variables can be used
Heterogeneous and Grid Compuitng
7
mpC (ctd) The abstract processes of the algorithm are
mapped to the real parallel processes of the program– The mapping of the abstract processes should minimize
the execution time of this program
Heterogeneous and Grid Compuitng
8
mpC (ctd) Example (see handouts for full code):
algorithm HeteroAlgorithm(int n, double v[n]){ coord I=n; node { I>=0: v[I]; };}; …int [*]main(int [host]argc, char **[host]argv){ … { net HeteroAlgorithm(N, volumes) g; … }}
Heterogeneous and Grid Compuitng
9
mpC (ctd) The program calculates the mass of a metallic
construction welded from N heterogeneous rails– It defines group g consisting of N abstract processes, each
calculating the mass of one of the rails
– The calculation is performed by numerical 3D integration of the density function Density with a constant integration step
» The volume of computation to calculate the mass of each rail is proportional to the volume of this rail
– i-th element of array volumes contains the volume of i-th rail» The program specifies that the volume of computation performed by
each abstract process of g is proportional to the volume of its rail
Heterogeneous and Grid Compuitng
10
mpC (ctd)
The library nodal function MPC_Wtime is used to measure the wall time elapsed to execute the calculations
Mapping of abstract processes to real processes – Based on the information about the speed, at which the real
processes run on physical processors of the executing network
Heterogeneous and Grid Compuitng
11
mpC (ctd) By default, the speed estimation obtained on initialization
of the mpC system on the network is used– The estimation is obtained by running a special test program
mpC allows the programmer to change at runtime the default estimation of processor speed by tuning it to the computations, which will be really executed– The recon statement
Heterogeneous and Grid Compuitng
12
mpC (ctd)
An irregular problem– Characterized by inherent coarse/large-
grained structure– This structure determines a natural
decomposition of the problem into a small number of subtasks» Of different size» Can be solved in parallel
Heterogeneous and Grid Compuitng
13
mpC (ctd)
The whole program solving the irregular problem– A set of parallel processes– Each process solves its subtask
» As sizes of subtasks are different, processes perform different volumes of computation
– The processes are interacting via message passing Calculation of the mass of a mettalic
«hedgehog» is an example of irregular problem
Heterogeneous and Grid Compuitng
14
mpC (ctd) A regular problem
– The most natural decomposition is a large number of small identical subtasks that can be solved in parallel
– As the subtasks are identical, they are of the same size
Multiplication of two nxn dense matrices is an example of a regular problem– Naturally decomposed into n2 identical subtasks
» Computation of one element of the resulting matrix
How to efficiently solve a regular problem on a network of heterogeneous computers?
Heterogeneous and Grid Compuitng
15
mpC (ctd) Main idea
– Transform the problem into an irregular problem» Whose structure is determined by the structure of the
executing network
The whole problem– Decomposed into a set of relatively large subproblems– Each subproblem is made of a number of small
identical subtasks stuck together– The size of each subproblem depends on the speed of
the processor solving this subproblem
Heterogeneous and Grid Compuitng
16
mpC (ctd)
The parallel program– A set of parallel processes– Each process solves one subproblem on a
separate physical processor» The volume of computation performed by each
of these processes should be proportional to its speed
– The processes are interacting via message passing
Heterogeneous and Grid Compuitng
17
mpC (ctd) Example. Parallel multiplication on a heterogeneous
network of matrix A and the transposition of matrix B, where A, B are dense square nxn matrices.
, =>
A B C=AxBT
Heterogeneous and Grid Compuitng
18
mpC (ctd) One step of parallel multiplication of matrices A and BT.
The pivot row of blocks of matrix B (shown slashed) is first broadcast to all processors. Then, each processor in parallel with others computes its part of the corresponding column of blocks of the resulting matrix C.
, =>
A B C
Heterogeneous and Grid Compuitng
19
mpC (ctd) See handouts for the mpC program implementing
this algorithm– The program first update the estimation of the speeds
of processors with the code» Executed at each step of the main loop
– The program first detects the number of physical processors
Heterogeneous and Grid Compuitng
20
mpC: inter-process communication
Basic subset of mpC is based on the performance model of parallel algorithm ignoring communication operations
– It presumes that» contribution of the communications into the total execution
time of the algorithm is negligibly small compared to that of the computations
– It is acceptable for» Computing on heterogeneous clusters» MP algorithms not frequently sending short messages
– Not acceptable for “normal” algorithms running on common heterogeneous networks of computers
Heterogeneous and Grid Compuitng
21
mpC: inter-process communication (ctd)
Compiler can optimally map parallel algorithms with substantial contribution of communication operations into the execution time only if programmers can specify
– Absolute volumes of computation performed by processes
– Volumes of data transferred between the processes
Heterogeneous and Grid Compuitng
22
mpC: inter-process communication (ctd)
Volume of communication– Can be naturally measured in bytes
Volume of computation– What is the natural unit of measurement?
» To allow the compiler to accurately estimate the execution time
– In mpC, the unit is the very code which has been most recently used to estimate the speed of physical processors » Normally specified as part of the recon statement
Heterogeneous and Grid Compuitng
23
mpC: N-body problem
The system of bodies consists of large groups of bodies, with different groups at a good distance from each other. The bodies move under the influence of Newtonian gravitational attraction
Heterogeneous and Grid Compuitng
24
mpC: N-body problem (ctd)
Parallel N-body algorithm– There is one-to-one mapping between groups of
bodies and parallel processes of the algorithm– Each process
» Holds in its memory all data characterising bodies of its group
Masses, positions and velocities of bodies
» Responsible for its updating
Heterogeneous and Grid Compuitng
25
mpC: N-body problem (ctd)
Parallel N-body algorithm (ctd)– The effect of each remote group is approximated
by a single equivalent body» To update its group, each process requires the total
mass and the center of mass of all remote groups The total mass of each group of bodies is constant. It is
calculated once. Each process receives from each of other processes its calculated total mass, and stores all the masses.
The center of mass of each group is a function of time. At each step of simulation, each process computes its center and sends it to other processes.
Heterogeneous and Grid Compuitng
26
mpC: N-body problem (ctd)
Parallel N-body algorithm (ctd)– At each step of simulation the updated system of
bodies is visualised» To do it, all groups of bodies are gathered to the process
responsible for the visualisation, which is the host-process.
– In general different groups have different sizes » Different processes perform different volumes of
computation» different volumes of data are transferred between different
pairs of processes
Heterogeneous and Grid Compuitng
27
mpC: N-body problem (ctd) Parallel N-body algorithm (ctd)
The POV of each individual process: the system includes all bodies of its group, with each remote group approximated by a single equivalent body .
Heterogeneous and Grid Compuitng
28
mpC: N-body problem (ctd)
Pseudocode of the N-body algorithm:
Initialise groups of bodies on the host-processVisualize the groups of bodiesScatter the groups across processesCompute masses of the groups in parallelCommunicate to share the masses among processeswhile(1) { Compute centers of mass in parallel Communicate the centers among processes Update the state of the groups in parallel Gather the groups to the host-process Visualize the groups of bodies}
Heterogeneous and Grid Compuitng
29
mpC N-body application
The core is the specification of the performance model of the algorithm:algorithm Nbody(int m, int k, int n[m]){ coord I=m; node { I>=0: bench*((n[I]/k)*(n[I]/k)); }; link { I>0: length*(n[I]*sizeof(Body)) [I]->[0];};
parent [0];};
Heterogeneous and Grid Compuitng
30
mpC N-body application (ctd)
The most principle fragments of the rest of code:void [*] main(int [host]argc, char **[host]argv){ ... // Make the test group consist of first Tgsize // bodies of the very first group of the system OldTestGroup[] = (*(pTestGroup)Groups[0])[]; recon Update_group(TGsize, &OldTestGroup, &TestGroup, 1,NULL,NULL,0 ); { net Nbody(NofGroups, TGsize, NofBodies) g; … }}
Heterogeneous and Grid Compuitng
31
mpC: algorithmic patterns
One more important feature of parallel algorithm is still not reflected in the performance model
– The order of execution of computations and communications
As the model says nothing about how parallel processes interact during execution of the algorithm, the compiler assumes that
– First, all processes execute all their computations in parallel
– Then the processes execute all the communications in parallel
– There is a synchronisation barrier between execution of the computations and communications
Heterogeneous and Grid Compuitng
32
mpC: algorithmic patterns (ctd)
These assumption are unsatisfactory in case of– Data dependencies between computations performed
by different processes» One process may need data computed by other processes in
order to start its computations
» This serialises some computations performed by different parallel processes ==> The real execution time of the algorithm will be longer
– Overlapping of computations and communications» The real execution time of the algorithm will be shorter
Heterogeneous and Grid Compuitng
33
mpC: algorithmic patterns (ctd)
Thus, if estimation is not based on the actual scenario of interaction of parallel processes
– It may be not accurate which leads to non-optimal mapping of the algorithm to the executing network
Example. An algorithm with fully serialised computations.
– Optimal mapping:» All the processes are asigned to the fastest physical processor
– Mapping based on the above assumptions:» Involves all available physical processors
Heterogeneous and Grid Compuitng
34
mpC: algorithmic patterns (ctd)
mpC addresses the problem– The programmer can specify the scenario of
interaction of parallel processes during execution of the parallel algorithm
– That specification is a part of the network type definition» The scheme declaration
Heterogeneous and Grid Compuitng
35
mpC: algorithmic patterns (ctd)
Example 1. N-body algorithmalgorithm Nbody(int m, int k, int n[m]){ coord I=m; node { I>=0: bench*((n[I]/k)*(n[I]/k)); }; link { I>0: length*(n[I]*sizeof(Body)) [I]->[0];}; parent [0];
scheme { int i; par (i=0; i<m; i++) 100%%[i]; par (i=1; i<m; i++) 100%%[i]->[0]; };
};
Heterogeneous and Grid Compuitng
36
mpC: algorithmic patterns (ctd)
Example 2. Matrix multiplication.algorithm ParallelAxBT(int p, int n, int r, int d[p]) {coord I=p;node
{ I>=0: bench*((d[I]*n)/(r*r)); };link (J=p)
{ I!=J: length*(d[I]*n*sizeof(double)) [J]->[I]; };parent [0];
Heterogeneous and Grid Compuitng
37
mpC: algorithmic patterns (ctd)
Example 2. Matrix multiplication (ctd) scheme { int i, j, PivotProc=0, PivotRow=0; for(i=0; i<n/r; i++, PivotRow+=r) { if(PivotRow>=d[PivotProc]) { PivotProc++; PivotRow=0; } for(j=0; j<p; j++) if(j!=PivotProc) (100.*r/d[PivotProc])%%[PivotProc]->[j]; par(j=0; j<p; j++) (100.*r/n)%%[j]; } };};
Heterogeneous and Grid Compuitng
38
mpC: the timeof operator
Further modification of the matrix multiplication program: [host]: {
int m; struct {int p; double t;} min;
double t; min.p = 0; min.t = DBL_MAX; for(m=1; m<=p; m++) {
Partition(m, speeds, d, n, r); t = timeof(net ParallelAxBT(m, n, r, d) w); if(t<min.t) { min.p = m; min.t = t; } } p = min.p;
}
Heterogeneous and Grid Compuitng
39
mpC: the timeof operator (ctd)
Operator timeof estimates the execution time of the parallel algorithm without its real execution
– The only operand specifies a fully specified network type» The value of all parametrs of the network type must be specified
– The operator does not create an mpC network of this type
– Instead, it calculates the time of execution of the corresponding parallel algorithm on the executing network » Based on
the provided performance model of the algorithm the most recent performance characteristics of physical processors
and communication links
Heterogeneous and Grid Compuitng
40
mpC: mapping
Dispatcher maps abstract processes of the mpC network to the processes of the parallel program
– At runtime– Trying to minimize of the execution time
The mapping is based on» The model of the executing network of computers
» A map of processes of the parallel program The total number of processes running on each computer The number of free processes
Heterogeneous and Grid Compuitng
41
mpC: mapping (ctd)
The mapping is based on (ctd)» The performance model of the parallel algorithm represented
by this mpC network The number of parallel processes executing the algorithm The absolute volume of computations performed by each
of the processes The absolute volume of data transferred between each
pair of processes The scenario of interaction between the parallel
processes during the algorithm execution
Heterogeneous and Grid Compuitng
42
mpC: mapping (ctd)
Two main features:– Estimation of each particular mapping
» Based on Formulas for
– Each computation unit in the scheme declaration
– Each communication unit in the scheme declaration Rules for each sequential and parallel algorithmic pattern
–for, if, par, etc.
Heterogeneous and Grid Compuitng
43
HeteroMPI
HeteroMPI– An extension of MPI– Programmer can describe the performance model of
the implemented algorithm» In a small model definition language shared with mpC
– Given this description» HeteroMPI tries to create a group of processes executing the
algorithm faster than any other group
Heterogeneous and Grid Compuitng
44
HeteroMPI (ctd)
Standard MPI approach to group creation– Acceptable in homogeneous environments
» If there is one process per processor» Any group will execute the algorithm with the same speed
– Not acceptable » In heterogeneous environments» If there are more that one process per processor
In HeteroMPI– The programmer can describe the algorithm– The description is translated into a set of functions
» Making up an algorithm-specific part of HeteroMPI run-time system
Heterogeneous and Grid Compuitng
45
HeteroMPI (ctd)
A new operation to create a group of processes:HMPI_Group_create(
HMPI_Group* gid,
const HMPI_Model* perf_model,
const void* model_parameters)
Collective operation– In the simplest case, called by all processes HMPI_COMM_WORLD
Heterogeneous and Grid Compuitng
46
HeteroMPI (ctd)
Dynamic update of the estimation of the processors speed can be performed by
HMPI_Recon(
HMPI_Benchmark_function func,
const void* input_p,
int num_of_parameters,
const void* output_p)
Collective operation– Called by all processes of HMPI_COMM_WORLD
Heterogeneous and Grid Compuitng
47
HeteroMPI (ctd)
Prediction of the execution time of the algorithmHMPI_Timeof(
HMPI_Model *perf_model,
const void* model_parameters)
Local operation– Can be called by any processes
Heterogeneous and Grid Compuitng
48
HeteroMPI (ctd)
Another collective operation to create a group of processes:
HMPI_Group_auto_create(
HMPI_Group* gid,
const HMPI_Model* perf_model,
const void* model_parameters) Used if the programmer wants HeteroMPI to find the
optimal number of processes
Heterogeneous and Grid Compuitng
49
HeteroMPI (ctd)
Other HMPI operationsHMPI_Init()HMPI_Finalize()HMPI_Group_free()HMPI_Group_rank()HMPI_Group_size()MPI_Comm *HMPI_Get_comm(HMPI_Group *gid)
HMPI_Get_comm– Creates an MPI communicator with the group defined
by gid
Heterogeneous and Grid Compuitng
50
Grid Computing vs Distributed Computing
Definitions of Grid computing are various and vague– A new computing model for better use of many separate
computers connected by a network– => Grid computing targets heterogeneous networks
What is the difference between Grid-based heterogeneous platforms and traditional distributed heterogeneous platforms?– A single login to a group of resources is the core– Grid operating environment – services built on top of this
» Different models of GOE supported by different Grid middleware (Globus, Unicore)
Heterogeneous and Grid Compuitng
51
GridRPC
High-performance Grid programming systems are based on GridRPC– RPC – Remote Procedure Call
» Task, input data, output data, remote computer
– GridRPC» Task, input data, output data» Remote computer is picked by the system
Heterogeneous and Grid Compuitng
52
NetSolve NetSolve
– Programming system for HPDC on global networks» Based on the GridRPC mechanism
– Some components of the application are only available on remote computers
NetSolve application– The user writes a client program
» Any program (in C, Fortran, etc) with calls the NetSolve client interface
» Each call specifies Remote task Location of the input data on the user’s computer Location of the output data (on the user’s computer)
Heterogeneous and Grid Compuitng
53
NetSolve (ctd)
Execution of the NetSolve application– A NetSolve call results in
» A task to be executed on a remote computer» The NetSolve programming system
Selects the remote computer Transfers input data to the remote computer Delivers output data to the user’s computer
– The mapping of the remote tasks to computers» The core operation having an impact on the
performance of the application
Heterogeneous and Grid Compuitng
54
NetSolve (ctd)
1. Assign (“task”)
netslInfo()
Agent
Server AProxy
Client
netsl (“task”, in, out)
Server B
netslX()
2. Upload (in)
3. Download (out)
Heterogeneous and Grid Compuitng
55
NetSolve (ctd)
Mapping algorithm– Each task is scheduled separately and
independently on other tasks» A NetSolve application is seen as a sequence of
independent tasks
– Based on two performance models (PMs)» The PM of heterogeneous network of computers» The PM of a task
Heterogeneous and Grid Compuitng
56
NetSolve (ctd)
Client interface– User’s command line interface
» NS_problems, NS_probdesc
– C program interface» Blocking call
int netsl(char *problem_name, …<argument_list>…)
» Non-blocking call request=netslnb(…); info = netslpr(request); info = netslwt(request);
Heterogeneous and Grid Compuitng
57
NetSolve (ctd)
Network of computers– A set of interconnected heterogeneous processors
» Each processor is characterized by the execution time of the same serial code
Matrix multiplication of two 200×200 matrices Obtained once on the installation of NetSolve and
does not change» Communication links
The same way as in NWS (latency + bandwidth) Dynamic (periodically updated)
Heterogeneous and Grid Compuitng
58
NetSolve (ctd)
The performance model of a task– Provided by the person installing the task on a
remote computer– A formula to calculate the execution time of the
task by the solver» Uses parameters of the task and the execution time of
the standard computation unit (matrix multiplication)
– The size of input and output data– The PM = a distributed set of performance models
Heterogeneous and Grid Compuitng
59
NetSolve (ctd)
The mapping algorithm– Performed by the agent– Minimizes the total execution time, Ttotal
» Ttotal= Tcomputation + Tcommunication
» Tcomputation
Uses the formulas of the PM of the task » Tcommunication= Tinput delivery + Toutput receive
Uses characteristics of the communication link and the size of input and output data
Heterogeneous and Grid Compuitng
60
NetSolve (ctd)
Link to NetSolve software and documentation– http://icl.cs.utk.edu/netsolve/