12
Parallel Computing 17 (1991) 285-296 285 North-Holland Practical aspects and experiences Parallelism in a multi-user environment R.J. van der Pas and J.M. van Kats CONVEX Computer B. K, Europalaan 514, 3526 KS Utrecht, The Netherlands Received April 1990 Revised December 1990 Abstract Van der Pas, R.J. and J.M. van Kats, Parallelism in a multi-user environment, Parallel Computing 17 (1991) 285-296 Vectorization reduces CPU time, whereas paraUelization leads to a reduction of wall-clock time at the cost of more CPU cycles. Application programs will in general not be perfectly parallel. Therefore a static scheduling scheme for processor usage will lead to a waste of cycles. In contrast with this, dynamic allocation of processors will prevent CPUs from being idle as long as there is work to do. Keywords. Dynamic and static scheduling; efficiency; parallel processing; throughput performance. 1. Introduction Parallel processing is posing a challenge to an increasing number of scientists. At the moment it is believed that the physical limits of CPU design and manufacturing are almost reached and so no drastic single processor performance increase can be expected in the near future. However, we think that such a belief has probably always existed and is ever so far overtaken by new, unprecedented, developments. For example Gallium Arsenide and supercon- duction at room temperatures might push the technology frontier. Parallel processing is here to stay and in this paper we will try to discuss some practical and economical aspects of parallel processing often not mentioned in the literature. To prevent making the discussion very vague, we will illustrate our points using some real results obtained with scalar, vector and parallel programs. 2. The problem must be solved as soon as possible Supercomputers would probably not exist if the conventional Von Neumann architecture could provide us with the processing power needed to solve today's problems. Optimization techniques applied in (super)computers, like vectorization and parallelization, are a derivative of the architecture of the system and not a goal in itself. They are necessary to squeeze the most out of machines based on these principles. 016%8191/91/$03.50 © 1991 - Elsevier Science Publishers B.V. (North-Holland)

Parallelism in a multi-user environment

Embed Size (px)

Citation preview

Parallel Computing 17 (1991) 285-296 285 North-Holland

Practical aspects and experiences

Parallelism in a multi-user environment

R.J. van der Pas a n d J .M. van K a t s

CONVEX Computer B. K, Europalaan 514, 3526 KS Utrecht, The Netherlands

Received April 1990 Revised December 1990

Abstract

Van der Pas, R.J. and J.M. van Kats, Parallelism in a multi-user environment, Parallel Computing 17 (1991) 285-296

Vectorization reduces CPU time, whereas paraUelization leads to a reduction of wall-clock time at the cost of more CPU cycles. Application programs will in general not be perfectly parallel. Therefore a static scheduling scheme for processor usage will lead to a waste of cycles. In contrast with this, dynamic allocation of processors will prevent CPUs from being idle as long as there is work to do.

Keywords. Dynamic and static scheduling; efficiency; parallel processing; throughput performance.

1. Introduction

Parallel processing is posing a challenge to an increasing number of scientists. At the moment it is believed that the physical limits of CPU design and manufacturing are almost reached and so no drastic single processor performance increase can be expected in the near future. However, we think that such a belief has probably always existed and is ever so far overtaken by new, unprecedented, developments. For example Gallium Arsenide and supercon- duction at room temperatures might push the technology frontier.

Parallel processing is here to stay and in this paper we will try to discuss some practical and economical aspects of parallel processing often not mentioned in the literature. To prevent making the discussion very vague, we will illustrate our points using some real results obtained with scalar, vector and parallel programs.

2. The problem must be solved as soon as possible

Supercomputers would probably not exist if the conventional Von Neumann architecture could provide us with the processing power needed to solve today's problems. Optimization techniques applied in (super)computers, like vectorization and parallelization, are a derivative of the architecture of the system and not a goal in itself. They are necessary to squeeze the most out of machines based on these principles.

016%8191/91/$03.50 © 1991 - Elsevier Science Publishers B.V. (North-Holland)

286 R.J. van der Pas, J . M . van K a t s

In general it can be said that any user of a computer system wants her or his computation to be finished as soon as possible. In a multi-user environment the system has to be shared with fellow users, leading to an increase in elapsed time. This is a normal and commonly accepted situation. The only objective of paraUellization is to achieve a reduction in elapsed time over the single processor version. Depending on the apphcation, system and workload, this reduc- tion can range from optimal to none at all. Now, if the potential decrease in elapsed time cannot be reahzed, where does the benefit of parallelization lie? In this paper we will try to make clear that even in such a situation parallellization can pay off, but expectations should not be set too high.

3. Parallel processing and vectorization (a discussion)

Parallel processing has been heralded as a technique to increase computer performance. Indeed for some applications quite spectacular speed-up's over uni-processor performance have been reported.

With a scheduling mechanism based on static allocation of processors it is the programmer's responsibility to make sure that, during the complete time the program is running, all CPUs are fully utilized. Otherwise, one or more of the processors will be idle during a part of the computation.

The majority of the existing production programs are not written in such a way that they can make efficient use of more than one processor during the total time of the computation. This is because in general the 'level of parallelism' within a program is not constant, but varies during the computations. So, in the case of a static scheduling mechanism, some of the processors will indeed be idle during a certain period of time, if there are more processors than processes at that time.

This means that in a production-oriented environment, like a computer centre serving a large user community, static scheduling of processors causes valuable cycles to be lost which increases the overall costs of computing. Dynamic scheduling of processors ([6]) does not suffer from this because as soon as a processor becomes free, a new task will be assigned to it. This task can be another program, another portion ( ' thread') of the same program, or even an operating system task.

This raises the question "Wha t do we actually want with parallel processing?" To answer this question two distinct computer environments must be distinguished: - Running one or more specific applications, within a given timeframe, on a dedicated

machine. A weather forecast model is a typical example of this. - Running a mix of applications, varying in code size, CPU time and I / O usage. For example

the computer centre of a university or large company can have a mixed workload of this kind.

Note that in practice both environments can co-exist, but not in an active state at the same moment.

It is obvious that parallel processing may reduce the wall-clock time (or time-to-solution) of a program running in the first (dedicated) environment. For the second environment this is less obvious and even questionable. A program using more than one processor claims in fact more of the available CPU resources and to prevent that the elapsed time of the other jobs, with a similar priority, increases above an acceptable level, the system should lower the priority. So if the workload is high, the program probably has only one CPU at its disposal if dynamic allocation of processors is possible.

Another very important aspect with parallel processing is that, no matter how efficient it is implemented, at least a few CPU cycles will always be needed for interprocessor communica-

Parallelism in a multi-user environment 287

tion or processor to system communication. This has for instance been observed in [5] and may in general lead to an increase in total CPU time. Depending on the workload, the wall-clock time can be reduced, but the CPU time may increase. In many papers about parallel processing, CPU time is mentioned where wall-clock time or Time-To-Solution (TTS) is meant.

With more traditional optimization techniques, like vectorization, the situation is in fact completely different. Here the main objective is to decrease CPU time (i.e. the bill) and this will lead in general also to a reduced elapsed-time. If a compiler or p rogrammer does a good job with for instance vectorization, the existing hardware is used more efficiently and hence the bill will in general be less.

We believe that researchers designing (new) parallel algorithms should take the following considerations into account: - First, try to vectorize the algorithm as much as possible. Experience shows that in general,

changes made to a program to increase vectorization, also increase scalar performance. This will become more pronounced with the advent of so-called ' super scalar machines ' which need to pipeline references to memory to achieve high computat ional rates.

- Next, parallelize the algorithm, preferably such that the execution time on one processor doesn't lead to unacceptable loss in efficiency. Now, reconsider the following: - Minimize interprocessor communicat ion (this will at least reduce the total CPU time) and

the number of spawn and join instructions, i.e. aim at coarse grained parallelism. - Avoid the introduction of new bottle-necks, like memory bank conflicts, being a function

of the number of processors used. - Maintain data efficiency. The (new) data structure for the parallel version of an algorithm

should not introduce an I / O bottle-neck. Parallel processing on shared memory C O N V E X / C R A Y like architectures in a multi-user

environment raises some new questions. In the next sections we will discuss some of these questions and try to give an answer as well.

4 . T h e e f f i c i e n c y t r a p

With parallel processing the efficiency of a computer is often measured by running a program on various numbers of processors. The standard approach is running on say i = 1, 2, 4, 8 . . . . processors, calculating the speed-up S(i) = T(1)/T(i), with T(i) the wall-clock time on i (vector)processors (i = 2 . . . . . P). The efficiency E(i) for i processors is then defined as E( i )= S(i) / i . For example, in case of a perfect speed-up S ( P ) = P the corresponding efficiency is E(P) = 100%.

Results only giving the efficiency or the speed-up are useless if at least not one of the times T( j ) ( j = 1, P) is given as well. We will illustrate this with a simple example.

The assumptions we make are: - We are dealing with 2 computers, A and B. - Both computers A and B have 4 vector processors and 4 scalar processors. - The vector and parallel overhead on both machines is zero and their scalar processors are

equally fast. - A vector processor of A is 10 times faster than the scalar processor of A. - A vector processor of B is only 2 times faster than the scalar processor of B. - The program we are considering consists of 3 parts:

Part 1 - An initialization phase, costing 1 unit of time in scalar mode Part 2 - A computat ional kernel, costing 40 units of time in scalar mode Part 3 - A post-processing part, costing 1 unit of time in scalar mode.

288 R.J. van der Pas, J . M . van K a t s

T i m e p a r t 1 T i m e p a r t 2 T i m e p a r t 3 S(4) E(4) W a l l - c l o c k

Sca l a r m o d e

A 1 40 1 - - 42

B 1 40 1 - - 42

V e c t o r m o d e

A 1 4 1 - - 6

B 1 20 1 - - 22

Para l le l & V e c t o r m o d e

A 1 1 1 2.0 50% 3

B 1 5 1 3.1 79% 7

Fig. 1. Ef f i c i ency a n d wa l l - c lock t imes fo r c o m p u t e r s A a n d B.

- Both parts 1 and 3 are completely scalar and cannot be vectorized a n d / o r parallelized. Part 2 can be vectorized and parallelized. Note that the third assumption is not realistic, but has been included to make the example

easier. The real life results will only be worse. In Fig. 1 the efficiency and wall-clock times for computers A and B are given in various

modes of execution i.e. scalar, vector and vector /paral le l mode. In Fig. 1 we see that the efficiency E(4) of computer B is much higher than that of A: 79%

versus only 50%. On the other hand, computer A is 2.3 times sooner finished with the program than computer B.

The reasons for this difference between efficiency and wall-clock time is that because computer A does a better job at vectorization, the scalar parts dominate more than with the slower computer B. For computer B there is more to reduce with parallelization than for computer A. The reduction caused by vectorization is reflected in the smaller elapsed time for A, whereas the reduction caused by parallelization is reflected in the efficiency. In [5], for a realistic code the same effect can be observed.

Note that the conclusions are not specifically bound to vectorization and parallelization, but hold for any method of optimization.

5. Some results

In this section we present results meant to illustrate some of the remarks made in section 3 and 4. Related experiments have been performed on an Alliant by Dimpsey and Iyer ([3,4]) and on a CRAY by Bierman ([2]).

As all our examples have been obtained on a CONVEX C220/C240 system, we will briefly describe the architecture and software of the C2 series in the next section. For a more detailed discussion we refer to [1].

5.1. Overview of the CONVEX C2xx architecture and software

The C2xx (xx = 10, 20, 30 or 40) system is a tightly coupled, shared memory, paral le l / multiprocessor supercomputer, supporting up to 4, 40 nanoseconds E C L / C M O S CRAY-like vectorprocessors each with a 4 Gigabyte virtual address space.

A schematic representation of the architecture of the four headed CONVEX C240 is given in Fig. 2.

The shared memory has a maximum capacity of 2 Gigabytes. There are 5 physical 200 Mbyte / s ports to memory with each port interfacing to memory via a non-blocking crossbar

Parallel ism in a mult i -user enoironment 289

Communica t ions Registers

k ~ <~C~C~ J M b

1KX6h

mm )0 de/.'

mm )o ~lels

~els

NON-Blocking Crossbar

5 Actual Ports

2 Gigabytes Physical Memory

Fig. 2. The CONVEX C240 architecture.

Scan Bus

PBUS

PBUS 80 MByte/s

mechanism. There is no functional unit called a crossbar, but instead the crossbar is distributed across the memory modules through VLSI gate arrays implementing the crossbar function.

Each vectorprocessor contains, next to the scalar processor, a vector unit with vector functional units for load/store, add/subtract/logical and multiply/divide/square root. These functional units can be fully chained. There are 8 vector registers, each 64 bits wide by 128 elements deep. In addition to this there is a 128 bit vectormask register to support automatic vectorization of loops with IF statements and indirect addressing. The scalar floating point units can execute concurrently with the vector floating point units.

Directly supporting multiprocessing and parallel processing is a set of 1024 shared, multi- ported 64 bits wide communication registers. These registers generally contain a Program Status Word, iteration values, address offsets and other parameters used to support automatic parallelization provided by the Fortran, C and Ada compilers. There are special instructions, like spawn, join, get, put and increment, that access values kept in the communication registers. The communication registers are logically separated in 8 blocks. This is more than the maximum number of CPUs and allows the system to have processes available before they need to be made active (' mounted').

The Asynchronous Self Allocating Processors (ASAP) mechanism is based on the work described in [6]. ASAP is a system combining both multi and parallel processing capabilities. To support the symmetric multiprocessing capabilities of the C2, a substantial rewrite of the Operating System (BSD based UNIX, POSIX compliant) kernel was necessary. The major areas of modification were in the scheduler and semaphoring kernel tables that deal with resource sharing and file management.

The ASAP mechanism uses a combination of dynamic scheduling of processors and a priority scheme such that when the workload is high the system behaves as a multiprocessing system and whenever the workload is low, jobs asking for more than one processor will get them.

ASAP is based on automatic, hardware implemented, allocation of processors. There is no control processor determining what to do next, but instead whenever a processor is finished with it's current piece of work, it examines the communication registers for the next portion of work. The processor does not (need) to know whether this is for the same program (parallel

290 R.J. van der Pas, J.M. van Kats

processing) or for another program (multiprocessing). Even when there are I /O-opera t ions to do within a code section (automatic) parallelization is still possible.

When a parallel region is entered, a so-called thread is created by the hardware. A thread is a series of instructions that represent the smallest schedulable entity. This results in all idle processors (the number is not fixed) responding to the spawn and they all begin executing threads. A thread initiates execution with its own values of the program counter and stack management registers. When a parallel region is exited, all threads, on all processors, asynch- ronously and independently execute join instructions. The last processor executing the join continues to execute the subsequent sequential part of the program. When a join is finished on a processor, this processor becomes free to execute other threads or processes of different programs. If the processor is not the last one to finish, it selfschedules itself through dispatch queues maintained in the communication registers.

Each CPU has a Communication Index Register (CIR) which is an index to one of the 8 partitions of the communication register sets. The CIR can be seen as a virtual process identifier managed by the CPU, either under the direction of the Operating System or automatically via the ASAP mechanism. One or more CIRs can point to different programs (multi processing) or the same program (parallel processing). Any CPU can execute any mounted process by switching its CIR register to the partit ion describing the process.

In this way CPUs will never be idle as long as there is work to be done. This will minimize the time-to-solution of the complete workload in a dynamic way determined by the use of the system. In case there are many programs executing, ASAP will behave like a multi processing system, but when there are only a few jobs executing it will grant each program multiple CPUs, i.e. parallel processing.

The CONVEX compiler technology is based on a common global optimizer supporting multiple front ends. Each front end produces a common intermediate language that is a global data flow representation of a program. All compilers (Fortran, C and ADA) support automatic vectorization and parallelization. In case of ADA not only on the loop level, but also on task level. The compiler generates the proper instructions to initiate a parallel region (like spawn) and manage the communication registers (like put).

5.2. An example of A S A P on a C O N V E X C240

To illustrate the ASAP dynamic scheduling mechanism we have executed a simple Fortran program containing the following code fragment:

DO 1010 1:I, 10

CALL GETINFO (TID(I), CPUID(I))

c(I)=O.O DO 1000 J=I,N

C(I)=C(I)+A(I,J)*B(J) 1000 CONTINUE

CALL GETINFO (TID(I), CPUID(I))

PRINT*, I. TID(I), CPUID(I)

1010 CONTINUE

Subroutine G E T I N F O reads the CPU and thread id currently running. Two versions have been executed on a four processor C240 with several users logged in. First we ran in vector mode only and next with the outerloop executed in parallel. The results in vector mode and parallel mode are given in Fig. 3.

In Fig. 3 it is seen that the vector version of the program runs on one CPU only and there is only one thread executing. The iterations are executed in sequential order. The parallel version shows a very dynamic pattern: the various threads more or less 'walk' over the CPUs, the order

Parallel ism in a mult i -user environment 291

Vector mode Parallel mode

Value of I Thread id CPU id Value of I Thread id CPU id

1 0 3 4 3 2 2 0 3 3 2 1 3 0 3 5 2 3 4 0 3 1 0 0 5 0 3 2 1 3 6 0 3 8 0 2 7 0 3 9 0 0 8 0 3 10 1 3 9 0 3 7 2 3

10 0 3 6 3 0

Fig. 3. Example of the ASAP mechanism on a C240 in vector and parallel mode.

of the iterations is non-deterministic. All four processors are utilized, but not in a symmetric way as this is fully determined by the workload. This example also shows that I / O can be performed within a thread.

A remark must be made on the way the outerloop is spread over the CPUs. In our example the loop length is very short and so a different algorithm to split the loop in parallel parts is used. In general the Fortran compiler will split the loop in blocks of contiguous loop indices where the number of blocks is determined upon execution and depends on the number of processors in the system.

5.3. Example of a mixed workload on a parallel system

We have selected 7 Fortran programs with not only a varying degree of vectorization and parallelization, but also a varying CPU time on one processor. The programs are based on the Synthetic Standard Benchmark of the Academic Computer Center of the University of Utrecht (ACCU) [7,8], slightly modified for our purpose. They can be characterized as follows: PROG1-Program to generate memory bank conflicts. PROG2-Calcula t ion of matrix times vector; several matrix sizes. PROG3-Solut ion of a full linear system; several matrix sizes. PROG4-Genera t ion of random numbers. PROG5-Solut ion of a linear least squares problem, using two methods. PROG6-Solut ion of a diffusion problem, using ODEPACK. PROG7-Solut ion of a Poisson problem using a highly parallel solver.

The programs vary in degree of parallelism. PROG4 is a pure scalar program and PROG7 parallelizes quite well; the other programs are somewhere in between. In the selection of these programs no" attempt has been made to simulate a real-life worlkoad. They have only been selected because of their varying degree of parallelism and the fact that some of them could be seen as being representative for modest application codes. As a consequence of this, the timing results in this section are only valid for the set PROG1 . . . . . PROG7. We believe however that the observations and conclusions hold in general for any set of programs.

Times in vector mode will refer to single processor CPU times (unless explictly stated otherwise). Times in parallel mode refer to wall-clock times or, equivalently, time-to-solution (TTS). Parallel mode will be shorthand for vector and parallel mode. All timing results presented in this section have been obtained with the Unix command ' t ime'. This command returns, among other statistics, the total CPU time spent in user mode and Time-To-Solution (TTS) of a job. All results presented in this section have been obtained on a two-headed CONVEX C220 system with 128 MByte of main memory (16 way interleaved for 64 bits).

292 R.J. van der Pas, J.M. van Kats

Program Vector mode Parallel mode

CPU "ITS CPU TTS

PROG7 222 226 245 127

Fig. 4. A parallel program on 1 and 2 CPUs.

Program Number of copies Parallel mode TTS(1)/TTS(n)

CPU TTS

PROG7 1 245 127 1.000 2 246 254 0.500 3 245 383 0.332 4 245 512 0.248

Fig. 5. Several copies of a parallel program on 2 CPUs.

Program Vector mode Parallel mode Idle time (est.)

CPU TTS CPU "ITS

PROG1 10 10 17 11 5 PROG2 30 30 31 24 17 PROG3 39 40 39 31 23 PROG4 74 76 75 77 79 PROG5 8 8 16 15 14 PROG6 207 210 225 137 49 PROG7 228 231 247 139 31 Sum 596 605 650 434 218 Total elapsed time 608 435

Fig. 6. Simulation of static scheduling on two CPUs.

PROG7 is quite a parallel program. We have run this on an empty C220 system in both vector and parallel mode. The results in seconds are given in Fig. 4.

In comparison with vector mode it is indeed seen that the CPU time increases with parallel processing. As mentioned earlier, this will happen on any parallel system, because there is always some communication overhead involved, consuming CPU cycles. In this example the extra cost is about 10%.

In our next experiment we further investigated the increase in CPU time caused by parallel processing and the reduction of TTS. Several copies of PROG7 have been executed simulta- neously on the C220. The results in seconds are given in Fig. 5. The times given for ' n ' copies are the average of the ' n ' values obtained; the ratio TTS(1) /TTS(n) is given in the last column.

The experimental ratio's of the elapsed times hardly differ from the theoretical value of l / n , demonstrating the efficiency of the system.

Note that this experiment represents a bad example for ASAP. We know that the program is parallel and will issue a great many spawn instructions. When running more than one copy, the request for additional processors will not be granted as the other processor is busy with at least one other copy of the same program. As the copies are submitted simultaneously, n requests will be issued at the same time, but never honored. In fact this could be called 'overhead associated with parallelism in a multi-user environment'.

In our third experiment we ran all 7 programs in two modes: vector and parallel, but in a sequential order. So the machine was in single user mode and there was only one program running at the time. The run in parallel mode is in fact a simulation of static scheduling and will show the amount of parallelism in the programs. The parallel times can also be used to

Parallelism in a multi-user environment 293

estimate the idle time (per program) caused by static scheduling. The following expression is used for this:

idle time = TTS* ( P - CPU t ime/TTS) = P * TTS - CPU time.

The results in seconds, including the estimated idle time, are given in Fig . 6. The figures under 'Total elapsed time' are the times for the complete run. It is indeed seen that the amount of parallelism varies. Parallellization reduces the total elapsed time with (605 - 434)/605 = 28%, at the cost of a (650 - 596)/596 = 9% increase in CPU time. We can also estimate the idle time from these figures and observe that during these 434 seconds, the system is idle for 218 seconds; this implies that (218/434) /2 = 25% of the available CPU cycles is lost.

Two additional remarks regarding the performance differences between vector and parallel mode are in place here. • In parallel mode, the CPU time of PROG1 increases with 70%.

This is an outline of the computational kernel of the program DO INC=I, 32

S=O

DO I = 1.1000. INC,INC

S=S+A(I)*B(I)

END DO

END DO

The intention of this code is to assert performance degradation caused by memory bank conflicts. The DO-loop over I is parallelized at the stripmine level. This means that the loop is split in (vectorized) chunks of 128. The chunks are distributed over the processors (2 in our case). The system we have used has 16 memory banks. In vector mode there are no memory bank conflicts if INC = 2. However, for INC = 2 in parallel mode the situation is different. This is the load pattern for the two CPUs:

CPU I: load A(1), A(3),...,A(257), a(1),... CPU 2: load A(259), A(261),...,A(513),...

Because of the 16 banks, element A(3) and A(259) are stored in the same bank and hence in parallel mode a bank conflict occurs. The same holds for the other elements of A (and B). As a result of these additional memory bank conflicts, CPU time will increase. For other (odd) values of INC memory bank conflicts do not occur and the performance will be higher in parallel mode. Clearly the net result is that the total elapsed times for the 32 invocations of the loop over 1 are almost equal in both modes. This is coincidence.

• In parallel mode, the CPU and elapsed time of PROG5 doubles. This is caused by the structure of the program, which in fact is not even vectorizable. The length of the loops does not make parallelization profitable, but this cannot be detected by the compiler.

Program T(0) T(1) T(2) T(3) T(4) T(5) T(6) T(7) 0 56 68 168 208 316 584 605

PROG1 10 2 0 0 0 0 0 0 PROG2 30 22 20 0 0 0 0 0 PROG3 40 32 30 10 0 0 0 0 PROG4 76 68 66 46 36 0 0 0 PROG5 8 0 0 0 0 0 0 0 PROG6 210 202 200 180 170 134 0 0 PROG7 231 223 221 201 191 155 21 0

Fig. 7. Analysis of a throughput test on a single headed system.

294 R.J. van der Pas, J.M. van Kats

The elapsed times obtained in vector mode can be used to predict a throughput situation where all jobs are started simultaneously.

The estimated figures can be obtained as follows. N is the number of jobs initially started in the throughput test. Let T(i) denote the wall-clock time passed since the start of the throughput test and define delta(i) as delta(i) = T(i + 1) - T(i) for i = 0 . . . . . N - 1.

Let tau(i, j ) be the elapsed time program j has to go at time T(i ) and define V(i) as the indexset ( j } for which tau(i, j ) = / = 0, i.e. V(i) consists of the indices of those jobs that are not yet finished. N(i ) denotes the number of elements in V(i).

With these definitions we get: V ( O ) = £ 1 . . . . , N }

N ( O ) = N

for i=O,...,N-1

for j in V(i) determine: m(i)=min{tau(i,j)}

delta(i )=m(i )*N(i )

tau(i+1,j)=tau(i,j)-m(i), for all j in V(i)

T(i +1 )=T(i ) +delta(i )

V(i +1 )={j I tau(i +l,j ) =/--0}

end for

In Fig. 7 we have used these formula to estimate the usage of a single processor system (i.e. a C210) during such a throughput test with the seven programs. Below each T(I) we have given the elapsed time in seconds the throughput test is running.

It is seen that the estimated time of 605 seconds is in good agreement with the actually observed value of 608. After 316 seconds there are only two jobs left in the machine and the last 21 seconds only PROG7 is running.

We can use this to predict what the elapsed time of the throughput test on a two headed C220 will be. This would be 584/2 + 21 = 313 if PROG7 wouldn't parallelize at all, but we know it is parallel. From Fig. 4 we derive a parallel efficiency of 0.89 and therefore the estimated elapsed time is 584/2 + 21/1.78 = 304 seconds.

To verify our estimate and demonstrate ASAP we have submitted the 7 jobs simultaneously. This will show how the ASAP mechanism deals with a mixed workload. Of course the wall clock time of the individual jobs will increase in this situation in almost all cases. To compare this with vectorization we also have run the same mix in vector mode only, so the programs are also running simultaneously but each program will always run on only one CPU. The CPU and elapsed times of the programs in both experiments are given in Fig. 8.

If we add the CPU times in the first column of Fig. 8 we get a total CPU time for the mix in vector mode of 650 seconds. Comparing this with the total CPU time of 596 seconds in Fig. 6 it is seen that when running the jobs simultaneously, the total CPU time increases with 9%. Of

Program Vector mode Parallel mode

CPU TTS CPU TTS

PROG1 11 41 15 53 PROG2 32 94 31 98 PROG3 40 109 39 115 PROG4 74 166 75 174 PROG5 8 30 15 55 PROG6 233 320 208 312 PROG7 252 350 248 331 Sum 650 631 Total elapsed time 351 331

Fig. 8. A mixed workload in vector and parallel mode.

Parallelism in a multi-user environment 295

course the wallclock times for the individual programs in vector mode increase because now they have to share the system.

As we are running a set of programs simultaneously it is not possible to simply calculate the time-to-solution by summing all individual wall-clock times. The time-to-solution has been measured in the experiments and turned out to be 351 seconds in vector mode and 331 seconds in parallel mode. Of course these are the same as the wall clock times of PROG7 because this program is the most time consuming (see also Fig. 7).

With the ASAP mechanism the total time-to-solution of the mix in parallel mode is 331 seconds. If we compare this with the simulated static scheduling (Fig. 6) we can see a reduction of (435 - 331)/435 = 24% in favour of dynamic scheduling. This can be fully ascribed to the ASAP mechanism, causing CPUs to be busy all the time as long as there is work available.

We can also use the results to estimate the overhead caused by ASAP. Our analysis predicted an elapsed time of 305 seconds for the throughput test on the C220. The measured value is 331 seconds. If we assume that our simple model is valid for this test, we derive that the overhead caused by ASAP is (331 - 304)/304 = 9%.

If we compare the times in Fig. 8 we see that for PROG1 . . . . . PROG5 the CPU time increases or remains constant in parallel mode. This is what one might expect. However, with PROG6 and PROG7 the CPU time drops in parallel mode! The same effect has been observed in [9] and can probably be explained by the fact that the threads can share data via the communicat ion registers, which leads to more efficient execution.

One might consider to run in vector mode only on a loaded system, because the dynamic scheduling will not allow the use of more than one processor. The results in Fig. 8 demonstrate that the total elapsed time is 6% lower when run in parallel mode. The system is able to do a better task scheduling in parallel mode and so this should be preferred over vector mode.

Again, we emphasize that the above given timings refer to the specific set of 7 programs we have used in our experiments.

5.4. Conclusions

1. ASAP dynamic scheduling scheme enables programmers always to work with one parallel version of their program. When the workload is high, it is unlikely that more than 1 CPU is available for their program, but the penalty for this is relatively small. On the other hand, as soon as there are some 'holes ' in CPU usage, it can be exploited, resulting in a decrease in

time-to-solution. 2. The overhead associated with the dynamic scheduling of ASAP is very low. 3. A dynamic scheduling scheme is particularly important in a multi-user environment. A CPU

will never be idle as long as there is work to do. 4. ASAP will minimize the time-to-solution of a complete workload.

Acknowledgements

The authors would like to thank Steve Wallach at CONVEX Corporate Headquarters for his suggestions. We also want to thank the referees for their remarks and suggestions which have

definitely improved this paper.

References

[1] M. Chastain, G. Gostin, J. Mankovich and S. Wallach, The CONVEX C240 architecture, in: A. Emmen, J. Hollenberg, R. LLurba and A. van der Steen, eds., Proc. Supercomputing Volume 1, Supercomputing Europe "89

(ASFRA B.V., ISBN 90-72945-018).

296 R.J. van der Pas, J.M. van Kats

[2] M. Bieterman, Microtasking general purpose partial differential equation software on the CRAY X-MP, J. Supercomput. 2 (4) (1988) 381-414.

[3] R.T. Dimpsey and R.K. lyer, Multiprogramming performance degradation: case study on a shared memory multiprocessor, in: Proc. 1989 lnternat. Conf. Parallel Processing 1Iol. H (St. Charles, IL, 1989) 205-208.

[4] R.T. Dimpsey and R.K. Iyer, Performance degradation due to multiprogramming and system overheads in real workloads: case study on a shared memory multiprocessor, in: Proc. 1990 Internat. Conf. on Supercomputing (1990) 227-238.

[5] W. Gentzsch, Parallel processing on shared memory systems: algorithms, granularity, load balancing, communica- tion, synchronization, memory usage and debugging, in: Proc. 4th lnternat. Conf. on Supercomputing, Santa Clara (1989).

[6] C.D. Polychronopoulos and D.J. Kuck, Guided self-scheduling: a practical scheduling scheme for parallel supercomputers, IEEE Trans. Comput. C-36 (12) (1987).

[7] A.J. van der Steen, Is It really possible to benchmark a supercomputer? (A graded approach to performance measurement), in: A.J. van der Steen, ed., Proc. Seminar Evaluating Supercomputers (Kogan Page Ltd., London, 1989).

[8] A.J. van der Steen and R.J. van der Pas, A family portrait (benchmarktests on a Cray Y-MP and a Cray-2S), Report TR-30, ACCU, Utrecht, January 1989.

[9] H.A. van der Vorst, Experiences with parallel vector computers for sparse linear systems', Supercomput. 37 (1990).