Multiprocessor computer architectures : algorithmic design ... · Multiprocessor computer architectures : algorithmic design and ... ARCHITECTURES: ALGORITHMIC DESIGN AND APPLICATIONS

Loughborough UniversityInstitutional Repository

Multiprocessor computerarchitectures : algorithmicdesign and applications

This item was submitted to Loughborough University's Institutional Repositoryby the/an author.

Additional Information:

• A Doctoral Thesis. Submitted in partial fulfilment of the requirementsfor the award of Doctor of Philosophy of Loughborough University.

Metadata Record: https://dspace.lboro.ac.uk/2134/10872

Publisher: c© A.S. Roomi

Please cite the published version.

https://dspace.lboro.ac.uk/2134/10872

This item was submitted to Loughborough University as a PhD thesis by the author and is made available in the Institutional Repository

(https://dspace.lboro.ac.uk/) under the following Creative Commons Licence conditions.

For the full text of this licence, please go to: http://creativecommons.org/licenses/by-nc-nd/2.5/

Dx crcn 10

LOUGHBOROUGH UNIVERSITY OF TECHNOLOGY

LIBRARY

AUTHOR/FILING TiTlE i . _____________ ~ ~ .9_"::"_ '--I ___ A~ _____ ----- ----- --;

---- --- ----- ---------------- ----- --- ----- - - ----- ---. ACCESSION/COPY NO.

o '+-0:> 0 I S-S"3 0 ----------------- .. -- ---- -- --- --------- - - -- - - --VOL. NO. CLASS MARK

Ul 1994

30 JUN

040015530 3

111111 11111111111111111111111111

MULTIPROCESSOR COMPUTER ARCHITECTURES:

ALGORITHMIC DESIGN AND APPLICATIONS

BY

AKEEL S. ROOM!, B.Sc.,M.Sc.

A Doctoral Thesis

Submitted in Partial Fulfilment of the Requirements

For the Award of Doctor of Philosophy

of Loughborough University of Technology

August, 1989.

SUPERVISOR: PROFESSOR D. J. EVANS; PH. D-:','6?'s6:; ," •• ,.;.:r-

Department of Computer Studies c;.;

.~

© by A.S. Roomi, 1989.

! I ;~;;;~ UnNers-; . of TQChnology Library

Dale J'.... '! 0

CERTIFICATE OF ORIGINALITY

This is· to certify that I am responsible for the work submitted I

in this thesis, that the original work is my own except as specified

in acknowledgements or in footnotes,. and that neither the thesis nor

the original work contained therein has been submitted to this or

any other institution for a higher degree.

A.S. RooMI.

To

My Paren ts ,

My Wife, Berjua~,

and my Children,

Amar, Maytham and Heatr.or,

with love,

AkeeL

'.

ACKNOWLEDGEMENTS

It is the special privilege of authorship, that at the end of

one's exertions, it is possible to pause, look back and formally

thank the many people without whose active participation, the

scheduled completion of a doctoral thesis would have been impossible.

Professor D.J. Evans, Head of the Parallel Processing Centre,

Loughborough University of Technology, in addition to being an effective

and extremely conscientious supervisor and Director, is also endowed

with a gently persuasive manner. This combination was ideal in guiding

me through the perilous area of parallel computers. I am most grateful

to him for his painstaking comments and useful suggestions throughout

the last three years.

My profound gratitude to the Ministry of Higher Education, Iraqi

Government, for the award of a three year scholarship to enable me to

undertake this research.

I am grateful to the staff and research students of the Department

of Computer Studies for their kind cooperation during my research. I

. wish to thank in particular, Dr. W.S. Yousif, for his useful suggestions

during the early stage of this work.

Finally, I would like to express my appreciation·to my wife,

Berjual, who provided an environment and the encouragement essential

to the completion of the work.

ABSTRACT

The contents of this thesis are concerned with the implementation

of parallel algorithms for solving partial differential equations

(POEs) by the Alternative Group EXplicit (AGE) method and an

investigation into the numerical inversion of the Laplace transform

on the Balance 8000 MIMO system.

Parallel computer architectures are introduced with different

types of existing parallel computers including the Oata":Flow computer'

and VLSI technology which are described from both the hardware and

implementation points of view. The main characteristics of the Sequent

parallel computer system at Loughborough University is presented, and

performance indicators, i.e., the speed-up and efficiency factors are

defined for the measurement of parallelism in the system. Basic ideas

of programming such computers are also outlined.

Basic mathematical definitions and a general description and

classification of POE's and its related discretised matrix are

introduced in Chapter 3.

In Chapter 4, the parallel version of the AGE method is developed.

for one and two dimensional elliptic POE's. The AGE method is

suitable for parallel computers as it possesses separate and independent

tasks. Therefore three synchronous and asynchronous strategies have

been used in the implementation of the method. The timing results

and the efficiency of these implementations were compared. A

. computational complexity analysis of the parallel AGE method is also

included.

The eigenvalues and the corresponding eigenvectors of the Sturm

Liouville problem are found by using the AGE method with different

boundary conditions.

In Chapter 5, the three parallel AGE strategies are also

implemented on time dependent POE's. The parallel AGE method was

applied on the second order parabolic equation with Oirichlet

boundary conditions and the diffusion-convection equation with a

comparison presented. Then the parallel AGE method on a two

dimensional parabolic equation and a hyperbolic equation is discussed.

A parabolic POE with derivative boundary condition is also solved

using the AGE method. A new AGE formula based on a O'Yakonov splitting

of the matrix is used to solve one and two dimensional parabolic POE's.

By comparing and analysing the results, yields an algorithm with

reduced computational complexity and greater accuracy for multi

dimensional problems.

Finally, in Chapter 6, the problem of numerically inverting the

Laplace transform has been investigated and numerical results were

obtained to compare th", different methods. An idea to improve the

accuracy by. imposing a Romberg integration is suggested. Attempts to

a~celerate the convergence of the slowly converging series is also

investigated. A parallel algorithmic form of an accurate method for

the numerical inversion of the Laplace transform is implemented.

The thesis concludes by summarizing the main results obtained

and suggestions for further work are included.

CONTENTS

PAGE

CHAPTER 1: PARALLEL COMPUTER ARCHITECTURES. AN INTRODUCTION

1.1 Introduction 1

1.2 Via Parallelism 4

1.3 Architectural Classification Schemes 8

1.3.1 Flynn 's Parallel Computen; Classification 8

1.3.2 Feng's Parallel Computers Classification 10

1.3.3 Shore's Parallel Computers Classification 13

1.3.4 Handler's Parallel Computers 16 Classification

1.3.5 Other Parallel Computers Classification 17

1.4 Pipeline Computers

1.5 SIMD

1.6 MIMD

1.7 Data-Flow Computers

1.8 VLSI Systems and Transputers

1.8.1 Transputer System

1.9 The Balance 8000 System

CHAPTER 2: PARALLEL PROGRAMMING AND LANGUAGES

2.1 Introduction

2.2 Parallel Programming

2.2.1 Implicit Parallelism

2.2.2 Explicit Parallelism

2.3 Programming the Balance System

2.3.1 Multitasking Terms and Concepts

19

21

23

32

35

38

41

47

49

52

54

60

61

2.3.2 Data Partitioning with DYNIX

2.3.3 Function Partitioning with DYNIX

2.4 Parallel Algorithms

2.4.1 The Structure of Algorithms for Multiprocessor Systems

CHAPTER 3: BASIC MATHEMATICS, GENERAL BACKGROUND

3,1 Introduction

3.2 Classification of Partial_Differential Equations

3;3 Types of Boundary Conditions

3.4 - Basic Matrix Algebra

3.4.1 Vectors and Matrix Norms

3.4.2 Eigenvalues and Eigenvectors

PAGE

66

70

73

78

80

81

84

87

91

93

3.5 NUMERICAL SOLUTION OF PDE'S BY FINITE DIFFERENCE METHOD 95

3.5.1 Finite Difference Approximation

3.5.2 Derivation of Finite Difference Approximations

-3.5.3 Consistency, Efficiency, Accuracy and Stability

3.6 Methods of Solution

3.6.1 The Direct Methods

3.6.2 The Iterative Methods

3.6.3 The Block Iterative Methods

3.6.4 Alternating Direction Implicit (ADI) Methods

3.6.5 Alternating Group Explicit (AGE) Method

95

103 -

107

109

109

III

- 116

116

123

PAGE

CHAPTER 4: STEADY STATE PROBLEMS, PARALLEL EXPLORATIONS

4.1 Introduction 131

4.2 Parallel AGE Exploration 132

4.3 Experimental Results for the One Dimensional Problem 137

4.4 Experimental Results for the Two Dimensional Problem. 150

4.5 The AGE Method for Solving Boundary Value Problems with Neumann Boundary Conditions 172

4.5.1 Formulation of the Method

4.5.2 Numerical Results

4.6 The AGE Method for Solving Sturm Liouville Problem

4.6.1 Method of Solution

4.6.2 Numerical Result

4.7 Conclusions

CHAPTER 5: . TIME DEPENDENT PROBLEMS, PARALLEL EXPLORATIONS

5.1. Introduction

5.2 . Experimental Results for the DiffusionConvection Equation

172

177

181

182

185

187

189

190

5.3 Experimental Results for the Two-Dimensional Parabolic Problem 202

5.4 Experimental Results for the Second Order Wave Equation

5.5 The Numerical Solution of One-Dimensional parabolic Equations by the AGE Method with

220

D' Yakonov Splitting 224

5.5.1 The AGE Method


225

229

5.6 The Numerical solution of Two-Dimensional parabolic Equation by the AGE Method with

PAGE

D' Yakonov Splitting 234

5.6.1 Numerical Results 240

5.7 A New ?trategy for the Numerical Solution of the Schrodinger Equation 244

5.7.1 OUtline of the Method


5.8 Conclusions

245

248

251

CHAPTER 6: NUMERICAL INVERSION OF THE LAPLACE TRANSFORMATIONS, SOME INVESTIGATIONS AND PARALLEL EXPLORATIONS

6.1 Introduction

6.2 The Numerical Inversion of the Laplace Transform

6.3 Numerical Experiments

6.3.1 The Implementation of the Fast Fourier

253

255

258

Transform Technique 268

6.4 Parallel Implementation of the Numerical Inversion of the Laplace Transform 271

6.5 Conclusions 275

CHAPTER 7: CONCLUSIONS AND FINAL REMARKS 276

REFERENCES 281

APPENDIX A: A LIST OF SOME SELECTED PROGRAMS 290

CHAPTER 1

PARALLEL COMPUTER ARCHITECTURES,

AN INTRODUCTION

Anyone who says he knows how computers

should be buil~ should have his head examined.

Computer Architecture

J.E. Thornton.

1

1 .. 1 INTRODUCTION

High-performance, flexible and reliable computers are increasingly

in demand from many scientific and engineering applications, which may

be 4equired to be solved in real time. Since conventional computers

have a limited speed and reliability, the satisfaction of these

requirements can only be achieved by a high-performance computer

system. The achievement of high performance not only depends on using

faster and more reliable hardwar~ devices, but also on different

computer architectures and processing techniques. Therefore, parallel

computer systems need to be developed further.

In earlier times, relays (in the 1940s) and vacuum tubes (in the

1950s) were used as switching devices and they were interconnected

with wires and solder joints. Central Processing Unit (CPU) structure

was bit-serial and arithmetic was done on a bit-by-bit fixed point

basis. By the early 1960's, transisters (invented in 1948) were used

i~ computer circuits .. Passive components such as resistors and

capacitors were also included in these circuits. All of these devices

were mounted on some kind of circuit boards, the most complex of which

consisted of a number of layers of conductors. and insulating material.

These provided interconnections between the elementary devices as well

as their mechanical support. Many improvements to ··computer architectures

were subsequently carried out. For example, Sperry Rand built a

computer system with an independent 1/0 processor which operated in

parallel with one or two processing units. Core memory was still

used in many computer systems. Then, solid-state memories replaced

the core memories.

By the late 1960's, Integrated Circuits (ICs) were in use,

,,'

2

followed by Large Scale Integrated (LSI) techniques, providing on one

silicon chip several transistors, the required resistors and capacitors

as well as interconnection paths.

Following the rapid advance in LSI technology, the Very Large Scale

Integration (VLSI) circuits have been developed with which enormously

complex digital electronic systems can be fabricated on a single chip

of silicon. Devices which once required many complex components can

now be built with just a few VLSI chips, reducing the difficulties in

reliability, performance and heat dissip'ation that arise from standard

small-scale and medium-scale integrate.

Until a few years ago, the current state of electronic technolo~y

was such that all factors affecting computational speed were almost

minimized and any further computational speed increase could only be

achieved through both increased switching speed and increased circuit

density. Due to the basic physical laws, the intended breakthrough

seemed unlikely to be achieved mainly because we are fast approaching

the limits of optical resolution. Hence, even if switching times are

almost instantaneous, distances between any two points may not be

small enough to minimize the ,propagation delays and thus improve

'computational speed. Therefore, the achievement of even faster

computers is conditioned by the use of new approaches that do not depend'

on breakthroughs in device technology, but rather on imaginative

applications of the skills involved in computer architecture.

Obviously, one approach to increasing speed is through Parallelism.

The ideal objective is to create a computer system containing p

processors, connected in some cooperating fashion, so that it is p

times faster than a computer with a single processor. These parallel

computer systems or multiprocessors as they are commonly known, not

only increase the potential processing speed, but they also increase

the overall throughput, flexibility, reliability and provide fault

tolerance in case of processor failures'"

3

4

1.2 VIA PARALLELISM

Parallelism, the notion of the parallel way of thinking was

conceived long before the emergence of truly parallel computers. It

is thought that the earliest reference to parallelism is in Le Manebrea's

,publication', entiJ:led "Sketch of the Analytical Engine" invented by

C. Babbage. There, reporting on the utility of the conceived machine,

he wrote:

"Likewise when a long series of identical computations is to

be performed, the machine can be brought into play sO,as to

give several results at the same time, which will greatly

abridge the whole amount of the processes".

Babbage's notion was neither implemented in the final design of

his calculating engine or elsewhere, due to the lack of technological

development accordingly, though, the notion of the parallel way of

thinking had been conceived.

The division of ~omputer systems into generations is determined

by the device technology, system architecture, processing mode and

languages used. 'We are currently in the fourth generation, while the

'. fifth generation is on the horizon.

The first generation (1938-1953). The first electronic digital

computer, ENIAC (Electronic Numerical Integrator And Computer), in 1946

marked the beginning of the first generation of computers. Using

vacuum tubes, magnetic drums as central memories and electronic valves

as their switching components, with gate delay times of approximately 1 ~s.

* Following Babbage' s lecture in Turin, describing his "difference engine"

a young Italian engineer wrote a detailed account of the machine in

French (published in October 1842). Ada, Lady Lovelace translated the paper into English.

5

The second generation (1952-1963). Transistors were invented in

1948, while the first transistorized digital computer (TRADIC) , was

built by Bell Laboratories in 1954. The propagation delay times of.

using the germanium transistor is approximately 0.3 ~s. Assembly

languages were used until the development of high-level languages,

FORmula TRANslation (FORTRAN), in 1956 and ALGOrithmic Language (ALGOL),

·in 1960.

The third generation (1962-1975). This generation was marked by

the use of Small-Scale Integrated (551) and Medium-Scale Integrated

·(MSI) circuits as the basic building blocks. High-level languages

were greatly enhanced with intelligent compilers during this period.

Multiprogramming was well developed to allow the simultaneous

execution of many program segments interleaved with I/O operations.

Virtual memory was developed by using hierarchically structured memory

systems. The propagation delay was about 10 ns, and later, around the

1970's, it became slightly less than 1 ns.

The fourth generation (l972-present). This generation is

characterised by enhanced levels of circuit integration through the

use of LSI circuits for both logic and memory sections. High~level

languages were extended to handle both scalar and vector data; Most

operating systems were time-sharing, using virtual memories.

Vectorizing compilers appeared in the second generation of vector

machines like the Cray-l (1976) and the Cyber-205 (1982). High-speed

main frames and supercomputers appeared as multiprocessor systems, like

the Univac 1100/80 (1976), Fujitsu M382 (1981), the IBM 3081 (1980),

and the Cray X-MP (1983). A high degree of pipelining and multi

processing is greatly emphasized in commercial supercomputers.· A

Massively Parallel Processor (MPP) was custom-designed in·1982.

All these various multiple processor architectures can be

categorized in four distinct organizations: Associative, Parallel,

Pipelined and Multiprocessors.

An attempt by Hockney and Jesshope [Hockney, 1981] to summarize

the principal ways to introduce the notion of parallel processing at

the hardware level of the various computer architectures, res~lts in:

1. the application of pipe lining-assembly lines-techniques in

order to improve the performance of the arithmetic or control

units. A process is decomposed into a certain number of

elementary subprocesses each of which being capable of

execution on dedicated autonomous units;

2. the arrangement of several independent units, operating in

parallel, to perform some basic prinCipal functions such

as logic, addition or multiplications;

3. the arrangement of an array of processing elements (PE's)

executing concurrently the same instruction on a set of

different data, where the data is stored in the PE's

private memories;

4. the arrangement of several independen t processors, working

6

in a cooperative manner towards the solution of a single task

by communicating via a shared or common memory, each of them

being a complete computer, obeying its own stored instructions.

To illustrate alternative hardware and software approaches, in

the following sections, we shall select principal significant

architectures, which differ sufficiently from each other. Specifically

for the Multiprocessor class, the Balance 8000, parallel processing

system, at Loughborough University of Technology, is described in

more detail, due to the fact that it was extensively used during the

carrying out of the present research.

7

8

1.3 ARCHITECTURAL CLASSIFICATION SCHEMES

To date many classification schemes have been proposed. In this

section we shall briefly present the theoretical concepts of the

architectures taxonomy given by Flynn (1966), Feng (1972), Shore (1973)

and Handler (1977).

1.3~1 Flynn's Parallel Computers Classification

In 1966, M.J. Flynn [Flynn, 1966] classified computer organizations

into four categories according to the multiplicity of instructions and

data streams. For convenience he adopted two definitions: the

instruction stream, as a sequence of instructions which are to be

executed by the system, and the data stream, as a sequence of data

called for, by the instruction stream.

Flynn's four machine organizations as shown in Figure 1.1 are:

1. Single Instruction Single Data (SISD);

2. Single Instruction Multiple Data (SIMD);

3. Multiple Instruction Single Data (MISD);

4. Multiple Instruction Multiple Data (MIMD).

SISD:

SIMD:

This is the classical von Neumann model. A single stream of

instructions operates on a single data stream. It may have more.

than one functional unit operating under the supervision of one

control unit. Examples are IBM 3600/91, CDC-NASF, Fujitsu

FACOM-230/75.

This is the class to which array processors and pipeline

. processors belong. All the processors elucidate the same

instructions and perform them on different data. Because of

9

8I8D 8IMD

C C

T [ J. 1

Pl

P2

P n

P

I 1 -1 N

8 I J r 8

1 8

2 8

n

MISD MIMD

I C C

1

c:5 C2

C n

T

Q Cl C n

Pi P2 P

n Pi P2 P

n

N N

I S

1 I 5J EJ S

n

FIGURE 1.1: Flynn's Parallel Computer Classification

rc: control unit; P: processor; N: data organisation network; S: store).

10

their simple form machines of this kind can have a large

number of processors. Examples are ICL/DAP. Illiac-IV. STARAN.

MISD: (Chains of processors), there are n processor units, each

receiving distinct instructions operating over the same data

stream. No real embodiment of this class exists.

MIMD: This is the multiple processor version of "SIMD. All processors

elucidate different instructions and operate on different data.

Most multiprocessor systems and multiple computer systems can

be classified in this category. Examples are C.mmp. Balance

8000. 21000. Cray-2.

1.3.2 Feng's parallel Computers Classification

In his classification Tse-Yun Feng [Feng. 1972] has proposed the

use of degrees of parallelism in various computer architectures. The

Tfl2ximum paralleUsm degree P, is defined as the maximum number of binary

digits (bits) that can be processed within a unit time by a computer

"" system. " If p. are the number of bits that can be processed within the 1.

ith

processor cycle and T is the processor cycle indexed by i=1.2 ••..• T.

then the average parallelism degree."p is defined by. a

T

L Pi i=l

= --"'--=--T (1.3.1)

typically. p.~p. Accordingly the utilization rate ~ of a computer 1.

system within T cycles is,

T

Pa L Pi

i=l IJ. = (1.3.2) p T.p

11

Figure 1.2 emphasizes the classification of computers by their

maximum parallelism degrees, where the horizontal axis shows the

word length n, while the vertical axis corresponds to a bit-slice*

length m.

If c be a given computer, then the maximum parallelism degree

p·(c) is represented by the product of the word length n and the bit-

slice length m; that is,

p(c) n.m . (L 3.3)

Obviously., p(c) is equal to the area of the rectangle defined by the

integers nand m.

There are four types of processing methods that can be seen from

Figure 1. 2:

1. Word-Serial and Bit-Serial (WSBS);

2. Word-Parallel and Bit-Serial (WPBS);

3. Word-Serial ·and Bit-Parallel (WSBP);

4. Word-Parallel and Bit-Parallel (WPBP).

WSBS: which is the conventional serial computer (von-Neumann);

One bit (n=m=l) is processed at a time.

WPBS: (n=l,m>l) I since an m bit-slice is processed at a time, so it

is termed bit-slice processing. Examples are STARAN, MPP.

WSBP: (n>l, m=l) , found in most existing computers. Since one word of

n bits is processed at a time, hence, it has been called word-

* The bit-slice is a string of bits, one from each of the words at the

same vertical bit position. As an example, the TI-ASC has a word

length of sixty-four and four arithmetic pipelines, and each pipe has

eight pipeline stages, so there are thirty-two bits per each bit-slice

in the four pipes.

16384 - - .. MPP

.5

:S IJ' <: Q)

..:I Q) U • .-1 .-< ., 1> • .-1 <Il

: (1,16384)

288 - _ . ...1

256

64

16

1

• Staran (1,256)

__ t __ _

1

C.DDDp ------ - ---~(16,16)

, -

1 PDP-ll------i'-----

1 (16,1) I' ,

16

Word Length (n)

PEPE --"(32,32) , ,

, 1 -

i _____ _

I

, , I

12

Illiac IV - • (64,64)

, ,IBM 370/168 1 Cray-l

-----------<t : (32,1) 1 (64,1)

32 64

FIGURE 1.2: Feng's Parallel Computer Classification System

WPBP:

slice processing. Examples are IBM 370/168UP, CDC 6600.

(n>l, m>l) , known as fully parallel processing, in which an

array of n·m bits is processed at one time. Examples are

TI-ASC, C.mmp, Balance 8000, 21000.

1.3.3 Shore's Parallel Computers Classification

13

In 1973 Shore [Shore, 1973] presented a classification of parallel

computer systems based on their constituent hardware components. There

are six different types of machines according to his proposal, and all

existing computers could belong to one of them. Figure 1. 3 shows the

six different types and what they are.

1. Machine 1, consists of an Instruction Memory (IM) , a single Control

Unit (CU) , a Processing Unit (PU) , and a Oata Memory (OM). Examples

are the Cray-l, COC 7600, etc.

2. Machine 11, is obtained from Machine I by simply changing the way

the data is read from OM. Machine 11 reads a bit from every word

in the memory, instead of reading all bits of a single word.

Examples are ICL OAP, STARAN, etc.

3. Machine Ill, this machine is derived from the combination of

Machines I and I1a There are two PUs, one horizontal and one

vertical. An example is Sander's Associates OMEN 60 .

. 4. Machine IV consists of a single CU and as many as possible,

independent PE's, each of which has a PU and OM. Communication

between these components is restricted to take place only through

the CU. Example is PEPE.

5. Machine V is derived from Machine IV by adding interconnections

between processors. An example is ILLIAC IV computer.

14

IM CU IM

Horizontal CU PU

Word-slice Vertical Byte slice. OM PU OM

1. Machine I 2 . Machine 11

CU IM

L G Horizontal PU PU PU

CU PU 1 2 n

Vertical PU OM

OM .OM OM 1 2 n

3. Machine III 4 • Machine IV

FIGURE 1.3: Shore's Parallel Computer Classification

PU 1

OM 1

PU 2

DM 2

CU

5. Machine V

CU

PU + OM

6. Machine VI

PU n

OM n

FIGURE 1.3: Shore's Parallel Computer Classification (continued)

15

6. Machine VI, the difference between this machine and the previous

machines,is that the PU's and the DM are no longer individual

hardware components, but instead they are constructed on the same

re board. Examples are the associative memories and associative

processors.

1.3.4 Handler Parallel Computers Classification

Wolfgang Handler [Handler, '1977J suggested his classification

outline for identifying the parallelism degree and pipelining degree

built ,into the ,hardware structures of a computer system. There are

three subsystem levels of parallel-pipeline proces'sing according to

hi~ classification:

1. Processor Control Unit (PCU);

2. Arithmetic Logic Unit (ALU);

3. Bit-Level Circuit (BLe).

PCU and ALU are well defined. Each PCU corresponds to one

processor. The ALU is equivalent to the PE in SIMD array processors.

The BLe corresponds to the combinational logic circuitry needed to

perform I-bit operations in the ALU. Let C be defined as a computer

'system, then C can be characterized by a triple containing six

independent entities, as defined below:

16

T(C) ; <KxKl,DxDl,wxWl> (1.3.4)

where,

D ; the number of ALU's (or PE's) under the control of one PCU;

K the number of processors (PCU's) within the computer;

W the word length of an ALU (or PE);

Dl = the number of ALU's that can be pipelined;

Kl = the number of PCU's that can be pipelined;

Wl = the number of pipeline stages in all ALU's (or PE's).

17

AS an example, the Texas Instrument's Advanced Scientific Computer

(TI-ASC) has one controller controlling four arithmetic pipelines, each

has 64-bit word lengths and .8-stages. Thus, we have

T(ASC) = <lxl,4 xl,64 x8> = <1,4,64x8> . (1.3:5)

·1.3.5 Other Parallel Computers Classification

There are some other classification approaches, less significant

than the former four, based mainly upon the notion of parallelism.

Hobbs, et al [Hobbs, 1970] in 1970, distinguished the parallel

architectures into Multiprocessors, Associative processors, Network or

Array processors and Functional ma~hines.

Murtha.and Beadles [Murtha, 1964] based their taxonomy view upon

the parallelism properties, attempting to underline the differences

between multiprocessors and a highly parallel organization. According

to them the parallel organization could be classified into the general

purpose network computers, the special-purpose network computers with

global parallelism and finally the non-global semi-independent network

computers with local parallelism.

18

1.4 PIPELINE COMPUTERS

In this section, the structures of pipeline computers and vector

processing principles are studied. Pipelining offers an economical way

to realize temporal parallelism in digital computers. To achieve

pipe lining , one must subdivide the input task (process) into a sequence

of subtasks, each of which can be executed on a dedicated facility,

called a stage or station. Stations are connected via buffers or latches.

The pipeline computers have distinct pipeline processing capabilities,

'e.g. there can be pipe lining between the processor and the I/O unit.

Within the processor, there can be pipelining between instruction.

A pipeline processor consists of a sequence of.processing circuits,

called stages, through which a data stream passes. Each stage does

some partial processing, on the data and a final result is obtained

after the data has passed through all these stages of the pipeline.

Figure 1.4 exemplifies a typical pipeline computer. This diagram

shows both scalar arithmetic pipelines and vector arithmetic pipelines.

The instruction PU is itself pipelined with three stages.

As an example, consider the process of performing the instruction

(F), decode the instruction (D), fetch the operand (0), and finally its

execution (E).

In a non-pipelined computer, the above steps E~St be completed

be~ore the next instruction can be issued as shown in Figure 1.5,

----011 Instruction processing I ·

FIGURE 1.5: Non-Pipelined Processor.

1/

,

Scalar processor

Scalar data r1 SPl l K

If - - - cal- -1 SP2 I - - - - -- .. - -.- -- - - -I

I ar .... 1 I'egi- I

1 ~ter I

I I

Main I ~ spn-1 - - -Memory Instruc Instru: Scalar - - ------

tion -tion K Scalar pipeline

fetch decode fetch

0 IS +- (F) (D) (0)

Vector processor

K vector --- - -

J fetch -I VPl

- - -, 1 I J vp2 1 kiect- 1

Instruction processing I ~ ~ --- - - -- - - - - -- - - ~ or

K Regi-, I

sten I

data 1 VPm Vector

Vector pipeline

FIGURE 1.4: Functional structure of a Modern Pipeline Computer with Scalar and vector Capabilities.

(IS: Instruction stream; 0: Operand fetch;

K: ControZ signaZ)

19

I

J

J I I ,

While in the pipel·ined computer the four stages F,D,O, and E are

executed in an overlapped manner (Figure 1.6). After constant time

intervals, the output of one stage is switched to the next station.

A new instruction is fetched (F) in every time cyCle, and stage (E)

produces an output every time cycle, because the time to perform an

instruction consists of multiple pipeline cycles ..

Pipe lining can be presented at more than one level in the design

of computers. Ramamoorthy and Li [Ramamoorthy, 1977] introduced many·

theoretical considerations of pipelining and presented a survey of

comparisons between various pipeline machine designs.

Stations

SI S2 S3 S4

FIGURE 1.6: A Pipelined Processor.

20

21

1.5 SIMD

The SIMD (array processor) is a synchronous parallel computer with

"multiple arithmetic logic units (ALU) , called processing elements (PEs).

These identical PEs, arranged in an array form and controlled by a single

control unit, which decodes and broadcasts the instructions to all

processors within the array. Each PE has its own private memory which

provides it with its own data stream. Hence the PEs are synchronized

to perform the same function at the same time. ·Two essential reasons

for building array processors are firstly, economic for it is cheaper

to build P processors with only a single control unit rather than P

similar computers. The second reason concerns interprocess~r

communication, the communication bandwidth can be more fully utilized.

An array processor computer is depicted in Figure 1.7.

The advantages of SIMD computers seem to be greatest when applied

to the solution of problems in matrix algebra or finite difference

methods for the solution of the partial differential equation. It is

known that most algorithms in this area require the same type of

operations to be repeated on large numbers of data items. Hence, the

problem can be distributed to the processors that can run simultaneously.

Obviously, a complete interconnection network, where each processor

"is connected to all other processors, is expensiye and unmanageable by

both the designer and the user of the system. Therefore some other

interconnection patterns are proposed to be of primary importance in

determining the power of parallel computers.

The interconnection pattern used in ILLIAC IV is that the

processors are arranged in a two-dimensional array where each processor

22

is connected to its nearest four neighbours. Also processors can be

arranged as a regular mesh in a perfect shuffle pattern or in various

special-purpose configurations for merging, sorting or other specific

applications.

I/O

I control I I2roc~S:S:QX:

CU (Scalar processing)

I Control I memoQ!

Data bus I I contra 1 I - - - --- - - - ---I

1

1 PEl I PE2 PEn I

I I

Iprocesso~ ~rocessorl Iprocesso3 I

~ I--- - -- +- I I

I memory I I memory I I memory I I I

(Array I processing) I

I 1

I

'"

Inter-PE connection network (data routing)

FIGURE 1.7: Array Processors Computer.

23

1.6 MULTIPROCESSING SYSTEMS (MIMD)

In this section our emphasis is on m~ltiprocessing computers. The

American National Standards Institute (ANSI) , defined the multiprocessor

as: "A computer employing two or more processing units under integrated

control", which is hardly complete since the two most significant

images for this type of computer, i.e. the sharing and interaction

concepts were not included.

In 1977 Enslow [Enslow, 1977] offered a complete definition of.a

multiprocessor system depending on· its characteristics. As well as the

ANSI definition, he included the following conditions:

1. All the processors with approximately comparable capabilities, must share

access to a cocmon memory, I/O channels, control units and devices.

2. The entire complex is under the control of a single operating system

providing the interaction between processors and their.program at

the task, instruction and data levels. The basic diagram of a MIMD

machine of P processors is illustrated in Figure 1.8. Although

each processor includes its own CU, a high level CU may be used to

control the transfer of data and to assign tasks and sequences of

operations between the different processors.

The .universal class of MIMD comp)lters may be classi·fied, depending

on the amount of interactions, into two main classes:

1. A tightly-coupled system;.

2. A loosely-coupled system.

The tight~y-coup~ed processors, as illustrated in Figure 1.9a,

where the processors operate under the strict control of the bus

24

CU connection 1- - - - - - - - - - - I

, I I

,- --) CUl ~ -,~ CUP E-

I I

1 ,..

I '" , , I .~ I

I I I

1 , I --I

1 I I I I , , ."

=1 Channel H Input .~

~ - _.

Channel H OUtput I-> Interconnection network

(switch)

k- - - , Secondary I Channel 1"---1 memory

i j

I

! -.-I , I - - - I , I

.c ,- -----) Control flow

I l Data flow Main memory

I

FIGURE 1.8: A -Typical MIMD System.

25

assignment scheme which is implemented in hardware at the bus/processor

interface.

The loosely-coupled processor, as shown in Figure 1.9b, where the

communication and the interaction between processors took place on the

basic of information exchange.

The tightly-coupled multiprocessor has a noticeable performance

advantage over the.loosely-coupled multiprocessor.

Examples of some multiprocessor systems which we shall briefly

discuss to exemplify the machin~s characteristics are;

1. The S.l system;

2. The Neptune system.

The S.l multiprocessor system which can be described as a high

speed general-purpose multiprocessor developed at LLN laboratory. The

S.l is implemented with the S.l uniprocessor called Mark IIAs, as

illustrated in Figure 1.10. This structure consists of 16 independent

Mark 'IIA uniprocessors which share 16 memory bank's via a crossbar

switch. Each processor has a private cache which is transparent to the

user. Each-uniprocessor, crossbar switch and memory bank is connected

to a diagnostic processor which can probe, report and change the internal

state of all modules that it monitors. For the S.l system, there'ex~s

a single user operating system multi-user operating system and advanced

operating system. It also supports multitasking by the division of

problems into cooperating tasks.

The Neptune system, is another system built at Loughborough in 1981

under professor Evans, and used extensively for the development of

parallel MIMD algorithms (see Barlow et al [Barlow, 1981J).

26

Shared memory

-

J Processor Processor Processor

1 2 3

a. Tightly-coupled

. Memory Memory 0 1 2 3 .

I i I

: ;

;

Processor Processor Processor 1 2 3

b. Loosely-coupled

FIGURE 1.9: Multiprocessor System.

27

1-1 Memory 0 - - ------ Memory 15

4

Controller ( Hg~agnostic I ~~agnost~c Controller 15 rocessor rocessor --.J

I

r I I I TI I

- -Crossbar

- _ switch Diagnostic processor

T i Uniprocessor 0 Uniprocessor 15

Data InstructioIl cache cache

M

F

P

I

A

Diagnostic I/O

~ --- -! I/O

store store processor

0 7

Mass storage

Real time units

I/O +---< I/O !-'> I/O process

-or 0

I Peripheral equipment

procelE -- -or

-,

1

Data Instruction cache cache

.

M

1-14 F

~--- - P

I

A

Diagnostic I/O I/O

store 1'---- .... store processor

0 7

Mass storage

Real time units

I/O +-l I/O ~ I/O process

-<)r -- -0

Peripheral equipment

rrocess -<)r 7

FIGURE 1.10: The Structure of the S.l Mark IIA Multiprocessor

The Neptune system constitutes four Texas Instruments 990/10

mini-computers configured, as illustrated in Figure 1.11. The

instruction sets include both 16-bit and 8-bit byte addressing

capability. The system comprises of four processors each with its

private memory of 128Kb and one processor (po) has a separate 10Mb

disc drive. The access to the local memory is made by a local bus

(or TILINE), which is attached to each processor, ~ince the TILINE

coupler is designed so that the shared memory follows continuously

28

from the local memory of each processor, thus, each processor can

access 192kb of memory. Each processor runs under the powerful DXIO

uniprocessor operating system which is a general-purpose, multitasking

system. The DXIO operating system has been adapted to enable the

processors to run independently and yet to cooperate in running programs

with data in the shared memory.

Although the processors are identical in many hardware features,

they are different in their speed. The relative speed of processors

PO,Pl,P2 and P3 are 1.0, 1.037, 1.006 and 0.978 respectively. This

Will, however, reduce the efficiency of the system and decrease the

performance measurements of an algorithm with synchronization.

Various interconnection networks, which is the main factor for

multiprocessor hardware system organization, with different character

istics such as bandwidth, delay and cost, ranging from the shared

common bus to the crossbar switch, have been suggested. Enslow in 1977

[Enslow, 1977] identified three intrinsic different organizations,

namely:

1. The time-shared common bus;

29

I I I

Shared I-- memory

I I P

2 I I M2

I I

..

~ (

I P

l I \. Disc

I

I I I Bc )\ I I ~ Disc

~ W I P: processor

I M: memory

( Disc )

FIGURE 1.11: The Neptune Configuration

30

2. Crossbar switch networks;

3. Multiport memory.

The time-shared common bus which represents the simplest inter-

connection system for either single or multiple processors. It consists

of a common communication path connecting all the functional units,

which are a number of processors, memories, and I/O devices. The system

capacity is limited by the bus bandwidth and system performance may be

demoted by adding new functional units.

To overcome the insufficienty of the time-shared bus organization,

the crossbar switch is used. The crossbar switch provides a separate

.path for every processor, memory module, and I/O unit. So that if the

multiprocessor system contains P processors and M memories, the cross-

bar needs (pxM) switches. In fact, it i·s difficult to have a large

system based on the crossbar switch concept because the complexity

2 grows at the rate of O{n ) for n devices. The important characteristics

of these systems are the extreme simplicity of the switch-to-functional

unit interfaces and the ability to support concurrent transfers for all

memory modules.

In the multiport memories systems, the functions of control,

switching between processors, and priority arbitration are centralized

at the memory interface. Hence every processor has a private bus to

every passiv~ unit, i.e. memory and I/O units.·

The prinCipal distinguishing features of such. a system are

expensive memory control, expansion from uniprocessor to multiprocessor

system using the same hardware, a large number of cables and connectors

and system limitation by memory port deSign.

Besides the three presented interconnection networks, there are

many others which can be valuable for multiprocessor organization such

as the·Omega network [Lawrie, 1975], the Augmented Data Manipulator

[Siegel, 1979] and the Delta network [Patel, 1981].

In Section 1.9 we will also present the Balance 8000 system as a

multiprocessor system.

31

32

1.7 DATA-FLOW COMPUTERS

A new approach to parallel processing i.e. Data Flow is briefly

outlined in this section, whereas in the previous sections, the computer

architectures reviewed are known as control flow (CF) (von Neumann)

machines. In the CF computer, the program is stored in the memory as a

serial procession of instructions. Hence, this is one of the essent~al

difficulties in the utilization of the parallelism of algorithms in the

. ·CF model of computation.

An alternative architectural model for computer systems is

composed to assert the parallelism of algorithms, this model is the

Data-Flow (DF) (also known as a Data-Driven) model of computation. In

the DF system, the course of computation is controlled by the flow of

data in the program. In other words, an operation is executed as and

when its operands are available. This means that the sequence of

operations in the DF system respond to the preference constraint

imposed by the algorithm used rather than by the location of the

instructions in the'memory. On that account, the DF computer can

perform Simultaneously in parallel as many instructions as it is given

and distributes the result to all subtask instructions which make use

of this partial result as an operand.

The DF computers can be grouped, depending on the problem. tackled,

into two main classes:

1. The static structure;

2. The dynamic structure.

In the static structure, the loops and subroutine calls are

unfolded at compile time so that each instruction is performed only

once. Figure 1.12 illustrates the static structure DF machine, which

consists of the following components: a store which holds the

instruction cells (packets) having space for the operation, operands

with their pointers to the successors, and a set of operating units to

execute the operations. The maximum· throughput is determined by the

speed and the number of operating units, the memory bandwidth and by

the interconnection system. The most significant factors that reduce

the throughput are the degree of concurrency available in the program;

the memory access and interconnection network conflicts and finally

the broadcasting of the results.

33

In the dynamic structure, the operands are labelled so that a

single copy of the same instruction can be used several times for

different instances of the loop or subroutine. Figure 1.13 shows the

dynamic structure DF· machine. The main components of the Manchester

DF computer [Gurd, 1985] are the token queue that stores computed

results, the token matching unit that combines the corresponding tokens

into instruction arguments, the instruction store that contains the

ready-to-execute instructions, the operating units and the I/O switch

for communication with the host. The degradation factors-are similar

to those of the static case except the additional overhead in token

label matching.

Because of the above recount discredit factors, DF systems are

only captivating for cases in which the con currency exhibited is of

several hundred instructions long. The significant advantages of·the

DF computers are the exploitation of the concurrency at a low level of

the performance hierarchy, since it allows the maximum utilization of all

the available concurrency.

34

Operating units

~

I Inst . cell I ~ ... ... g ~ ...,

Q) I Q) c: c:

c: c: 0 I 0 .... . ... ..., I

..., 01 III

:!Ol ... ..., ... . ... ..., I cell I .Q

Ul Inst. ... .... ..: Q

FIGURE 1.12: The Static Data-Flow Computer.

hOr Token queue

1 To

I/O switch Matching unit OVerflow unit

r Fr om host

Instruction store

Processing units

FIGURE 1.13: The Dynamic Data-Flow Computer.

35

1.8 VLSI SYSTEMS AND TRANSPUTERS

Attributable to current maturity in hardware technology, Large

Scale Integrated (LSI) circuitry has become so close-knit that a single

silicon LSI chip may contain tens of thousands of transistors. Whereas

the rapid advance in LSI technology leads to the Very Large Scale

Integrated (VLSI) circuits where the number of transistors that the LSI

circuits contain will be increased by another factor of 10 to 100. This

advent in the LSI and VLSI chips has given a large boost to the research

and development of array processor and multiprocessor architectures.

The key factors of VLSI technology are: its capacity to implement

enormous numbers of devices on a chip, its low cost and the high degree

of integration. While the main VLSI problem is to overcome the design

complexity. The size of wires and transistors approach the limits of

photolithographic constancy, for it becomes literally impossible to

achieve further diminutiveness and actual circuit area becomes a key

issue. In addition, the chip area is also limited in order to maintain

high chip yield and the number of pins is limited by the finite size

of the chip perimeter. These restrictions form the basis of the VLSI

paradigm.

The separation between the processor from its memory and the

-limited opportunities for synchronous processing are the principal

problem in the conventional (von Neumann) computers. The VLSI-designs

put forth more flexibility than conventional (von Neumann) systems to

overcome these difficulties, since memory and processing architectures

can be implemented with the same technology and in close vicinity.

Many authors investigate the requirements of parallel architectures

for VLSI, among those are Kung [Kung, 1982], Dew [Dew, 1982] and

Seitz [Seitz, 1982].

Dew classified VLSI architectures as:

1. Simple and regular design, where it accommodates a few modules

which are replicated many times while the grain of the modules

depends on the application.

36

2. The design must have a very high degree of parallelism through both

pipelining and multiprocessing.

3~ Communication and switching, since the major differences.between

VLSI design and the earlier digital technologies is that the

communication paths will dominate both the area and the time delay.

This is because the speed of the devices increases as the feature

size decreases while the propagation time along a wire does not.

One of the novel ideas to emerge from VLSI research into parallel

architectures, is that of systoZic array processors. The concept of

systolic architectures, pioneered by Kung [Kung, 1982], is basically

a gen·eral methodology of directly mapping .algorithms onto an array of

PEs. The significant differences between a systolic array processor

and multiprocessor lattice are that in a systolic array, the

communication is only to neighbouring processing cells (i.e. no global

bus), the communication· with the outside world occurs only at the

boundary cells; and the processing cells are Ithardwired" and not

programmed from the host. The fundamental principle of asystolic

architecture, the systolic array in particular, is illustrated in

Figure 1.14. By replacing a single PE with an array· of PEs, a higher

computation throughput can be achieved without increasing the memory

bandWidth.

37

Memory

PE

a. The Conven~ional Organization.

Memory

L-----lPE PE PE PE PE PE

b. A Systolic Array Processor

FIGURE 1.14: Systolic Design Principle.

One problem correlated with the systolic array systems, is that

the data and control movements are manipulated by global timing

reference beats, so in order to synchronize the cells,:. extra delays are

often used to ensure correct-timing~ To overcome this conflict Kung

[Kung, 1985J suggested to take advantage of the data and control flow

locality, inherently possessed by most algorithms. This enables a data

driven, self-timed approach to array processing. Ideally, such an

approach constitutes the requirement of correct timing by correct

sequencing a

38

1.8.1 Transputer System

Another important development, which is set to make a major impact

in the field of multiprocessor systems is the INMOS Transputer [INMOS,

1984). What distinguishes the INMOS transputer from its competitors is

that it has been designed to exploit VLSI. The transputer is a single

chip microprocessor containing a memory, processor and communication

links for connection to other transputers, which affords direct hardware

support for the parallel language OCCAM. The structure of a transputer

is given in Figure 1.15. In the transputer INMOS have taken as many

components of a traditional von-Neumann computer as possible and

implemented them on a single lcm2

chip, while at the same time

presenting a high level of support for a synchronous view of· computation.

The communication between VLSI devices has a very much lower bandwidth

than communication between subsystems on-chip. Transputer systems

communicate asynchronously and therefore they can be synthesized with

each processor. In transputerports, all components perform

concurrently; each of the four links and the floating-point coprocessor

can all execute useful work while the processor is performing other

instructions.

Since the transputer is designed for the OCCAM language, so

concurrency may be described between transputers in the system or

within a single transputer, which means the transputer can support

internal concurrency. The processor contains a scheduler which

enables any number of processes to run on a single transputer sharing

processor time, while each link presents two unidirectional channels

for point to point communication. Processes are held on two process

---l

Reset Ana1yze Error Boot from rom c1x vcc

·'gnd

not mem not mem not mem rd not mem rf

mem wai mem con f

--n--j

-l

-

:=

) System 7 services

On-chip ram /

I

~

Application and I

Specification I Interface

K

FIGURE 1.15: Transputer Architecture.

/ /

/ .I

I /

/ /

J I

Memory

I

Processor

Link interface

Link interface

Link interface

Link interface

Event

.

39

~i no

~, ut 0

; n 1

ut 1 .01

~i n 2

ut 2 ~o

~i n 3

ut 3

~~. vent ego vent ck. em ego em ran.

a m r m g

?

queues. The active queue, which holds the process being executed and

any other active processes waiting to be executed while the inactive

queue holds those processes waiting on an input, output or timer

interrupt. Since there is a·little status to be saved, so the

process swap times are very small depending on the instruction being

executed. To carry out adding and deleting processes from the active

process queue, there are two further micro-instructions. These are

start process and end process. Input message and output message

instructions are for communication between processes on the transputer

or between processes on different transputers.

40

41

1.9 THE BALANCE 8000 SYSTEM

The Balance 8000 is an expandable, high-performance, MIMD parallel

computer that employs from 2 to 12 32-bit CPUs in a tightly-coupled manner,

using a new processor pool architecture. By sharing its processing

load among up to 12 architecturally identical microprocessors and

employing a single copy of a Unix-based operating system, the Balance

8000 system eliminates the barriers associated with multiprocessor

systems. It delivers up to 5 million instruction/s(MIPs), and its

power grows almost linearly when more processors are added. To make

the most. efficient use of its multiprocessing power, the system

dynamically balances its load, that is it automatically and continuously

assigns jobs to run on any processor that is currently idle or busy

with an assigned lower-priority job.

At the same time, the system is easily extendible. We can add

CPU's, memory, and I/O subsystems within a node, or more nodes within

a distributed network, or more distributed and local-area networks _

all with no changes in software. From the hardware point of view,

the system consists of a pool of 2 to 12 processors, a high-bandwidth

bus, up to 28 Mbytes of primary storage, a diagnostic processor, up to

4 high-performance I/O channels, and up to 4 IEEE-796 (Multibus) bus

couplers. It is managed by a version of the UNIX 4.2 BSD operating"

system, enhanced to provide compatibility with UNIX system V and to

exploit the Balance parallel architecture. Figure 1.16 evinces the

main functional blocks of the Balance 8000 system.

Each processor in the pool is a subsystem containing three VLSI

parts, a 32-bit CPU, a hardware floating-point accelerator, and a paged

virtual memory management unit. Each two subsystems are on one circuit

f-- Custom devices

Multibus f-- interface

board

Disk

Multibus adapter board

O---t-'-O Terminal

mux

I \

2-12 32-bits

CPUs

Memory 2-28

Megabytes

SB 8000 bus I ,

I I I I

,- - - -'-'----I

; Special I I purpose I , accelerators I

I '- - - - - - ---'

, , I , , , , ,

r --~ --- --I

I Special , , I purpose I controllers I , 1_- _______ I

FIGURE 1.16: Balance 8000 Block Diagram

SCED board

Ethernet

I

/

\

I

Ul :1

.Q

H Ul U Ul

}--

\ System console

.J I

r::::: =>

Disk

..... /

db

43

board. To reduce all the processor wait-periods and minimizing the

bus traffic in the system, each processor contains a cach~ memory.

The two-way set associative cache consists of 8 Kbytes of very high-

speed memory accessed instructions and data, which means that,

requests for the same data are satisfied from the cache, rather than

from the primary storage.

Designing a cach~ for a processor pool architecture is difficult

for several reasons. / Since data in each cache represents a copy of

some data in the primary memory, it is important that all copies and

/

the original remain the same, even when a cache is updated. To ensure

that, the system employs a write-through mechanism, in which each write

cycle goes through to the bus and memory, in addition to updating the

appropriate cach~. Also, if two processors have both recently read

/

the same data into their respective caches, and one of them updates

/ its cache, the second processor can not use its new state data,

because of the caches bus-watching logic. As illustrated in Figure

1.17, this logic continuously monitors all write cycles on the bus

/

and compares addresses with those in its own cache to see if any writes

affect its own contents. / When such an address appears, the cache

invalidates the entry in question.

The last component of the processor subsystem is the "System Link

and Interrupt ControZZer" (SLIC) chip. The SLIC appears with every

processor in the system, as well as on every memory controller, I/O

channel, and bus controller board. Communication between the SLICs is

accomplished with a simple command-response packet carried over a

dedicated bus. The controller serves several functions. Firstly it is

the key element of the system's global interrupt system. Secondly, its

CPU 1

/

Cache 1 / Cache controller

. ~'."

CPU 2

r-r--

Block marked invalid

I " ,.,.; .-..... ---.--.----- --::'--~

FIGURE 1.17: Bus-Watching Logic Monitor

Main memory

Block updated

" " ~

, I

.1 : )1 :

r;'

I'

Processor interface

Interrupt controller

I I

SLIC 1

~ L-l~~-~f,~,~·~~~§·~~_~·~~ ___ I~/_O ____ J~

~;- .

Semaphore cache

Receiver Transmi t ter~ contention

SLIC.1 bus

resolution

Start bit

SLIC packet

Message D t priority a a

Processor interface

Interrupt controller

I I

SLIC N

Error control

Semaphore. cache-

Receiver ~ransmitter ~ contention

resolution

J

-I/O --

I-

____ ~~SL-I-~~C-lO-C-k--------_________________ . ____ ~--~--------------------------

FIGURE 1.18: SLIC Chips

46

purpose is to manage a cach~ of single-bit unit-secaphores. Finally,

the controller serves as a convenient communication path among modulesa

For example, system diagnostic and debugging routines take modules on

and off-line using the SLIC bus, which carries error management

information.

The DYNIX operating system is an enhanced version of UNIX 4.2 BSD

that can also emulate UNIX system V at the system-call and command

levels. To support the Balance multiprocessing architecture, the DYNIX

operating system kernel has been made completely &~areable so that

multiple CPUs can execute identical system calls and other kernel code

simultaneously. The DYNIX kernel also adjusts the memory allocation

for each process to moderate the process's paging rate and to turn

virtual memory performance for the entire system.

Applications programming on the Balance 8000 system is supported

by compilers for the main programming languages. We can use a single

language or a combination of languages to suit. our application. Later

in Chapter 2, we will explain in detail information on the language

tools available for use with the DYNIX operating system.

· CHAPTER 2

PARALLEL PROGRAMMING AND LANGUAGES

One must have. a good memory to be ab~e

to keep the promises one makes.

F.W. Nietzsche.

2.1 INTRODUCTION

The recent advances in hardware technology and computer

architecture, lead to faster and more powerful parallel computer

systems, from which we benefit with a considerable throughput and

speed when they are applied to solve large problems. Problems for

parallel computer systems require some extra programming facilities

which come under the heading of parallel programming, to distinguish

it 'from the conventional programming of single-processor computers.

AS explained by the various architectures of existing parallel

computers, parallelism can be achieved in a variety of ways.

Attempting to summarize all these possible known ways of achieving

parallelism and categorize them into several distinct levels, w~

obtain:

a. Job level { b. Program level { c. Instruction level

d. Arithmetic and bit level

{

between jobs

between phases of. a job

between parts of a program

within Do-loop

between phases of instruction execution

between elements of a vector operation

within arithmetic logic circuits

The design of algorithms for parallel computers is greatly

influenced by the computer architectures and the high-level

47

languages which have been used.

This chapter will elucidate parallel programming and parallel

algorithms.

48

2.2 PARALLEL PROGRAMMING

The two new concepts behind the recent ideas of parallel

programming theory are parallelism and asynchronism of programs.

Gill [Gill, 1958] defined parallel programming as the control of

two or more operations which are performed virtually simultaneously,

and each of which entails following a stream of instructions.

49

There seem to have been two trends in the development of high

level languages. Those that owe their existence to an application or

class of applications, such as FORTRAN, COBOL and C, and those that

have been developed to further the art of computer science, such as

. ALGOL , LISP, PASCAL and PROLOG. The development of the former to

some extent has been stifled by the establishment of standards.

Conversely, to some extent, the lack of standards and the desire to

invent have led to the proliferation of versions of the latter. An

attempt to produce a definitive language; incorporating the 'best'

features of the known art and to bind these into an all-embracing

standard was made by the U.S. Department of Defense, and the

resulting language ADA [Tedd et aI, 1984], has been adopted.

Although con currency is addressed in ADA, there is now far more

practical experience of concurrency and new languages have been

developed, such as OCCAM, which treat concurrency in a simpler,

more consistent, and more formal manner.

The numerous and vastly different applications and underlying

models of parallelism will require radically different language

structures. Hockney and Jesshope [Hockney, 1988] suggested three

major divisions in language development:

1. Imperative languages.

2. Declarative languages.

3. Objective languages.

Imperative language is one in which the program instructs the

computer to perform sequences of operations, or if the system allows

it disjoint sequences of instructions operating concurrently_

Imperative languages have really evolved from early machine code,

by successive abstraction away from the hardware and its limited

control structures. This has had beneficial effects, namely the

improvement of programmer productivity and the .pcrtability obtained

by defining a machine independent programming environment. An

imperative language, however, even at its highest level of

abstraction, will still reflect the algorithmic steps in reaching a

solution. In addition to the retention of this notion of sequences,

these languages also retain a strong flavour of the linear address

space; still found in most machines. Harland [Harland, 1985]

introduced the concurrency, where many disjoint sequences of

instructions may proceed in parallel. By abstracting concurrency,·

the non-deterministic sharing of the CPU's cycles could be obtained.

The declarative style of programming has had the most profound.

effect on computer architecture research during this period. This

style of programming does not map well onto the classical von-Neumann

architecture, with its heavy use of dynamic data structures. It is

also based on a more mathematical foundation, with the aim of moving

away from descriptions of algorithms towards a rigorous specification

of the problem. These languages are based either on the calculus of

50

functions, lambda calculus, or on a subset of predicate logic. Since

these declarative languages are based on mathematics, so it is

·possible to formally verify the software systems created with them.

A further advantage of the declarative approach is that such

languages can supply implicit parallelism, as well as implicit

sequencing [Shapiro 1984] •

51

Objective language is based on two main techniques, encapsulation

and inheritance, which is a more pragmatic foundation than the rigour

of logic or functional languages. Hence it can provide a potential

solution to the software problem. It also provides a model of

computation which can be implemented on a distributed system.

Encapsulation is the most straight-forward and is often used as a

good programming technique [Booch, 1986]. Encapsulation hides data

and gives access to methods or procedures to access that data which

is only provided through the shared procedures.

By encapsulating a programmer's efforts in the creation of such

constrained objects, a mechanism must be provided in order to enhance

that object, which is the second technique of objective languages.

Inheritance allows the programmer to create classes of objects, where

those classes of objects may involve common access mechanisms or

common data formats. The mechanism for implementing inheritance is

to replace the procedure call to an object by a mechanism involving

message passing between objects.

There are at least three emerging parallel software design

approaches based upon the concealment of the parallelism by the

hardware structure. In other terms, for some architectures, the

parallelism is hidden by the hardware itself whilst for others it is

revealed to the user so that appropriate decisions are made as and

when needed. The first of these approaches, is the automatic

translation of sequential programs or the implicit parallelism.

The second approach is explicit parallelism, in which the programmer

manages the concurrency of the applications by coding directly in a

concurrent language. The third approach, advocated by Backus and

Dennis [Backus, 1978] is based on the functional language model and

is implemented on most DF computers. Relyirig on the programmer's

ability, the former method could rapidly become unworkable as it is

impossible to keep "juggling" with a large nuinber of tasks. The

functional approach, which is the most natural form of handling

parallelism can achieve the highest degree of concurrency, since the

instructions are scheduled for execution directly by the availability

of their operands. However, the high cost of implementing this

unstructured low-level con currency makes this method of less

importance, at least for the pr~sent moment.

The explicit and implicit parallelism detection approaches will

be discussed in paragraphs (2.2.1) and (2.2.2).

2.2.1 Implicit parallelism

52

Many of the existing sequential software exhibits naturally some

form of synchroneity which needs only to be identified and then

exploited in the. design of parallel algorithms. The implicit approach

is one of the approaches to parallelism that relies on the implicit

detection of parallel processable tasks within a sequential algorithm.

This approach of parallelism is associated with sophisticated

compiling and supervisory programs and their related overheads. Its

effectiveness lies in that it is independent of the programmer, and

existing programs need not be modified to take advantage of inherent

parallelism. However, in implicit parallelism, it will be necessary

to analyse the program to_see how it can be divided into tasks, and

then the compiler could detect parallel relationships between tasks

as well as carrying out the normal compiling work for a program to be

run on a serial computer.

53

Different methods have been developed which possess the feasibility

of automatically recognising parallelism in computer programs.

Bernstein [Bernstein,· 1966] proposed a method based on set theory.

His presented theory is based on four different ways of utilizing a

memory location by a sequence of instructions or tasks. These

conditions are:

1. The location is only fetched during the execution of a task.

2. The location is only stored during the execution of a task.

3. The first operation involving this location is a fetch.

One of the succeeding operations stores in this location.

4. The first operation involving this location is store. One

of the succeeding operations fetches from this location.

Following this work, Evans and Williams [Evans, 1978] have

described a method of locating parallelism within ALGOL-type

programming language , where some constructs such as loops, if

statements, and assignment statements are studied.

One of the most studied detection schemes that has been given

much consideration is the implicit detection of the inherent

parallelism within the computation of arithmetic expressions.

Because of the sequential nature of most of the uniprocessor systems,

the run-time of any arithmetic expression computation is always

proportional to the number of operations. This run-time can be

further reduced on a parallel system by concurrently processing many

parts of the expression. In fact, the commutativity and associativity

were extensively used in order to reduce the height of the computational

tree representation. For example,· consider the expression,

(a*b*c*d*e*f*g*h)

which can be rearranged in a form suitable for parallel processing,

«(a*b)*(c*d))*«e*f)*(g*h)))

As it can be seen in Figure 2.1 and 2.2 which depicts the tree

representation of the above expression for a sequential and a parallel

processor respectively, the run-time was reduced by four time units.

Many algorithms have previously been proposed for recognizing

parallelism at the expression level, some of which are those

suggested by Squire [Squire, 1963], Hellerman [Hellerman, 1966],

Stone [Stone, 1967], Baer and Bovert [Baer, 1968), Kuck [Kuck, 1977],

and Wang and Liu [Wang, 1980].

2.2.2 Explicit Parallelism

In explicit parallelism, the programmer has to specify explicitly

those tasks that can be executed synchronously by means of special

parallel constructs added to a high-level programming. language.

Although these programming constructs can be time consuming and

difficult to implement they can offer significant algorithm design

flexibility.

Level 7

Level 6

Level 5

Level 4 f

Level 3 e

Level 2

Level 1 c

Level 0 a b

FIGURE 2.1: Binary Tree Representation of the Expression (a*b*c*d*e*f*g*h) for a Serial Computer.

a b c d e f

9

FIGURE 2.2: Binary Tree Feoresentation of the Expression (a*b*c*d*e*f*g*h) for a Parallel Computer.

55

9 h

56

Significant research has been done on this approach with a

particular interest on those parallel task issues such as task

declaration, activation, termination, synchronisation, and

communication. In other terms, a synchronous program consists of

sequential processes that are carried out simultaneously. These

processes cooperate on common tasks by exchanging data through shared

variables.

Dijkstra [Dijkstra, 1965J proposed the utilization of semaphores

and introduced two new primitives (P and V) that greatly simplified

the processes of synchronisation and communication. A software

·implementation of these two primitives in terms of an indivisible

instruction, the test-and-set instruction, was installed in many

systems.

Dennis and van Horn [Dennis, 1966J proposed a very straightforward

mutual exclusion lock out mechanism. Critical regions are enclosed

within a LOCK W-UNLOCK W pair, where W is an arbitrary one-bit

variable.

The parallelism in the explicit approach can also be indicated

by using language constructs that exploits the parallelism in

algorithms, Anderson [Anderson, 1965J introduced five parallel

constructs, the FORK, JOIN, TERMINATE, OBTAIN and RELEASE statements

which are presented below in an ·ALGOL-68 format;

label:

label: TERMINATE L ,L2

, .•. ,L ; 1 n

The FORK statement initiates a separate control for each of the

n segments with the labels L .• Only local labels must be used and ~

their scope is defined as the block scope in which this statement

is declared. The next sequence of paths may only be initiated when

all the forked paths of the previous level have completed their

performances.

The JOIN statement which is associated with the FORK statement

and must occur in the same level in the program is used to terminate

the parallel processes that have been forked. This action is

57

implemented by including a code that causes test bits to be available,

thus allowing the forked paths to be ~ynchronised after they are

completed (Figure 2.3).

The TERMINATE statement is used to explicitly terminate program

paths which have been dynamically activated by the fork statement,

thus avoiding the creation of a backlog of meaningless incomplete

activations. Really the JOIN and TERMINATE statements are control

counters, decreasing by one after the execution of one statement,

comparing each time to zero; if different than zero, the path is

terminated and the processor is free to execute the next path in the

queue, otherwise, the processor goes to the next program segment.

The OBTAIN statement is used to provide exclusive access to the

listed variables by a single process. It is used to avoid mutual'

interference by locking-out other parallel program paths from the use

of these variables. If this statement occurs in a block then the

variables should be the same variables occurring in higher level

blocks.

) )

58

FIGURE 2.3: The Fork/Join Technique.

The RELEASE statement which is implemented with the OBTAIN

statement. It releases those variables ·that have been locked out by

an OBTAIN statement.

More specifically the OBTAIN/RELEASE concept is an approach

implemented to assist in solving the synchronization problem.

59

The above presented statements are directly implemented as library

functions and supplied with enough information to control parallel and

multiprogramming activities. In general, the FORK statement would be

substituted at the compilation time by a special code that when

executed would create as many parallel paths as the number of labels

following the FORK statement. Each of these paths is assigned to the

available processors and usually the first path is carried out by the

same processor that carries out the FORK statement itself. If the

number of created processes ~s greater than the number of available

processors, the excess paths are kept in a queue until a processor

becomes free.

The labels used in a parallel program are cross-referenced at

the compilation time by. arranging them on a forward reference list

which is loaded by all the labels contained in the labels list of an

instruction.

To recapi·tulate, in the explicit approaCh the process of

parallelization is under the entire responsibility of the programmer,

·a fact which jeopardizes program determinancy.

60

2.3 PROGRAMMING THE BALANCE SYSTEM

The hardware configuration and the operating system of the Balance

MIMD system was described in Chapter 1. This section describes the

programming languages supported by Sequent for use with the DYNIX

operating system. These languages include C, FORTRAN77, pascal; and

assembly languages.

The Balance system supports the two basic kinds of parallel

programming: multiprogramming.and multitasking.

Multiprogramming is an operating system feature that allows a

computer to execute multiple unrelated programs concurrently. The

multi-user, multiprogramming UNIX environment adapts quite naturally

to the Balance multiprocessing architecture and automatically

schedules processes for optimal throughput. In other versions of the

UNIX operating system, executable processes wait in a run queue; when

the CPU suspends or terminates execution of one process, it switches

to the process at the head of the run queue. DYNIX balances the

system load among the available processors, keeping all processors

busy as long as there is enough work available, thus using the full

computing capability of each processor.

Multitasking is a programming technique that allows a single

application to consist of multiple processes executing concurrently.

The DYNIX operating system automatically does multitasking for some

applications.

The Balance language software includes multitasking extensions

to C, pascal, and FORTRAN. The DYNIX Parallel programming Library

(PPL) includes routines to create, synchronize, and terminate parallel

processes from C, Pascal and FORTRAN programs. The DYNIX gprof

61

utility creates a programs execution profile, a listing that shows us

which subprograms (subroutines or functions) account for most of a

program's execution time.

2.3.1 Multitasking Terms and Concepts

In the DYNIX operating system, a new process is created by using

a system call called a fork. The new (or child) process is a

duplicate copy of the old (or parent) process, with the same data,

register contents, and program counter. If the parent has files open

or has access to shared memory, the child has access to the same files

and shared memory.

A UNIX forking operation is relatively expensive (about 55

millisecond). Therefore, a parallel application typically forks as

many processes as it is likely to need at the beginning of the program,

and does not terminate any process until the program is complete,

since the process can wait in a busy loop during certain code sequences.

Typically, multitasking programs include both shared and private

data. Shared data is accessible by both parent and child processes,

while private data is. accessible by only one process. There are

several advantages to sharing data. Firstly it uses less memory than

having multiple copies. Secondly it avoids the overhead of making

copies of the data for each process. Finally it provides a simple

and efficient mechanism for communication between processes. If the

program includes any shared data, the process's virtual memory space

also contains a shared data area and a shared heap (Figure 2.4).

Tasks can be scheduled among processes using three types of

algorithms,

Shared Memory

Shared

Data

Process 1

Private Data

, ! Private i

I Heap ;

! __ I - .. - r ,- -- - - --

I

I Shared 1

i Data i

- - - -- -,- - .J -- -,

i 1 Shared I i i , Stack I _____ ..J

,------;

I Shared 1<----+----'1 Heap

, I I

i_ - l--.J l'

Private Stack

FIGURE 2.4: Process Virtual Memory Contents.

62

1. prescheduling;

2. static scheduling;

3. dynamic scheduling.

In prescheduling, we need to determine the task division, before

compiling the program. Prescheduled programs cannot automatically

balance the computing load according to the data or the number of

CPUs in the system. Therefore, this method is appropriate only for

applications where each process is performing a different task.

63

In static scheduling, the tasks are scheduled by the processes at

run time, but they are divided in some predetermined way. For example,

if a program includes lOO-iterations, using static scheduling, each

process might execute 10 iterations of the loop, if the program uses

10 processes.

In dynamic scheduling, each process schedules its own tasks at

run time by checking a task queue or a "do-me-next" array index.

For example, a dynamically scheduled program might perform a matrix

multiply, with each process computing three matrix elements and then

returning for more until all the work is done.

Dynamic scheduling produces· dynamic load balanCing, i.e. all

processes keep working as long as there is work to be done, while

static scheduling produces static load balancing, i.e. the division

of tasks is statically determined, several processes may stand· idle

while one processor completes its share of the job.

In any place where two or more parallel processes can read and

write the same data structure constitutes a dependency, because the

results of the program depend on when a given process references that

64

data structure. To ensure correct results, code sections containing

dependencies cannot be executed simultaneously by multiple processes,

where the code section, that contains dependencies are critical regions.

There are two basic types of dependencies:

1. access dependencies;

2. order dependencies.

Access dependencies can yield incorrect results if two or more

processes try to access a shared data structure at the same time,

whilst the order dependencies can yield incorrect results if two or

more processes try to access a shared data structure at the same time

or in the wrong order.

The best way of handling dependencies in a program is rewrite

the code to eliminate them. The dependencies are sometimes inherent

in the application. In these instances, we must set up the processes

so that they communicate with each other to execute the dependent code

sections. This communication can ,be set up by using mechanisms

called semaphors and locks)· dlhere the .semaphore is a shared data

structure used to synchronize the actions of multiple cooperating

processes. Meanwhile the lock ensures that only one process at a time

can access a shared data structure. The lock has two values: lock

and unlock. Before attempting to access a shared data structure, a

process waits until the lock associated with the data structure is

unlocked, indicating that no other process is accessing the data

structure. The process then locks the lock, accesses the data

structure and unlocks the lock. While a process is waiting for a

lock to become unlocked, it spins in a tight loop, producing no work -

hence the name "spinlock ll• This spinning is also referred to as a

busy wait.

Figure 2.5 illustrates how a lock is used to prevent multiple

processes from executing a dependent section simultaneously.

acquire lock execute

release lock

Pi -------

P2

p3

dependent section

I

I , wai t for I execute ___________ 1 _______________ _ ,

dependent: section I

lock acquire lock

release lock

I wait for lock ,execute

----------_______________ 1 __________ _

65

)

,

release lock

)

acquire dependent lock section

time )

FIGURE 2.5: Role of Lock in Protecting Dependent Sections.

An Event is something that must happen before a .task or process

can proceed. Events have two values: posted and cleared. One or

more processes wait for an event until another process posts the

event and hence the waiting processes proceed. It is required to be

cleared by the waiting process; by a master process or by another

process.

I/O in parallel programs is complicated. These complications

can usually be reduced by performing I/O only during sequential

phases of the program or by designating one process as a server to

perform all I/O.

2.3.2 Data Partitioning with DYNIX

This subsection explains how we structure FORTRAN programs for

data partitioning. It also explains how we shall use the DYNIX PPL

to execute loops in parallel. The data partitioning method is

sometimes called microtasking. Microtasking programs create multiple

independent processes to execute loop iterations in parallel. It has

the following characteristics:

1. the parallel processes share some data and cr"eate their

own private copies of other data;

2. the division of the computing load adjusts automatically

to the number of available processes;

3. the program controls data flow and synchronization by using

the tools specially designed for data partitioning.

The microtasking program works as follows:

a. Each loop to be executed in parallel is contained in a

subprogram •

b. For each loop, the program calls a special function which

forks a set of child processes and assigns an identical

copy of the subprogram to each process for parallel

execution. The special function creates a copy of any

private data for each process.

c. Each copy of the subprogram executes some of the loop

iterations.

d. " If the loop being executed in parallel is not completely

independent, the subprogram may contain calls to functions

that synchronize the parallel processes at critical points

by using locks, barriers, and other semaphores.

66

e. When all the loop iterations have been executed, control

returns from the subprogram. At this point, the program

either terminates the parallel processes or leaves them to

spin in a busy-wait state until they are needed again.

67

There are three sets 'of routines in the DYNIX PPL, a microtasking

library, a set of routines for general use with data partitioning

programs, and a set of routines for memory allocation in data

partitioning programs.

The microtasking library routines allow us to fork a set of

child processes, assign the processes to execute loop iterations in

parallel, and synchronize the processes as necessary to provide proper

data flow between loop iterations. Table 2.1 lists the microtasking

routines in the PPL.

The data partitioning routines include a routine to determine

the number of available CPUs and several process synchronization

routines that are more flexible than those available in the micro

tasking library. Table 2.2 illustrates the general-purpose data

partitioning routines in the PPL.

The memory allocation routines allow a data partitioning program

to allocate and de-allocate shared memory and to change the amount of

shared and private memory assigned to a process. Table 2.3 shows the

memory allocation routines in the PPL. More detail on PPL can be

found in Section 3p in Volume 1 of the DYNIX Programmer's Manual.

Before we convert a loop into a subprogram for data partitioning,

we must analyze all the variables in the loop and determine two things:

68

ROUTINES DESCRIPTIONS

m fork Execute a subprogram in parallel

Return process identification number

Return number of child processes

Terminate child processes

m lock Lock a lock

m multi End single_process code section

m next Increment global counter

Suspend child process execution

Resume child process execution

Set number of child processes

Begin single-process code section

Check in at barrier

m unlock Unlock a lock

TABLE 2.1: Parallel Programming Library Microtasking Routines

69


: Return number of CPUs on_line

s init barrier Initialize a barrier

s init lock Initialize a lock

s lock or s clock Lock a lock

S LOCK C macro for s lock

s unlock Unlock a lock

s UNLOCK C macro for s unlock

5 wait barrier Wait at a barrier

TABLE 2.2: Parallel Programming Library Data-Partitioning Routines


brk or sbrk Change private data segment size

shbrk or shsbrk Change shared data" segment size

shfree Do_allocate shared data memory

shmalloc Allocate shared data memory

TABLE 2.3: Parallel Programming Library Memory-Allocation Routines

a. Which data can be shared between parallel processes and

which must be local to each parallel process.

b. Which variables cause dependencies· or criticaL regions,

code sections which can yield incorrect results when

executed in parallel.

To complete the development of our data partitioning program, we

need to do the following things:

1. Invoke the appropriate compiler with the proper options to

link the program with the PPL.

2. Execute the program and check the results.

3 .. If necessary, we use the DYNIX parallel symbolic debugger,

Pdbx to debug the program.

Thus, if we need to compile and link a FORTRAN program, we enter

the following command:

fortran-e-F!SHCOM!program.name -lpps,

70

this command compiles a FORTRAN source file and links the·object.code

with the PPL, producing an executable file named (a.out). It also

places all COMMON blocks declared with the (-F) option into shared

memory. The (-e) option makes the FORTRAN code compatible with the C

subroutines in the PPL. To execute the program, simply enter the name

of the executable file as· a DYNIX command. The default file name is

a. out.

2.3.3 Function Partitioning with DYNIX

This subsection describes the facilities provided by the Balance

system to support function partitioning applications. Function

71

partitioning involves creating multiple processes and having them

perform different operations on the same data set. The processes may

be created within a single program or they may be independent programs

created at the operating system level. The DYNIX PPL contains several

routines that can be used for function partitioning applications.

The fork( ) system call creates a duplicate copy of the current

process. The parent process sets up a shared memory region and one or

more locks, then forks one or more child processes to Shar~ the work.

The children inherit the parent's complete memory image', including

access to shared memory and·locks. Child processes are identical to the

parent and they can be designed to choose their own tasks based on the

order of their creation. The new process is created from a file that

. contains either executable object code or a shell script. The PPL

routines s init barrier and s wait barrier initialize a barrier and - -cause processes to spin until all related processes arrive at the

synchronization point. Processes can send and receive signals among

themselves, and'to handle special events such as terminal interrupts.

If a child process determines that the parent will not need any help

for a significant amount of time, the child can relinquish its processor

for use by other applications. The parent can send the child a wakeup

signal when required. Since a Balance system can have multiple

processes running simultaneously, some programs that use UNIX signals

may behave differently on the Balance system than on a uniprocessor.

For example, if each of P child processes sends a signal to their

parent, the parent will not necessarily receive P signals. This type

of race condition is also possible on uniprocessors, but may not

manifest itself until the program is parted to a multiprocessor.

72

The simplest and most efficient mechanism for interprocess

communication is a semaphore in the Balance shared memory. The

interprocess communication subsystem provides the ability to transfer

data directly between processes using system calls. Finally the

interprocess communication facilities are extremely useful for certain

types of applications.

73

2.4 PARALLEL ALGORITHMS

Long before parallel computers had been constructed many

researchers have studied parallel algorithms (since the 1960's).

However, designing parallel algorithms became more important and

interesting as the development of parallel computer architecture

advanced. Therefore, a variety of algorithms have been designed from

different view points and for the various parallel architectures which

were described in Chapter 1.

The structure of parallel algorithms has to be such that they are

fully independent computations. Although this case cannot always be

encountered, some algorithms may already contain independent

computations that are not needed to be reorganized. Such algorithms

are said to have inherent parallelism.

Stone [Stone, 1973] highlights some of the problem areas in

parallel computation, these include the necessity to rearrange the

data in memory for efficient parallel computation: the recognition that

efficient sequential algorithms are not necessarily efficient on

parallel computers and conversely, that sometimes inefficient sequential

algorithms can lead to very efficient parallel algorithms and lastly

the possibility' of applying transformations to sequential algorithms

to yield new algorithms suitable for parallel execution.

Kung [Kung, 1980] identified three orthogonal dimensions of the

space of parallel algorithms: concurrency control, module granularity

and a communication geometry. Concurrency control is needed in

parallel algorithms to ensure the correctness of the concurrent

execution, because more than one task module can be executed at a

time. The module granuU2rity.of a parallel algorithm reflects whether

74

or not the a·lgorithm tends to be communication intensive. This must

be taken into consideration for efficiency reasons. If the task

modules of a parallel algorithm are connected to represent intermodule

communication, then a geometric layout of the resulting network is

referred to as the communication geometry of the algorithm.

As was exemplified by the various architectures of existing

parallel computers, parallelism can be achieved in a variety of ways,

the ·same holds with parallel algorithms, since a close correspcndence

must exist between architectures and algorithms. Consequently, since

there exists such a variety, the question to answer now is, "How to

choose amongst the different alternatives·in order to solve a specific

problem, or what type of problem is better adapted to a given

architecture?"

Since, in most cases, performance is the reason why parallelism

is being investigated, it must be considered as a very critical issue.

The study of how to design algorithms, for different parallel

architectures, might reveal that an algorithm requires a peculiar

feature of that architecture to run efficiently ..

As mentioned previously, in SIMD computers, the number of

processors tends to be large compared with that of MIMD computers.

In general, we can say that the algorithm designed for SIMD comput·ers·

requires a high degree of parallelism because this type have up to·

order pm,· i.e. o (pm) processors, while a MIMD computer has up to O(p)

processors, where p is the number of subtasks of the problem and m is

an integer index representing problem complexity. This does not mean

that an algorithm designed for a MIMD computer cannot be run on a

SIMD computer.

75

Since the processors of a MIMD computer are asynchronous, they

need not necessarily be involved on the same problem. On the other

hand, the processors of a SIMD computer are synchronous and hence they

cannot be used to run independent computations that are not identical

and they must remain idle when not required.

The performance of an algorithm is defined by the absolute

arithmetic answer to some relevantly set quantities, such as, the

Computation time, Speed-up, and Efficiency of the algorithm. In real

. machines, actual computation times are often proportional to the total·

. number of arithmetic operations in the programs, Whilst, in cases of

programs with little arithmetic, are proportional to the number of

memory accesses, or the n~mber of I/O transmissions. We will use T p

to denote the computation time on a computer with P processors, and

Tl to denote the computation time of an uniprocessor computer.

The term Speed-up (S ) of a P processor computer over a sequential p

computer is defined as,

S Tl

~l = , P T p.

and the Efficiency (E p

) is defined as,

S E = J>.

~ 1 P P

,

and to compare two parallel algorithms for the same problem the

following measure of effectiveness (F ) is introduced, p

where,

•

F P

c p

S =...l:'.

c p

PT P

~ 1 ,

(2.4.1)

(2.4.2)

(2.4.3)

(2.4.4)

measures the cost of the algorithm.

The following simple example is given for illustration. Suppose

that it is required to form the inner or scalar product,

16 A 2

i=l a.b. ~ ~

To carry this out with a single processor sequentially requires 16

multiplications and 15 additions. If we take the number of °additions

(assume 1 multi. ~ 1 addition)", as a measure of the time needed, then

76

normalising by assuming the time for a single addition to be the unit,

owe have Tl =31. If two processors were available we would form the

two products·,

simultaneously, requiring 15 time units, and then form,

d

at the next stage, requiring a further time unit. Thus A would be

obtained in T2

=15+1=l6 time units.

If now we had three processors A could be formed in the following

three stages:

Cl = albl+a2b2+a3b3+a4b4+a5b5

C2 = a6b6+a7b7+aebe+agbg+alOblO

C3 allbll+a12b12+a13b13+a14b14+a15b15

d l = C

l+C

2 d2

= c3+a16b16 ,

el

dl

+d2

= A •

This requires T3

=9+3+l=13 time units.

If four processors were available then A could be formed in the

following stages:

This requires T4

=7+1+1=9 time units.

A table of the performance measures can now be constructed.

Table 2.4 shows that with increasing p, S increases steadily, while . p

E decreases. p FpTl' however, has a maximum when p=8 which indicates

that p=8 is the optimal choice of number of processors for this

calculation.

P T C S E P P P P

·1 31 31 1 1

2 16 32 1.93 0.96

3 13 39 2.38 0.79

4 10 40 3.10 0.77

8 6 48 5.16 0.64

16 5 80 6.2 0.38

16 TABLE 2.4: The Performance Measure of I aib

i.

i=l

F T =S E pIp P

1

1.85

1.88

2.38

3.30

2.4

77

78

We can find in the literature that many algorithms are suitable for

SIMD computers. Examples of such algorithms have been introduced by

Miranker [Miranker, 1971], Stone [Stone, 1971,1973b] and Wyllie

[Wyllie, 1979]. On the other hand, although implementing algorithms

on MIMD computers are more difficult than their implementation on SIMD

computers, many parallel algorithms have been developed to run on MIMD

computers. 'For example, Muraoka [Muraoka, 1971] showed how

parallelism 'is exploited in applying algebraic expressions on a MIMD

computer. Baudet et al [Baudet, 1980] has developed a variety of

parallel iterative algorithms suitable for MIMD computers.

2.4.1 The Structure of Algorithms for Multiprocessor Systems

A parallel algorithm for a multiprocessor is a set of n concurrent

processes which may operate simultaneously and cooperatively to solve

a given problem. Synchronization and the exchange of data is needed

between processes to ensure that the parallel algorithm works correctly

and effectively to ,solve a given problem. Therefore, at some stage in

the performance of a process there may be some pcints where the

processes communicate with other processes. These points are called

the "interaction points". The interaction pcints divide a process

,into stages. Hence, at the end of each stage, ,a process may

communicate with some other processes before the next stage of the

computation is initiated.

Parallel algorithms for multiprocessors may be classified into

asynchronous and synchronous parallel algorithms. Due to the inter

actions between the processes, some processes may be blocked at

certain times. The parallel algorithm in which some processes have to

wait on other processes is called a synchronized algorithm. The

weakness of a synchronized algorithm is that all the processes that

have to synchronize at a given point wait for the .slowest amongst

them. To overcome this problem, an asynchronous algorithm is

suggested. For a parallel asynchronous algorithm, processes are

generally not required to wait for each other and communication is

achieved by using global variables stored in shared memory. Small

delays may occur because of concurrent accesses to the shared memory.

/

79

CHAPTER 3

BASIC MATHEMATICS, GENERAL BACKGROUND

Most nwnerical analysts have

no interest in arithmetic.

B. Parlett.

3.1 INTRODUCTION

In the present Chapter we shall scrutinize the necessary

preliminary mathematical background of differential equations, and

review all the related notations, conditions, and concepts essential

for their proper use. The mathematical formulation of any important

- scientific and engineering problem.· involving rates of change with

respect to two or more independent variables, leads to Partial

Differential Equations (POE's) or a set of such equations. Another

class of differential equations which govern physical systems is

Ordinary Differential Equations (ODE's) in which only one independent

variable is present in the differential equations.

80

With the use of automatic digital computers becoming widespread,

the numerical methods are found to be an attractive alternative, since

the analytical solution for the majority of these equations is

.extremely difficult or too inconvenient to be obtained. A number of

approaches have been developed over the years for the treatment of

POE's, the most important and wideiy used of these are the methods of

finite elements and finite differences. A compleoentary description of

the finite difference methods will now be given.

81

3.2 CLASSIFICATION OF PARTIAL DIFFERENTIAL EQUATIONS

The most general mathematical form of the second-order PDE's of

two independent variables, x and y, (these may both be space

coordinates, or one may be a space coordinate and the other the time

variable), with a dependent variable U, which can be expressed as,

a2 . 2 + B ____ U + ~ +

axay a/ ~+

ax Eau

ay +Fu+G= O. (3.2.1)

Equation (3.2.1) can be further classified according to the nature

of its coefficients:

1. Linear, if the coefficien~s A,B,C, ... ,G, are constants or

functions of one or both independent variables x and y.

2. Non-linear, if any of the coefficients A,B,C, ..• ,G, are

functions of the dependent variables, U, or its derivatives.

3. Semi-Linear, if the coefficients A,B, ... ,G, are functions of

the independent variables x and y only.

4. Quasi-linear, if the coefficients A,B, and C, are functions

of x,y,U, ;~, and ;~ but not of second-order derivatives.

5. Homogeneous, if G=O, otherwise it is called inhomogeneous.

6. Se If-Adjoint , if it can be replaced. by

a au + --(C(y)--)·+ FU + G ay ay

o

Additionally, equation (3.2.1) can be classified into three

particular types according to whether the discrLffiinant (~4AC) is

greater than, equal to, or less than zero. The general second-

order quasi-linear PDE has the form,

2 ~+

axay

2 c~ al

+ G o , (3.2.2)

82

au au where A,S,e, and G may be functions of x,y,U,-- and -- but not the ax ay

second-order derivatives. In this section, let us consider the

following denotation for the first- and second-order derivatives,

2 a

2u a

2u au au a u = M; = N; -2 = P; = Q, and -- = S. ax ay axay 2 ax ay

Let R be a curve on the x-y plane on which the values of its derivatives

above satisfy equation (3.2.2). Therefore, the differentials of M and

N in directions tangential to R satisfy the equations,

aM dx aM Pdx + Qdy dM = + -- dy , ax ay (3.2.3)

and

dN aN dx aN Qdx + Sdy = + -- dy = , ax ay (3.2.4)

where

AP + BQ + CS + G = 0 , (3.2.5)

and ~ is the slope of the tangent to R at point p(x,y).

Elimination of P and S from (3.2.5) using (3.2.3) and (3.2.4)

results. in,

~(dM-QdY) + BQ + d~(dN-Qdx),+ G = 0 ,

i. e. I

B (dy) + C} _ {A dM ~ + ~ + Gdy} = 0 dx dxdx dx dx . (3.2.6)

Now, by choosing the curve R so that the slope of the tangent,

at every point on it, is a root of the equation,

B (~) + C = 0 J (3.2.7)

hence, the Q term is eliminated.

Therefore, equation (3.2.6) leads to,

83

A dM ~ + CdN + Gdy dx dx dx dx ° . (3.2.8)

Consequently, it is apparent that every point p(x,y) of the

solution domain there are two directions, given by the roots of equation

(3.2.7), along which there is a relationship, given by equation (3.2.8),

between the total differentials dM and dN with respect to x and y.

The directions given by the roots of equation (3.2.7) are called

characteristic directions and the PDE is considered to be hyperbolic,

parabolic, or elliptic according to whether these roots are real and

distinct, equal, or complex, respectively, i.e. according to whether

2 > B-4AC = 0. <

The above classification scheme is rather interesting, since the

coefficients A·,B and C are functions of the independent variables x and

y and/or the dependent variable U, thus this classification depends in

general on the region in which the PDE is defined.

For example, the differential equation,

° , 2 2

is hyperbolic in the region when x-y >0, parabolic along. the

2 2 2 2 boundary x -y =0, and elliptic in the region where x -y <0.

Typical, parabolic and hyperbolic PDE's result from diffusion,

equalization or oscillatory processes and the usual independent

variables are time and space. On the other hand the elliptic PDE is

generally associated with steady-state or equilibrium problems.

3.3 TYPES OF BOUNDARY CONDITIONS

The solution of a PDE has to satisfy, in particular, some

boundary conditions arising from the formulation of the problem

itself. Usually, the elliptic PDE's are classified as boundary value

problems, since as in Figure 3.1, the boundary conditions. are given.

round the (closed) region.

y

U=f on aR br-----------------,

O~--------------~----__. a

x

FIGURE 3.1: Boundary Value Problem.

The parabolic and hyperbolic types of equations are either

initial value problems or initial boundary value problems, where the

84

initial or/and boundary conditions are supplied on the side of the open

region, and the solution proceeds towards the open side (see Figure

3.2a,b and 3.3a,b).

In accordance with the specific type of the occurred boundary

conditions defined on the boundary oR of the region R, many different

basic categories of problem can be distinguished. These are:

1. DirichZet probZem, where the solution u is specified at each

point on oR.

. U(O,t) given

for t>o

t Open

1 aU =--k2 at U(a,t) given

for t>O

° I---------..Jal,-~ .... x

U(x,O) given

on [O,a)

(a)

t

85

L-~ ___________ -+ X

U'(x,O) given on the

entire initial" line

(b)

FIGURE 3.2: Initial Boundary Value Problem for a Parabolic Equation

U(O,t) given

for t>o

t

°

Open

U(a,t) given for t>O

t

U(O,t) given

for t>o

~--------~-~. x aU .

U and at gl.ven a

on

t=ox [0, aJ

(a)

2 1 a U

c 2 at2

L----------~x aU

U and at given on

t=Ox[O,w)

(b)

FIGURE 3.3: Initial Boundary Value Problem for Hyperbolic Equation

2. Neumann problem, where values of the normal derivatives ~~ are

aU given on aR, where ~ denotes the directional derivative of U

along the outward normal to aR.

3. Mixed problem, where the solution u is specified on part of aR

and ~~ is specified on the remainder of aR.

4. Periodic problem, the solution u has to satisfy the periodicity

conditions, for example,

I aul ul x = u x+£' an x

aul = an x+£

where £ is the period, x and x+£ are on aR.

5. Robin's problem, where a combination of U and its derivatives

are given along the boundary, i.e., aU + eau given on aR. an

86

87

3.4 BASIC MATRIX ALGEBRA

Numerical approaches such as the finite difference and finite

... """~ element methods to solve ODE's and. PDE'sTyields a system of linear,

simultaneous equations which can be represented as a matrix system.

Methods of solving such systems depend on some matrix properties, for

example, irreduoibility, diagonal dominanoe, and positive definiteness

of the coefficient matrix of the system.

A matrix is defined as a two-dimensional array with each element

denoted as a, "where i and j represent the row and the column 1.,J

respectively of the array in which the element appears. A matrix A,

say, is of size (nxm) if it has n 'rows and m columns, and can be

denoted'by,

A = [a, ,1 = l. , J

rall la21 I I

I I I

~~l

a - ______ a 12 Im

I a -_

n2 -a run

when n~l, then we'have a row veotor, and for m=l, then it is a

(3.4.1)

oolumn veotor. The vectors are usually denoted by a small underlined

letter with a singl'e subscript, such as b, represents the i th element l.

of the vector b. The vector £, for example, whose elements are b ,b , . 1 2

... ,b is of order n and is. denoted by, n

b2 ,

b , (3.4.2) , I ,

ti n

A square matrix is a matrix of order n, where n=m. In this

thesis, all matrices used are square matrices, unless otherWise

stated. The set of elements a, .' where i=1,2, ... ,n of the matrix J.,J.

A (3.4.1) is the diagonaZ of A. The transpose of the matrix A is

denoted by AT and is obtained by interchanging the rows and columns

of A. The determinant of A is denoted by det(A) or jAI. An inverse,

-1 A , of a given matrix A, if it exists, is a square matr·ix such" that,

-1 -1 A A AA = I ,

where I is the identity (unit) matrix whose order is the same as that

of A and is defined as follows,

a .. = 1, for all i=1,2, ... ,n, J.,J.

a .. ; 0, for all i,j;1,2, ... ,n and i,lj. J.,J

If A possesses an inverse then it is non-sin~AZar othe~ise it

is singuZar. On the other hand, A is singular if IAI;o, and non-

Singular if IAI,Io.

If the entries of a matrix A are complex numbers, the conjugate

of A is the matrix A whose entries are the conjugates of the

corresponding entries of A, i. e., if A; [a. .1 then A; (a. .1. The 1,J 1,J

Hermitian transpose (conjugate transpose) of A, d~oted by AH, is the

transpose of A, i.e.,

H -T T A ; (A) ; A

The sum of the diagonal elements of a matrix A is called the trace of n

A, denoted by tr(A) , i.e. , tr (A) ; I a. .. i=1 1.,1.

88

A permutation matrix P=[P .. J, is a matrix which has elements of l.,)

zeros and ones only with exactly one non-zero element in each row and

column. For any permutation matrix P we have,

T = P P I ,

hence -1 = P

Definition 3.1:

1.

2.

3.

4.

5.

The matrix A=[a .. 1 is said to be: l.,)

T Symmetric, if A=A ;

. -1 T Orthogonal, if A =A

.. l.. f H Herm~t1an, A =A;

Null, if a .. =0, (i=j=1,2, ... ,n); l.,)

Sparse, if a relatively large number of its elements a. . are l. , )

zero;

6. Dense (full), if a relatively large number of its elements a. . l.,)

are non-zero;

7. Diagonal, if a. .=0 for i;ij; l. , )

8. Lower triangular, if a =0 for i,j

9. Upper triangular, if a. .=0 for l., )

Definition 3.2:

The matrix A=[a .. 1 is called: l.,)

i<j;

Dj.

1. Banded, if a .. =0 for ji-jj>r, where (2r+l) is the bandwidth of A; l.,)

2. Tridiagonal, if r=l;

3. Quindiagonal, if r=2;

4. Block diagonal, if each Di

, where i=1,2, ... ,r (Figure 3.4d) is a

square matrix.

89

rx I x x

x x x X \ 0 \ \ \ C

\ \ \ \ \ \ \ \

\ "- \ \

A = \ , A = \ , , r=l

\ \ \ , , , \ \ \

\ \

" \

0 \ 0 \ ,

\ \ \ , X X X X

X X x

a. Diagonal matrix b. Tridiagonal matrix

-x x x Dl

x x x x 0 D2 0 \

x X X X x \ \ \ \ \

\ \ \ \ \ \ A = \ \ \ \ , r=2, A = \ \ \ \ \ \ \ , \ \ \ \ , , , , \ ,

\ x x x x x \

0 \ x x x x \

0 ,

j \

X x D n

c. QUindiagonal matrix d. Block diagonal matrix

FIGURE 3.4: Types of Banded Matrices (where x denotes a non-zero element)

Defini tion 3.3:

An (nxn) matrix A is diagonally dominant if,

la, ,I ~ ~,~

n

I j=l j;o!i

and for at least one i,

1 a , ,I, for all 1 ~ i~n , ~,J

90

la .. 1 > ~,~

la .. 1 ~,J

Definition 3.4:

If a matrix A is Hermitian, and

for all ':0.+0, ·then A is positive defipite. A is non-negative definite

Definition 3.5:

. (1) (2) (3) A sequence of matr1ces A lA lA , ... , of the same dimension

is said to converge to a matrix A, if and only if,

Urn 11 A-A (k) i I = ° , k-

where 11 11 is a suitable norm.

Let A be a square matrix, ·then A converges to zero if the sequence

. (1) (2) (3) of matr1ces A lA lA , ... , converges to the null matrix, 'and is

divergent otherwise.

Definition 3.6:

A matrix A=[a .. 1 of order n has property A if there exists two ~, )

disjoint subsets 51 and 52 of W={1,2, ... ,n} such that if iFj and if

either ai,jFo or aj,iFO, then i E 51 and j E 52 or else i E 52 and

j E 51'

Definition 3.7:

A matrix A of order n is consistently ordered if for some t

there exist disjoint subsets Sl'S2, ... ,St of W={1,2, ... ,n} such that

t

L Sk=W' and such that if i and j are associated, then j E Sk+l if k=l j>i and j E Sk_l if j<i where Sk is the subset containing i.

3.4.1 Vectors and Matrix Norms

92

The measure of the size or magnitude ofoa vector matrix, is called

its norm and is denoted by I I 11. If u is a vector then its norm.is

denoted by I I~I I, and it is a non-negative number satisfying the

following three axioms:

1. II~II =0 for ~=O, and II~II >0 if ~;!o;

2. Ila~1 i=lal.II~11 for any complex scalar a;

3 .. 11~+~II~II~11+i I~II for the vectors u and v.

Similarly,. as above we can define the matrix norm. The norm of

a matrix A of order n, is denoted as I IAI I, satisfying the following

axioms:

1. I IAI I~O, with equality only when A is the null matrix;

2. 11 aA 11 = I a I . 11 A 11, for any complex scalar a;

3. IIA+BII~IIAII+IIBII, for any matrices A and B (the triangle

inequality);

4. 11 AB I I ~ 11 A 11 . 11 B I I, for any matrices A and B (the Schwarz

inequa li ty) .

There are common types of norms. Some of them are:

i. The Euclidean norm, denoted by IIAII ' is given by, E

\' 2 ! =(Lla .. 1) .

. . 1.,J ~,J

ii. The spectral or L2

.norm, denoted by IIAlls' is given by,

iii. The Ll norm, denoted by IIAI11

, is given by,

n

2 la. .1 . i=l 1., J

max j

iv. The L= norm, denoted by ! IAI loo' is given by,

L ~

3.4.2 Eig·envalues and Eigenvectors

93

Suppose that A is (nxn) matrix and u be a non-zero column vector

of order n. If there exists a scalar A such that,

Au = AU (3.4.3)

then Ais called an eigenvalue of A and u its corresponding eigenvector

of A.

Equation (3.4.3) can he written as,

(A-AI)U = 0 . (3.4.4) .

The non-trivial solution, ~O to this matrix equation exists if and

. only if the matrix of the system is singular, i.e.,

det(A-AI) o . (3.4.5)

Equation (3.4.5) is called the characteristic equation of A and

the left-hand side is called the characteristic polynomial of A, which

94

can be written as,

n-l n n a +a l+ ... +a 11 +(-1) 1 o 1 n- o . (3.4.6)

n Since the coefficient of 1 is not zero. equation (3.4.6) has

always n roots (complex or real) which are the n eigenvalues of the

matrix ~. namely. 11 .1 2 •...• 1n

(not necessarily all distinct). each

of them possessing a corresponding eigenvector. Wilkinson

[Wilkinson. 1965J de·scribed many methods in detail for obtaining the

eigenvalues, along with the corresponding eigenvectorsw

95

3.5 NUMERICAL SOLUTION OF FOE'S BY FINITE DIFFERENCE METHOD

Numerical methods for the solution of FOE's have been studied

extensively for many years and a number of approaches have been

developed. The most widely used are the finite element and. finite

difference methods. Finite difference methods are found to be

discrete techniques where the domain of interest is represented by

a set of points (nodes): The information between these points is

commonly obtained using Taylor series expan~ions.

3.5.1 Finite Difference Approximation

It is necessary for understanding the finite difference technique

to consider the nomenclature and fundamental concepts encountered in

this form of approximation theory. The basic concept is to subdivide

the domain of the solution of the given FDE by a net with a finite

number of mesh points.' The derivative at each point is then replaced

by a finite difference approximation. In this sub-section, we shall

develop certain finite difference representations for one- and two-

independent-variable system using Taylor series expansions.

Let us first, consider u(x), in which u is a continuous function

of the single independent variable x. By discretizing the x domain

.... into a set of points (nodes) such that,

u(X,) _ u(ih) _ u, , i~O,1,2,3, .. , ~ ~

and replacing the location x, by ih, the nodal coordinates are ~

(3.5.1) .

specified as the product of the integer i and grid spacing h, where

h is assumed constant and to be a small quantity less than unity (see

Figure 3.5).

u

OL---~h~--2~h---3~h~--4·h---45~h~~~---n+h--~X

FIGURE 3.5: Finite Difference Discretization of u=u(x) using

constant Mesh Spacing h.

y

} k

x •

h i, j+l

(a)

0-i-l,j i,j

i, j-l

(b)

FIGURE 3.6: Two-Dimensional Finite Difference Grid.

96

o i+l,j

97

in the two-dimensional case the function u(x,y) may be

specified at any nodal location as,

u(x, ,y;) ~ J

_ u(ih,jk) - i=O,l,2, ... U. . I

~,J

j;O,1,2, .. , (3.S.2)

The spacing in the x-direction is h and in the y-direction, k. The

integers i and j refer to the location of u along the x and y co-

ordinates, respectively (see Figure 3.6).

The Taylor series expansion for u(x) can be written at the

point x, as, ~

u (x , ) +hu' I, h2

Uxxli

h3

u(x+h) ; +- + - u I + · .. ~ x ~ 21 3! xxx.

~

(3.S.3a)

h 2 h 3 , u(x-h) ; u (x,) -hu I + -u I uxxxl, + · ..

~ xi 2! xx. 31 ~ ~

(3.S.3b)

Rearranging these equations, we may write,

u(x,+h)-u(x,) h h 2

uxl, ~ ~

; -u I - -u I + h 2! xx i 3! xxx.

~ ~ (3.S.4a)

u(x,)-u(x,-h) h h2

uxl. ~ ~

; + - u I -u I + · .. h 2! xx. 3! xxx. ~ ~ ~

(3.S.4b)

Therefore, two possible approximations to the first derivative of U

. at x, are given by (3.S.Sa,b). ~

u (x, +h) -u' (x, )

Uxl, " ~ ~

h -~

or

U (x,) -u (x, -h)

Uxli

" ~ ~

h -

u -u i+l i

h (3.S.Sa)

U,-U, 1 ~ ~-

h (3.S.Sb)

Because the series has been arbitrarily truncated, there is clearly

an error, E., say, associated with this approximation. This error ~

can be characterized by the first and largest term of the truncated

series, which yields,

O(h)

We say that this error is of order h, O(h). The O(h). error is

in absolute value smaller than ch, (c is constant) for sufficiently

small h.

Adding (3.5.4a) and (3.5.4b) and solving for u I ' results into x i

u -u i+l i-I

98

u I = x i 2h (3.5.6)

2 with the truncation error of order O(h ).

While subtracting (3.5.4b) from (3.5.4a) and solving for

we obtain,

u I = xxi

U. l-2u.+u. 1 ~+ 1. 1.-

h2

2 with the truncation error of order O(h ).

We can proceed further to derive many finite difference

u I ' xx . ~

(3.5.7)

approximations for u(x,y). New concepts involve the use of u. . instead ~,J

of u. and the fact that a partial derivative with respect to x implies ~

that y is held constant. Using (3.5.5a) in conjunction with Figure

3.6b as an illustration, we can write directly,

au .. ~ ax

u. I .-u .. 1.+,] 1.,)

h + O(h) (3.5.8a)

au, , ~ ay

u -u i,j+l i,j

k

99

+ 0 (k) • (3.5.8b)

So, in the approximation to u! ,we hold the subscript j x , , ~,J

constant while in u! we hold i constant. The two-dimensional y , ,

. 1. , J equivalent of (3.5.8a,b),

u -2u +u '1 ' '1

uxxl. ~

1.+ 1. 1.-

h2

is obtained in the same way as,

u I -yy , , ~, J

u -2u +u i+l,j i,j i-l,j

h2

2 + 0 (h ) , (3.5.9a)

(3.5.9b)

The important extension is how to derive mixed derivatives, if we

need the finite difference representation for u , xy

u I -xy , , ~,J

[au I I ax ay, '

~,J

A finite differenc~ representation of (3.5.10) is now readily

obtained using (3.5.6) as follows,

...£..[l'o! I = J:..[aul ax ay " 2h ay , 1 ' 1.,J 1.+ ,]

au --ayl'l,l

1.- , J

(3.5.10)

1 .2h

u, 1 ' 1-u , 1 ' 1 [ 1.+ ,]+ 1.+ ,J-2k

k2

+-u I + ... 3! YYY, 1 '

1. + I J

=

u, 1 ' l-u, 1 ' 1 1.- ,]+ 1.- ,J+

2k k 2

- - u I + 3! YYY, 1 '

1.- ,J

1 u, 1 ' l-u, 1 ' 1-u , 1 ' l+u , 1 ' [ 1.+ ,J+ 1.- ,]+ 1.+ ,]- 1.- ,]

2h 2k

2 2 O(h ) + 0 (k ) ,

••• 1

and when hand k are equal, this equation becomes

In Table 3.1, a list of the most frequently used finite

difference approximation for u(x,y) are illustrated.

Derivative

u I . xx .. 1.,J

Finite Difference Approximation

u -u i+l,j i,j

h

u -u i,j i-l,j

h

u'l,-U'l' 1.+,] 1- ,J 2h

-u +4u -3u i+2,j i+l,j i,j

2h

Ui+l,j+l-Ui_l,j+l+Ui+l,j_l-Ui_l,j_l

4h

u, 1 ,-2u, ,+u, 1 ' 1.+,] 1.,] 1.- ,J

h2

-u +16u -3OU +16u -u i+2,j i+l,j i,j i-l,j i-2,j

12h2

Ui+l,j+l-Ui+l,j-l-Ui-l,j+l+Ui-l,j+l

4h2

100

(3.5.11)

Order of Error

O(h)

O(h)

TABLE 3.1: Finite difference approximation in two independent

variables with k=h

101

The two-dimensional finite difference approximations can be

extended in a straightforward manner to three space dimensions, or

two space dimensions and time. The approximation of derivatives in

the third space dimension is analogous to that presented earlier.

To conclude this sub-section, we illustrate now the finite

difference approximation is used to represent PDE's. Let us consider

the heat flow parabolic equation,

u xx (3.5.12)

and the Laplace or Poisson elliptic equation,

u + u xx yy

o (3.5.13a)

or u + U = f (x ,y)

xx yy (3.5.13b)

Temporarily, we shall ignore the initial and boundary conditions

and concentrate on the derivation of some finite difference forms for

the POE which wili hold within the domain of interest.

A finite difference representation of (3.5.12) is readily

. obtained using information from Table 3.1. A term-by-term

substitution yields,

u. , l-u, , 1.,]+ 1.,)

k

u, 1 ' - 2u , , +u, 1 ' 2 1.+,] 1.,] 1.- ,]

2 + 0 (k ,h ) .

h

Neglecting the truncation error, the finite difference equation at

the level (j+l) becomes,

AU + (1-2A)U, , + AU, 1 ' i+l,j 1,J 1.- ,J

(3.5.14)

where A The error associated with (3.5.14) is of order

102

2 O(k,h ). The discrete diagram for (3.5.14) appears in Figure 3.7a.

The Laplace equation (3.5.13a) can be approximated in exactly

the same manner using Table 3.1,

u -2u +u i+l,j i,j i-l,j +

h2

u -2u +u i,j+l i,j i,j-l

k2

2 2 + O(h ,k ) ; o ,

which can be rearranged to yield,

1 -2[u. 1 .+u. 1 .1 h ~+,J 1-,]

1 + "2[U' '+l+u .. 11 k 1,J 1.,J-

(3.5.15)

The discrete diagram for equation (3.5.15) is given in Figure 3.7b.

When k;h, equation (3.5.15) reduces to,

u. 1 . +u. 1 . +u. . l+u . . 1 1.+,) 1.-,J 1,J+ 1,)-

which is also second-order accurate.

i, j+l

o i-l,j i,j i+l,j i-l,j

(a)

(3.5.16)

i, j+l

i,j i+l,j

i,j-1

(b)

FIGURE 3.7: Computational Molecules for Finite Difference Approximations

103

3.5.2 Derivation of Finite Difference Approximations

Previously, we have shown that by using truncated Taylor series

expansions, we can replace a PDE with a finite difference approximation.

The truncated terms lead to the truncation error, which provides a

measure of the accuracy of the finite difference approximation. For

any PDE there are many possible finite difference representations, and

the schemes with smaller truncation error are to be preferred. This

is not always true, since other features in addition to the truncation

error make a particular finite difference approximation a feasible

candidate for a computation.

Let us consider equation (3;5.12) as a model for developing the

finite difference approximations. There are two types of approximation,

explicit and implicit. Table 3.2 shows some of the well-known finite

difference methods. The truncation error and the stability require

ment of each method are also included.

FINITE DIFFERENCE FORM

Cwssical Explicit

l-

ui , j+l = (1-2A)ui,j+A(ui+l,j+ui_l,j)

. Dufort-Frankel Explicit

2.

(1+2A)ui ,j+l = 2A(ui+l,j+Ui_l,j)+(1-2A)Ui,j_l

Richardson Explicit 3.

= U. j_l+2X(u. 1 j+ui _l .)-4Xui . U i,j+l 1., 1.+ , ,J ,J

Special Explicit 4.

1 u .. 1 = U. + 6(u. 1 .-2ui .+U. 1 j) ~, J+ l.,j 1.+ ,J ,J 1.-,

TABLE 3.2: Well-known Finite Difference Methods

STABILITY CONDITION

A ~ 1

Uncondit-ionally

Stable

Unstable

Stable

TRUNCATION ERROR

O(k,h2

)

O(k2

,h2

)

O(k2,h

2)

O(k2

,h4

)

COMPUTATIONAL MOLECULE

~ cS

continued ••• .... o A

,

FINITE DIFFERENCE FORM

Baciauards (FuUyJImpZicit 5.

(1+2>.)u ->. (u +u ) = u, i,j+l i-l,j+l i+l,j+l 1,j

Crank-Nicolson Implicit 6.

(2+2>')Ui,j+l->'(Ui_l,j+l+ui+l,j+l) = (2-2>, )u i , j +>. (u i _l , j +ui+l, j)

Weighted ImpZicit

7. (1+2>'9)Ui,j+l->'9(ui_l,j+l+ui+l,j+l) = A(1-9) (ui_l,j+ui+l,j)+

o~e~l (1-2>. (1-9) )ui,j

Douglas Implicit 8.

5 >. 1 >. 1 (~>')Ui,j+l-(2 - 12) (Ui-l,j+l+ui+l,j+l) = (- + -) (u +

2 12 i-l,j 5

u i +l ,j)-(6->,)u i ,j

TABLE 3.2: Well-known Finite Difference Methods

STABILITY TRUNCATION CONDITION ERROR

Uncondit-ionally

O(k,h2) Stable

Uncondit-ionally

O(k2

,h2

) Stable

O(k2 ,h2) 1 ; >. ~ (2-49) ,for 9 =

O(k,h2) O~9~1

for e " ;

Uncondit-ionally

O(k2

,h4

) ,Stable

COMPl1I'ATIONAL MOLECULE

..-/

V

.J

~

0 -\.. -0

.f

continued ••• ,... o '"

FINITE DIFFERENCE FORM STABILITY TRUNCATION COMPUTATIONAL CONDITION ERROR MOLECULE

r'\

Variation of Douglas Implicit Uncondit-

9. 151 1 5 ionally 0(k2

,h4

) (S-A)Ui+l,j+l+(4+2A)Ui,j+l+(S-A)Ui_l,j+l = ~i+l,j·~i,j+ Stable

1 1 5 1 6"i-l, j- 24"i+l, j-l- ~i. ;-1- 24"i-1. ;-1

0 SauZ'ev's Alternating (b) I Uncondit- 0--0 10. (a) Ui ,j+l-Ui,j = A (u. 1 j -U. j -U. . 1 +U. 1 . 1) ionally 0(k,h2

) ~+ ,. ~, 1., J+ 1.-, J+

~ Stable (b)

U i ,j+2-U i,j+l = A tu -u -u +U ) i+l,j+2 i,j+2 i,j+l i-l,j+l

TABLE 3.2: Well-known Finite Difference Methods 2 au a U to-=--

at ax2

107

3.5.3 Consistency, Efficiency, Accuracy and Stability

When a PDE (say, equation (3.5.12» is approximated by a finite

difference analogue, one naturally expects that the difference scheme

indeed represents the differential system in some sense. By this, we

mean that a difference system is consistent with a differential system

when the former becomes identical with the latter in the limit as h,k+O.

Clearly, consistency is a fundamental requirement.

Efficiency refers to the amount of computational work done·by the

computer in solving a problem over unit time-length.

Accuracy of a numerical solution depends on two major classes of

errors, i.e., round-off and truncation errors. Round-off errors

characterise the differences between the solution furnished by the

computer and the exact solution of the difference equations.

Truncation errors are caused by the approximations involved in

representing differential equations. Truncation errors depend upon

the spatial grid size h and step size k. Intuitively, one would

assume that the accuracy of any finite difference solution would be

decreased by increasing the grid sizes, Since, evidently, if the grid

spacing is reduced to zero, the discretized equivalent becomes

identical to the continuous field. Magnitude of truncation errors can

be estimated using Taylor, series expansion.

In recommending a numerical method, we need to strike a balance

between efficiency and accuracy. This is because a method might have

incurred more work to attain accuracy. On the other hand, one may

settle for a less accurate method in favour of its simplicity and

computing cost effectiveness.

The final concept to be studied is stability. If u(x,t) is the

exact solution and u .. is the solution of the finite difference l.,)

equations, the error of the approximation at the point (i,j) is

((u .. )-u(ih,jk». One is interested to know the behaviour of l.,)

lu .. -u(ih,jk) I, as j->«> for fixed h,k. That is, whether the solution l.,)

is bounded (stable) as the index j-><o. Also of interest is the

behaviour of lu .. -u(ih,jk) I.- as h,k->O. That is, whether the l.,]

difference scheme is convepgent.

It is clear in both cases that as the number of cycles of

calculations become large there is a possibility for unlimited

amplification of errors, and hence, the total accumulated error will

quickly swamp the solution rendering it worthless. It can therefore

be said that a numerical method is stable if a small error at any

stage produces a smaller cumulative error.

Lax's equivalence theorem gives a relation between consistency,

stability and convergence of the approximations of linear initial

value problems by finite difference equations.

Theorem 3.1

Given a properly posed initial boundary value problem and a

finite difference approximation to it that satisfies the consistency

condition, then s~ability is the necessary and sufficient condition

for convergence.

Various techniques are available for a quantitative treatment of

the stability of finite difference schemes. Among those commonly used

are the linearised Fourier analysis method (van Neumann criterion),

the matrix method, the maximum principle and the energy method.

108

3.6 METHODS OF SOLUTION

As mentioned in the previous section, the application of the

finite difference methods for solving PDE's (say, equation (3.5.12»

yields a system of linear simultaneous equations which can be

represented in matrix notation as,

109

Au = b (3.6.1)

where A is.a coefficient matrix of order (nxn) , the order of the

matrix A equals the number of interior mesh points, b is a column

vector containing known source and boundary terms, and u is the

unknown column vector.

·This section deals with well-known methods of solving the system

(3.6.1). Usually the methods used lie in two classes, the class of

direct methods (or elimination methods) and the class of iterative

methods (or indirect methods) which mainly depend upon the structure

of the coefficient matrix A. So, if,A is a large sparse matrix,

iterative methods are usually used, since these will not change the

structure of the original matrix and therefore preserve sparsity.

Another advantage of. iterative methods, not possessed by direct

methods, is their frequent extension to the solution of sets of non

linear equations.

3.6.1 Direct Methods

The direct methods are based ultimately on the process of the

elimination of variables. The most widely used methods are:

1. Gaussian ELimination (GE) method, which involves a finite number

of transformations of a given system of equations into an upper

110

triangular system which is much more easily solved. Precisely the

number of the transformation is one less than the size of the given

system. If any of the diagonal elements of the matrix A in the

system (3.6.1) becomes zero during the elimination process, we re

order the equations. The reorders are referred to as pivoting

which is normally employed to preserve stability against rounding

error.

There are two basic well known pivoting schemes. Partial pivotin·g

and complete pivoting. In the partial pivoting strategy we choose

an element of largest magnitude in the column of each reduced matrix

as the pivot, elements of rows which have previously been pivoted

being excluded from consideration. Whilst in the complete pivot,

the pivot at each stage of the reduction is chosen as the element

of largest magnitude in the submatrix of rows which have not been

pivotal up to now, regardless of the position of the element in the

matrix. This may require both row and column interchange. The

complete pivoting is time consuming in execution and is not

frequently used.

2. Gauss-Jordan (GJ) method, which is an alte·rnative to GE leads to a

diagonal matrix rather than triangular at the end of the process.

In this method, the elements above the diagonal are made zero at

the same time that zeros are created below the diagonal, and hence

the solution can be obtained by dividing the components of the

right-hand side vector, £, by the corresponding diagonal element,

i.e., there is no need for the back substitution stage as in GE.

111

3. Triangular, ·or LV Decomposition, is a modification of the GE. The

matrix A is transformed into the product of two matrices Land U,

where L is a lower triangular and U is an upper triangular matrix

with one's on its diagonal entries. Then equation (3.6.1) can be

written as,

LUu ; b . (3.6.2)

The solution of (3.6.1) by this algorithm follows from (3.6.2) by

introducing an auxiliary vector, X (say), such that the system

(3.6.2) will be split into two triangular systems,

(3.6.3a)

Uu X (3.6.3b)

The two vectors X and ~ can be obtained from (3.6.3a,b) by forward

and backward substitution processes respectively.

The amount of work in the LU and GE are the same, while the GJ

method requires almost 50% more operations than the GE method.

3.6.2 The Iterative Methods

It is well known that iterative methods can make use of the great

speeds of modern-day computers which are used in large scale computations

for solving the matrix equations which arise from finite difference

approximations to PDE's.

In the iterative method, a sequence of approximate solution vectors

(u(k)} are required to solve the non-singular system (3.6.1) such that

(k) -1 U -+ A b as k-to;!

112

Without loss of generality, let us consider that the (nxn) non-

singular coefficient matrix A of the system (3.6.1) can be expressed as,

A ; Q - S , (3.6.4)

where Q and S are also (nxn) matrices, and Q is non-singular. This

expression then represents a splitting of the matrix A. Equation

(3.6.1) then becomes,

Su + b (3.6.5)

Different splittings of the matrices Q and S will clearly give

different iterative methods. Some of these methods are:

1. The Jacobi Method. In this method we assume that Q;D and S;E+F,

where D is the main diagonal element of A. E and F are strictly

lower and upper (nxn) triangular matrices respectively. Hence

equation (3.6.5) can be written as,

Du ; (E+F)u + b (3.6.6)

-1 By the assumption that A is non-singular, thus D exists and we

can replace the system (3.6.6) by the equivalent system,

-1 -1 u ; D (E+F)~ + D b

The Jacobi iterative method is defined by,

(k+l) u

(k) ~ +~

(3.6.7)

(3.6.8)

where B is the Jacobi iterative matrix associated with the matrix A

113

and is given by,

-1 B = D (E+F) ,

-1 (k) and ~=D £. In this method the components of the vector u

b d wh 'l 'th of u(k+l) must e save 1 e comput1ng e components

2. The Gauss-Seidel (CS) Method, which is based on the immediate use

, (k+l) (k) of the 1mproved values u

i instead of u

i . By setting Q=D-E

and S=F, the matrices D,E and F as defined before, equation (3.6.5)

becomes,

(D-E) ~ = Fu+b

Then, the GS iterative method is defined by,

(k+l) (k+l) (k) b Du ; Eu +Fu + (3.6.9)

-1 By multiplying both sides by D ,equation (3.6.9) can be written

as,

(k+l) u

(k+l) (k) L~ +R~ +~

-1 -1 -1 where L=D E, R=D F, and ~=D £

(3.6.10)

Since L is strictly lower triangular matrix then det(I-L)=L Hence

-1 (I-L) is a non-singular matrix and (I-L) exists, and the GS

iterative method will take the form,

(k+l) L (k) ~ = ~ +!,

where L is the CS itel'ative matl'ix, and it is given by,

and

L -1

(I-L) R

-1 t = (I-L) ~.

(3.6.11)

The computational advantage of this method is that it does not

(k+l) require the simultaneous storage of the two approximations lli

and u(k) in the course of the computations as does the Jacobi 1.

iterative method.

3. The Simultaneous Overrelaxation (JOR) Method, is a modification of

114

the Jacobi method. ~(k+l) If we assume that u is the vector obtained

from the Jacobi method, then from (3.6.8),

~(k+l) (k) u = B~ +SL (3.6.12)

(k+l) and by choosing a real parameter w, the actual vector u of

this iteration method is determined from,

(k+l) u (3.6.13)

~(k+l) Elimination of u between equations (3.6.12) and (3.6.13) leads

to

(k+l) (k) U =Bu +wg

or- - (3.6.14)

where B is the iteration matrix of the JOR method and is given w

by -1

B = [wD (E+F)+(l-w)Ij . w

The real parameter w is called the relaxation faetor. If w=l, we

have the Jacobi method. If w>l «1) then we are in a sense

carrying out the operation of "overrelaxation" (underrelaxation)

a t each of the nodal points. Both, the Jacobi and JOR methods are

clearly independent of the order in which the mesh points are

scanned.

4. The Successive OverreZaxation (SOH) Method, which is the same as

(k+l) the JOR method except that one uses the values of ui

whenever

115

possible. So, from equation (3.6.10) the 50R method is defined as,

(k+l) (k+l) (k) (k) u = W(L!:!. +R!:!. +.<I) + (l-w)!:!.

Equation (3.6.15) can be written in the form,

(k+l) (I-WL)!:!.

(k) [wR+ (l-w)I l!:!. +w.<I

(3.6.15)

(3.6.16)

But (I-wL) is non-singular for any choice of w, since det(I-wL)=l.

So we can solve (3.6.16) for u(k+l) obtaining

(k+l) (k) -1 u = L u +(I-wL) wg

w- -

where L is the SOH iteration matrix and is given by, w

L w

-1 = (I-wL) (wR+(l-w)I).

(3.6.17)

(3.6.18)

For w=l, we have the G5 method. And w>l «1) corresponds to the

cases of overrelaxation (underrelaxation) at each of the nodal points.

The G5 and 50R methods are both dependent upon the order in which

the points are scanned.

s. The Syrrunetric SOH (SSOH) Method, whereas this method involves two

half iterations using the SOR method. The first half iteration is

the ordinary 50R method while the second half iteration is the SOR

method using the mesh points in reverse order. Hence, we can define

the 550R iterative method by,

(k+!) (k) -1 u = L u + (I-wL) w.9.

w- (3.6.19a)

and

116

(k+l) (k+~) -1 u = K u +(I-wR) wg (3.6.19b) w- -

where u(k+~) is an intermediate approximation to the solution,

L , is given in (3.6.18) and K is given by, w w

-1 K = (I-wR) [wL+(l-w)I]

w (3.6.20)

(k+~ ) By eliminating u between equations (3.6.19a) and (3.6.19b)

we have,

(k+l) (k) u = H u +d

or- --w

where H is the 880R iteration matrix and is given by, w

H = K L w w w

and

-1 -1 (I-wR) (I-wL) [wL+(l-w)I] [wR+(l-w)I]

d --w

-1 -1 w(2-w) (I-wR) (I-wL) .2.

3.6.3 Block Iterative Methods

(3.6.21)

(3.6.22a)

(3.6.22b)

In our previous discussion of the iterative methods for solving

equation (3.6.1), we dealt with point iterative methods, that is, at

each step the approximate solution is modified at a single point of the

domain. An extension of these methods are the block (group) iterative

methods in which several unknowns are connected together in the

iteration formula in such a way that a linear system must be solved

before anyone of them can be determined. Equation (3.6.1) can be

written as,

(3.6.23)

In the system (3.6.23), assume that the equations and the unknowns u. ~

are partitioned into t groups such that ui

' i=1,2, ... ,n1

, constitute

the first group; i=n1+l,n

1+2, ... ,n

2, constitute the second group, and

th in general, u" n l<i<n constitute the 5 group and nn=n.

1. s- ... S A.

117

Young [Young 1971] defines an ordered grouping n of W;{1.2.3 •...• n}

as a subdivision of W into disjoint subsets Gl

.G2 •...• GR. such that

Gl U G2 •... U GR. ; W. Two ordered groupingsn and n' defined by Gl

.G2

•

... ,G! and G1 "G2 , , ... ,Gt

" respectively are identical if £=1' and if

Gl=Gl ,· G2=G 2 , •...• GR.=GR.'

Evidently. this partitioning n imposes a partitioning of the matrix

A into blocks (group~ofthe form.

A - --2.2

A = I

I (3.6.24)

A A -----A R..l R..2 R..R.

where the diagonal blocks A, ., 1~i~2 are square, non-singular matrices. 1..1.

From this partitioning of the matrix A, we define the matrices,

D = ,

o

and

0 , ,

F =

, , , ,

o

, ,

Al •2 - - - - -, , , , , , , , , , , , , ,

0 , , , ,

0, "- ,

A2 •1 , , o

• E , , , , , ,

, , , , , , , A ~----A'

R..l R..R.-l

A 1.R.

, , AR._l.R.

0

, , '0

(3.6.25)

118

where D is a block diagonal matrix and the matrices E and Fare

strictly lower and upper block triangular matrices respectively, and

A-D-E-F. (3.6.26)

Now, for the column vector ~ in equation (3.6.23) we define

column vectors U1 'U2""'U where U is formed from _u by deleting all - - -~ -s

elements of ~ except those corresponding to group s. Similarly,

define column vectors ~1'~2'.'./~2 for the given vector £, we also

define the submatrices A, of for i,j=l,2, ... ,R., such that each A .. 1.,) l.,J

are formed from the matrix A by deleting all rows except those

corresponding to G, and all columns except those corresponding to G,. 1 J

The system (3.6.23) evidently can be written now in the equivalent

form,

~

I j=l

A, ,U, 1,J-:J

= c, -1

i=l, .... ,R.. (3.6.27)

If we suppose that all the submatrices A, , are non-singular. Then, 1,1

the bZock Jaaobi iterative method is defined by

,or equivalently,

where,

B, ' 1,J

and V -i

=

~

I j=l j;li

~ I A, ,u~k)

j=l 1, J-:J

j;li

(k) B, ,u,

1,J-:J

-1

{ -A, , A,

l.,j 1,1 =

l 0

-1 A, , C, , 1,1 ""'2

if i;lj

if i=j

(3.6.28)

(3.6.29)

(3.6.30)

(3.6.31)

We may write (3.6.28) in the matrix form,

wher~,

and

(k+l) u

(U) + v

E(U) and F(u) are again strictly lower and upper triangular matrices.

In a similar manner to the pOint case, we can derive the block

GS, JOR, SOR and SSOR methods.

The determination of a suitable value for the relaxation factor

w of the SOR method is of paramount importance, and in particular the

optimum value of w, denoted by ~' which minimizes the spectral radius

of the SOR iteration matrix and thereby maximize the rate of

convergence of the method.

Young [Young, 1954] proved that when a matrix possesses property

A, then it can be transformed into what he termed a consistently

ordered matrix. Under this condition the eigenvalues A of the SOR

iteration matrix L associated with A are related to the eigenvalues ~ w

of the corresponding Jacobi iteration matrix B by the equation

119

2 (A+w-l)

A 2 2

= w ~ (3.6.32)

from this equation it can be seen that,

! A-W~A +w-l = 0 • (3.6.33)

If we assume that A is symmetric, then, the eigenvalues of B are real

120

and occur in pairs ±~. Let 11 denote the largest eigenvalue of B, it

can be shown that wb defined by,

or equivalently,

2

~ l+/l-"ii~

is the value of W which minimizes p(Lw

)' i.e. WFWb

, then

p (L ) > p (L ) W wb

Using this optimum value of w, it can also be shown that,

r-:2 l-/l-ii~

~ l+1I-U~

Moreover, for any w in the range O<w<2, we have,

p (L ) W

=

_ 2..2 ! r[w~+(W ~ ;4(W-l» J2

lw-I l

(3.6.34a)

(3.6.34b)

(3.6.35)

(3.6.36)

(3.6.37)

The estimation of wb depends on whether P(B) or p(L) can be estimated.

Several methods have been suggested by carre [Carn" 1961J, Varga

[Varga, 1962J and Hageman and Kellogg [Hageman, 1968J.

3.6.4 Alternating Direction Implicit (ADI) Methods

To increase convergence, another form of splitting was suggested,

which consists of an initial first half iteration in the row direction

followed by a second half iteration in the column direction. Such

methods are aptly designated AZtePnating Direction ImpZicit methods or

ADI methods for short.

121

The first ADI methods were developed by Peaceman and Rachford (PR)

[Peaceman, 1955] for solving the matrix equation,

Au = b (3.6.38)

where the (nXn) matrix A is non-singular and can be represented as the

sum of the three (nxn) matrices,

A = H + V + r . (3.6.39)

We can make the following assertions about the matrix A:

1. A is a real, symmetric and positive definite, irreducible matrix

with non-positive off-diagonal entries. Derived from the

discretisation of a two dimensional elliptic operation.

2. H and V are real, symmetric, diagonally dominant matrices with

positive diagonal entries and non-positive off-diagonal entries,

derived from row and column discretisation of .elliptic operation.

3. L is a non-negative diagonal matrix.

In a typical situation H and V would be tridiagonal or could be

made so by a permutation of the rows and corresponding columns.

By using (3.6.39) we can write the matrix equation (3.6.38) as a

pair of matrix equations,

122

(H+!l: +rI) ~ ~ b- (V+!l: -rI) u

(3.6.40)

(V+!l:+rI)u ~-(H+!l:-rI)~

for any positive scalar r. If we let,

then the Peaceman-Rachford Alternating Direction Implicit method is

defined by,

(3.6.41)

where the r's are positive acceleration parameters chosen to make the

process converge rapidly.

Since the matrices (Hl+rk+1I) and (Vl+rk+lI) are, after suitable

permutations, tridiagonal non-singular matrices, the above implicit

process can be directly carried out by the simple algorithm based on

the GE. Indeed, the ADI method is derived from the observation that

for the first equation (3.6.40) we solve first along horizontal mesh

lines, and then for the second equation of (3.6.40) we solve along

vertical mesh lines. The vector u(k+!) is treated as an auxiliary

vector which is discarded as soon as it has been used in the'

, (k+l) calculatlon of u .

The two equations of (3.6.41) are now combined to give,

u(k+l) ~ T u(k)+g (b), k~O , r - r k+l k+l

(3.6.42a)

where, (3.6.42b)

and

123

T r

(3.6.42c)

The convergence of the ADI scheme can be easily shown for the

stationary case with constant parameters r.=r. If we let, 1

then by similarity, T and T have the same eigenvalues. Hence from r r

(3.6.42c) we obtain,

P (Tr ) = P (Tr ) ~ 11 Tr 11

~ 11 (Hl-rI) (Hl+rI)-lll 11 (Vl-rI) (vl+rI)-lll ,

where P(Tr

) is the spectral radius of Tr

. Since HI and VI are

symmetric and positive definite, then in the L2 norm, we find,

Ir-~.

Ir+~: < I ,

where ~i' l~i~n are the eigenvalues (positive) of HI' A similar

argument applies to the norm involving VI' Hence, p(T )<1 for all r>O, r.

and therefore the PR iteration (3.6.41) converges.

3.6.5 Alternating Group Explicit (AGE) Method

As we have already seen in the subsection (3.6.4), the ADI method

was developed to obtain the solution implicitly in the horizontal and

vertical directions.

Evans [Evans, 1985], however, derive another method where the

analysis of which is analogous to the ADI scheme. This Explicit

Iterative method employs the fractional splitting strategy on tri-

diagonal systems of difference schemes and which has proved to be stable.

124

Its rate of convergence is governed by the acceleration parameter r.

Let us recall equation (3.6.1) where the matrix A and the two

vectors u and b are given by,

d c l a d c , ,

0 , , , , , , , , , , , , A = , , ,

, , (3.6.43) , , , , , , , , 0

, , , , ,

l a d c

a d

u - (u1

,u2 ,···,un

)T

and b (b ,b

2, .. .,b )T =

1 n

We now consider a class of methods for solving the system (3.6.1),

which is based on the new splitting of the matrix A into the sum of

the matrices,

where Gl and G2 satisfy the conditions Gl+rI and G2+rI are non

singular for any r>O. And Gl

,G2

are given by,

hd c

a hd

=

'I I hd

la

o

c

·4 __ _ -,-

L ___ _

I

I

'0 i

+ I hd , I la ,

-~ c

hd (nxn)

(3.6.44)

(3.6.45a)

and hd ---n I

I --I t-I

1 hd C I

I I _ _ ,_a_ _hd_ l _ -I r -

1 , 0 G2

= , (3.6.4Sb) , , ,

" r- -,- -1 -,hd C

,In.n, 0

I , hd 1 -"- la_

, I

"

if n is even (odd number of intervals) , and,

hd I I --, -'- r- -'- - -

hd C 1 , ,

1 1 a hd , , 0

- - - 1-:: -1- -1

Gl

= " (3.6.46a) I

~ -, -, ,

1 0

, hd C

a hd (nxn)

and

ihd C I l a hd 1 1

-1- -,- - -+ - - - -1

I hd Cl 1 1 0

a hd _1-__ L_ - ~!--, G

2 = ,

'- ,I (3.6.46b) , "

- - -" I-1 hd C

! 1

0 I

I 1 1 , ,a

- h,d, :- - J - , - ,I , , ' -, I I

, hd , (nXn)

if n is odd (even number of interval s) , with hd=d/2.

It is assumed that the following conditions are satisfied:

i. Gl+rI and G2+rI are non-singular for any r>o.

ii. for any vectors ~l and ~2 and for any r>O, the systems,

and

are more easily solved in explicit form since they consist

of only (2 x 2) subsystems.

We shall be concerned here with the situation where Gl

and G2

are

either small (2 x 2) block systems or can be made so by a suitable

permutation of their rows and corresponding columns. This procedure

is convenient in the sense that the work required is much less than

would be required to solve the original system (3.6.1) directly.

By using (3.6.44) we can write the matrix equation (3.6.1) in the

form,

126

(3.6.47)

(k+l) (k+l) Following a strategy similar to the ADI method, u and u

can be determined implicitly by,

or .explicitly by,

(G +rI)u(k+1) 1 -.

(k+l) (G +rI)u . 2 -

u (k+t)

(k+t) b-(G -rI)u - 1 -

If we combine the above two equations into the form,

(3.6.48a)

(3.6.48b)

where,

and b --x

(k+l) (k) u = T u +b

r- --x

The matrix T is called the AGE iteration matrix. r

127

(3.6.49)

(3.6.50)

We now seek to analyse the convergence properties of the AGE

method. Let us assume that U is the exact solution, then,

and (3.6.51)

.. (k) (k) LeL~ =~ -Q be the error vector associated with the vector iterate

(k) u Therefore from (3.6.48b) and (3.6.51) we have,

(Gl +rI)~ (k+!) (k)

-(G -rIle 2 -

similarly,

and hence (k+l)

e = T r

= -(G _rI)e(k+!) 1 -

(k) e

where T is given in (3.6.50)., r

To indicate the convergence properties of T , we have the r

following theorem.

Theorem 3.2

If Gl

and G2

are real positive definite matrices and if r>O then

p(T )<1, where p(T ) is the spectral radius of T . r r r

Proof: If we define the matrix T as, r

~ -1 Tr = (G

2+rI)T

r(G

2+rI)

-1 -1 = (Gl-rI)(Gl+rI) (G

2-rI)(G

2+rI) ,

~

then it is evident that T is similar to T , and hence from the r r

properties of the matrix norms we have,

However, since Gl and G2

are symmetric and since (Gl-rI) commutes

-1 with (Gl+rI) we have,

11 (Gl-rI) (Gl

+rI) -111

=

-1 p«Gl-rI) (Gl+rI) )

I (),-r) I

~x ().+r)

where)' ranges over all eigenvalues of Gl

. But since Gl

is positive

definite, its eigenvalues are positive. Therefore,

The same argument applied to the corresponding matrix product with G2

shows that I I (G2-rI) (G2+rI)-11 1<1, and we therefore conclude that,

P (T ) = p 6' ) ~ liT 11 < 1 , r r r

hence the convergence follows.

It is possible to determine the optimum parameter r such that the

bound for p(T ) is minimized. r

Let us assume that Gl

and G2

are real positive definite matrices

and that the eigenvalues ). of Gl

and ~ of G2

lie in the ranges,

128

o < a ~ ).,~ ~ b • (3.6.52)

Evidently, if r>O we have,

= ( max a~A~b

I A-r! I A+r ) ( max a~J.l~b !

J.l-r! ) J.l+r

[ max a~y~b

129

= q,(a,b;r) (3.6.53)

Since (y-r)/(y+r) is an increasing function of y we have,

max a~y~b

When r=~, then,

la-rl !b-r\ = max( la+rl' b+r )

Ib-r! = Ib+r

Ib.- fa

/b+ fa

Moreover, if O<r<~, we have,

Ib-rl Ib+rl -

and if ~<r, then,

la-r\ la+r

2/b (rab-r)

(b+r) (/b+fa)

2/b (r..:/ab)

(r+a) (1b+1a)

Therefore ~(a,b;r) is minimized when r=/ab and

> 0 ,

> 0 .

Thus, r=1ab is optimum in the sense that the bound ~(a,b:r) for

p(T ) is minimized. r

For an efficient implementation of the AGE algorithm, it is

essential to vary the acceleration parameters rk

from iteration to

iteration. This will result in a substantial improvement in the rate

of convergence of the AGE method, equation (3.6.48a) can be written as,

130

.(3.6.54)

; b-(G +r I)u(k+1) - 1 k+l -

, k~O •

The best values of rk

+l

can be ascertained provided Gl

and G2 are

commutative (see Evans, 1985b).

CHAPTER 4

STEADY STATE PROBLEMS,

PARALLEL EXPLORATIONS

It isn't reaLLy

Anywhere!

It's somewhere else

Instead! .

A.A. Milne.

131

4.1 INTRODUCTION

Previously, in section (3.6) we conclude that the point (Explicit)

methods have natural extensions to block iterative processes in which

groups of components of x(k) are modified simultaneously. A faster

rate of convergence may be obtained, if a group of points are

evaluated at once in one iteration step rather than solving the

individual points.

In this chapter, a method involving a: (2x2) block is implemented

in parallel. Three parallel strategies of the AGE method are

presented which is an ideal parallel algorithm since it can be

subdivided into a number of small independent tasks. Also, these

tasks can be done at the same time without interfering with each

other, for solving one, two or more dimensional boundary value

problems which were developed and implemented on the Balance 8000

system. These include synchronous and asynchronous versions of the

algorithms. The results from these implementations were compared as

well as the presentation of the timing results and the performance

analYSis of the best strategy.

A one dimensional boundary value problem with boundary condition

involving a derivate is also solved using the AGE iterative method

with the D'Yakonov splitting formula. The same three parallel

strategies were also implemented on this kind of problem.

The Sturm-Liouville problem,

d dU dx(P(X)dx) - q(x)U + Ap(X)U ; 0 ,

where p,q and p are real functions of x in the interval a~x~b, subject

to the boundary conditions at the points a and b, is also solved using

the AGE algorithm, where the parameter A and the vector solution U are

determined.

4.2 PARALLEL AGE EXPLOITATION

The basic concepts of the AGE scheme for solving a system of

equations was presented in subsection (3.6.5). In this section we

proceed with the parallel implementation of this method.

Three strategies are investigated and implemented to solve one

and two dimensional boundary value problems. The strategies are

mainly concerned with the way the problem to be solved is decomposed

into many tasks that can be run in parallel. In the first two

strategies the problem is solved by decomposing its interval into

subsets and assig~~~ach subset to different processors which can run

them in parallel, whilst in the third strategy the problem is solved

by decomposing its domain into partitions and each partition is

assigned to different processors.

These three strategies are programmed on the Balance MIMD system

using both the synchronous and asynchronous approach. The results

from the implementations of these approaches, such as the timing

needed to solve the" problem, number of iterations required and the

'sp~d-up' ratios are obtained and compared.

Using these strategies, shared memory is used to hold the input,

the results from the first sweep and the final output component values.

These values can then be accessed by different processes. Before the

process iterates on its task, it needs to read all its components

first from private memory, then it releases all the values of the

components for the next iteration. In the different parallel versions,

different mesh sizes are evaluated. The results shown in this chapter

are an average of many runs.

To demonstrate these strategies, let us consider the differential

132

133

equation, d

2U

- --- + q(x)U = f(x) , dx

2 (4.2.1)

subject to the two-point boundary conditions,

U(a) = Cl, U(b) = 8 , (4.2.2)

where Cl and 8 are given real constants, and f(x) and q(x) are given

real continuous functions in a~x~b, with q(x)~O.

For simplicity, we place a uniform mesh of size hi where

h = (b-a) (n+l)

on the interval a~x~b, and we denote the mesh points of the discrete

problem by,

x, = a+ih, ~

as illustrated in Figure 4.1 below.

x =a o

FIGURE 4.1

x n-l

x n x =b

n+l

Using Table 3.1, the finite difference replacement of equation

(4.2.1) is given by,

2 -u. +(2+h q. )u.-u.

~-l ~]. ~+l

which can be written in matrix notation as,

Au = £ '

where A as in (3.6.43), with d=2+h 2q. and a=c=-l. ~

If we split A into component matrices Gl

and G2

, where Gl

and G2

as defined in (3.6.45a,b), we can find u(k+!) and u(k+l)

(4.2.3)

(4.2.4)

134

explicitly by,

Figures 4.2 and 4.3, show the computational molecule for the

(k+!) and (k+l) sweep (first and second sweep) .

. (k+!) (k+!)

0----; (k) (k)

i-l i i+l i+2 i-l i i+l i+2

FIGURE 4.2: Explicit Computational Molecule for the (k+!)th Sweep.

(k) (k)

"""-0-0 (k+! ) (k+!)


FIGURE 4.3: Explicit Computational Molecule for the (k+l)th Sweep.

In'the first strategy, the mesh of points (Figure 4.1) is

decomposed into subsets of points, each of which are assigned to a

processor. Then each processor computes its own subset in two sweeps.

In the first sweep, the evaluation of two successive pOints (in their

natural order) at a time starting from the first two pOints and

terminates after evaluating the· last two points .. While the second

sweep is started after the first sweep has been completed. In the

second sweep we evaluate the first point then each two successive

points at a time and the last point is evaluated on its own.

As an example, given the interval shown in Figure 4.1. If we

start evaluating the mesh points by taking a pair of points at a time,

then the order of the first sweep is,.

(Xl ,x2) ,(x3 ,X4 ),···, (xn

_l

,xn

)

After the completion of the first sweep, the ordering of the mesh

pOints in the second sweep will be,

(Xl)' (x2 ,x3), (x4 ,XS)"'" (x

n_

2,x

n_

l), (x

n) •

So, a single iteration is terminated.after evaluating all the

points in the given interval in both first and second sweeps.· A test

of convergence is carried out by one processor. If all the components

of the mesh are obtained with the required accuracy then the procedure

terminates otherwise further iterations are needed until all the

components have converged.

As in the first strategy, in the second strategy the problem mesh

of points (Figure 4.1) is decomposed into subsets each of which are

aSSigned to a processor. Each processor computes its own subset in two

sweeps. In the first sweep, we evaluate each two successive points in _ .. ,

136

an odd-even (red-black) manner. That is, we first evaluate all the

odd points then followed by all the even points. In the second sweep,

the evaluation is carried out as in the first sweep aspect, i.e., odd

points are evaluated first and then followed by the even ones.

As an example, suppose that n;lO (in Figure 4.1). Then in the

first sweep we start evaluating the odd mesh points in the following

order xl ,x3

,xs ,x7

and Xg then followed by the even points in the order

of x2

,x4

,x6

,xS

and xlO · Whilst in the second sweep, we first evaluate

the odd points and in the order x1

,x3

,xS

,x7

and Xg then followed by

the even points in the order of x2

,x4

,x6

,xa and xlO ·

Then, a test of convergence is carried out by one processor and if

all the components of the mesh points are obtained with the required

accuracy then the procedure terminates otherwise further iterations are

needed until convergence is achieved.

The third strategy is completely different from the other two

strategies, in the manner of partitioning the domain (domain

decomposition) on the processors. The problem domain is partitioned

into subtasks, and each of these tasks are assigned to a processor.

If P is the number of processors and n is the number of points in the

interval (Figure 4.1), then each partition contains points.

Each processor then computes its own n points in two sweeps . p

asynchronously without waiting for the other processors to complete

their computations. At the end of each iteration, each processor

checks to ensure convergence. If convergence is obtained, the processor

sets its flag and tests the remaining flags to ensure that the other

groups have also set their flags, otherwise further iterations are

required.

137

4.3 EXPERIMENTAL RESULT FOR THE ONE DIMENSIONAL PROBLEM

Consider the linear problem,

+ U = x , (4.3.1)

subject to the boundary conditions,

U(O) = 1, U(n/2) n 2 - 1 • (4.3.2)

The exact solution for this problem is given by,

U (x) = cosx-sinx + x .. (4.3.3)

By following the finite difference discretisation procedure

given in Table 3.1, equation (4.3.1) can be approximated to obtain the

linear difference equation,

2 -u + ( 2-h ) u -u

i-I i i+l

where x,=ih, for i=1,2, ... ,n. ~

2 = -h x

i

The boundary conditions are replaced by the values,

where h =

U o = I,

(~ -0)

(n+l)

u n+l

(4.3.4)

(4.3.5)

The linear system (4.3.4) can be represented in matrix notation as,

(4.3.6)

(4.3.7)

u = (u ,u2

, ... ,u )T - 1 n

and b (bl ,b2 ,··· ,bn

)T =

2 2 where d=2-h , c=a=-l and b,=-h x" for i=1,2, •.• ,n.

~ ~

138

If we take n even and split A into Gl

and G2

equation (4.3.6) can

be written as

(G +G )u = b , 1 2 - -

where Gl and G2

are as given in equation (3.6.45a,b), with

(4.3.8)

h2

hd=l-2"

(k+l) d (k+l) Hence by' applying the AGE method, u an u can be determined

successively, by,

(k+l) u

(4.3.9)

where r is the iteration parameter. It is obvious that the (2x2) sub-

matrices of (Gl+rI), (G2+rI), (Gl-rI) and (G

2-rI) can be determined and

-1 -1 (Gl+rI) ,(G

2+rI) are easily invertible as shown below.

-1

w

1- - --

l

tIW

I -1 w r - - - -

-,-o

-,-

'0 -1- _ __ _

:J :-1 I

(4.3.lOa)

and

, I

I 1

0

, I

-1 -1 ,

I t + --, ,-1 -,- w,

1-wJ

-v -1 I I

--! - ~,- _ L l

I I Iv -11 I -1 v I

,-- --I

I - I ,0 -,- - -1,"- I , '

f', ... " I

-I-I "" "' ... '1

-1--'-'1-

v - -I ,v

'-1 ,-

o

-' -

I , v -1

'-1 , , v

'l , I

vi -L ... - ... - - L

-1 I - -1--1 o ' I ~-I

" " " .. "" I I "', "-

, "' I ' ~

0, '-

~,~ , -1 1--IV -, ' 1-1 v I ,- - -:--l • • v . .-

139

(4.3.10b)

(4.3.1ce)

(4.3.10d)

" " v where w=hd+r and v=hd-r, and the (2x2) submatrices of Gi

, G2

, G1

and

v G

2 have the forms,

" r1

-J v

r1

-1] G = I and G

vJ

hence, Iw II " -1 1 2 G * ~ wJ

' where det=(w -1) • det

(k+l) . (k) Therefore, the vector u can be determ~ned from U in "two

steps, we first determine u(k+!) explicitly as follows:

ul

(k+! )

f u

2 1

1

1

w --j-

Iw , -4- -----

b2-vu

2+u

3

b3

+u2

-vu3

140

= -- * ___ 1_1_ det * b -vu +u (4.3.11a) 4 I 4 5

- - - I u n-ll

LU J L n

I I .

1 b: +u -vu

J 1 n-l n-2 n-

w Lb -vu n n

1

Now by using the values of u(k+!) obtained from above, we can determine

(k+l) u explicitly.

(k+l) Idet/w: u

l -,---I

-, 1 u 2 w 11 I

1 1 l- e w 1 - 1- - "*. * = *- ,

I det i' .... ' ""-l 1- -, - - ~1 _1-

I 1 w 1 , I

t --; 0 , u 11 W , n-l

f- -t "'1 u I det/w n ,

. The convergence test used was the average test,

(k+l) (k) (k) abs(u. -u. )/(l+abs(u. )) < E ,

1. 1. 1.

-4 where E=O.l*lO .

Ibl-vu l +u2

l (k+!)

b2

+ul

-vu2

b3 ~vU3 +u4 (43.11b)

b I +u -vu n-2 n-3 n-2

b -vu +u n-l' n-l n

b +u -vu n n-l n

This problem was implemented in parallel on the Balance 8000 system

using the three strategies described previously. In all these parallel

implementations a different number of points within the given interval

was taken. The optimal iteration parameter r is also obtained from the

numerical experiments by choosing the one that gives the smallest

141

number of iterations.

The results from the first parallel synchronous strategy are

illustrated in Table 4.1. From this table, it is clear that the optimal

timing results are obtained when the number of subsets are equal to the

number of available processors.

Table 4.2 shows the results obtained from implementing the second

parallel synchronous strategy. By comparing the results of Tables 4.1

and 4.2, we notice that the times from the first strategy are less than

that of the second strategy, i.e., evaluating the points using the

first strategy takes less time to converge than that of the second

strategy. This is mainly due to the distribution of the components

within each strategy. In the second strategy, we first notice from

its implementation that the number of computational operations are

higher than that of the first strategy. Secondly, there is a

possibility in the second strategy that during the evaluation of its

components the old values may be used. which means extra iterations will

be needed. Whilst in the first strategy, the most recent values of the

components will be used in the evaluation process and a greater rate

of convergence is achieved.

142

No.of No.of Elapsed time No.of Speed-up

points in seconds r iterations processors

12 1 2.293 0.495 17 1

2 1.325 17 1. 730

3 1.177 I 17 1.948

I 4 1.079

, .17 2.125

5 0.971 17 2.361

6 1.040 I 17 2.204 i

7 1.134 17 2.022

8 1.207 17 1.899

9 1.273 17 1.801

48 1 7.423 0.513 25 1

2 3.791 25 1.958

3 2.595 25 2.595

4 2.000 25 3.711

5 1. 721 25 4.313

6 1.431 25 5.187

7 1.435 25 5.172

8 1.152 25 6.443

9 1.182 25 6.280

TABLE 4.1: The Results from the First Synchronous Strategy

1 , I

: I

- - --No.of No.of Elapsed time No.of points in seconds

r iterations processors

72 1 13.661 0.515 32

2 6.89 32

3 4.643 32

4 3.533 32

5 3.128 32

! 6 2.419 32

7 2.403 32

.' 8 2.052 32

9 1.855 32

120 1 17.339 0.560 39

2 8.928 39

3 6.112 39

4 4.544 39

5 3.750 39

6 3.336 39

7 2.947 39

8 2.567 39

9 2.202 39 .

TABLE 4.1: The Results from the First Synchronous Strategy (continued) •

143

Speed-up

1

1.982

2.942

3.866

4.367

5.647

5.684

6.657

7.364

1

1.942

2.836

3.815

4.623

5.197

5.883 -

6.754

7.874

144

No.of No.of Elapsed time No.of Speed-up points in seconds


12 1 2.64 0.495 19 1

2 1.559 21 1.693

3 1.280 21 2.062

4 1.221 21 I 2.162 !

I 5 1.102 21 I 2.395 i I 6 1.178 21 2.241

i 7 1.258 21 2.098

8 1. 296 21 2.037 · "

9 1.306 21 2.021

48 1 I 7.614 0.513 25 1

2 I 4.013 28 1.897

3 I 2.993 28 2.543 j

4 2.547 28 2.989

5 2.001 28 3.805

6 1.854 28 4.106

7 1. 772 28 4.296

8 1.431 28 5.320

9 1.322 28 5.759

TABLE 4.2: The Results from the Second Synchronous Strategy

i

--

No.of No.of Elapsed time No.of Speed-up r

points processors in seconds iterations

72 1 13.992 0.515 33 1

2 7.133 35 1.961

3 5.042 35 2.775

4 4.001 35 3.497

5 3.828 35 3.655

6 2.772 35 5.047

7 2.512 36 5.570

8 2.152 35 6.501

9 1.927 35 7.261

120 1 17 .443 0.560 44 1

2 9.128 44 1.910

3 7.113 44 2.452

4 5.544 44 3.146

5 4.750 44 3.672

6 4.336 44 4.022

7 3.947 44 4.419

8 2.567 44 6.795

9 2.403 44 7.258

TABLE 4.2: The Results from the Second Synchronous Strategy (continued)

145

146

The third strategy was implemented using the asynchronous

approach. The results from this implementation are shown in Table 4.3.

By comparing the results obtained from the first and third strategies,

we notice that the timing of asynchronous implementation takes less

time to converge than that of the synchronous approach and this is

due to the synchronization overheads needed after each iteration in

the synchronous implementation. Also, from both Tables 4.1 and 4.3,

it is clear that the better efficiency can be obtained· by using the

third strategy. This is because the speed-up ratios of the asynchronous

implementation is higher than that of the .synchronous one.

To conclude from the results of the 3 strategies, we can say that

in the implementation of the one dimensional boundary value problem

using the parallel AGE method the best results are obtained when the

problem domain is decomposed into a number of subsets, each of which

is assigned to a processor where the number of processors is equal to

the number of subsets.

There are extra overheads incurred by the system, which degrades

the parallel algorithm performance in both the synchronous and

asynchronous implementations. These overheads are the generation of

the parallel paths and the synchronization at the end of each iteration

cycle.

The timing results obtained from these strategies are diagrammatically

illustrated in Figure 4.4.

The computational complexity for the sequential algorithm for each

point in each iteration is equal to (8 additions + B multiplications).

Now for n mesh of points, each processor will evaluate f.':'.l points with PI

total computational complexity equal to T:[(8 additions + 8 multiplication)

n * -] operations per iteration. p

147

No.of No.of Elapsed time No.of Speed-up points in seconds


12 1 2.451 10. 495 14 1

2 1.244 14 1.970

3 1.068 14 2.294

4 0.821 14 2.985

5 0.699 14 3.506

6 0.599 14 4.091

7 0.506 14 4.843

8 0.473 14 5.181

9 0.426 14 5.753

48 1 6.188 0.513 21 1

2 3.109 21 1.990 I

3 2.088 21 2.963

4 1.574 21 3.931

5 1.245 21 4.970

6 1.052 21 5.882

7 0.989 21 6.256

8 0.936 21 6.611

9 0.893 21 6.929

TABLE 4.3: The Results from the Third Asynchronous Strategy

No.of No.of Elapsed time No.of Speed-up r points processors in seconds iterations

72 1 13.915 p.515 28

2 6.996 29

3 4.701 29

4 3.519 29

5 2.807 29

6 2.342 29

-; 2.119 29

8 1.989 29

9 1. 785 29

120 1 19.663 0.563 37

2 9.904 37

3 6.574 38

4 4.990 38

5 3.948 38

6· 3.058 38

7 2.819 37

8 2.711 37

9 2.493 37

TABLE 4.3: The Results from the Third Asynchronous Strategy (continued) .

1

1.988

2.960

3.954 .

4.957

5.941

6.566

6.995

7.795

1

1.985

2.991

3.940

4.980

6.430

6.975

7.253

7.887

148

o 2

No. of procs.

149

• First Strategy ... Second Strategy .. Third Strategy n=4l!

FIGURE4.4: The Timing Results for the One Dimensional Boundary Value Problem.

150

4.4 EXPERIMENTAL RESULTS FOR THE TWO DIMENSIONAL PROBLEM

The concept of the AGE method is now extended to the case.of the

two dimensional problem. Consider the Dirichlet problem on the region

R,

° , (x, y) ER, (4.4.1)


U(O,y) U(l,y) ~ ° , o~y~l (4.4.2a)

U(x,O) ~ U(x,l) ~ sinrrx, o~x~l (4.4.2b)

The exact solution of this problem is given by,

U(x,y) ~ sech~*cosh(rr(y-!»*sinrrx (4.4.3)

By following the finite difference discretisation procedure given

in Table 3.1, equation (4.4.1) can be approximated to,

-u. 1 .-u. 1 . +4u. .-u. . l-u. . 1 ~ 0, l~i,j!,n, (4.4.4) 1-,] 1+,J 1.,J 1.,J- 1.,J+

and

is

x. =ih, Yj~jh, ~

If

even)

we assume

internal y

(0,1)

°

(0,0)

for O~i,j:;nTl.

that R is a regular region and we order the 2

(n n

mesh points row-wise, as shown in Figure 4.5.

sinTTX I

-- - -2

n _. -

0

- -n+l n+2 2n-l 2n

1 2 P In-l n

sin1TX (1,0)

FIGURE 4.5

, ..

~pplying (4.4.4) at each mesh point yields the system Au=b.

f4 -1

-1

" 4 -1 , ,

, , " , , " , " " -1 4

"

1-1

I " I "

" I

" I -1 f

, , , '. ,

" o /- - --

-1 4 I

!-l_ I~ - - - -I....... ......... .......

............... - ........... -1

I , ................. _ ......... ............... - ...... -I

-1 - - - - - -1-, ,

_ ::::--.._ -1

o

L

'4 -1 I

- I '

sin1fx 1 n-

sin1TX n

o , ,

I sin~rrx 2 n -n

sinrrx 2 1 n -n+ , , I

l

- -1

-1

-1 ~I

*

III 1

u n I

151

(4.4.5)

If we split A into the sum of its constituent symmetric and

positive definite matrices Gl

,G2

,G3

and G4

we have,

where Gl and G2

are the differences in the x plane and G3

and G4

are the differences in the y plane.

Then,

, : ""'- '

, 1

-,-t- -'- ~ , "",' "- I

, ' "-I I - r- --: ,- ...... - '-I

'-, '- I "-

- 1

,,- ,"I , " ...... ~ '- '- I

" I , " _, - ~ - _I __

:~:~ , I 2 2

(n Xn )

1 , :: ~~ag (A) and,

152

(4.4.6)

(4.4.7a)

(4.4.7b)

and,

n

2 ____ ..11-1 n __ _

-1 I

1 1

--I---~ , I, , t'" '. I

" 1 I " , L

-1'---

'I

1

o 1 - __ I

1

1

1

-1 1

1 , -1 1 I

-1- -l,-, " ,

153

2 2 - - - __ n -1 n

o

+------

I ',',',0' , - - - - - - - -1- -' ~

Tl I I I

(4.4.8a)

1 2

111 - - t- -

2 I 1

o

3--------n-

, 1

-1 1

1 1 --'----h 1

,0 1 , 3 1 -1 1 , ,_I -, - - 7,- r - - -t I

I " ,

- -1-

o

" " , , " 1'1 ___ 1._ 'I , -1 ' , '-I 1

- :- -~-.: - -:-, , 1

-'- -..0-'=-1 1

------1

o

I

1-1 1- -

r

-1 ' I

I' t-I, ' " , I, " "I 1 ' , , 'I

o

-1 o

I-~ r. 1 ,I

I ,-I

o

1 2 2 (n xn )

2 n

I

I

'I f - -, - ,- - ,- (4.4.8b)

1- r - t- - 1- - j

1 -11 1 1 0 1

,- 1.::.1_ ].1,"" __ -1-, " 1

1- r - -1-',011 -1--;-o 1 1 1

- L- _:-1 ~ ;-J '1 '2 2 , (n xn )

,- 1-

(;4

154

Also, by reordering the points column-wise i.e. along the y-

direction, then we find that G3

and G4

have the same structure as G1

and G2

respectively, i.e.,

1

1 11 -1

n+l- - - ~ - -

o

2 - - - -

-I

I

- n

- - - 1

n+2 ,

-1 I

11 -1----1-

, , I

I' "" 1

I I n

1 '<'" ,"," I -----~---

2n I

I

1 I

o

(n-l)~1

n2 L

1 n+l 2n+l- - - -

1 ~~---T-~--,1 -11 1

1 .1 1 0 n+l

1

11 1

1_1

- - - 2 - - - -n -I 1

- - L_I I

2n+l ~~ !:_ L_ I 1

G2

; 1 1 2

- - I--

1 I ~, 1

I' " '~ I I - ;- - 1- -, - - - _1 __

o I ,1 -1, 1

~ '_1 1 i 1 _-1--'. _ _ ~ _L: ____ _

_ 1. , '1 -, __ 1..:1_ " , " ,

2 2n---(n-l)n n

I

o

o

, -"''-I - _ __ , 11 I

;I n

- -1-I 11

T-----

n

I I

2

o

-11

I 1 '-1 1_ _lL

I," I " ' - ~'t

1 1 1 c 1-1

c -+ -I

- -j --1

(4.4.Bc)

(4.4.Bd)

t

Figures 4.6 and 4.7 illustrate the way the points are

,.., "" represented in Gl ,G2 ,G3

and G4

in the x and y-directions respectively.

I I

: : I I , I

I , I I I

I I , I , , , ,

I I , , I 1 I , ,

I , , I

I I I I , I

, I

, I ,

I , ,

J I'-'

y

1--- -- - --

I : I I , I , I I ,

- - - - - I ,

- - - - - I ,

. -- .

--I , 1 ,

I I I , I , I

I I , , r--'

t

x

FIGURE 4.6

y

I I I

,_ - - .1 , , - - - - 1

I

------------

- _____ L

r-4'----- - -------

t

FIGURE 4 _ 7

155

t

156

The Douglas-Rachford formula for the AGE fractional scheme then

takes the form,

( ) (k+l/4)

G +rI u 1 -

(k) (rI-G -2G -2G -2G)u +2b

1 2 3 4- -

(G +rI)u(k+!) 2 -

(k) (k+l/4) ~ G~ +~

(G +rI)u(k+3/4) 3 -

(k) (k+!) ~ G3~ +r~ .

(k) (k+3/4) G4~ +~

(k+l/4) (k+!) (k+3/4) Then we can obtain the values of U I U , u and

(k+l) , ~ respectlvely from,

and

(k+l/4) -1 (k) u ~ (Gl+rI) [(rI-Gl-2G2-2G3-2G4)~ +2b] ,

u(k+!) (G2+rI)-1[G~(k)+r~(k+l/4)]

u(k+3/4) ~ (G3+rI)-1[G3~(k)+~(k+!)]

(k+l) u

(4.4.9a)

(4.4.9b)

(4.4.9c)

(4.4.9d)

(4.4.lOa)

(4.4.l0b)

(4.4.lOc)

(4.4.lOd)

Since the matrices (G,+rI), i~1,2,3,4 are all (2x2) block sub-1.

matrices, then they are all easily invertable as shown below. To find

the inverses of the above systems for Gl

, we let w~l+r, so,

.- II :1 " _1 *Iw ~ Gl

~ , then Gl

~

det lL wJ '-

where 2

det~ (w -1) , then,

(G1+rI)-1 ~ 1

det *

and,

1 * ~ det

157

~ :! : 0 - - - .1- _I ___ I

I''', I o I'~~'I ,

- - -1- ~

L

L

o I , W I I I

11

o

- - '1,

I '1 ..... - - - - - - - - -/ wl

- - - -t I,' " I "," I " , , ,

, " , ___ 1 ___ -,-Iw l'

, - -, - ,- - - , 1, wl I 0 ~ r - ...., ~,\, , "

w I

, I 1 I ,0 ,1 __ w+,,, _I_

t'" I ':""" I

,- - -.!... ~~~ -I IW o I ,

I - -. I ,

-, 1

-I

'1

I

1

c

(4.4.11a)

l

'1 -I - - - L , "

w, -, I , _1.. __ '_,_

o

'det/wl L - "

~, 'I , , 4-,1- -

,det/w I _.1 __

-----1 I .- :- -J _- _-I

Iw , ,1 I

11 1

2'" L , , I' , ~

o 1 r .l

f , \.1 -+ - !.. ~ -I

l, o -~ ,

IW

I '1 w I

r ~--l

:det/~ (4.4.11b)

-1 -1 The inverses (Gl+rI) and (G

2+rI) are numbered in the order

1,2,3, ... ,nxn, in the x-direction. Also (G3+rI)-1 is similar to

-1 (4.4.lla) and (G4+rI) is similar to (4.4.llb) but they have a

different ordering and are applied in the y-direction.

We now consider the iterative formulae, equation (4.4.10) at each

of the four intermediate levels:

(i) For the solution at the first intermediate level (the (k+l/4)th

iterate), equation (4.4.l0a) is rearranged as,

u (k+l/4) -1 (k)

(Gl-i-rI) [(rI-(Gl+2G2+2G3+2G4))~ +2£1

(k) Since the coefficients of u , are all matrices of the same size, we

can add up their elements paying attention to their direction and

r -1

7 -2

-2 7

I , , " "

c = I

-2

-1

'-2 I; I; 1 \

, 1 , " \ ,

" ,

, -1

0

" , 7

-1

" -2' , \ \ \

7 , -'.21 ",-1-:2-, _-" ' ."i

-=- - -=--2 ~ ~I '-2 17 -1 , , , ' . , I

, \

I I \

\ ,

-21

'-1 7 I

-2

-2

7 ,

, ,

o

-1

, "

" ., ,

" -1

-2

" , 7

7

-2

-1

158

-1 7

(4.4.12)

159

then, (k+l/4) -1 (k) u = (G

l +rI) [(rI-C) ~ +~)

Now let t=rI-7, where 7 is the value along the diagonal of C.

Then the new coefficient matrix of ~ (k) , will have the same structure

as C, but will have the value t along the diagonal. Hence,

(k+l/4) u

1 det

f 11

! I C 1 Wl - --' - --I I,', . ~ 'I

I-~ , ,

- -I -I ~,

0 I IW 1 1

o I

l ~ Jl_ w-l L _____ J

" I' ", 1 ,,' ,I

* - _ _ _ - J ~,'-t 1 w 11

1

T --

o 1

11 1

I r-

:0

1

10

L I ':1

~ (k) (k) (k) I . 1 ,

tUl +u2 +2un+l+2sLnnxl

(k) (k) (k) (k) . ul +tu2 +~u3 +2un +2+2sLnnx

2 I

(k) (k) I (k) (k) . 2u 2+tu l+u +2u2 1+2sLnnx 1 n- n- n n- n-

(k) (k) (k) . u l+tu +2u

2 +2sLnnX

n- n Inn I

(k) I(k) (k) .

1

wJ

12u 2 +tu 2 +u 2 +2swnx 2 n -2n+l ,n -n+l n -n+2 n -n+l

2 (k) + (k) +t (k) +2 (k) u 2 u 2 u 2 u 2 +2sinnx 2

n -2n+2 n -n+l n -n+2 n -n+3 n -n+2 I I I

(k) (k) (k) (k) . 2u 2 +2u 2 +tu 2 +u 2+2sLnnx 2

n -n-2 n -2 n -1 n n -1

(k) (k) (k) . 2u 2 +u 2 +tu 2 +2sLnnX 2

n -n-l n -1 n n

*

(4.1O.13a)

160

By carrying out these matrix vector operations, we obtain the

(k+l/4) values of u at the first intermediate level.

(ii) For the solution at the second intermediate level (the (k+!)th

iterate, equation (4.4.l0b),

(k+!) _ (G + I)-l[G (k) (k+l/4) 1 ~ - 2 r ~ +~ ,

can be written in a matrix form as,

(k+! ) ·u

1 det *

det/w I

I

L

I

I W

11

,0 I

"I

le 1 _wL

, :::: 1-I" , I ", '

_I

0

-, -IW

11 r

_1-

I

-I

I

11

I ,- - -I I 0

- --T

I ..

o

-r

-, .l-

10

wl 1 ' 1 , "-

'\1 1 ,

I- ~ 'W

11

- _I I

-

--l 1,

Wl

'-, ,det/w ,

*

U +ru1 r(k) (k+l/4)

~k) (k) (k+l/4) u2

+u3

+ru2

(k) (k) (k+l/4) u2

+u3

+ru3 , ,

, (k) 1 (k+l/4) u +ru

n n ,

I (k) 1 (k+l/4) u 2 +ru 2

n -n+l n -n+l I

161

(k) (k) (k+l/4) u 2 +u 2 +ru2 (4.4 .13b) I n -n+2 n -n+3 n -n+2

(k) (k) (k+l/4) u 2 +u 2 +ru 2

n -n+2 n -n+3 n -n+3 1

I u 2 +ru 2 Ln n

(k) : (k+l/4) j The multiplication of the above matrix vector will give the values of

(k+! ) u .

(iii) Now for the third intermediate level (the (k+3/4)th iterate).

If we reorder the mesh points column-wise parallel to the y-axis,

equation (4.4.10c) is transformed to,

Then we have,

(k+3/4) u ~

-1 (k) (k+!) ~ (Gl +rI) [Gl~ +~ 1

w l' 1

1 '0 ~_~I __ L __ I

," 1 I,,, " I , " _..1_'-' __ I o 1 ,w 1

o

(k+3/4) u _1_* ___ 1-

, 1 1 ,1 w

-,-" -l-~ , "I

1 , , , ' 1 -1------

Iw l'

det

1 o

*

r.: (k) (k) (k+!) ,ul +un+1 +ru

1 (k) (k) (k+!)

u +u +ru 1 n+l n+l

I

(k) : (k) (k+!)

l

u 2 +u 2 +ru 2 n -2n+l n -n+l n -2n+l

(k) (k) (k+i) u 2 +u 2 +ru 2

n -2n+l n -n+l n -n+l I , I

(k) (k) (k+i) u +u +ru

n 2n n (k) (k) (k+!)

u +u +ru n 2n 2n

I (k) I (k) (k+!)

u 2 +u 2 +ru 2 I n -n n n -n

It (k) (k) (k+i) u 2 +u 2 +ru 2

n -n n n

where c stands for column-wise ordering.

Hence the values of u(k+3/4) will be obtained from the above

matrix vector multiplication

(iv) At the fourth and final level (the (k+l)th iterate). In a

similar manner, equation (4.4.10d) is transformed to,

Then, we have,

(k+l) u -c ) -l[ (k) (k+3/4)] ; (G 2+rI G~ +~

162

(4.4.13c)

(hI) u -c

1 det

fCIet/w I .I--I I' -1-

, W , 1 0 I

'1 w' 1 1 --t- - , - - -I

h,' I 1

"" 'I I '-.' , - - - ,:J 1'- --' , ,w

,0 , -' -

~l _ wJ __ , I , ;det/w

- - - I" __ - ~ - ---!. _1-

* - - -

~ ,~ , f ,"

- Ideth.' ,- - .-!

I

1 '

w

o 1- - -I...!

l ruik) +ruik+3/4) l

(k) (k) (k+3/4) u +u +ru

n+l 2n+l n+l (k) (k) (k+3/4)

u +u +ru n+l 2n+l 2n+l ,

I (k)' (k+3/4)

u 2 +ru 2 n -n+1 n -n+1

I , (k) I (k+3/4)

u +ru n n (k) (k) (k+3/4)

u2n

+u3n

+ru2n

(k) (k) (k+3/4) JI u +u +ru I 2n :3n 3n

l (k) 1 (k+3/4) u 2 +ru 2

n n

o

0

-

- I -11

- -,

10

, 'I

163

l

T

I

T -I - -w l ,-,

,det/w

(4.4.l3d)

then the values of u(k+l) will be obtained from the above matrix vector

multiplication.

*

Hence, the AGE scheme corresponds to sweeping through the mesh

parallel to the coordinate x .and y axes involving at each stage the

solution of (2 X 2) block systems. The AGE iterative procedure is

continued until convergence is reached.

Again, in all these parallel implementations the optimal

iteration parameter r was obtained from the experiments by choosing

the one that gives the best execution time.

Tables 4.4, 4.5 and 4.6 represent the results obtained from the

implementation of these strategies. As in the one dimensional

problem, the asynchronous strategy performs better than the other two

strategies, giving much smaller elapsed timings and show a near linear

speed-up. This is due to the absence of any synchronization points,

which implies less overheads and no contention for shared data.

Also, from these three tables it is clear that better efficiency

can be obtained by using the asynchronous approach rather than the

synchronous approach. This is because the speed-up ratios of the

.asynchronous implementation is higher than that of the synchronous.

Figure 4.8 shows the run time results obtained from using the

three parallel AGE strategies when the matrix size of the problem is

equal to 12x12.

In .conclusion, we can say from Figure 4.8 that the best results

are obtained when the third strategy of the parallel AGE iterative

method is used to solve the two dimensional boundary value problem.

164

165

. Matrix size No.of Elapsed time r No.of Speed-up processors in seconds iterations

12x12 1 32.992 1'-.600 11 1

2 16.849 11 1.959

3 11. 535 11 2.860

4 8.888 11 3.711

5 7.650 11 4.312

6 6.362 11 5.185

7 6.380 11 i 5.171 I I

8 5.122 11 I 6.441 I I

9 I 5.256 11 I 6.277 , ,

24x24 1. I 151.796 1.443 17 1

2 76.555 17 1.982

3 51.596 17 2.942

4 39.263 17 3.866

5 34.731 17 4.370

6 26.879 17 5.647

7 23.845 17 6.365

8 21. 211 17 7.156

9 19.877 17 7.636

TABLE 4.4: The Results from the First Synchronous Strategy

Matrix size No.of Elapsed tilDe r No.of processors in seconds iterations

36x36 1 288.991 1.037 23

2 148.803 23 I

3 101.867 I 23

! 4 75.747 I 23

5 62.502 I 23 I

6 55.607 I 23

7 49.118 23

8 42.792 23

9 36.713 23

48x48 1 500.095 0.900 27

2 252.324 27

3 172.438 27

4 132.491 27

5 103.213 27

6 93.055 27

7 82.777 27

8 73.408 27

9 64.208 27

TABLE 4.4: The Results from the First Synchronous Strategy (continued)

166

Speed-up

1

1.942

2.836

3.815

4.623

5.197

5.883

6.753

7.871

1

1.981

2.900

3.774

4.845 I·

5.374

6.041

6.812

7.788

167

Matrix size No.of Elapsed time r No.of Speed-up processors in seconds iterations ,

12x12 1 38.339 1.600 13 1

2 19.912 14 1.925 , 3 13.475 14

I 2.845 I

!

4 10.549 I I

14 3.634 I

5 9.266 14 4.137

I 6 7.533 14 5.089 I 7 7.217 14 5.312

8 6.181 14 6.202

9 6.072 14 6.314

24 x 24 1 175.358 1.447 21 1

2 89.104 23 1.968

3 60.367 23 2.904

4 46.396 23 3.779

5 40.191 23 4.363

6 31.936 23 5.490 I·

7 27.635 23 6.345

8 24.744 23 7.086

9 25.001 23 7.623

TABLE 4.5: The Results from the Second Synchronous Strategy

Matrix size No.of Elapsed time r No.of processors in seconds iterations

36 x36 1 335.788 1.033 27

2 174.223 29 I i 3 119.875 29

i 4 I 88.776 29

5 73;784 I 29

6 63.418 30

., 52.647 30

8 50.158 30

9 48.182 30

48x48 1 280.361 0.893 33

2 293.139 33

3 201.763 33

4 156.271 33

5 125.217 33

6 109.466 35

7 96.504 35

8 83.092 35

9 76.452 35

TABLE 4.5: The Results from the Second Synchronous Strategy (continued)

168

Speed-up

1

1.927

2.801

3.782

4.550

5.294

6.378

6.694

6.969

1

1.979

2.876

3.713

4.634

5.301

6.013

6.984

7.591

169

Matrix size No.of Elapsed tilDe r No.of Speed-up processors in seconds iterations

12x12 1 30.699 1.600 9 1

2 15.602 9 1.967

3 10.631 9 2.887

4 7.922 9 3.875

5 7.131 9 4.305

6 6.021 9 5.098

7 4.723 9 6.235

8 4.103 9 7.482

9 3.929 9 7.813

24x24 1 132.702 1.441 17 1

2 66.795 17 1.989

3 44.069 17 3.011

4 34.081 17 3.893

5 29.014 17 4.573

6 23.568 17 5.630

7 20.259 17 6.550

.8 18.256 17 7.268

9 17.235 17 7.699

TABLE 4.6: The Results from the Third Asynchronous Strategy

Matrix size No.of Elapsed time r No.of Speed-up processors in seconds iterations

36x36 1 231.932 1.030 20

2 116.612 20

3 78.355 20

4 58.662 . I 20

5 46.786 I 20

6 39.045 , 20

7 33.662 20

8 31.968 20

9 29.037 , 20

, 48x48 1 493.327 0.903 i 25

2 245.332 25

3 168.241 25

4 12".493 25

5 98.244 25

6 88.243 25

7 78.763 25

8 69.447 25

9 62.027 25

TABLE 4.6: The Results from the Third Asynchronous Strategy (continued)

1

1.988.

2.960

3.953

4.957

5.940

6.890

7.255

7.987

1

2.010

2.932

3.869

5.021

5.590

6.263

7.103

7.953

170

j

8

6

4

2

o 2 4

No. of ProC9.

6

-Go First Strategy .. Second Strategy ... Third Strategy

1'1 = 2.'t

8

171

10

FIGURE4.8: The Timing Results for the Two-Dimensional Boundary Value Problem.

172

4.5 THE AGE METHOD FOR SOLVING BOUNDARY VALUE PROBLEMS WITH NEUMANN

BOUNDARY CONDITIONS

In this section, the AGE iterative method as discussed previously

in Section (3.5), is used to solve a boundary value problem with

boundary conditions involving derivative or Neumann boundary conditions.

Here another formulation of the AGE method using the D'Yakonov

[D'Yakonov, 1963] splitting formula is used.

4.5.1 Formulation of the Method

Consider the differential equation,

d2

U - --- + q(x)U ; f(x) , ~2

over the line segment a~x~b, subject to the boundary conditions,

i.

ii.

iii.

dui ~I a

U(a)

001 ~I a

= a , U(b) ; a ;

a, dui ; S ~Ib

; a, ~Ib ; S·

(4.5.1)

(4.5.2)

Here, a and a are given real constants, and q(x) and f(x) are given

real continuous functions on a~x~b, with q(x)~O.

For the differential equation (4.5.1), the strategy of the finite

difference method is to replace the above equation by a difference

equation. Hence, we cover the interval a~x~b, by a uniform mesh of

size, h ; (b-a)

(n+l)

and denote the mesh points of the discrete problem by,

(4.5.3)

173

Therefore, the finite difference replacement of equation (4.5.1) can

be obtained (using Table 3.1) and is given by,

+ q,u, 1. 1.

with a truncation error of order O(h2

) .

In matrix notation (4.5'.4) can be written as,

A~ = £ '

(4.5.4) ,

(4.5.5)

where A and b have the following forms, according to the boundary

conditions, these are:

i. Idl c -1

fl -I

fbl l 1 I

a2

d2

c2 u

2 Ib~ 1

.... 0 i , .... I

" " ....

I ....

-, , " , -, , " , I ........ .'" .,

I , , "

.... I

>j Ib' 0

....

l '-il-l d n-l cn_11 I n-l

a dn J ~n J n Ln

(4.5.Ga)

2 where d, = 2+h q, , a, c, = -1, i=1,2, ... ,n, with a

1 c = 0

1. 1. 1. 1. n 2

and bl

= a+h f (xl) , 2

i=2,3, ... ,n-l, b, = h f (x, ) , 1. 1.

2 b = h f(x )+2hS.

n n

ii.

-.. -.. -.. -.. -..

o

L

where c, = l.

and, b, = l.

iii.

-..

" -.. -.. -..

a n-1

o

d n-l

a -1 n

-1, i=I,2, ... ,n, with al

2 2 h f(x,), b h f(x )+6,

l. n n

= c = 0 n

b n-1

Ibn L J

174

(4.5.Gb)

i=2,3, ... ,n-I.

l = I (4.5.Gc)

Let us proceed with equation (4.5.Gb), and follow the approach of

Evans [Evans, 1985], by splitting A into,

(4.5.7)

where,

and

1

o I hd

2 I

1 I - - - - 1,-,- - "I - - -

I" , ,-" , I

I ,'" 'I , , ____ -' ,'I

- - - Thd

- -I n-l I o

L 'a , n I

hd I 1 - - _1_

1 hd I 2

L _ _ _

1

'a , 3

- --I

, - 1- - ,.

o

l

c n-2

hdn

_l

where n is even representing an odd number of

1

-1--

1

1 ,hd, I , I!.J ,

intervals,

175

(4.S.8a)

(4.5.8b)

and hd.=d./2. ~ ~

By using equation (4.S.7), the matrix equation (4.5.5) can be written

in the form,

(4.5.9)

Another formulation similar to (3.6.48b), with the same accuracy

can be derived from the AGE method using the D'Yakonov [D'Yakonov, 19631

splitting formula which can be written explicitly, as,

(k+!) -1 (k) u (Gl+rI) [(Gl-rI)(G2-rI)~ +e.1 (4.5.lOa)

u(k+l) = (G2+rI)-1[~(k+!)1 , (4.5.10b)

where u(k+!) is an intermediate value and r is the iteration

parameter.

and

Hence, from equation (4.5.10) u (k+t) (k+1)

and u are given by,

1 (k+t)

i 1

1 det

l (k+1)

1 det

*

~ I L

-c 1

W2 _.1

o

I, \

I \ \ 1\ \ \ I I , \ \ I , ' ,

, \ I

;- -.:..'.! I W I n-1

, -a , n

o I - --\

i i \

f *

(k) 2 I (k) (k)

l

a v u +v U +c v u +b n-1 n-1 n-2 n-1 5 n-1 n-1 n n-1

J a a u(k)+a v u(k)+v2u(k)+b L n-1 n n-2 , n n-1 n-1 n n n

ret I I

w~ L ___ I I W -c I

I 1.2 2 I

1_ - :.-~3_ w3 I'

\' \ * I ,\ \

I I \ \ ,

,\ \ , , , 1- -:-I 0

1- -:-L

--~ I

I _ 1_-

IW I n-2

I-a I n-1

I - '- -/ -c I

n-2 1 I W ,

_n-_11 _ I -i I de1:.1 IW J I n

* I I I I

176

(4.5.11a)

(4.5.11b)

177

where, v.;hd.-r, w,=hd.+r, i=1,2, ... ,n and det=w.w, l-c,a. 1. ~ ~ 1. 1. 1 1.+ 1..1+

By carrying out the vector multiplications in equation (4.5.11a,b)

(k+t) we obtain the values of u


d (k+l)

an u .

A number of experiments were conducted to demonstrate the

application of the AGE algorithm with D'Yakonov splitting on boundary

value problems. The iteration parameter r was chosen so as to provide

the most rapid convergence and the convergence criterion was taken as

-6 £=10 .

Consider the following problem taken from Fox [Fox, 1957],

d2

U

dx2

- U = 0 ,


0<':>0:.<1

: I = 1 , U(l) = 0 •

Ix=o

The analytical solution is given by,

where,

U

1 A = -.=....~

2 (l+e )

x -x Ae + Be

andB 2

-e

(4.5.12)

(4.5.13)

(4.5.14)

The following tables show the results obtained by implementing

this method for a different number of points.

178

n=10, r=0.560

Numerical Sol. Exact Sol. Absolute Error Relative Error Number of iterations

-0.67375 -0.67371 0.00004 0.000059 23

-0.59144 -0.59140 0.00004 0.000067

-0.51401 -0.51397 0.00004 0.000077

-0.44083 -0.44080 0.00003 0.000068 .

-0.37130 -0.37127 0.00003 0.000080

-0.30485 -0.30482 0.00003 0.000098

-0.24090 -0.24088 0.00002 0.000083

-0.17898 -0.17894 0.00004 0.000223

-0.11851 -0.11848 0.00003 0.000253

-0.05901 -0.05900 0.00001 0.000169

TABLE 4.7

179

n=16, r=0.590

i Number of Numerical Sol. Exact Sol. Absolute Error Relative Error iterations

-0.70408 -0.70405 0.00003 0.000042 28

-0.64899 -0.64895 0.00004 0.000046

-0.59616 -0.59610 0.00006 0.000100

-0.54533 -0.54530 0.00003 0.000055

-0.49645 -0.49640 0.00005 0.000100

-0.44925 -0.44921 0.00004 0.000089

-0.40363 -0.40358 0.00005 0.000123

-0.35938 -0.35934 0.00004 0.000111

-0.316309 -0.31635 0.00004 0.000129

-0.27448 -0.27445 0.00003 0.000109

-0.23353 -0.23350 0.00003 0.000128

-0.19340 -0.19336 0.00004 0.000206

-0.15392 -0.15389 0.00003 0.000194

-0.11499 -0.11496 0.00003 0.000260

-0.07647 -0.07642· 0.00005 0.000654

-0.03816 -0.03814 0.00002 0.000524

TABLE 4.8

Again, the same three parallel strategies discussed earlier, were

implemented. The timing results from this implem~~tation gives no

significant difference from the AGE implementation with Peaceman-

Rachford formulation (Section 4.3).

7

6

5

4

2

1

2 4

No. of Procs. 6

-Go First Strategy ... Second Strategy .. Third Strategy

h " It.

8

180

10

FIGURE4.9: The Timing Results for the one-Dimensional Boundary Value Problem with D'Yakonov.

181

4.6 THE AGE METHOD FOR SOLVING STURM LIOUVILLE PROBLEM

Consider the differential equation,

d dU dx(P(X)dx) - q(x)U + ApU : 0 , (4.6.1)

where p,q and p are real functions of x in the interval a~x~b, subject

to the boundary conditions,

U(a) : Cl, U(b): S , (4.6.2)

where Cl and S are given real constants.

The problem is to determine the unknown function U and the unknown

parameter A using the AGE iterative method.

x =a o FIGURE 4.10

x n-l

x n

x :b n+l

Now when the interval a~x~b is divided up into n equal subintervals

h (Figure 4.10) and the derivatives approximated by the central

difference expressions (Table 3.1), we obtain,

Pi --2(u. 1-2u.+u. 1) h ~- ~ ~+

i=1,2, ... ,n and h = (b-a) (n+l)

- q.u. : -Ap.u. 1. 1. 1. 1.

In matrix notation (4.6.3) can be written in the form,

Au : AU

where,

(4.6.3)

(4.6.4)

182

Idl cl -I la2

d2

c2 0

"-, "- "-

"- "- "-

I "- "- "-A = "- "- (4.6.5) "- , , "- "- I "- ,

"-, , , , , 0 "- , ,

c I a d n-l n-l n-ll

a d J n n

2 -(2p,+q,h ) d, ~ ~ l<i<n = , ~

h2

p,-p~ h/2 ~ ~ 2<i<n (4.6.6) a, = ,

~ h

2

p, +p~ h/2 ~ ~ l<i<n-l c, =

~ '2 h'

and >: /;;\ ' l<i<n ~ , - :A !~' 11

"-~

The tridiagonal matrix A is real and diagonally dominant (if,

d,>-a,+c,) with positive diagonal entries which are ideal conditions to ~ ~ ~

consider iterative methods of solution [Varga, 1962).

4.6.1 Method of solution

By splitting the matrix A into the sum of two submatrices,

where, I

C ! - --I

1

c -I n-ll

hd i nJ

(4.6;7)

(4.6.8a)

and ~d~ ~ __

I hd2

I I a 3 [- -- - -

I I

-t-

-I _

hd 1 3 I __ _ 1_

" ' I, , "

I ' , , r , , .

o , 1

. I I

','~' -I-- - 'hd - c

-l

1 I n-2 n-2 I

e" I

- - - J - - - ~ a~_~ - ~d,,:.l ~ - -J . '1 I hd I' : n

if n is even, and hd.~d,/2, i~1,2, ... ,n. ~ ~

183

(4.6.8b)

By using (4.6.7) the matrix equation (4.6.4) can now be written

in the fona,

and by following a strategy similar to the ADI method, [Peaceman,

19551 , (k+!) d (k+l) b d t ' d' I' 'tl b u an u can e e erml.ne llDp l.C 1. Y y,

and

(Gl +rI) ~ (k+!)

(k+l) (G2+rI)~

- (k) (k) A~ - (G

2 -rI) ~

~ ~u(k+!)_(G -rI)u(k+!) - 1-

or explicitly by, (k+!)

u -1 (k) (k)

(Gl+rI) ['X~ -(G2-rI)~ 1

and u(k+l) ~ (G2+rI)-l[A~(k+!)-(G2-rI)~(k+!)1

where r is the iteration parameter, given by,

(4.6.9)

(4.6.10)

(4.6.11)

where ~ and v are the minimum and maximum eigenvalues of the submatrices

of Gl

and G2

•

From equation (4.6.11) u(k+!) and u(k+l) are given by,

and

and

fl Tk+!)

u2

1 *

I det 1

: I [n-l l un J

I 1 * =

I det

un -

J Lu n

~: , 2

1-

l 0

-c I 1,

, , "

, , I' " ,

, 1

" , , I . ," , • I

~--~~-'Wn~l

I-a , n

fdet/ wl

,

1- - - ~ - - - - -1 Iw -c 1

: 1 2 21

o l -----1

1*

184

(k) (k) ul -vlul

- (k) (k) (k) A u 2 -v 2u 2 -c2u 3 ~ (k) (k) (k) AU

3 -a

3u

2 -v

3u

3 1

, (k) (k) (k) A U -a u -vu

n-l n-l n-2 n-ll ~ (k) (k) J A U -v u

n n n

(4.6_12a)

W I n-l ---1---,

~ det/w I L

, ,

I n~

- (k+!) 1 (k+!) (k+!) 1 All -v U -c u n-l n-l n n-l n

- (k+!) (k+!) (k+!) 1 All -a u -v u n n n-l n n I

(4_6_12b)

Hence, the numerical procedure would be as follows:

Given an initial guess eigenvalue A (0) and an initial eigenvector u(O)

we determine a new solution ~(l) , by using the AGE algorithm. To

determine a new value of A (1) , we use "RayZeigh's quotient" [Fox, 1957),

185

(l)A (0) u • u

A (1) (1) (0)

(4.6.13) u • u

Then we continue until the procedure gives no further change in the

value of A. This procedure is only suitable for deriving the smallest

eigenvalue A (n) otherwise diagonal dominance of A will be lost and the

AGE algorithm will diverge.

4.6.2 Numerical Result

In order to explore this method, we present here some numerical

experiments. Consider the problem of Mathien's equation,

d2

U --- + (A-2qcos2x)U = 0 , dx

2 (4.6.14)

\'lhere,

q is constant and A the eigenvalue, subject to the boundary

conditions, U(O) = 0, U(21T) = 0 . (4.6.15)

The following tables show the results obtained by solving this

problem by the AGE method.

n=9, -5

£=0,1*10 ,r=1.500, q=l

A=0.890461, Number of Iterations = 17

The Solution Vector A*U. A*U. ~ ~

0.133369 0.11877 0.11876

. 0.257145 0.22900 0.22898

0.359889 0.32047 0.32047

0.428808 0.38183 0.38184

0.453195 0.40354 0.40355

0.428808 0.38182 0.38184

0.359889 0.32048 0.32047

0.257145 0.22900 0.22898

0.133369 0.11877 0.11876

TABLE 4.9

186

-5 n=13, £=0.1*10 , r=1.500, q=l

A=0.894524, Number of iterations = 22

The Solution Vector A*U, A*U, 1. 1.

0.081095 0.07269 0.07254 0.159266 0.14243 0.1424,

0.231265 0.20710 0.20687 0.293353 0.26238 0.26241

0.341572 0.30557 0.30554 0.372259 0.33288 0.33299'

0.383806 0.34232 0.34243

0.372260 0.33292 0.33300

0.341571 0.30548 0.30554 0.293356 0.26259 0.26241 0.231261 0.20682 0.20687 0.159269 0.14272 0.14247

0.081091 0.07246 0.07254

TABLE 4.10

4.7 CONCLUSIONS

This chapter included the study of synchronous and asynchronous

AGE iterative methods, where the mesh of points of a problem to be

solved was partitioned into n processes. The implemented parallel AGE

methods have been used to solve one and two dimensional boundary value

problems. Three strategies of the parallel AGE methods have been used

to solve these problems.

In both the one and two dimensional problem, the third strategy

showed a better improvement over the other two. Since, in the other

two strategies we need to synchronize the processes at the end of each

iteration step, which in turn, degrades. the performance of the

algorithms.

Comparison, between the first and second strategies shows that

better results obtained when using the first strategy. This is due to

the total number of computational operations in the second strategy

being higher than that of the first strategy and also because of the

re-orderings needed in the second strategy.

From the experimental results it can be seen that the shared data

overhead in the third strategy implementation is less than that of the

first and second strategy implementation.

187

To solve a boundary value problem involving derivatives at its

boundaries, by using AGE algorithm, we noticed from the experimental

results, that the absolute errors and the number of iterations is less

than many other iterative methods, whilst, the implementation of the

three parallel strategies gives almost nearly the same speed-up obtained.

Finally, it can be seen that the parallel AGE method is suited for

the parallel implementation on a MIMD computer which results in the -

almost linear speed-ups obtained from their implementation.

To find the eigenvalues and the corresponding eigenvectors of the

Sturm-Liouville problem,· by the AGE method, we noticed experimentally,

that the results obtained by applying this method is better in

comparison with many other methods.

188

CHAPTER 5

TIME DEPENDENT PROBLEMS,

PARALLEL EXPLORATIONS

We look before and after

and pine for what. is not •.

Shelley.

189

5.1 INTRODUCTION

The parallel AGE iterative method was developed and implemented to

solve one and two dimensional boundary value problems in Chapter 4.

In this Chapter, we will implement the same three parallel strategies

presented earlier in Section 4.2 to solve one and two dimensional

parabolic equations as well as the one dimensional wave equation.

The parallel AGE method was developed and implemented on the

Balance 8000 system using its available 5 processors. These include

synchronous and asynchronous versions of the algorithm. The results

from these implementations were compared as well as the performance

analysis of the best method presented.

The AGE iterative method with the D'Yakonov splitting formula to

solve the one and two dimensional diffusion equation is also

represented in this chapter. This new strategy yields an algorithm

with reduced computational complexity and greater accuracy for multi

dimensional problems. The same three parallel strategies were also

implemented on this kind of problem formulation.

A new method for solving the (nxn) complex tridiagonal matrices

derived from the finite difference/element discretization of the

Schrodinger equation is also presented and confirmed by numerical

experiments.

190

5.2 EXPERIMENTAL RESULTS FOR THE DIFFUSION-CONVECTION EQUATION

This experiment involved the parallel implementation of the AGE

method for solving the diffusion-convection equat~on. Consider the

following problem,

aU a2

u au O<x<1 . F70 E ---p , at

ax2 ax (5.2.1)

with the initial condition,

U(x,O) = ° , O~x~l , (5.2.2)

and the Dirichlet boundary conditions,

U(O,t) ° U(l,t) = 1 t~o . (5.2.3)

In this example the coefficient E and p assumed the same values of 1.

The exact solution is given by,

U (x, t) (ePx/e:.. l )

(eP/':.l) + 2 L

n=l

n L(x-l) ~(_-=l~)~n~n ____ ~ e2E

2 2 (mr) + (piE)

2 2 , ( ) ,.[ (nn) E+P /4EJt

S1n nTTX • e (5.2.4)

A uniformly-spaced network whose mesh points are x,=ih, t,=jk, 1 J

for i=O,l,2, .•. ,n+l and j=0,1,2". "m+l, is used ",ith h=(n:l) ,

T k k=(m+l) and A=h2 ' the mesh ratio.

divided as illustrated.

The real line O~x~l is thus

x n-l x x =1

n n+l

At the point p(x, ,t, ,) the derivatives in equation (5.2.1) are ~ J+,

approximated by the finite difference discretizations given in Table

3.1. Thus we have,

-;(E+K)U, 1 ' l+(l+E)U, , I-HE-K)U, 1 ' 1 = ;(E+K)U, ,+ 1.- ,J+ 1.,)+ 1.+ ,J+ 1.-1,J

191

(l-E)u, ,+; (E-K)u, 1 ' , 1.,J 1.+ ,]

(5.2.5)

2 2 which is the Crank-Nicolson formula with O(h,k ) accuracy, where

E=EA and K=;pAh. When all the points l~i~n within a line are

considered, equation (5.2.5) generates a tridiagonal system of linear

equations of the form,

d c 1 I~l l [hI l I a d c u

2 b2

/ a d c 0 I " , ,

"- I , , , I I

"-"- "-,

I = (5.2.Ga) "- "-, I , "- "- I I

"- "- J I I I I , , "- "- I

0 , , ,

/bLl/ "- "- d' u'

L a

n-l

a d (nxn) Ln J (nxl) '~n J (nxl)

or A~=!?, (5.2.Gb)

where a=-;(E+K), d=(l+E) and c=-;(E-K).

Also,

;(E+K)UO ,+;(E+K)UO

' l+(l-E)Ul

,+;(E-K)U2

' ,J ,J+ ,) ,J

b, = ;(E+K)u, 1 ,+(l-E)u, ,+;(E-K)u, 1 ' , i=2,3, ... ,n-l .... 1.- ,J 1.,] 1.+ ,J

b = ;(E+K)u ,+(l-E)u ,+!(E-K)u ,+;(E-K)u , n n-l,J n,) n+1,J n+1,J+1

Let us now assume that we have an even number of intervals

(corresponding to an odd number of internal points, i.e. n odd) on

the real line O~x~l. We can then perform the following splitting of

the coefficient matrix A,

192

A (5.2.7) where,

[h~ :_ _ _ i _ _ ~ __ ~ r ,hd cl'

1 la hd 1 :0 1- -:- - -:;,:,~ :- --I

1 ' , , I 1

I ' , 1

_____ 1_ '~'---J ,I 1 I hd c

1

I 0 I I

_ I : . ,a hd

(5.2;8a)

and [hd c

a hd c ~J ,

(5.2.8b)

where hd;d/2.

Hence, the AGE iterative method with.Peaceman-Rachford formula

. (k+!) (k+l) .. can be applied to determ1ne ~ and ~ exp11c1tly by,

and

(k+!) u

(k+l) -1 (k+i) U ; (G2+rI) [(rI-Gl)~ +£1 (5.2.9)

where the matrices (Gl+rI), (G2+rI), (rI-G

l), (rI-G

2) are represented

by

(5.2.10a)

193

~ cl ,

l I 0 I

wl I -"Go' - -1-- r-

"" '- I I (G

2+rI) = I "" , (5.2.lOb) , ,

I I " ... ...... 1

r~ -, - - ~- -r-

I 'w c I I

I la I t- -1'-- --f- _w_ ~j , W ,

Ivl I I --1 --t-_-.1 - -1-1 Iv cl

~ I I 0 'a vi ,- -1"- I I ", (rI-G

l) I "" "- (5.2.lOc)

I t '- .... ' ~~ I --.1 I- - -'-'1---

I 0 'v

J L I I 'a , ,

r c il 0 v ----t- - f--" ,

I,"" I (rI-G

2) I "",~

(5.2.l0d) = ," 'I I 1- __ I __ "::......'L, -I I Iv c , .

I c I I I I 1- la

_ v ~j -I -,-, v

L I ,

where w=hd+r and v=r-hd.

It is clear that (Gl+rI) and (G2+rI) are block diagonal matrices.

All the diagonal elements except the first (or the last at (G2

+rI)) are

(2x2) submatrices. Therefore, (Gl+rI) and (G2+rI) can be eaSily

inverted by merely inverting their (2X2) block diagonal entries.

(k+t) Then, from equation (5.2.9), u d

(k+l) an u are given by,

= _1_ * det

I -a - --I-

I

o L

_....J _

10 I

I , -a I

- - (k) (k) I IVU l -cu2 +bl (k) (k)

-aul

+vu2

+b2

. (k) (k) b vu3 ,cu4 + 3

(k) ! (k) b -ell +

vu n-l n-2 n-2

. -au (k) +vu (k) +b 1

n-2 n-l n-ll (k)

["'n +bn J

194

(S.2.11a) and

l (k+l)

1 det *

2 where det=w -ac.

-, 1- (k+t> b l I vUl + 1

I (k+,) (k+,) 'b vu 2 -eu3 T 2

f -c ;

-a w I - --+--

o

o

, ' - -,---I, " ' I "" " " , I

I

_1-I

" ' " I , ' --r--

-c I I

I

IW

I-a w 1 1 -, ---~:t~~

(k+,) (k+,) b ' -au

2 tvu

3 + 3

I I I

I (k+,) (k+,) b vu -ell +

tl n(~+!> n (k+,) bn-1J -au 1 +VU + n- n n

(S.2.11b)

The corresponding explicit expressions for the AGE equations are

obtained by carrying out the multiplications in (S.2.11a,b). Thus

we have,

(i) At level (k+t>

and

where,

(k+!) u i +l

195

{ 0 for i=n-l A -aw, B = vw, C -cv, E. wb.-cb. 1 , D =

c2

otherwise, ~ ~ ~+

2 { 0 for i=n-l

~ ~ ~ ~

and A a , B -av, C = vw, E. = wbi+l-ab i , D = ~

otherwise. -cw

with the following computational molecules (Figure 5.1) .

1 k+!

k

1 2

1 k+! k+!

k k

i-I i i+l i+2 i-I i i+l i+2

FIGURE 5.1: The AGE Method At Level (k+!)

(ii) At level (k+l)

(k+l) u

i u. 1 . u. 1 . 2 . e 1.- 1. 1.+ 1+ l. = (P (k+!)+QU(k+!)+R (k+!)+su(k+!)~T )/d tj

i=1,3, .. . ,n-2, (k+l)

lli+l = (-P (k+!) ~QU (k+!) +R~ (k+!) ~S (k+!) T"') Id t

u. 1 +. u. 1 + u. 2 +. e 1- 1. 1+ 1+ 1.

196

and (k+l) ( (k+!) (k+!) +b )/w u -au 1 +vu n n- n n

where,

{ 0 for i;l 2 p Q ; vw, R ; -cv, 5 c , T. wb.-cb. 1 ,

-aw otherwise ~ ~ ~+

and

{ o for i;l

~ .....,

'V ~ "-' p ;

2 , Q ; -av, R ; vw, 5 ; -cw, T. = -ab. +wb. l'

~ ~ ~+

a otherwise

with its computational molecules given by Figure 5.2.

1 k 1 k

k+!

i-I i i+l i+2 i-I i HI i+2

k+l

k+t

FIGURE 5.2: The AGE Method At Level (k+l)

The computational complexity per iteration per point is 42

multiplications + 28 additions, while it usually requires only two

iteratioIl3 for convergence for small time step k.

In order to compare the three parallel strategies we list the

timing results with the speed-up ratios of the three strategies in

Tables 5.1, 5.2 and 5.3, where a different number of mesh sizes are

chosen. From these tables, we notice that the asynchronous strategy

(Table 5.3) achieves better speed-up ratios than the other strategies.

These speed-ups are mostly linear, i.e., of order p, where p is the

number of processors that are in use. However, if we consider the

timing results in Tables 5.1 and 5.2, we see that the algorithm of

the first strategy requires less time.

197

The difference in the running times for the three strategies is

due to the fact that in the first two strategies we need to synchronise

the processors, while in the third strategy, each processor works on

its subtasks without waiting for the other processors. Also, the

difference in the running-times between the first two strategies is

due to the reordering needed in the second strategy.

The timing results of the three versions are represented in

Figure 5.3.

-7 k=O.OOl, t=O.l, A=O.l, £=5.0xl0

No. of points No. of processors

9 1

2

3

4

5

-7 k=0.OO5, t=0.5, A=0.7, £=5.0xl0


11 1

2

3

4

5

-7 k=0.OO5, t=1.0, A=l, £=5.0xlO


13 1

2

3

4

5

r

0.5

r

0.5

r

0.5

Time (second) Speed-up

10.330 1

5.695 1.813

4.096 2.521

I 3.296 3.134

: 2.556 4.041


12.512 1

6.890 1.815

5.995 2.087

4.339 2.883

3.229 3.874


14.791 1

8.968 1.649

7.348 2.012

6.352 2.328

5.181 2.854

TABLE 5.1: The Results From Implementing The First Strategy.

198

-7 k=O .001, t=O.l, \=0.1, E=5.0xl0


9 1

2

3

4

5

-7 k=0.OO5, t=0.5, \=0.8, E=5.0xl0


11 1

2

3

4

5

-7 k=0.OO5, t=1.0, \~1, E=5.0xlO


13 1

2

3

4

5

Time Speed-up r

(second)

0.5 11.117 1

6.146 1.808

4.454 2.495

3.771 2.948

2.993 3.714

Time Speed-up r

(second)

0.5. 13 .364 1

7.321 1.825

6.562 2.036

4.511 2.962

4.068 3.285

Time r

(second) Speed-up

0.5 15.821 1

9.479 1.669

7.993 1.979

6.897 2.293

5.433 2.912

TABLE 5.2: The Results From Implementing The Second strategy.

199

-7 k=O.OOl, t=O.l, A=O.l, c=5.0xl0


9 1

2

3

4

5

-7 k=0.005, t=0.5, A=0.8, c=5.0xl0


11 1

2

3

4

5

-7 k=0.005,t=1.0, A=l, c=5.0xl0


13 1

2

3

4

5

Time Speed-up r

(second)

0.5 9.603 1

5.247 1.830

3.779 2.541

2.833 3.389

2.297 4.180

Time Speed-up r

(second)

0.5 11.480 1

6.152 1.866

4.977 2.306

3.788 3.030

2.993 3.835

Time Speed-up r

(second)

0.5 13.544 1

7.971 1.699

6.804 1.990

5.707 2.373

4.669 2.900

TABLE 5.3: The Results From Implementing The Third Strategy.

200

o 2 3

No. of proc8. 4

- First Strategy ... Second Strategy ... Third Strategy

h:r13

FIGURE 5.3:The Timing Results for the Diffusion Convection Equation.

201

6

202

5.3 EXPERIMENTAL RESULTS FOR THE TWO-DIMENSIONAL PARABOLIC ·PROBLEM

Now consider the two-dimensional heat equation,

aU at

a2u + --2 + H(x,y,t) , O~K,y~l , t~T , (5.3.1)

ay -t

with H(x,y,t)=sinx siny e -4,

where the theoretical solution is given by,

U(x,y,t) -t 2 2

sinx siny e + x + y , O~x,y~l, t~O . (5.3.2)

The initial and boundary conditions are defined so as to agree with

the exact solution. At the point p(x. ,y.,t ) in the solution domain, 1 J r

the value of U (x, y, t) is defined by u.. where x. =ih, y. =jh for ~/J,r 1. J

O~i,j~n+l 1

and h (n+1) The increment in the time t,k is chosen such

that t =rk for r=0,l,2, .... r

The mesh ratio is defined

A weighted finite difference approximation to (5.3.1) at the

point (i,j,r+t) (Table 3.1) leads to the five point formula,

-ABu. l· l+(1+4~e)u.. l-lBU. l· l-lBu,. 1 1- ABu .. 1 1 = 1. - I J I r+ 1. , J , r+ .1. + , J I r+ 1. , J - I r+ 1. , J + ,r+

A(1-6)u. 1 . +(1-4A(1-6) )u .. +A(1-6)u. 1 . +A(1-6)u .. 1 + 1.- ,J,r 1.,J,r 1.+ ,],r 1.,)- ,r

A(1-6)u .. 1 +kH.. l' for i,j=1,2; ... ,no 1.,J+,r l.,J,r+-z

(5.3.3)

We notice that when 6 takes the values 0, t and 1, we obtain

the classical explicit, the crank-Nicolson and fully implicit schemes

2 2 2 2 whose truncation errors are O(h ,k), O(h ,k ) and O(h ,k) respectively.

Let us proceed with Crank-Nicolson scheme since it is more

accurate and unconditionally stable. Hence, the weighted finite

difference equation (~.5.3) can be expressed in the more compact

matrix form, as,

Au (k+l) BU(k) + b +!l. (5.3.4)

where,

d

a , ,

a

A

1-I I

C

d , , , , ,

C , , , a

, , d

a ---~ , ,

,

203

..., I

, , o , ,

, C , , dl Cl

(5.3.5a)

1- ..... -- 0 ___ ....... C

I --... ............... -......... I o - - ......... -a I ....... - ..... --~-I c -I

I r a Id C

, I la d , I " o , "

C.

,,", " " , , , ,

, , a I

I l where d=1+2A,a=c=-A/2,

and the matrix B is of the form,

le s

s

f e

f ~

B

o

l where e=1-2A, f=S=A/2.

I IS , I " I \ ,

\

\ \ ,

I __ s-I-

I" , 0 s r, ....... " .... I ...." .... O ," . " . I '.' - - - ~-

I e f. I

I

\ If \ " \

" \ \

\ f'

I

"

s

e

"

" ' " .... ', , " a d' c

o

s ,

a d l 2 2 .-l (n xn )

I I I

-, s 1 (5.3.5b)

--I

I , , " ,

I e si

, , , ". f'

f ~ (n2xn2)

204

The vector b consists of the boundary values where,

A A A A -[u +u ]+ -[u +u J,-u -t- -:-U , ••• , 2 O,l,r 1,O,r 2 O,l,r+l l,O,r+l 2 2,O,r 2 2,O,r+l

A A A -u + -u ,-[u +u J 2 n-l,O,r 2 n-l,O,r+l 2 n,O,r n+l,l,r

A + -[u +u J

2 n,O,r+l n+l,l,r+l

b. J

lu +~2 . l'o, ... ,o,lu . +A 20,j,r O,J,r+ 2 n+l,J,r ~n+l,j,r+l'

for j=2,3, ... ,n-l,

b ; J" [u +u J + J" [u +u J, lu + n 2 O,n,r l,n+l,r 2 O,n,r+l l,n+l,r+l 2 2,n+l,r

-2AU

2 1 1 ' ••• , lu + lu , ,n+ ,r+ . 2 n-l,n+l,r 2 n-l,n+l,r+l

J,,[u +u J+ J,,[u +u J , 2 n,n+l,r n+l,n,r 2 n,n+l,rTl n+l,n,r+l

and the vector ~ contains the source term of (5.3.3) given by,

g. ; k(Hl . ~,H2' ~, ... ,H . », for j;1,2, ... ,n. J ,J,r+"l ,J,r+"1 n,J,r+"l

We observe from (5.3.5) that A has the form,

~:\i __ 'I~ :~:~ 0 '

A

~ :~ ~-~--I ~ _ L _ 1 - - 1- - --

L I ~i ~ 22 (n Xn )

Now, if we split A into the sum of its constituent symmetric and

positive definite matrices G1

, G2

, G3

and G4

, we have,

where,

Therefore,

where,

o -, , I

l

o

-;~J 122

(n Xn )

l ! - --I I ---I

diag (G2

) = %ctiag(A) ,

205

(5.3.6)

Fd 1 ,

--1--+ 1 hd cl

1 1 - ,- - --I

1 l o

1

I;:-::~- -11--- - -- - - --I

, ...................

+ ___ ~>':-~ _, __ , __ , __ J

o

L

Cl

1 ,- - - - - --

o

L

I hd 1 1 1

1- - - - - - - -1- - --, 1

1 Ihd cl' 0 1

1 'a hd 1 , ,- -1- - ---1,-, -1-

I" ' , " I '"" J + -'-f--,

1 1 Ihd

o la ,

c

,-

l I

206

(5.3.7b)

207

also, by reordering the points column-wise, i.e. along the y-direction

we found that G3 and G4

have the same structure as GI

and G2

respectively

i. e. ,

I n+l 2 n+2

lfhd 1 1

1- --1- - _L - -I-- - - -.: n+l I hd cl,

I I, 0 I. 1 1

---" ~ .!:d-.! __ , ____ I t I'" t I , I'" 'I

, - - -1'-':: ~- - --' o 1 Ihd c 1

n 2n - - - - - - -(n-l) n

o

2 n

l I 1

1

h~ , -;: "' _ _ ~ _ _ _ _ _ _ _ _ _ _

I, ~" I ' " I

C

, , ,,' " ,I

, , 1 ---1--'-'--

Ihd I' 1_ -.1 ______ 1-- __ _

I hd c' J~

1_ -.l_

C hd'

__ 1,= _...L --, , ",,~, 1

- -~ ~ ~:- -- -I C ' I hd cJ I ,

1 ' I a hd (2 2) 1 n Xn

(S.3.7c)

and, 1 n+l

Cl

~

G =G = 4 2

wi th hd=d/4.

2n+L __ . ___ 2. ___ n_

o

. I ,, .L

o

2 - -n

l

208

(s.3.7d)

The Douglas-Rachford formula for the AGE fractional scheme then

has the form,

(G +rI)u(k+l/4) = 1 -

(k) (k) (rI-G -2G

2-2G

3-2G)u +2Bu +2b+2a

1 4- - -.<.

(G +rI)u(k+t) = (k) (k+l/4), 2 _ G~ +ru

(k) (k+t) = G3~ +~

(k) (k+3/4) G4~ +~

(5.3.8)

In a similar manner to equation (4.4.9) (Section 4.4), we now examine

the above iterative formula at each of the four intermediate levels:

(i) At the first intermediate level (the (k+l/4)th step)

Using the expression (s.3.8a), we obtain,

(k+l/4) u -1 (k) (k)

= (Gl+rI) [(rI+Gl)-2A)~ +2B~ +2(£+9.)] (5.3.9)

where,

-+

o I

o

I a _w-.! I ;' ....

10 I

-, -I " ....... "'" I , ........ '" :I

1_ 1_ ..L ,--'_ , I IOw

209

l I

(5.3.10)

cl I '-

,a ~ (2 2) n xn

with w=r+hd.

By letting c=«rI+Gl

)-2A), we have,

c

f -2d -2c

1-2a w-2d -c

I -a w-2d -2c

a-=-=-----

l o

le

\

-2a w-2d -c I

\

\

\

=a w-2g I __ Cl _

1:::-::::..----- __ IC_-_-_-a __ ~~-I- __

la .w-2d -2c

I ' ,-2a w-2d -c

o

I \ -a w-2d

\

, (

al

"-'-

, "-

,

c

-2c

"-, "-

" , , -2a w-2d -c

Therefore,

1 I I

I I knl I I

1 I

, U

Lnj

r F-2d

1 = --

det

-2c

w-2d

I

L -c 1-2a

I -a w-2d -2e " " " "

" " " "

1 " " " " " " "

1 I

1 !-a--------2a

,.

0

'w ,

o

-,-

-c , ,

l .:....- --

I ~ ---I

-r - -- I

01 _ J'-:~ :!il_

I T

r<· I

" ' I '," I - - ~J-

1 ___ L

--1

" " " "

re I

\

\ \ I

\ I

" " -e I I

\ I \ I

w-2d I \ ,

-a _w-=-2~1 _ 5 L __ _ . , 1\" '\ :c - .

\\ ......:

--. a! \ \. .

I I w 0, :

: -a .:J

l 1

0 I

I - ______ 1

* - - e

11- - - -- -- - - - - - - I -lw-=2d - -2e

I o

a I' I

\ 1-2a w-2d -c I \ I \

\ , \ ,

I I a,

-a w-2d -2c, ,

-2a w-2d -c

-a W-2dl

-.J

210

*

(k)

~lil U

12

I, 1

U1n +

I

1

2*

~ : s

's I \

I' " " \

\ \

1

,", " , '" , " \

f e s \ I

I o

f e I S _______ ,-- _.J - - - - - - -f _ ", Is_

- I "",- - 1 -- __ - ,,~ I -- -... __ - -_fl _ ' " I - - s 1 - - - I f -' le s - - - - -

I \. I , \

If o \

\ \

\ I

£1

,-l 1£1 +.'Il

b2

+g2 1- -

1 :

1 I

2 * 1

I 1 1 1 I

b +g ---n-l ---n-ll

I b +g C' ---n ~

)

, e s , ' , , ' " , " , ,

" " f

, ,

e

~ f

211

By carrying out these matrix vector multiplications, we obtain the .

f (k+l/4)

values 0 u at the first intermediate level.

(ii) At the second intermediate level (the (k+t)th step)

From equation (5.3.8b), we have,

(5.3.11)

Hence, equation (5.3.11) yields,

- ~(k+,)

Ulll u

12 I I

I :

= _1_* u ln det

rw -c 1

I 0 I

.::a_ ~ 1-, -.:.. __ L ___ I

~ __ C,~ __ ~ ___ I:

1 w -c o I' , 1

I -a w i --,- _ ..J _ --I , I

,det/wl ------- - - ,,,- ~

k'" 1 '," ....

o

" ,'! --------'w -c

r - - ---

I. ,J Ln l

r, f c: I la w' 1

0 I

I~- ~~'~:- -:-, - I ~'I - -'- ~

I ,W cl o 1 I I w l

o

l I

o I

__ -L. _ la - 1- __ ,

" 'w I -....! -'- - - - - - -- ------I I" 1 ", I ,,, - - - - - - ,_,..L. - r -,w c

- -I ,- - I

1 '10 1 I 1 1 I a wj-- _...1_ ,- ',,- I I

"" , t ,_-_"" I I

I r ~"~~ -c 1-

10 ' I I

:a w ~ ~J

o

L _ ~_

'0 I_a w I !- 1'-'- -I -- I

" " ' , ........... ,,! I " , -t_ _ _ I _ ....!.. :...J _ ---l _ _

o 1 IW -c 1

- ,

, -1-

I 1 I

U'2J _n

+ r*

I

I-a -l -

U In ,

I I

1 _ , 1

I )

*

The multiplication of the above matrix-vector operands will give the

values of u(k+').

212

213

(iii) At the third intermediate level (the (k+3/4)th step)

By reordering the mesh points column-wise parallel to the y-axis

we found that

and (5.3.12)

where c stands for a column-wise ordering of the mesh points. Therefore,

equation (5.3.8c) can be transformed to give,

(k+3/4) -1 (k) (k+!) ~(c) = (Gl+rI) [Gl~(C) +~(c) J,

and hence, equation (5.3.13) yields,

_1_* det

Idet/w 1 1 ---Iw

r-_ I -a -I -

1- 1

I , 0 L

1

I I "1 - -I -

-cl 0

w' - 'I, ,- 1-·

I' " """1 , " I j--"'r:--'

w -c I I ,

o

I-a w I - ',_, -I - - - - - -

I'"'' I , " ' , ,,~

- - - - - - - -'ljet/;'- --1--;

_ __ l __ .J _ I

'-

0

I I

IW

1

I-a -,

.!..

-c I I

w'

1

, j, ' I ," I" ,,'!

-I-~ -, W

0

(5.3.13)

l

o

--cl L '-a ~

214

o

:a w I - - - - - -1- - - - - --_. - . 1 ' ,

""" I +r*

- '- ' , ,," I

~ '---iWi - - 1- - - _.

1_'1- __ '_, I - r . -. Iw C :0

o I J

th f 1 of U (k+ 3/4 ) d f th ere ore, the va ues will be obtaine ram e

multiplication of the above matrix vector operands.

(iv) At the fourth (and final) intermediate level (the (k+l)th step)

Similarly, by reordering the mesh points_column-wise parallel

to the y-axis we have,

Then, the last equation of (5.3.8) is transformed to give,

which yields,

(k+l) u -(c)

(5.3.14)

(5.3.15)

u nl 1 I , I

(k+l) Iw -c I·

I 0 ~a _ :-' L I

I' ~ \. i , , 1 " ,

--~'-r- --t--I ·w-c

o I

- - -1-,

I-a _..L_

I , 1 1- - - '-

~-* det I

I I L

o

- - --; , , , 1 __ l_,_

I

c 1

,0

1-I

J

, . ! -a

o

c

* - -,

'0 w' -~'-i---!--

" \. ' I : ,,\. 'I

....!... ~~ __ J._ 'w -c I I

:-a w 1 -, -, ,det/~

[U l (k+3/4) 1

I 111 I

IU~ll I ! I I '

J

215

(k+l) The values of u will be obtained from multiplying the above matrix

vector operands.

216

Thus, the AGE scheme corresponds to sweeping through the mesh

parallel to the coordinate x and y axes. The iterative procedure is

continued until the requirement I (k+l) (k) I· .. u. . -u. . ~e: 15 satl.sfl.ed, where 1.,J 1.,J

£ is the converge criterion.

Now, we concentrate our attention on the implementation of this

problem on the parallel Balance 8000 system. We abbreviate our

discussion in this section only on the first and the third strategies,

in order to compare the synchronous and asynchronous strategies.

Tables 5.4 and 5.5 represent the results obtained from these

implementations respectively. From these tables, we notice that the

asynchronous version achieves better speed-up ratios, where these

speed-ups are mostly linear.

The difference in the running times for the two strategies is

due to the fact that the synchronous version generates a large amount

of communication overheads in the system. Also, the synchronisation

cost will be large if the number of tasks is greater than the number

of available processors. This is because, the tasks that complete

their job remain idle for a long time whilst waiting for the other

tasks to complete their job. The same situation occurs as many times

as the algorithm iterates to achieve convergence. Clearly, this

generates a significant amount of overheads that degrade the

performance of the algorithm.

On the other hand, in the asynchronous strategy, the number of

subtasks generated is equal to the number of cooperating processors

where each task is iterated to evaluate its points without any

synchronisation. Therefore, the asynchronous strategy which has no

synchronisation delays is better than the synchronous version.

In Figure 5.4, the timing results of the two strategies are

illustrated.

-5 k=0.OOO2, t=0.OO18, A=0.06, £=10

Matrix size No. of processors

7x7 1

2

3

4

5

-5 k=0.OO04, t=0.OO36, A=0.04, £=10

Matrix size No. of Processors

Time Speed-up r

(second)

1.5 23.118 1

13.061 1.170

10.141 2.279

8.036 2.876

7.447 3.104

Time Speed-up r

(second) ----

9x9 1

2

3

4

5

-5 k=o.ool, t=O.l, A=0.OO5, £=10


13x13 1

2

3

4

5

1.5

r

1.5

26.684 1

14.845 1.797

9.622 2.773

8.426 3.166

I 7.420 3.596

I Time Speed-up

(second) I I

I ! 29.261 1

16.634 I 1.759 I I ,

I 14.598 I 2.004

I ! 10.81'1 I 2.705 I I

I 8.822 3.316

TABLE 5.4: The Results From Implementing The First Strategy

--

I , ,

217

I

-5 k=0.OOO2, t=0.OO18, A=0.06, £=10


7 x 7 1

2

3

4

5

-5 k=0.OOO4, t=0.OO36, A=0.04, £=10


9x9 1

2

3

4

5

-5 k=O.OOl, t=O.l, A=0.005, £=10


13x13 1

2

3

4

5

r

1.5

r

1.5

r

1.5

Time Speed-up

(second)

22.908 1

12.876 1.779

9.818 2.333

7.742 2.958

6.085 3.764

Time Speed-up

(second)

24.843 1

13 .377 1.857

9.420 2.637

8.095 3.068

6.309 3.937

Time Speed-up'

(second)

27.905 1

14.134 1.974

13.004 2.145

9.556 2.920

8.099 3.445

TABLE 5.5: The Results From Implementing The Third Strategy

218

219

4

3

-G- First Strategy ... Third Strategy n .c:: 't

o~~~~==~~~~~==~~~==7 o 2 3 4 5 6

No, of proe.,

FIGURE 5,4: The Timing Results for the Two-Dimensional Heat Equation,

220

5.4 EXPERIMENTAL RESULTS FOR THE SECOND ORDER WAVE EQUATION

Consider the one-dimensional wave equation,

a2u 2 ' O~x~l, O~t<T ,

ax (5.4.1)

subject to the initial conditions,

U(x,O) = ~in(nx) , (5.4.2a)

aU at (x ,0) = 0 , (5.4.2b)

and the boundary conditions,

U(O,t) U(l,t) = 0 • (5.4.2c)


1 . U(x,t) = as~nnx COSnX . (5.4.3)

From the expectation of achieving stability advantages, a general

implicit finite difference discretisation to (5.4.1) at the· j+l,j

and j-l time levels (Table 3.1) is,

2 2 2 -aA u, 1 ' 1+(l+2aA )u, , l-aA u, 1 ' 1

1.- ,J+ 1.,)+ 1.+ ,)+ 2

- (1-2a)A u, ,+ ~-l,J

2 2 2 2 2(1-(1-2a)A )u, ,+(1-2alA u, 1 ,+aA u, 1 ' 1-(1+2aA )u

1.,J 1.+ ,) 1.- ,)- i,j-l

2 , 1 2 +OA u. 1 ' 1 ' 1.= , I a 0 • , n , 1+ ,J- (5.4.4)

with a=l for unconditional stability, A=i and truncation error of

order 0(h2

,k2

) with its computational molecule given by Figure 5.5.

Equation (5.4.4) gives a tridiagonal system of equations at

the (j+l)th time level, which can be displayed in matrix form as,

221

i-l,j+l

i-l,j i,j i+l,j

i-l,j-l i,j-l. i+l,j-l

FIGURE 5.5

Id c l r' I Fl l la

I d c 0 u

2 !b2 I I .... .... " "

.... "-

I: I " .... .... ....

I : " " .... " " .... , I I (S.4.Sa) "

, "

.... , I

I ! " " " .... " I I " .... , 0

, , " " I a d cl lUn- 1 /bn_ll I

L a dJ U J 'b J n l!n

or Au = b , (S.4.5b)

where, 2 2 a = c = -aA and d 1+aA.

The vector ~ is defined by,

22 2 2 b

l = [2(l-(l-2a)A )u

l .+(l-2a)A u

2 .1+[-(1+2aA )u

1 '_1+aA u

2 '_11

IJ ,] ,J,J

2 2 +aA [uo . l+u

O '_ll+(l-2a)A Uo .

,J+,J ,J

and

b i

222

2 2 2 2 (1-2a)A u. 1 .+2(1-a)1 )u .. +(1-2a)A u. 1 .+aA u. 1 .

~- ,J 1.,) 1.+,J 1- ,)-1

2 2 (1+2aA )u .. l+aA u. 1 . l' i=2,3, ... ,n-l,

1,)- 1+ ,J-

b n

2 2 2 2 = [(1-2a)A u 1 .+2(1-(1-2a)A)u .1+[aA u 1 . 1-(1+2aA)u . 11

n- ,J n t ) n- ,)- n,J-

2 2 +al [u 1 . l+u 1 . 11+(1-2a)A u 1 . n+ ,J+ n+ ,J- n+ ,)

The u values on the first time level are given by the initial

condition (S.4.2a). Values on the second time level are obtained

by applying the forward finite difference approximation to equation

(S.4.2b) at t=O,

or,

u -u au(x 0)" i,l i,O at i' k

= 0

u. 1 = u. ,which is a first order approximation. 1., 1.,0

, ~olution on the third and subsequent time levels are

generated iteratively by applying the AGE algorithm along the lines

(k+!) and (k+l).

When we implemented the AGE algorithm on equation (S.4.5b), we

arrived at the same form of equations along the lines (k+!) and (k+l)

as those that we have derived in Section (5.2). These explicit

equations are given by (5.2.11a,b).

By implementing this algorithm in parallel using the first and

third strategies, we obtain similar results to those obtained

previously in Section 5.2. These can be predicted because the

computation load on each of the processors is roughly the same.

Figure 5.6 illustrates the results obtained from implementing these

two strategies.

5

4

3

1

o 1 2

No. of Proes.

3 4

.... First Strategy

... Third Strategy h:13

5

FIGURE 5.6: The Timing Results for the Second Order Wave Equation.

223

6

224

5.5 THE NUMERICAL SOLUTION OF ONE-DIMENSIONAL PARABOLIC EQUATIONS BY

THE AGE METHOD WITH THE D'YAKONOV SPLITTING

In this section, the AGE iterative method for solving the one-

dimensional parabolic problem is introduced using a D'Yakonov splitting

strategy. It is known that the diffusion equation when solved by

the AGE method yields an algorithm with reduced computational

complexity and greater accuracy for multi-dimensional problems.

Consider the heat conduction problem,

au at

with the boundary conditions,

U(O.t) = gl

U(~.t)

and the initial condition,

U(x.O) = f(x)

I t~O , (5.5.1)

(5.5.2)

(5.5.3)

Let us assume that the rectangular solution domain R is covered

by a rectangular grid with grid spacing h.k in the x and t directions

respectively. The grid point (x.t) are given by.

x. = ~

ih fori=1,2, ... ,n

and t. = jk J

for j=O.1.2 ••..

A crank-Nicolson implicit finite difference approximation to

(5.5.1) at the jth and (j+l)th time level with principal truncation

2 2 error O(h .k ). (Table 3.1). can be written as.

-Au. 1 . l+(2+2A)u. . l-Au. 1 . 1 = All. 1 .+(2-2A)u .. +Au. 1 .• 1.- ,J+ 1.,)+ 1.+ ,J+ 1.- ,J 1.,) 1.+,]

h i=1,2, ... ,n; j=O,l,2, ... ; and A = k2

For the totality of points i=l.2.3 •...• n. equation (5.5.4) can

be written in matrix form as,

(5.5.4)

i. e. I

where,

and

c l d c ,

L

Au ; b

I U n-l

u J n

;

fb l I 1 I b 2 I

I

c ; a ; -A, d;2A+2 and the column vector b is,

b, ~

AU ,+(2-2A)ul

,+AU2

,+Agl O,J ,),J

AU, 1 ,+(2-2A)U, ,+AU, 1 ' 1- ,J 1,J 1.+,J

b ; AU 1 ,+(2-2A)U ,+AU 1 ,+Ag2 n n- I] n, J n+,)

5.5.1 Formulation of the AGE Method

By splitting A matrix into components Gl

and Gi such that,

A ; Gl

+ G2

'

where Gl

and G2

are as defined in (5.2.8a) and (5.2.8b),

respectively with hd;(2+2A)/2.

225

(5.5.5)

(5.5.6)

(5.5.7)

By using equation (5.5.7) the matrix equation (5.5.6) can be

written in the form,

(5.5.8)

Another formulation similar to the Peaceman-Rachford splitting and

which has the same accuracy can be derived by using the D'Yakonov

splitting formula. This can be written explicitly as,

(k+l)* -1 (k) U ; (Gl+rI) [(Gl-rI) (G2-rI)~ +!?] (5.5.ga)

(k+l) -1 (k+l) * U ; (G

2 +rI) [~ ] (5.5.9b)

226

(k+l)* where u is an intermediate value and r is L~e iteration

parameter.

Since (Gl+rI) and (G2+rI) are easily invertible, then from

(k+l) * d (k+l) equa tion (5. 5 . 9) u an u are given by,

I~l l (k+l) *

fde~/~ ~ I I I

1- ____ 1 __ _

u2

u' n-l

and

1

I

1

I

1 det

1 det

*

IW -c I

!:"~ _w -L _ I

r -c :

-a W I

1- - -...,---' "-

1

, , "-"- , ,

"- ,

.--

*1 ...... "- ',I

"- ' ,',1

1-.I- - - -- 1;

1 I

0 I

1- I-a - 1- - - - I'

L 2

where v=hd-r, w=r+hd and det=w -ac.

1

10 I 1 : - --I

0

'w

, -a ,

I-

1 r

-C

1

j

I

I - -I

1

-c

WI -!---,

pet/w J

f2Ul-V~U2 +bl 2 1k)

\vaul+v u2-vcu3+c u

4+b

2 , 2 2

* la u l -vau2+v u 3-vcu4 +b3

; 2 I I-vau 2+v u l-vCU +b 1 i 2 n- n- 2 n n- I 1 a u -vcu +v u +b J L n-2 n-l n n

(5.5.lOa)

fl l (k+l) *

IU 2

I 1 I I 1

* 1 (5.5.l0b)

I u'

1 n-l r 1 ' n J

By carrying out the multiplication 'in equation (5.S.l0a,b)

we have,

(i) At level (k+l) *,

(k+l)* u

l

(k + 1) * u,

1. = (A1U, 1+B1U,+Clu. 1+D1U. 2+E.)/det

1.- 1. 1.+ 1.+ 1.

where,

(k+l)* u i +l

Al = 2 -(vwa+a c)

2 Bl = (v w+vac)

2 Cl = -(vwc+v c)

A2

B2

C2

(va 2 2

= +wa )

2 = -(v a+vwa)

2 = (vac+v w) ,

\ 0 i=n Dl = loa: i=:-l

(wc +vc ) otherwise i 2 -(ac +vwc) otherwise

\

E, = (wb,-cb, 1) ~ ~ 1.+

El.' = (wb,-ab, 1) , l. l. +

with the following computational molecules (Figure, 5.7).

1 (k+l)*

(k)

1 2

1 (k+l)*

(k)


FIGURE 5.7: The AGE Method At Level (k+l)*.

227

(ii) At level (k+l) ,

(k+l) u.

1 (wu.-cu. l)/det

1 1+

(k+l) = (-au.+wu. l)/det Ui +1 ~ 1+

(k+l) u = u /w ,

n n

228

i:;l,3, ... ,n-l, 2.

with computational molecules given by (Figure 5.8),

(k-t 1)

8--8 (k+l)*

i i+l i i+l

1 (k+l)

l/w (k+l) *

n

FIGURE 5.8: The AGE Method At Level (k+l).

The computational complexity can be easily derived, which is 6

multiplications and 6 additions per point per AGE iteration plus

. th . . h 2 2 11 some pre-computat1on of e quant1t1es sue as c , w , etc. Norma y,

because the previous solution is a good approximate solution then only

two iterations are usually required.


A number of numerical experiments were conducted on a model

problem to demonstrate the application of the AGE algorithm with the

D'Yakonov splitting strategy on parabolic problems.

We considered the following problem,

aU at

subject to the initial condition,

U(x,O) = sinx, O~X~TI


U(O,t) = 0

U(n,t) 0, t~O


-t U(x,t) = e sinx

Tables 5.6, 5.7 and 5.8 present the numerical solution, absolute and

relative errors and the exact solution of this problem at appropriate

grid points for values of nt=5,11 and 21. The results confirm the

accuracy given by the D'Yakonov splitting is approximately equivalent

to the Peaceman-Rachford and better than Douglas-Rachford splitting

as expected.

By implementing this algorithm in parallel using the three

strategies discussed earlier, we found that the speed-ups are

slightly better than those found in Section 5.2. Figure 5.9 shows

the running times achieved by applying these three strategies.

229

k-O 005 h-O In r-O 5 - . - . - . -6 E-l0 -

~ O.in

Numerical 0.302948 solution

Absolute -5 D 5.0xl0 Error

Relative 1.6xl0 -4

Error

Numerical 0.302963 Solution

DR Absolute Error

6.5xlO -5

Relative 2.1xl0 -4 Error

Numerical 0.302948 Solution

PR Absolute Error

5.0xl0 -5

Relative Error

1. 6xlO -4

Exact Solution 0.302898

TABLE 5.6

0.2n

0.576241

9.4xl0 -5

1.6x1O -4

0.576269

1.2xl0 -4

2.1xl0 -4

0.576241

9.5xl0 -4

1.6xl0 -4

0.576146

0.3n 0.4n O.sn 0.6n

0.793127 0.932377 0.980359 0.9328"0

1. 3xl0 -4

1.5xl0 -4

1.6xl0 -4

1.5x1O -4

. 1.6xl0 -4 1.6xlO

-4 1.6xl0

-4 1.6x1O

-4

0.793166 0.932423 0.980407 0.932423

1.6xl0 -4

1.9xl0 -4

2.0xl0 -4

1.9xlO -4

2.1xl0 -4 2.1x1O

-4 2.1xl0

-4 2.1xl0 -4

0.793127 0.932377 0.98036 0.932377

1.3xl0 -4

1.5xl0 -4 1.6xl0

-4 1.5xl0

-4

1. 6xl0 -4

1.6xl0 -4

1.6xl0 -4

1.6xl0 -4

0.792997 0.932224 0.980199 0.932249

LEGEND D: D'Yakonov splitting PR: Peaceman-Rachford splitting DR: Douglas-Rachford splitting

0.7n 0.8n

O. "193127 0.576241

1.3xl0 -4

9.4xl0 -5

1.6xlO -4

1.6xl0 -4

0.793166 0.576269

1.6xl0 -4

1.2xl0 -4

2.1xl0 -4 2.1xl0 -4

0.793128 0.576241

1.3xl0 -4

9.4xl0 -5

" -4 1. 6xl0

" -4 1.6xl0

0.792997 0.576146

0.9n

0.302948

5.0xl0 -5

-4 1.6xl0

0.302963

6.4xl0 -4

2.1xl0 -4

0.302948

s.oxl0 -5

1.6xl0 -4

0.302890

No.of Iters.

2

4

2

" w o

Time level nt=13, -6

k=O.OOs, h=O.ln, r=O.s, E=lO

~ O.ln


D Absolute Error

1.4xlO -4

Relative Error

4.2x l0-4


Absolute -4 DR Error 1.8xlO

Relative -4 Error

6.3xlO

Numerical Solution 0.291165

PR Absolute Error

1. 4xlO- 5

Relative 4.9 xlO- 5 Error

Exact Solution 0.291021

0.2n 0.3n

0.553838 0.762278

2.7 xlO-4 3.7X10-4

4.2 x l0-4 4.2xlO-4

0.553909 0.762390

3.5xlO -4 4.8 xlO-4

6.3 X10-4 6.3xlO-4

0.553828 0.762279

-5 2.7 xlO . 3.7 X10-5

4.9 XlO-s 4.9 XlO-5

0.553555 0.761903

No.of 0.4n O.sn 0.6n 0.7n 0.8n 0.9n Iters.

0.896112 0.942228 0.896112 0.762278 0.553828 0.291164

4.4xlO-4 4.6xlO-4 4.4xlO-4 3.7xlO -4 2.7 X10-4 1.4xlO -4 2

4.2xlO-4 4.2 Xl0-4 4.2 Xl0-4 4.2 X10-4 4.2 X10-4 4.2 X10-4

0.896244 0.942366 0.896244 0.762391 0.553909 0.211207

5.7 xlO -4

6.0xlO -4 s.7XlO-4

4.8xlO -4

3.SxlO -4 1.8xlO -4 4

6.3xlO-4 6.3 XlO-4 6.3 xlO-4 6.3 xlO-4 6.3 XlO-4 6.3 xlO-4

0.896113 0.94228 0.896113 0.762279 0.553828 0.291165

4.4 X10-5 4.6xlO-5 4 .4 xlO-5 3.7xlO-5 2.7 XlO-s 1.4x l0 -5 2

4.9 X10-s 4.9XlO-s 4.9 XlO- 5 4.9 XlO-s 4.9 xlO-s 4.9 XlO-s

0.895671 0.941765 0.895671 0.761904 0.553444 0.291022

TABLE 5.7

Time level nt=21, -6

k=0.OO5, h=O.ln, r=0.5, £=10

~ O.ln 0.2n

Numerical 0.279839 0.532286

Solution

Absolute -4 -4 D 2.2xlO 4.3xlO Error

Relative -4 -4 Error 8.2xlO 8.2xlO

Numerical Solution 0.279908 0.532417

Absolute -4 -4 DR 2.9xlO 5.6xlO Error

Relative -3 -3 Error 1.OxlO 1.OxlO

Numerical solution 0.279840 0.532287

Absolute -4 4.3xlO -4 PR 2.2x1.0 Error

Relative -4 -4 Error

8.2xlO 8.2x10

Exact Solution 0.285446 0.542951

0.3n 0.4n 0.5n

--

0.732629 0.861257 0.905576

6.0xlO -4

7.0xlO -4

7.4xlO -4

8.2xlO -4

8.2xlO -4

8.2xlO -4

0.732809 0.861468 0.905801

7.8xlO -4

9.1xlO -4

9.6xlO -4

1.0x10 -3

l.OxlO -3

1.0x10 -3

0.732630 0.861258 0.905580

6.0xlO -4

7.0xlO -4

7.4xlO -4

8.2xlO -4

8.2x10 -4

8.2xlO -4

0.747307 0.878513 0.923723

TABLE 5.8

0.6n 0 .. 7n

0.861257 0.732629

7.0xlO -4

6.0xlO -4

8.2xlO -4

8.2xlO -4

0.861468 0.732809

9.1xlO -4

7.8xlO -4

1.OxlO -3

l.OxlO -3

0.861258 0.732630

7.0xlO -4 6.0xlQ

-4

-4 -4 8.2xlO 8.2xlO

0.878513 0.747308

0.8n

0.532286

4.3xlO -4

8.2xlO -4

0.532417

5.6xlO -4

l.OxIO -3

0.532287

4.3xlO -4

8.2x10 -4

0.542951

0.9n

0.279839

2.3xlO -4

8.2xlO -4

0.279908

2.9xlO -4

l.OxlO -3

0.279840

'-4 2.2xlO

8.2xlO -4

0.285446

No.of Iters.

2

4

2

'" w

'"

4

3

1

o 1 2

No. of procB

3 4

+ First Strategy ... Second Strategy ... Third Strategy n .. \3

5 6

FIGURE 5.9: The Timing Results for the One-Dimensional Heat Equation with D'Yakonov.

233

234

5.6 THE NUMERICAL SOLUTION OF TWO-DIMENSIONAL PARABOLIC EQUATIONS BY

THE AGE METHOD WITH D'YAKONOV SPLITTING

The AGE method with D'Yakonov splitting for solving a one-

dimensional parabolic differential equation described in Section 5.5,

can be readily extended to problems involving higher space dimensions.

Consider the two-dimensional heat equation,

au at , O~x,y~N and t~o ,

with the initial conditions,

U(x,y,O) = F(x,y,t) ,

and the boundary conditions are,

U(x,O,t)

and

(5.6.1)

(5.6.2)

(5.6.3)

At the point p (x. ,y . ,t ) in the solution domain R, the value of 1 J r '

U(x,y,t) is denoted by u, , ,where x,=ih, y.=jh, for O~i,j~n+l ~,J,r ~ J

and h N (n+l)

The increment in time t,k is chosen such that t =rk, r

for r=O,1,2, •... A weighted finite difference approximation to

(5.6.1) (see Table 3.1) at the point (i,j,r+i) is given by,

-leu. l' 1+(1+4le)u.. l-leu. l' l-leu .. 1 l-lBu .. 1 '1 ~- ,],r+ 1,J,r+ 1+ ,J,r+ 1,)- ,r+ 1,)+ ,r+

_ l(l-B)u. . +(1-4l(1-B»u .. +l(l-S)u. . +A(l-B)u .. J.-l,),r . 1,],r 1+1,J,r l.,)-l,r

k +A(l.,.B)u .. 1 ,for i,j=1,2, ... ,n, r=O,1,2, .•. , A = -2 and O~e~l.

1,J+ ,r h

This approximation can be displayed in a more compact matrix form as,

(k+l) (k) Au ;; B~ + E ' (5.6.4)

where the vector b consists of the boundary values with,

235

bl

A(l-S)[u +u ]+A8[u +u ],,,(1-8)u + O,l,r 1,O,r O,l,r+l l,O,r+l 2,O,r

A8u2 1.··· .,,(1-8)u 1 +,,8u '1.,,(1-8) [u + ,O,r+ n- ,O/r n-l,O,rT . n,O,r

u ]+,,8[u +u ] • n+l,O,r n/O,r+l n+l,l,r+l

bJ. ; ,,(1-8)Uo . +,,8Uo ' 1.0 •...• 0."(1-8)u l' +A8u l' 1

,J,r ,J,r+ n+ IJ,r n+ ,],r+

for j=2,3, ... ,n-l,

and

b ; 1.(1-8) [u +u ]+1.8 [u +u ].A(1-8)u n O,n,r l,n+l,r O,n,r+l l,n+l,r+l 2,n+l,r

+A8u •...• A(1-8)u +,,8u • 2,n+l,r+l n-l,n+l,r n-l,n+l,r+l

1.(1-8) [u +u ] +1.8 [u +u ] n,n+l,r n+l,n,r n,n+l,r+l n+l,n,r+l

The coefficient matrices A and B in equation (S.6.4) take the same

block tridiagonal form as in (S.3.Sa) and (S.3.Sb) respectively.

If we choose 8;! and split A into the sum of constituent

symmetric and positive definite matrices Gl

.G2

.G3

and G4

• such that.

(S.6.S)

where Gl .G2 .G3

and G4

are the same as in (S.3.7a). (S.3.7b). (S.3.7c)

and (S.3.7d) respectively.

Analogous to Section S.S. the D'Yakonov splitting for equation

(S.6.4) then takes the form.

(G +rI)u(k+l/4) ; 1 -.

(k) [(G -rI) (G

2-rI) (G

3-rI)(G -rI)]u +b

1 4 - -

(G +rI)u (k+!) ; 2 -

(G +rI)u(k+3/4) 3 -

(G4

+rI).!:: (k+l)

(k+l/4) u

(k+! ) ; U

(k+3/4) u

(a) 1 (b)

(S.6.6)

(c) j (d)

236

Let us consider the above iterative formulae at each of the

four intermediate levels;

(i) At the first intermediate level (the (k+l/4)th step)

If we reorder the mesh points column-wise parallel to the y-axis,

~ '" we find that G3~=Gl~(c) and G4u=G2 ~(c)' where the suffix c stands

for column-wise ordering. Therefore, equation (5.6.6a) can be written

as,

r l (k+l/4) IUll1

= 1

det *

fd~t/:~- - ~

IW -c I 0 I-a w

~---I' - 1,,- I -I

"~,', I I 1 " --r - -1- --I

, 0 , W -c

l o

I _____ ...L :-a ~I,' j---- -- ------

o

L

I'" I , " I ,,,I , -1--1--'--1-'--

ldet/W 1

~ --I

I Iw

'-a -,

1-

T -1- -

-c I ,0 W, ~-I

I, .;~, I

I " ....... ',

o '_ 1

,w

I

'-a

-c

WI

*

r 2

w

wa

L

o

wc 2

w wc

o

2 c

" ' , 2'

, ,

a wa

o

c

Iwa I 2 a wa

o

wc 2

w

2 c

wc

l

- - --I I

o I , , , ,

o

, I 2,

w __ L I,.... I

I" " ""'-," I

I _, +-Iw

2

I wa

I 2 la I

o

,

wc 2

w

"

wc 2

o

2 c

wa w wc '- " '-. '

"-",

, " "

" " "- 2' ,

"-a wa

o ,

*

"-2

w

I U nl

I I

237

+

(5.6.7a)

B . t th 1 eh th 1 of u(k+l/4) y carry1ng ou e necessary a 9 ra, e va ue

can be derived.

(ii) At the second intermediate level (the (k+!)th step)

From equation (5.6.6b) we have,

(kd) U (G )

-1 (k+1/4) = 2+rI ~

which yields,

I \ '

I"

k+t) -c f 10

-a W j

1 J det "I

- T ..... ==---,- - - - T-I" : ," ! ,,-,-

- - 1- ~ 'TW -c!

o '-

1- _______ -

o

, I

238

o

, ---------

-c -,

1

1

IUln

-,-_oil : o 1 : ,

-a 1-

w -~,- "

... ,," I t" " '

1

, , 1

I i" '. - - - - - 1- -/ c "w" -c,

'-a w' /

:- - t:t~J " , J

U 2 n

fl."nd u(k+t) by " th" t" 1 "1" " Hence, we can carry1ng out 1S ma r1X-vector mu t1P 1cat1on.

(iii) At the third intermediate level (the (k+3/4)th step)

By reordering the mesh points column-wise parallel to the ""'axis,

equation (5.6.6c) can be written as,

(k+3/4) u -(c)

and hence, we have,

(G + I) -1 (k+t) 1 r ~(c)

Iul jk+3/4) ~e~/:: ___ : ___ : __ _ u ,I Iw -e-, I 0 ~l'l I

I-a w I I - f- - -, - - r

I, I I I~~, I

t ",,' , " 'I +---'--',---" I I 0 ,w -e I

: I _1_* U nl ( det

-- -'-j I-a w I

- -.J - - T ,,, , I ,,'

, 'I , I

___ L

I

, " '''-

o

- - - - - -, -- --det/w , -,

,

w -e: : 0 I I

o 1- _ =-a _ w--; ,,- : _ _ __ I I - ~\~- --- I

::a -; t~2J 0,

L and by carrying out the necessary algebra, we find the values of

(k+3/4) u

(iv) At the fourth intermediate level (the-(k+l)th step)

In a similar manner to the third step, equation (S.6.6d) can be

expressed as,

which yields,

(k+l) u -(c)

( )-1 (k+3/4)

= G2+rI ~(e)

239

240

(k+l)

:0 , I -,

o r l (k+3/4) u

ll

u2l

IU3l

_1_* -unl det 1

\-

I

I : j LU

2 n L

-C, I -a w

_1_- ' 1

---,--

- - --

0

, -c

I

w l

o

By carrying out the necessary algebra, we can find the values of

(k+l) u .

It can be readily shown that the method formulated above is

second order accurate AGE algorithm in k. Hence, the D'Yakonov split

gives a second order AGE method. The computational work involved at

each stage is the solution of 2x2 block systems, or in explicit form

as simple recurrence relations. The iterative procedure is continued

until convergence- to a specific level of accuracy £. is achieved.


A number of numerical experiments were conducted on the model

problem, i.e. the diffusion equation to demonstrate the application

of the AGE algorithm with the D' Yakonov splitting strategy on a two

dimensional parabolic problem.

241

Consider the following problem,

" ,2 "2 oU ~ + ~ +

2 2 H(x,y,t) (5.6.8)

at ax ay H(x,y,t) ; sinxsiny exp(-t)-4, defined in the region O~x,y~l, t<O.

We choose the exact solution to be,

-t 2 2 U(x,y,t) = sinx siny e + x +y , O~x,y~l, t~o.

The initial and boundary conditions are defined so as to agree with

the exact solution.

The numerical results for different values of x and y were

compared with the corresponding results obtained in different formulae.

It is generally observed from Tables 5.9, 5.10 and 5.11 that the AGE

algorithm using D'Yakonov splitting produces more accurate results

with less number of iterations.

-4 x=O.l, k=O.OOl, h=O.l, t=l.O, A=O.l, r=1.5, £=10

~ 0.1 0.2 0.3 0.4

AGE-D l.oxlO -5

l.oxlO -5

2.1xlO -5

3.2xlO -5 --

AGE-CN 1.75xlO -5

3.4xlO -5

4.98xlO -5

6.29xlO -5

AGE-IMP 2.0xlO -5

3.9xlO -5

5.7xl0 -S

7.3xlO -5

--- -------- --------Exact

0.029018 Solution 0.067946 0.126695 0.205177

0.5 0.6

3.7xlO -5 4.0xlO

-5 .-

7.24xlO -5

7.69xlO -5

8.4xlO -S

9.0xlO -S

0.303308 0.421006

TABLE 5.9: The Absolute Errors of the Numerical Solution of Problem (S.6.8)

x=0.5, k=O.OOl, h=O.l, t=l.O, A=O.l, r=l.S, £=10 -4

I~ 0.1 0.2 0.3 0.4 O.S 0.6

AGE-D 1.7xlO -5 2.3xlO-S 3.3xlO

-S 3.6xlO

-6 S .0xlO

-S S.lxlO

-S

AGE-CN -5 -4 -4 -4 -4 -4 ".24xlO 1.42xlO 2.07xlO 2.62xlO 3.04xlO 3.25xlO

AGE-IMP 8.41xlO -S l.66xlO -4 2.41xlO -4

3.07xlO -4 3.58xlO -4 3.87xlO

-4

--Exact Solution 0.303308 0.376183 0.468197 0.S78931 0.707976 0.854943

TABLE S.lO: The Absolute Errors of the Numerical Solution of problem (5.6.8)

.. ~.-.. -

0.7 0.8 0.9 No.of Iters. -

4.1xlO -5

4.0xlO -5

3.1xlO -5 3 ----

7.42xlO -5

6.21xlO -5

3.78xlO -5

3 -- ------- -------- --

8.9xlO -5

7.6xlO -S

4.8xlO -S

4

0.SS8194 0.714801 0.890760

0.7 0.8 0.9 No.of iters.

4.9xlO -S

5.0xlO -S 4.3xlO-5

3 ---

3.16xlO -4

2.68xlO -4

1.6SxlO -4

3 - ,-- -- -

3.38xlO -4 3.34xlO -4 2.1SxlO -4

4 --- -_._--

0.019463 1.201191 0.399808

-4 x=0.9, k=O.OOl, h=O.l, t=l.O, A=O.l, r=1.5, £=10

~ 0.1 0.2 0.3 0.4

-5 -5 -5 -4 AGE-D 3.1xlO 4.9 XlO 8.0xlO 1.oxlO

3.78xlO -5

7.45xIO -5 -4 -4

AGE-eN 1.09xlO 1.4xlO

-5 -5 -4 -4 AGE-IMP 4.84xlO 9.59xlO 1. 4xlO 1.81xlO

Exact 0.890760 0.990813 1.109460 1.246013

Solution

0.5 0.6

1.2xlO -4

1.3XIO -4

1.65xlO -4

1.81xlO -4

2.15xlO -4

2.4xlO -4

1.399809 1.5-/0209

TABLE 5.11: The Absolute Errors of the Numerical Solution of Problem (5.6.8)

0.7 0.8 0.9 No.of iters.

1.5xlO -4 1.4xlO

-4 1.oxlO

-5 3

1.83xlO -4

1. 63xlO -4

1.09xlO -4

3 ----,---_ .. - - - - .--~---- -,,-- - .

2.48xlO -4

2.29xlO -4

1.63xlO -4

4 -.----- _._--- -

1.756611 1. 958450 2.175209

244

5.7 A NEW STRATEGY FOR THE NUMERICAL SOLUTION OF THE SCHRODINGER

EQUATION

The numerical solution of the one space dimensional Schrodinger's

equation which is a well known equation in Quantum Mechanics is

obtained by a new direct method which depends on separating the real

and imaginary parts of the discretized complex tridiagonal matrices

which are obtained from a Crank-Nicolson formulation of the

differential equation into a new decoupled form.

Given the equation,

(5.7.1)

with the initial condition,

U(x.O) - H(x) • t-O • (5.7.2)


U(O.t) - " (5.7.3)

A uniformly spaced network whose mesh points are x -rh. t.-jk. r J

R. for r-O.l.2 ••••• n+l and j-O.l.2 •••.• m+l is used with h-(n+l)

T k=-- and the

(m+l) mesh ratio A- k2 .

h 'A weighted approximation to the differential equation (5.7.1) at

the point (x .t. ,) is given by. (Table 3.1). r J+,

-A9u 1 . 1+(2A9-i)u . l-ASu 1 . 1 - A(1-9)u 1 .-(2A(1-9)+i)u . r- ,J+ r,J+ r+ ,J+ r- ,J r , )

+A(l-S)u 1 .• r-l.2 •...• n. j-O.l.2 •... and O~S~l. (5.7.4) . r+ I)

This approximation can be written in a more compact matrix form as,

245

e -i -A6 l ~'l I b l I I u 2 I l -A6 2A6- i -A6 i b 2 " " " 0 ' I " " " I I

" " , I I

I I I

I " "-,

I : " "- "- , " " "- I "- "-

"- "- "- I I

" , " , ,

Ib;_~ L

0 -A6 2A6-i -A6

G:-J -A6 2A6-i ~n

(s.7.sa)

i. e. ,

Au = b (s.7.sb)

where, b

l = A(1-6) (uo ,+u

2 ,)+A6Uo ' 1-(2A(1-6)+i)u

l j

IJ I) ,J+ ,

b r

A(1-6)(u 1 ,+u l,)-(2A(1-e)+i)u ,. r=2.3 •...• n-l r- I) r+,J r,)

and b = A(1-6) (u 1 ,+u 1 ,)-(2A(1-e)+i)u ,+A6u 1 ' 1 .

n n- IJ n+,J n,) n+ ,J+

Here b is a column vector of order n consisting of the boundary

values as well as known u values at the time level j while u are the

values at the time level (j+l) which we seek. If we let 6=!. we

recall that (5.7.4) corresponds. to the well known Crank-Nicolson

2 2 method with accuracy of order O(h .k ).

5.7.1 Outline of the Method,

The complex tridiagonal system (s.7.5b) can be rewritten as.

(C+iD)~ = b (5.7.6)

where, d

a

C = (s.7.7a)

246

I l I

c 0 I "-

" , D ; "

l "- , ,

0 , "-

"-" c

(5.7.7b)

a=AI d;->./2 and c;l.

Now let, u ; (v+iw) , (5.7.7c)

b ; (f+ig) (5.7.7d)

The above system (5.7.7) can further be rewritten in a real variable

form as,

F -01 rvl

ifl

cJ ;

19 J (5.7.8)

~ lwJ

which is obtained by separating out the real and imaginary parts of

the complex coefficient matrix, i.e.,

(C+iD) (v+iw) (f+ig) , (5.7.9)

to give, CV-Dw; f (5.7.lOa)

Dv+CW ; 9 (5.I.lOb)

The system (5.7.l0a,b) ·can further be reduced into the uncoupled

form by premultiplication of the matrix,

~" Dl I ~

to give,

2 2 01 fvl rCf+Dg l rc +D (5.7.11)

~ 2 j lwJ ~Df+C9J C +D

247

which represents two linear systems, with the same coefficient matrix

for the unknown components v and w, which need to be solved only once,

but with different right hand sides.

Since C was originally of tridiagonal form, then the matrix c2

will be quindiagonal and of the form,

-2 2 2

l d +a 2ad a

d2+2a

2 2 2ad 2ad a I

I

2 2 2 2 a 2ad d +2a 2ad a , C " "

... " "- "- "-"- "- "-

"

I "- "- "-

" "- "- ... ... "- "-"- "- ... ... ... "- ... ... ... ...

c2 J ... ... ... ... "- ... "-

"- "- ... "-

... .(5.7.12) " ..... 2 2'..... ..... ... 2 "- 2.

a 2ad d +2a 2ad a ,

0 2 2ad d

2+2a

2 2adl

L a

d2

+a 2 J 2

a 2ad

whilst D 2

will remain in diagonal form.

Thus, the solution of the complex system has been reduced to

solving two identical quindiagonal systems in real variables of the

form,

2 2 2ad

2

l rl l ~ I' +a +1 a IWl

2 2 2 12ad . d +2a +l 2ad a

IV2 I w2

la2 2 2 2 0 I I I I 2ad d +2a +1 2ad a

I ... ... ... I I

"- "-...

"-"- ... ... "- "- ... ... ...

I "- ... ... "- ... ... ... ... I I "- "-...

"-... "- , .... "- "- I ... "- ... ... ... ... ... ... ... ... "-... 2 2 ... 2 ... ... 2

0 a 2ad d +2a +1 2ad a

2 2 2 2ad I a 2ad d +2a +1 v

n-l \Wn_ll

I 2 2 2 \ L'n J L

a 2ad d +a +1 ~ L'n J

Real Sol. Imag.Sol.

248

[fl l gl l f2 g2

[ I I

I I ~ I (5.7.13)

I I I I I

f g~-l n-l

~n I [n ·1 J

RHSl RHs2

which can be solved directly by efficient elimination procedures

involving no inte,rchanges.

Since the coefficient matrix is' unchanged, then the Gaussian

elimination algorithm need only be applied once to the matrix whilst

the operations on the two right hand sides can be carried out

simultaneously.

By the use of the quindiagonal solver [Conte, 1965] the

computational complexity for the algorithm is now of order O(n)

multiplications, while the Gaussian elimination is of order o(n3

)

multiplications.

5.7.2 Numerical Results.

Consider the problem of solving the Schrodinger equation,

au a2u -i at ---2' O~x~n/2 , (5.7.14)

ax with the initial condition,

u = sinx + icosx ,


U(O,t) ~

U(n/2,t)

-t ie

-t ~ e

(5.7.15a)

(5.7.15b)

249

The following tables illustrate the results obtained by solving

the problem (5.7.14) by this method.

Numerical solution Exact solution

Real Imag. Real Imag.

0.139508 0.970233 0.139497 0.970222

0.276164 0.940504 0.276154 0.940494

0.407198 0.891629 0.407189 0.891620

0.529943 0.824604

I 0.529935 0.824596

0.641904 0.740795 0.641894 0.740785

0.740794 0.641903 0.740785 0.641894

0.824604 0.529943 0.824596 0.529935

I 0.891631 0.407200 I 0.891620 0.407189 I

I 0.940504 0.276164 I 0.940494 0.276154

0.970231 0.139506 I

0.970222 I

0.139497 , , : I

TABLE 5.12: Solution At Time Level nt=5, t=0.OO2

250

Numerical Solution Exact Solution


0.131382 0.913729 0.131373 0.913720

0.260082 0.885734 0.260072 0.885724

0.383487 0.839707 0.383476 0.839696

0.499082 0.776583 0.499074 0.776575

0.604522 0.697645 0.604513 0.697645

0.697655 0.604523 0.697645 0.604513

0.776583 0.499082 0.776575 0.499074

0.839705 0.383485 0.839696 0.383476

0.885734 0.260082 0.885724 0.260072

0.913731 0.131383 I 0.913720 0.131373

TABLE 5.13: Solution At Time Level nt;17, t;0.08

Numerical Solution Exact Solution


0.128728 0.895636 0.128772 0.895628

0.254931 0.868194 0.254922 0.868185

0.375894 0.823080 0.375883 0.823069

0.489203 0.761209 0.489192 0.761198

0.592553 0.683841 0.592543 0.683831

0.683838 0.592551 0.683830 0.592543

0.761207 0.489201 0.761198 0.489192

0.823079 0.375893 0.823069 0.375883

0.868204 0.254931 0.868185 0.254922

0.895636 0.128780 0.895628 0.128772

TABLE 5.14: Solution At Time Level nt;21, t;O.l

251

5.8 CONCLUSIONS

The parallel implementations of the AGE iterative method to solve

one and two dimensional parabolic problems as well as the second order

hyperbolic problem are investigated in this Chapter. Using the three

parallel strategies described in Chapter 4, we notice that the

asynchronous version achieves a better speed-up ratio, where these

speed-ups are mostly linear.

This is because of the overheads generated in the other two

strategies and also because of the idle time that is spent waiting

for the other processors to complete their computations. There is

also a time that is needed to order the pOints in the second strategy

which also degrades the efficiency of the algorithm.

From the experiments we notice from the results obtained from

the first two strategies that there is no sufficient parallelism

available to enable the processors to work efficiently. This is

because of the granularity of the 2x2 blocks is too small to keep

the processors always busy.

The numerical experiments carried out on the solution of the

model problem of one and two dimensional parabolic differential

equations using the AGE method with D'Yakonov .splitting yields an

algorithm with reduced computational complexity and greater accuracy

for multi-dimensional problems.

Solving the (nxn) complex tridiagonal matrices by a new direct

b method'presented in section 5.7. This direct method enables us to

reduce the. complex system into two identical real quindiagonal

systems. Since the coefficient matrix is unchanged, then the Gaussian

elimination algorithm need only be applied once to the matrix whilst

the operations on the two right hand sides can be carried out

simultaneously.

252

CHAPTER 6

NUMERICAL INVERSION OF THE LAPLACE

TRANSFORMATIONS) SOME INVESTIGATIONS

AND PARALLEL EXPLORATIONS

You ean observe a ~ot

just by watohing.

Y. Berra.

253

6.1 INTRODUCTION

The Laplace transform is an important technique which is used to

find the solution of differential equations with assigned boundary and

initial conditions in applied sciences and engineering. Already,

extensive tables of the Laplace transforms are available. However, it

is interesting to investigate and obtain numerically the inverse

Laplace transforms.

Another technique, that of integral transforms, which had its

origin in Heaviside's work, has been developed during the last few

years and has certain advantages over the classical methods.

The integral transform F(s) of a given function f(t) in the range

a~t~b is defined as follows,

b

F(s) ; J k(s,t)f(t)dt , a

(6.1.1)

where k(s,t) is a known function of sand t, known as the kernel of

the transform, provided that the integral exists.

"In the application of integral transforms to the solution of

differential equations, use has so far been made of five different

kernels. These five transforms are:

(a) The Laplace transform,

F (s) ; ( -st

f(t)e dt,

Cb) The Fourier sine and cosine transform,

F(s) f(t)sin(st) dt cos(st) ,

(c) The Complex Fourier transform,

F (s) ;

(6.1.2)

(6.1.3)

(6.1.4)

254

(d) The Hankel transform,

r~

F(s) ; J f (t) t o

J (st)dt n

(6.1.5)

where J (t) is the Bessel function of the first kind of order n. n

(e) The Mellin transform,

F (s) ; r~

J f (t) o

The effect of applying an integral transform to a partial

(6.1.6)

differential equation is to exclude temporarily a chosen independent

variable and to leave a partial differential equation in one less

variable for solution. The solution of this equation will be a function

of s and the remaining variables. When this solution has been

obtained, it has to be inverted to recover the Zost variable. Thus,

if t is the variable eliminated and F(s) is one of the transforms given

above, we first obtain auxiliary equations giving F in terms of s and

the remaining independent variables. These are solved for F and then

inverted to obtain f(t).

In this chapter, we shall briefly describe the Laplace transform

and its numerical inversion. A comparison between different methods

suggested for the numerical inversion of the Laplace transform is

given in order to choose an appropriate method. Then an attempt is

made to obtain more accurate results by applying extrapolation

techniques in a Romberg quadrature strategy is considered. Then we

implement the numerical inversion of the Laplace transform suggested

by Crump [Crump, 1976] on the Balance 8000 MIMD parallel system

incorporating some parallelism.

255

6.2 THE NUMERICAL INVERSION OF THE LAPLACE TRANSFORM

The Laplace transform of a real function f:R + R, with f(t);O

for t<o and its inversion formula are defined as,

r~

F(s) ; L[f(t)] J

-st f(t)e dt, (6.2.1)

o

f(t) -1

L [F(s)] ~+~ st J F(s)e dt (6.2.2)

v-im

with s;v+iw;v,w E Rand i;l-l. v E R is arbitrary, but greater than

the real parts of all the singularities of F(s). The integrals in

(6.2.1) and (6.2.2) exist for Re(s»a E R if,

(a) f is locally integrable,

(b) there exists a to~O and k,a E R, such that If(t) I~keat

for all t>-to' (i.e. of exponential order),

(c) for all t E (O,~) there is a neighbourhood in which f

is of bounded variation.

In the following we always assume that f fulfills the above

conditions and in addition that there are no singularities of F(s) to

the right of the origin" (for the case of singularities to the right,

a suitable translation of the imaginary axis can be performed).

Although there exists extensive tables of transforms and their

inverses, it is highly desirable to have methods for approximate

numerical inversion for computer implementation. A large number of

different methods have been devised for the numerical evaluation of

the Laplace inversion integral. In this section we briefly survey some

of these methods, where many of them are either orthogonal series

expansions or weighted sums of values of the transform at a set of

pOints.

Schmittroth [Schmittroth, 1960] has described a method in which

the inverse transform is obtained from the complex inversion integral

by use of numerical quadrature. This method gives good results, but

if the inverse transform is required for a large number of values of

the independent variable, the quadrature procedure must be repeated

for each value of the independent variable.

In cases where the inverse is required for many values of the

independent variable, it is convenient to obtain L~e inverse as a

series expansion in terms of a set of linearly independent functions.

Norden [Norden, 1955] has described two such methods in which the

expansion coefficients are calculated by solving a system of

simultaneous equations. The problem of solving simultaneous linear

equations can be reduced to one of solving a triangular system.

Salzer [Salzer, 1958] described a method, using orthogonal

polynomials (function) which was later refined by Shirtlifte and

Stephenson [Shirtlifte, 1961]. They attempt an approximate

evaluation of the inversion integral using Gaussian quadrature in the

complex plane. In this method it is necessary to find all roots, real

and complex, of a polynomial of high degree.

Lanczos and Papoulis [Lanczos, 1956] describe methods in which

the inverse transform is obtained as a series expansion in terms of

trigonometric functions, Legendre polynomials or Laguerre polynomials.

Papoulis obtained the inverse transform as a series expansion in

terms of Laguerre functions. The expansion coefficients are obtained

256

257

from the coefficients of the Taylor series expansion of the Laplace

transform by solving a triangular system of linear equations. The

disadvantage of this method, is the necessity of obtaining the Taylor

series expansion of the Laplace transform. Lanczos found the inverse

transform as a series expansion in terms of Laguerre functions, by

applying a con formal mapping to the Laplace transform and then developing

the resulting function in a Taylor series expansion.

Weeks [Weeks, 1966] further refined the ideas of Lanczos and

Papoulis. He obtained the coefficients in the expansion of the inverse

transform directly by trigonometric interpolation applied to the Laplace

transform. The resulting algorithm is quite suitable for automatic

computation.

Longman [Longman, 1975] described a method in which the inverse

function F(s) is replaced by Pade approximants ~ (s) in the inversion n,m

integral. Unfortunately, the construction of the rational function

~ (s) requires a knowledge of the Taylor expansion of F(s) about the n,m

origin, and this makes it impossible to implement a procedure which

uses only values of F(s) .

Finally, Dubner and Abate [Dubner, 1968] showed that the inverse

integral may be approximated by a certain Fourier series. This method

will be described in detail in the next section.

258

6.3 NUMERICAL EXPERIMENTS

As stated in the previous section, a number of numerical inversion

methods have been developed during the last few years. In this

section, we shall confine ourselves to the methods using Fourier

series approximations.

We shall describe the derivation of the Dubner and Abate method

(OUbner, 1968]. Then, the natural continuation of this method as

suggested by Durbin [Durbin, 1974] and Crump [Crump, 1976] is

described. A comparison between these methods is given by a number of

experimental results. Finally, we attempt to increase the accuracy

by using the Romberg integration method.

Let f(t) be a real function of t satisfying (6.2.3) conditions.

Then, by expanding equation (6.2.1) we have,

F(S) = (00 -vt

J e f(t)coswtdt - i

o

-vt e f(t)sinwtdt,

= Re{F(v+iw)} + iIm{F(v+iw)}.

Also, by expanding equation (6.2.2), we have,

f(t) =

or

f (t)

1 211i

vt e 211

+ i

+W

f evt(coswt+isinwt)(Re{F(s)}+iIm{F(s)}idw

_W

r+W l J-!Im{F(S)}Coswt+Re{F(S)}Sinwt)dWJ

(6.3.1a)

(6.3.1b)

(6.3.2a)

(6.3.2b)

The imaginary part in (6.3.2b) cancels out because of the parity

of Re{F(s)} and Im{F(s)}. Again by the same argument, we can have,

259

f (t) (6.3.2c)

For t<O, f(t)=O, which means that,

rm J (Re{F(s)}coswt+lm{F(s)}sinwt)dw O. (6.3.2d)

o

Consequently, we obtain 3 formulas for the Laplace inverse of

f(t) corresponding to F(s). These 3 are:

(a) f (t)

(b) f (t)

(c) f (t) =

2 vt

- e 11

vt e

11

Re{F(s)}coswtdw ,

r'"

J Im{F(s)}sinwtdw,

o

r'" Jo(Re{F(S)}Coswt-Im{F(S)}sinwt)dW

(6.3.3a)

(6.3.3b)

(6.3.3c)

Fourier series were first used by Dubner-Abate, where they

suggested the following. Let h(t) be a real function of t with h(t)=O

for t<o. Consider sections of h(t) in intervals like (nT, (n+l)T) ,

construct an infinite set of 2T-periodic functions g (t) (Figure 6.1), n

where,

r h (nT-t) -T~t~O (a)

1 g (t) = i h (nT+t) O~t~T (b)

t (6.3.4) n

h ((n+2)T-t) T~t~2T , n:;:O, 2,4, ... , (c) l J

r h ( (n+l) T+t) -T~t~O

(a) 1

gn (t) i h ((n+l)T-t) O,t,T (b) ( (6.3.5) I

l h( (n-l)T+t) T~t~2T, n;;l, 3 15, ••• , (c) J

o

-T o

-T o

T

T

T

-vt h(t)=e f(t)

2T

2T

2T

FIGURE 6.1

3T

3T

3T

Then by developing each g (t) into cosine Fourier series we have, n

where,

g (t) n

A n,k

L k=l

A cos knt/T , n,k

(n+l)T 2 r .

h(t)cosknt/T dt. T JnT

Since it is always possible to write,

260

(6.3.6)

(6.3.7)

261

h (t) (6.3.8a)

or

we have,

L n=O

f (t)

A n,k

vt e h (t) , (6.3.8b)

-vt e f(t)cos knt!T dt (6.3.9a)

i Re{F(v+ikn!T)} (6.3.9b)

L n=O

vt e g (t)

n

2evt

= -T-[ lRe{F (v) }+ L Re{F(v+ikn!T}cosknt!T) (6.3.9c) k=l

By using the expressions (6.3.4b), (6.3.4c), (6.3.Sb), (6.3.Sc),

(6.3.8a) and (6.3.8b) we have,

L n=O

vt e g (t)

n

m

f (t) + \' -vkT 2vt L e (f(2kT+t)+e f(2kT-t»

k=l

f(t)+error l(V,t,T)

In conclusion, for any O~t~2T, we can write,

f(t)+errorl(v,t,T) 2 vt _e_[ !Re{F (v) }+ .T

l Re{F(v+ikn!T)}cosknt!T), k=l

(6.3.10)

Equation (6.3.10) is the Dubner-Abate formula, where the error is a

function of v,t/T. The factor L -2vkT 2vt

e f(2kT-t)e is the most k=l

disturbing one since it increases exponentially with t. Numerically

the Dubner-Abate method is only valid for t~T!2.

Similar numerical inversion techniques can be developed that

utilize Im{F(s)} rather than Re{F(s)}. Durbin [Durbin, 1973)

suggested the following,

262

Consider h(t) in the interval (nT, (n+l)T) with an· infinite set of

odd 2T-periodic functions k (t). (Figure 6.2). n

h (t) =e-vtf (t)

T 2T 3T

-T ° T 2T 3T

-T ° T 2T 3T

FIGURE 6.2

By def ini t-ion we hav.e,

{ h (t) nT~t~(n+l)T

k ( t) n -h(2nT-t) (n-l)T:st:;:nT , n=O,1,2, ...

(a) } (b)

Similarly, on the intervals (-T,T), (O,T) and (T,2T), we have,

(6.3.11)

-h (nT-t) -T::t::O (a) 1 k (t) = h (nT+t) O~t~T (b) I n

-h ( (n+2) T-t) T::t~2T n=O, 2,4, ..... (c) J

r h ( (n+l) T+t) -T::t~O (a)

1 k (t) =

1 -h ((n+l)T-t) O~t:;T (b)

n

h( (n-l)T+t) T~t~2T n=1,3,5, ...... (c) l J

Hence, the Fourier representation for each odd function k (t) is, n

k (t) n

I B sin krrt/T , n,k

where, we find that,

B n,k

(n+1)T r -vt

= J e f (t) sink rrt/T dt .

nT

263

(6.3.12)

(6.3.13)

(6.3.14)

(6.3.15)

Then, by summing (6.3.15) over n and comparing it with equation

(6.3 .1a), we have,

B n,k

-vt e f(t)sinkrrt/T dt

Again, summing (6.3.14) over n and multiplying both sides by

vt e , we can have an expression similar to equation (6.3.9c),

t 2 vt . L e

V k (t) = - --e--[Im{F(v+ikrr/T)}sinknt/T]

n T

00

n=O

(6.3.16a)

(6.3.16b)

(6.3.17)

Likewise, on the interval (O,2T), using the equations (6.3.8a),

(6.3.8b), (6.3.12c), (6.3.13b) and (6.3.13c), we find,

k (t) n

f (t) + L k=l

-2vkT 2vt e [f(2kT+t)-e f(2kT-t)]. (6.3.18)

264

Hence, another representation for f(t) is,

f(t)+error2(v,t,T) 2evt

iW

] - -T- ~~O Im{F (V+ik1T/T) }sink1Tt/T] . (6.3.19)

The advantages of OUrbin method are twofold. First, the error bound

on the inverse f(t) becomes independent of t, instead of being

exponential in t. Second, the trigonometric series obtained for f(t)

in terms of F(s) is valid on the whole period 2T of the series.

Kenny S. Crump [Crump, 1976] utilized a combination of both real

and the imaginary parts of the Fourier series. His proposal is as

follows:

For n=O,l,2, ... I define 9 (t), -oo<t<co, by, n

-vt g (t) = e f(t) , 2nT~t~2(n+l)T ,

n

where g (t) is a periodic function with 2T period. The Fourier series n

representation of g (t) is then given by, n

g (t) = n

lA + L {A kcos(k1Tt/T)+B k Sin (k1Tt /T)}, n,D k=l n t n,

where the Fourier coefficients A and B are given by, n,k n,k

and

A n,k

B n,k

(2(n+l)T_vt = (liT) J e f(t) cos (k1Tt/T) dt

2nT (2 (n+l)T_vt

= (l/T)j e f(t)sin(k1Tt/T)dt 2nT

By summing equation (6.3.20) with respect to n, we have,

L n=O

g (t) n

(liT) [!F(v)+ L Re{F(v+ik1T/T)}cos(k1Tt/T) k=l

-Im{F(v+ik1T/T)}sin(k1Tt/T) ] ,

(6.3.20)

(6.3.21)

265

which can be written in the form of the inverse Laplace transform as,

f(t)+error3(v,t,T) vt

= (e /T) [tF(v)+ I Re{F(v+ikrr/T)}cos(krrt/T) k=l

-Im{F(v+ikrr/T)}sin(krrt/T)}] (6.3.22)

where, 00 '"

error3(v,t,T) = evt I g (t)

n = e

vt I exp{-v(2nT+t)}f(2nT+t) n=l n=l

or

error3(v,t,T) I -2nvT e f( 2nT+t) (6.3.23) n=l

However, by comparing errorx, error2, and error3 we notice that

error3 = (errorl + error2)/2

To increase the rate of convergence of equation (6.3.22) and

thereby reduce the truncation error Crump suggested the use of either

the Euler transformation (first used by Simon, Stroot and Weiss

[Simon, 1972] or the epsilon algorithm [Wynn, 1962]. He also showed

the superiority of using the epsilon algorithm in speeding the rate of

convergence.

In 1984, Honig and Hirdes [Honig, 1984] attempted to accelerate

the convergence of equation (6.3.19) by using three acceleration methods.

The epsilon algorithm, the minimum-maximum method and a method based

on the curve fitting. Also, they tested other acceleration methods

such as the Euler transformation and Aitken's extrapolation procedure

but these methods turned out to be less efficient.

To speed up the computation and to increase accuracy Dubner-Abate

and Durbin introduced the use of the Fast Fourier Transform (FFT)

266

techniques [Cooley and Tukey, 1965], (this technique is described in

the next section).

To fulfil our comparison between these methods, we list the results

obtained for solving the two problems,

(1) F (s) 2 = sits +1) f (t) t/2 sin t

2 = l/(s +s+l) f (t) = ~ exp(-t/2)sin(tI3!2).

13 ( 2) F (s)

The results of these problems are illustrated in Table 6.1 and 6.2.

The trapezoidal rule lacks the degree of accuracy, which is

generally required from a quadrature formula. However, Romberg

integration is a method that has wide applications because it uses

the trapezoidal rule to give preliminary approximations, and then

applies the Richardson extrapolation process to obtain improvements

to these approximations.

To improve the accuracy of the final results for a reduced number

of terms n, we applied the Romberg integration to the numerical

inversion of the.Laplace transformation methods that were suggested by

Dubner-Abate, Durbin and Crump.

Although the Romberg strategy appears to give more accurate

results for small time intervals, they are too small to be of

practical use in applying it to the full range of the transform.

Table 6.3 shows the numerical inversion of the Laplace transform with

Romberg integration by using the Durbin method.

267

F(S)~s/(S2t-1), f(t)~t/2 sint.

t Exact Solution Dubner and Abate Durbin Crump

1 0.4207355 0.420732 0.4207785 0.4207348

2 0.9092975 0.909297 0.9092443 0.9092966

3 0.2116800 0.211683 0.2116644 0.2116795 i

4 -1.5136050 -1.513614 -1. 512939 -1.5136050

5 -2.3973107 -2.39'7300 -2.3972"16 -2.3973100

6 -0.8382466 -0.838262 -0.8382199 -0.8382477

7 2.2994530 2.299529 2.299371 2.2994370

8 3.9574329 3.957423 3.957483 3.9574310

9 1.8545331 1.854535 1. 854252 1. 854533

10 -2.7201055 -2.720956 -2.719565 -2.7201030

TABLE 6.1

2 2 . t::' F(s)~l/(s +s+l) , f(t)~ - exp(-t/2)sin(dt/2).

13

t Exact Solution Dubner and Abate Durbin Crump

1 0.5335073 0.5335338 0.5335343 0.5335066

2 0.4192797 0.4192651 0.4192649 0.4192776

3 0.1332426 0.1332654 0.1332644 0.1332419

4 -0.0495299 . -0.0494962 -0.0~94953 -0.0495370

5 -0.0879424 -0.0879581 -0.0879563 -0.0879449 .

6 -0.0508923 -0.0508995 -0.0508996 -0.0508929

7 -0~0076437 -0.0076448 -0.0076477 -0.0076439

8 0.0127151 0.0127121 0.0127137 0.0127149

9 0.0128046 0.0128020 0.0128079 0.0128046

10 0.0053854 0.0053849 0.0053881 0.0053859

TABLE 6.2

F(s)=(2f-;-1/(s+l»; f(t)=2-exp(-t).

Increment in t=O.OOOOl

0.4686862

0.4687000 0.4687046

0.4686812 0.4686749 0.4686729

0.4686969 0.4687022 0.4687039 0.4687044

0.4686954 0.4686948· 0.4686943 0.4686942 0.4686942

TABLE 6.3

6.3.1 The Implementation of the Fast Fourier Transform (FFT)

Technique

Although the earliest discoveries of the FFT technique was in

1942 (Danielson and Lanczos), this technique became generally known

only in the mid-1966's from the work of Cooley and Tllkey [Cooley,

1965]. This technique reduced the computational time of the Fourier

transform from N2

to less than Nlog2

N, where N is usually taken to be

a power of· 2.

An attempt to introduce the use of the FFT for numerically

inverting the Laplace transform was first suggested by Wing [Wing,

1968]. Dubner-Abate also applied the implementation of the FFT

technique to their method. Their implementation is as follows:

The finite Fourier transform pair X(j)" A(k) is defined by,

N-l

268

x (j) \' 2Tfi L A(k)exp(~ jk) , (6.3.24)

k=O

for j=0,1,2, ... ,N-l. Also equation (6.3.10) may be written as,

f (t) -

vt e

T { I

n=-CIO

269

Re{F(v+ i1Tn)} exp(i1Tn t)}. T T

(6.3.25)

Now, if we require the value of f(t) at the equidistant points

t=j~t, where j=O,1,2, •.. ,~, i.e., ~t is the desired sampling

interval and the maxilllum t-values is t max 1

:: ~L)t. Then, we have

either T=2t or T=tN~t. From these definitions we can write max

equation (6.3.25) as,

f (j~t) N-l

( 21Ti = I A(k) exp -- jk)

b (j) N ,

k=O where,

~

A(k) 1 I Re{F(v+

21Ti (k+nN»} = N~t N

n=-CIO

and,

b(j) = 2exp(vj~t)

Hence, the right hand side of equation (6.3.26a) is of the same

form as equation (6.3.24), thereby permitting the use of the FFT

(6.3.26a)

(6.3.26b)

(6.3.26c)

technique. In a silllilar manner, Durbin, applied the FFT technique to

his method.

In 1987, Hsu and Dranoff [Hsu, 1987] ilIlplemented the FFT

technique to the method that is described by Crump (equation 6.3.22).

By applying the trapezoidal rule approximation to equation (6.3.2b)

with W=k1T/t and ~W=1T/t, we obtain,

vt e

f(t) ~ 2T { I k=-co

F(v+ik1T/T)exp(ik1Tt/T)}.

Assuming t=j~T, then equation (6.3.27) can be written as,

f (j~T)

where,

A(k) =

= exp(vjllT) { 2T

N-l I {A(k)exp(21Tikj/N)}}

k=O

F{v+i1T(k+nN)/T}, n=-IXI

(6.3.27)

(6.3.28)

(6.3.29)

270

with N~T;2T and j;0,1,2, ... ,N-l. Now equation (6.3.28) and (6.3.29)

can be computed by the FFT technique as follows. First, A(k) can be

obtained from F(s) by the use of equation (6.3.29). Then, the complex

conjugate of A(k) will be the input to the FFT subroutine and the

output will be X(j) for j;0,1,2, ... ,N-l. The inverse function is then

given by,

f (j~T) = exp(vj~T)

2T x (j) j==O,1,2, .. : ,N-l, (6.3.30)

where X(j) is the complex conjugate of X(j). Since the imaginary part

of X(j) is zero, X(j);X(j).

There are some disadvantages of the FFT technique, when applied

to the numerical inversion of the Laplace transforms. These

disadvantages are:

(i) When only a small number of transform values are required

it is not economic to use FFT. Since the Fourier transform

needs only O(N) operations for each transformed value

required.

(ii) N must be a highly composite number (i.e. N;2m). Otherwise

it is necessary to add a string of zero values to the data

to make the length a power of 2.

(iii) If a re-ordering has to be carried out, no real savings in

computing time is achieved for N<SO.

Thus, on the remarks above, we can say that when N is too small

or only a small number of transformed values are required, the FFT

algorithm is not economic.

271

.6.4 PARALLEL IMPLEMENTATION OF THE NUMERICAL INVERSION OF THE LAPLACE

TRANSFORM

We give in this section a parallel treatment of Crump's

expression that numerically inverts the Laplace transform,

~

vt \' f(t) - (e /T) [!F(v) + L {Re{F(v+iklT/T) }cos(klTt/T) k=l

-Im{F(v+iklT/T)}sin(klTt/T)}]

Numerical experiments with this parallel algorithm have been

carried out on the Balance-8000 multiprocessor MIMD system, where we

used all the available 5 processors.

The speed-up values represent the ratio of the time taken by the

parallel version over the sequential version, which contained none of

the special programming constructs required for parallel processing

over the parallel version.

Tables 6.4 and 6.5 lists the elapsed execution times (in seconds)

minus the cost of forking child processes and child page table build

up, since child processes do not automatically inherit the parent's

page table when they are created.

The speed-up values are plotted in Figures 6. 3 and 6.4 .

F(s)=exp(-4rs)

No. of Points

10

3 -; f(t)= 2exp(-4/t) (TIt)

Elapsed time No. of Processors in seconds

1 55.008

2 34.942

3 27.828

4 22.422

5 17.522

TABLE 6.4

-; -1 F(s)=s exp(-s )

-; ; f(t)=(TIt) cos(2t)

Elapsed time No. of Points No. of Processors in seconds

10 1 31.994

2 16.628

3 10.784

4 8.828

5 7.989

TABLE 6.5

272

Speed-up

1

1.574

1.976

2.453

3.139

Speed-up

1

1.924

2.966

3.624

4.004

273

5

4

3

o~~~~~~~~==~~~==~~=7 o 2 3 4 5 6

No. 01 proes.

FIGURE 6.3: The Timing Results for the First Problem.

274

1 2 3 4 5 6 No. of proc8.

FIGURE 6.4: The Timing Results for the Second Problem.

.6.5 CONCLUSION

In this chapter, we have presented different methods for

numerically inverting the Laplace transform. Each of these methods

has its own shortcoming. Although we recommend Crump's method, there

is no general technique to invert the Laplace transform function

numerically.

The proper formula for the numerical inversion of Laplace

transform based on the Fourier series method should contain both the

sine and cosine function terms. Eliminating "either the sine or cosine

function and replacing the integration by a discrete formula will

distort the original function into an even or odd function with a

concomitant decrease in the working interval of the inverted function

from (O,2T) to (O,T).

275

An attempt to increase the accuracy by applying the Romberg

integration resulted in giving accurate results for small time intervals,

which is too small to be of practical use in applying it to the full

range of the transform.

Although the direct FFT technique was found to be easy in

application to numerically inverting the Laplace transform, it has

been shown that it has many disadvantages as outlined earlier.

The improvement in performance attainable by the implementation of

the Laplace inversion transform is considerable: Which means that the

numerical inversion of the Laplace transform algorithm is a viable

parallel algorithm.

CHAPTER 7

CONCLUSIONS AND FINAL REMARKS

I am no prophet -

and here's no great matter.

T.S. Eliot.

276

In the past computers have developed more processing power,

firstly, by an increase in the density of integration and, secondly,

by the operational acceleration of their basic switching elements.

Both L~ese methods, however, lead to higher power dissipation per unit

area to which there are fundamental thermodynamic limits.

For further improvement of the performance attainable by present

day computers technologically, it is necessary that radical new

developments must take place.

ParaZZeZ processing, therefore, is widely viewed as the only

natural and feasible way forward to achieve a significant increase in

processing power. Parallel computers, as has been discussed in the

introductory chapters of this thesis, are classified in various

different types, each of which has an effective range of problems to

which it is most suitable for solving.

To date, in concern with the imminent next generation of

computer systems, the epitome of contemporary computing - the

supercomputer - consists of banks of high-speed processors, operating

in parallel on arrays of numbers or closely coupled in a pipeline,

with further processors to pass the information to and from secondary

storage. The total assembly achieves a very high throughput of data,

and makes use of densely packed VLSI chips.

In general, programming parallel systems is more difficult than

programming uniprocessor systems, and this had led to the parallelism

being concealed on most existing MIMD systems. This kind of system

consists of the primary testbed for all numerical algorithms designed

and analysed herein. The actual prototype on which the extensive bulk

277

of our experiments and measurements were carried out was the Balance

8000 MIMD system at Loughborough University of Technology.

The problem which arises here lies in making the p processors

cooperate, so that one problem can be appropriately partitioned amongst

them to be solved with greater speed than could be solved on a uni-

processor.

The analysis of an algorithm is important for different reasons.

The most straightforward reason is to discover its vital statistics

in order to evaluate its suitability for various applications or

compare it with other algorithms. Generally, the vital statistics of

interest are the time and space, most often time. Actually, in.all

the algorithms presented in this thesis, we were interested in

determining the times for an implementation of a particular algorithm

on the Balance 8000 parallel system. From that analysis, we were able

to measure the relative speed-up ratio of each algorithm· which was the

main criteria to exploit the efficiency by using parallelism.

In this theSis, we have· demonstrated a characterization of the

architecture of current parallel computers such as SIMD and MIMD

computers and how the research in this area is developing as technology

advances where the whole computer can be assembled on a silicon VLSI

chip, which may contain millions of transistors.

In the second chapter, we have showed how to program parallel

systems. However, as we have demonstrated different applications

require different language structures. One of the major divisions in

language development is that between the imperative and declarative

style of programming. The other major development is the object-oriented

approach to programming. We have also showed that some specific

languages have been used on a parallel computer by adding some

parallel constructs.

In Chapters 4 and 5, the parallel AGE iterative method, to solve

one and two dimensional elliptic and parabolic equations as well as

278

the wave equation, was developed and implemented on the Balance 8000

parallel system. The implementation of the parallel AGE iterative

method was programmed using different versions and strategies involving

synchroneity and asynchroneity together with natural or red-black

ordering. It is clear that the implementation of the different

strategies presented different 'timings and losses when they were run

on the Balance 8000 parallel system.

It can be seen from the experimental results that the best

results were obtained when the problem is solved using the asynchronous

strategy of the parallel AGE iterative method. This is due to the

synchronization overheads occurring at the end of each iteration.

Also, we noticed that the results obtained by applying the first

strategy is better than that of the second'strategy. This is due to

the total number of computer operations in the second strategy being

higher than that of the first strategy. As the old values' are used

while evaluating the next point using the second strategy.

The Sturm-Liouville problem is also solved using the AGE

iterative method, in Chapter 4. The eigenvalues and the corresponding

'eigenvectors were determined.

The AGE iterative method with D'Yakonov splitting strategy to

solve the diffusion equation is presented in Ch'apter 5. The new

strategy yields an algorithm with reduced computational complexity and

greater accuracy for multi-dimensional problems. The same three

parallel strategies were implemented on this kind of formulation.

279

A new strategy for solving the (nXn) complex tridiagonal matrices

derived from the finite difference/element discretisation of the

Schrodinger equation is presented in Chapter 5. In this strategy, the

solution of the complex system has been reduced to solve two identical

quindiagonal systems, in real variables. These can then be solved

directly by efficient elimination procedures involving no interchanges.

By the use of the quindiagonal solver, the computational complexity for

the algorithm is now of order O(n) multiplications, whilst the Gaussian

elimination procedure for the original matrix is considerably more.

The fundamental importance of the Laplace transform resides in its

ability to lower the transcendence level of an equation. Ordinary

differential equations involving f(t) are reduced to algebraic

equations for F(s) .

The evaluation of inverse Laplace transforms of functions is a

problem of fundamental importance in pure and applied mathematics.

Thus, it is often desirable to have a convenient means of computing

·f(t) numerically for various values of t from the given values of F(s) ,

where th~ function F(s) might be either a complicated expression

whose poles and residues are too difficult to obtain, or F(s) might be

itself known only numerically as a computed function of s without any

knowledge as to its explicit analytic form.

A large number of different methods for numerically inverting the

Laplace transform has been developed during the last few years. In

Chapter 6, we have briefly mentioned some of these methods, whilst we

restrict ourselves to the methods using Fourier series approximations.

We carried out a number of experimental tests to demonstrate the

simplicity and the accuracy of these methods. To speed up these

methods, we implemented the Crump's method on the Balance 8000 MIMD

parallel system, where the resulting times are represented

diagramatically.

Further work is required to include more than one iteration

parameter r in the AGE iterative method for which an improvement in

the rate of convergence is predicted.

280

The study in Chapter 5 of the new strategy for the numerical

solution of the Schrodinger equation can be easily extended to two

dimensional equations as well as to the nonlinear Schrodinger equation.

Finally, more extensive work is required on the numerica~

inversion of the Laplace transform. Techniques for its efficient

application to determine the numerical solution of certain classes of

linear parabolic and hyperbolic equations especially those involving

independent variables, time t and two spatial variables x and y. The

basic idea follows the usual Laplace transform approach in that after

applying the transform to the differential equation the number of

independent variables is reduced by one and the resulting subsidiary

equation involves a complex parameter s instead of the variable t.

The numerical application of a suitable inversion procedure

means that the subsidiary equation has to be solved for several

different values of the parameter s, in order to obtain the solution

of a parabolic or hyperbolic problem at some specified time t.

REFERENCES

281

Anderson, J .P. (1965): "Program Structu:r>es for PaI'aZZeZ PI'ocessing",

Comro. of ACM, Vol.8, No.12, pp.786~788.

Baer, J.L. (1982): "Techniques to ExpZoit PaI'aUeUsm", in Parallel

Processing Systems, An Advance Course, ed. Evans, D.J.,

Cambridge Univ. Press, pp.75-99.

Barlow, R.H., Evans, D.J., Newman I.A. and Woodward, M.C. (1981):

"A Guide to Using the Neptune PaI'aUeZ PI'ocessing System",

Internal Report, Computer Studies Dept. L.U.T.

Baudet, G.M., Brent, R.P. and Kung, H.T .. (1980): "ParaUeZ Execution

of a Sequence of Tasks on an Asynchronous MuUipI'ocessor",

The Australian Computer J., Vol.12, No.3, pp.105-112.

Bernstein, A.J. (1966): "AnaZysis of PI'ogI'ams fop ParaUeZ Processing",

IEEE Trans. on Electronic Computers, Vol: EC-15, No.5,

pp.757-763.

Booch, G. (1986): "Object OPiented Development", IEEE Trans. Software

Eng., Vol. SE-12, pp.211-221 .

.,. Carre, B.A. (1961): "The Determination of the Optimum AcceZeI'ation

FactoI' fop Successive OveI'I'eZaxation" , The Comput.J., Vol.4,

pp.73-78.

Conte, 5.0. [1965]: "Elementary Nwneric:al Analysis", McGraw-Hi11 Inc.,

New York.

Cooley, J.W. and Tukey, J.W. [1965]: "An Algorithm for the Mac:hine

Calc:ulation of Complex Fourier Series", Maths.Comp.J.,

vol.19, pp. 297-301.

Crump, K.S. [1976]: "Numeric:al Inversion of Laplac:e Transforms Using a

Fourier Series Approximation", J. ACM, Vol. 23, No. 1,

pp. 89-96.

282

Davies, B. and Martin, B. [1979]: "Numeric:al Inversion of the Laplac:e

Transform: A Survey and Comparison of Methods", J. of Comput.

Phys., Vol.33, pp.1-32.

Dennis, J.B. and Van Horn, E.C. [1966]: "Programming Serrrmtic:s for

Multiprogrammed Computations", Comm. of the ACM, Vol. 9,

pp.143-155.

Dijkstra, E. W. [1965]: "Guarded Commands Nondeterminanc:y and Formal

Derivation of Programs", Comm.Assoc.Comput., Vol.18, pp.457.

Douglas, J. and Rachford, H.H. [1956]: "On the Numeric:al Solution of

Heat Conduc:tion Problems in Two and Three Spac:e Variables",

Trans.Amer.Math.Soc., Vol.82, pp.421-439.

Dubner, H. and Abate, J. [1968]: "Nwnerical Inversion of LapZace

Transforms by Relating Them to the Finite Fourier Cosine

Transform", . J .ACM, vol.15 , No.1, pp .115":123.

Durbin, F. [1973]: "Nwnerical Inversion of Laplace Transforms: An

Effective Improvement of Dubner and Abate's Method",

Comput.J., Vo1.17, No.4, pp.371-376.

D' Yakonov, Y. [1963]: "On the Application Distintegrating Difference

Operators, ·Z,Vycil.Mat.i.Mat.Fiz., vol.3, pp.385-388.

Enslow, P.H. [1977]: "Multiprocessor Organization - A Survey",

Comput. Surveys, Vol. 9, No. 1, pp.l03-129.

Evans, D.J. and Williams, S.A. [1978]: "Analysis a"d Detection of

Parallel Processable Code". The Comput.J. Vo1.23, No. 1,

pp.66-72.

Evans, D.J. [1985]: "The AGE Matrix Iterative Method". 3rd Franco

South Asian Math.Conf., Kuala Lumpur, Malaysia.

Feng, T.Y. [1974]: "Data Manipulation Functions iT; Parallel Processors

and Their Implementations", IEEE Trans. on Comp., Vol. C-23,

No.3, pp.309-318.

F1ynn. M. J. [1966]: "Very High Speed Computing Sysoems", Proc. IEEE,

Vo1.54, pp.1901-1909.

283

Fox, L. [1957]: "The Nwnerical Solution of Two-Point Boundary

Problems in Ordinary Differential Equations",

The Clarendon Press, Oxford.

Gill, S. [1958]: "Parallel Prograrmling", Comp.J. Vol. 1, pp.2-10.

Gurd, J.R., Kirkham, C.C. and Watson, I. [1985]: "The Manchester

Prototype Data-Flow Computer", Comm. ACM, No.l, pp.34-52.

Hageman, L.A. and Kellogg, R.B. [1968]: "Estimating Optimwn Over

relaxation Parameters", Maths. of Comp., Vol.22, pp. 60-68.

Handler, W., [19-i7]: "The Impact Of Classification Schemes on Computer

Architecture", Int.Conf. on Parallel Proc., pp.7-15.

Harland, D.M. [1985]: "ToWards a Language for Concurrent Processes",

Software Practice and Experience, Vol.15, pp.839-888.

Hellerman, H. [1966]: "Parallel Processing of Algebraic Expressions",

IEEE Trans. on Electronic Computers, Vol. EC-15, pp.82-91.

Hobbs, L.C. and Thesis, D.J. [1970]: "SuY'Vey of Parallel Processor

Approaches and Techniques", in parallei System: Technology

and Applications, ed. by Hobbs, Spartan Books, New York,

pp.3-20.

284

Honig, G. and Hirdes, U. [1984]: "A Method for the Nwnerical

Inversion of Laplace TransfoI'Tlls", J .Comput. and Appl.Math.,

Vol. 10, pp.113-l32.

Hockney, R.W. and Jesshope, C.R. [1988]: "Parallel Computers:

Architecture, Programming and Algorithms", Adam Hilger,

Bristol.

285

Hsu, J.T. and Dranoff, J.S. [1987]: "Nwnerwal Inversion of Certain

Laplace TransfoT'Tlls by the Direct Application of FFT Algorithm",

Comp.Chem.Engng., Vol. 11, No.2, pp. 101-110.

INMOS [1984]: "Occam Programming Manual", Englewood Cliffs, N.J.:

Prentice Hall.

Kuck, D.J. [1977]: "A Su:r>vey of Parallel Machine Organization and

Programming", Comp. Surveys, Vol. 9, No. 1, pp.29-59.

Kung, H.T. [1980J: "The Structure of Parallel Algorithms", Advances

in Computers, Vol. 19, pp.65-ll2.

Kung, H.T. [1982J: "Notes on VLSI Computation", in Parallel Processing

Systems, ed. Evans, D.J., Cambridge Univ. Press, pp.339-356.

Lanczos, c. [1956J: "Applied Analysis", Prentice-Hall, Englewood Cliffs,

New Jersey.

Lawrie, D.H., Layman, T., Baer, D. and Randal, J.M. [1975]: "Glipnir -

A Programming Language for ILLIAC IV", Comm. ACM, Vol.1B,

pp.157-164.

2B6

Miranker, W.L. [1971]: "A Survey of Parallelism in Numerical Analysis",

SIAM Review, Vol. IB, pp.524-547.

Muraoka, Y. [1971]: "Parallelism Exposure and Exploitation in Programs",

Ph.D. Thesis, Univ. of Illinois at Urbana-Champaign.

Murtha, J. and Beadles, R. [1964]: "Survey of the Highly Parallel

Information Processing Systems", Prepared by Westinghouse

Electroc Corp., Aerospace Division, ONR Rep. No. 4755.

Patel, J.H. [19Bl]: "Performance of Processor-Memory Interconnections

for Multiprocessors", IEEE Trans. Comp., Vol. c-31, No. 10,

pp. 771-7BO.

Peaceman, D.W. and Rachford, H.H. [1955]: "The Numerical Solution of

Parabolic and Elliptic Differential Equations", J.Soc.lndust.

Appl.Math., Vol. 3, pp.2B-41.

Ramamoorthy, C.V. and Li, H.F. [1977]: "Pipeline kPchitecture",

Comp. Survey, Vol. 9, No. 1, pp.61-102.

Salzer, W.E. [195B]: "Tables for the Numerical Calculation of Inverse

Laplace Transforms", J.Math. and Phys., Vol.37, pp.B9-109.

Saul'yev, V.K. [1964]: "Integrotion of Equations of Parabolic Type by

the Method of Nets", G.T. Tee, Transl., Pergamon, New York.

Schmittroth, L.A. [1960]: "NumeI'ical InveI'sion of Laplace Tr>ansfoI'lTls",

Comm. ACM, Vol. 3, pp.171-173.

Schapery, R.A. [1962]: IIPr>OC. 4th V.S. Nat. CongI'., Appl.Mech.",

ASME 2, pp.1075-1085.

Seigel, H. T. [1979]: InteI'connection Networks foI' MIMD Machines 11,

IEEE Comp.J., pp.57~65.

Seitz, C.L. [1982]: IIEnsemble AI'chitecture for VLSI - A Survey and

Taxonomy 11, in Proc. of MIT Conf. on Advanced Research in

VLSI, ed. by Penfield, P. and Artech House Jr., pp.130-135.

Shapiro, E. Y. [1984]: IIA Subset of ConcurI'ent Prolog and its

InterpretoI'II, ICOT Inst. for New Generation Technology,

Technical Rep. TRoo3.

287

Shirtlifte, C.J. and Stephenson, D.G. [1961]: IIA ComputeI' Oriented

Adoption of SalzeI' IS Methods fOI' InveI'ting Laplace Tr>ansfoI'lTls ll ,

Ibid., Vol. 40, pp.135-141.

Shore, J.E. [1973]: "Second Thoughts on PaI'allel Processing ll , Comput.

Elect.Engng., pp.95-109.

Simon, R.M., Stroot, M.T. and Weiss, G.H. [1972J: "NwneT'ical

InveT'sion of Laplace Transforms with Application to

Percentage Labeled Experiments", Comput. Biomed. Rev.,

Vol.6, pp. 596-607.

Stone, H.S. [1967J: "One-Pass Compilation of Arithmetic Expressions

288

for a ParaLLel Processor", Comm. ACM, Vol. 10, No.4, pp.220-223.

Stone, H.S. [1971J: "ParaLLel Processing with the Perfect Shuffle",

IEEE Trans.Comp., vol. C-20, pp.153-161.

stone, H.S. [1973J: "Problems of ParaLLel Computation", in Complexity

of Sequential and Parallel Algorithms, ed. Traub, J.F.,

Academic press, New York.

Stone, H.S. [1973J: "An Efficient ParaLLel Algorithm for the Solution

of a Tr>idiagonal Linear System of Equations", J. of ACM,

Vol.20, No.l, pp.27-38.

Squire, J.S. [1963]: "A Translation Algorithm for Multiprocessor

Computers", Proe. 15th ACM Natl.Conf.

Tedd, M., Crespi-Reghizzi, S. and Natali, A. [1984J: "Ada for

Mul tipT'ocessors", Ada Companion Series, Cambridge Universi ty·

Press.

Varga, R.S. [1962J: "Matrix Iterative Analysis", Prentiee-Ball,

Eng1ewood Cliffs.

Wilkinson, J.B. [1965J: "The Algebraic Eigenvalue Problem", Oxford

University Press, Oxford.

Wing, o. [1967J: "An Efficient Method of Nwnerical Inversion of

Laplace Transforms", Arehs.E1eetron.Comp., vo1.2, pp.153.

Wyllie, J.C. [1979J: "The Complexity of Parallel Computation",

Ph.D. Thesis, Comp.Sei. Dept., Corne11 Univ., U.S.A.

Wynn, P. [1962J: "Upon a Second Confluent Form of the £-Algorithm",

Proe. Glasgow Math.Assoe., Vol. 5, pp.160.

Young, D. [1954J: "Iterative Methods for Solving Partial Difference

Equations of Elliptic Type", Trans .Amer .Math .Soe., Vol. 76,

pp.92-111.

Zakian, V. [1969J: E1eetron.Lett. 5, pp.120-121.

289

APPENDIX A

A LIST OF SOME SELECTED PROGRAMS

1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 c 15 c 16 c 17 c 18 c 19 20 c 21 22 c 23 24 25 26 27 28 29 30 31 32 33 c 34 35 36 37 38 39 40

* *

Program 4.1 One-Dimensional Boundary Value Problem Solved Using the Parallel AGE iterative Method with the Thrid Strategy. d**2U/dx**2+U = x Boundary Conditions U(O) = 1 ,U(1tI2)=1tI2 -1 . Exact Solution U(x)= cosx - sinx + x . n : The order of the system m: The iteration counter a : The subdiagonal of A c : The superdiagonal of A d : The diagonal of A hd : Half the diagonal of A b: The right hand side vector r : The iteration parameter

Set up common block to hold shared variables COMMON /SHCOM/a,b,c,hd,m,n,r,v,w,eps,det,u/ Declare subroutine age as argument to a function EXTERNAL age Declare system calls integer m_fork,m_kill-procs,m_set-procs integer time1,time2,bite_size ParameterOmax=50) dimension aOmax),bUmax),cOmax),dOmax),hdOmax),detOmax)

,u(0:jmax),u1 (0:jmax),u2(0:jmax),u3(0:jmax),wOmax) ,vOmax)

data eps,pi/0.000001,3.1415926/ r=1.44 h=(pi/2.0)/real(n+ 1) hh=h*h Input matrix size. do 10 i=1,n a(i)= -1.0 c(i)=a(i) d(i)=(2.0-hh) hd(i)=d(i)/2.0 b(i)= - hh*i*h u(i)=0.1

290

41 10 42 43 44 45 46 47 48 c 49 50 51 52 53 54 55 c 56 c 57 58 c 59 60 c 61 62 c 63 64 65 100 66 67 101 68 69 70 71 c 72 73 74 75 76 c 77 78 79 c

continue a(1 )=0.0 c(n)=O.O u(0)=1.0 u(n+ 1 )=piJ2.0 - 1.0 b(1 )=1.0 - hh*h b(n)=(pi/2.)-hh*h*n Divide work up into rn/bite_size] chunks x=real(n)/real(bite_size) if(real(int(x)).eq.x)then no--procs=x else no--procs=int(x)+ 1 endif Set the number of processors to be used and therefore the number of chunks to be evaluated at the same time i=m_set--procs(no--procs) Start timing call 310ck_time(time1) Fork offspring then assist them to evaluate subroutine i=m_fork(age) Finish time call _clock_time(time2) write(*,1 00)(time2-time1 )/1 00.0 format(5x,'The time = ',f8.5) write(*,1 01 )(u(i),i=1,n) format(5x,f8.5) stop end subroutine age Common block contains shared variable list

. COMMON /SHCOM/a,b,c,hd,m,n,r,v,w,eps,det,u/ parameterOmax=50) dimension aUmax),bUmax),cUmax),hdOmax),u(O:jmax),

* u1 (O:jmax),u2(O:jmax),detUmax) System call INTEGER m_next integer path,start,finish,bite_size Get process path numbers

291

80 81 c 82 c 83 102 84 85 c 86 c 87 88 89 90 91 92 93

·94 11 95 96 30 97 98 99 12 100 c 101 c 102 c 103 104 105 106 107 108 109 110 111 13 112 c 113 c 114 c 115 c 116 c 117 118 c 119

path=m_nextO start=(path-1 )*bite_size-1 Whilst there is more work loop if(start.le. n)then finish=start+bite_size-1 If this is the last item of work,then set finish pointing to the end of the last item if(path.eq.no-procs) finish=n do 103 k=start, finish n1 =n-1 n2=n-2 do 11 i=1,n w(i)=hd(i)H v(i)=hd(i)-r det(i)=(w(i)·w(i+ 1 )-a(i+ 1 )*c(i)) m=O· iflag=O m=m+1 do 12 i=1,n u1 (i)=u(i) ••••• The first Sweep •••••

==============

u(k+ 1/2)=(G1 HI)<-1 >·[b-(G2-rl)·u(k)] do 13 i=1,n1,2 i1=i+1 i2=i+2 i3=i-1

v1 =b(i)-a(i)·u1 (i3)-v(i)·u1 (i) v2=b(i1 )-v(i1 )*u1 (i1 )-c(i 1 )·u1 (i2) u2(i)=(w(i1 )·v1-c(i)·v2)/det(i) u2(i 1 )=( -a(i 1 )·v 1 +w(i)·v2)/det(i) continue

..... The second sweep •••••

================

u(k+ 1 )=(G2HI)<-1 > ·[b-(G1-rl)·u(k+ 1/2)]

u(1 )=(b(1 )-v(1 )·u2(1 )-c(1 )·u2(2))/w(1)

do 15 i=2,n,2

292

120 . 121

122 123 124 125 126 127 15 128 c 129 130 c 131 132 17 133 134 103 135 136 137 138 139 c 140 141

i1=i+1 i2=i+2 i3=i-1 v1 =b(i)-a(i)*u2(i3)-v(i)*u2(i) v2=b(i)-v(i)*u2(i1 )-c(i1 )*u2(i2) u(i)=(w(i 1 )*v1-c(i)*v2)/det(i) u(i1 )=(-a(i1 )*v1 +w(i)*v2)/det(i) continue

u(n)=(b(n)-a(n)*u2(n 1 )-v(n)*u2(n))/w(n)

do 17 i=1 ,n if(abs(u(i)-u1 (i)).gt.eps) iflag=1 if(iflag .eq.1) goto 30 continue path=m_nextO start=bite_size*(path-1 ) goto 102 endif All done so return to main program return end

293

•. . '. "

1 2 3 4 5 6 7 8 9 10. 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

c c c c c c c c

c

10 c

11

Program 4.2 This program is to calculate the eigenvalues and the corresponding eigenvectors,by the use of the AGE iterative method.

d**U/dx**2+(lemda-2*q*cos2x)U=0. Boundary Conditins, U(O)=O , U(7tl2.)=1t parameter(jmax=50) data eps,pi/0.000001 ,3.14159261 reallemda,lemda2,le read(*:)n

dimension a(jmax),b(jmax),c(jmax),d(jmax),u(jmax) * ,da(jmax),ud(jmax),le(jmax),ude(jmax)

choose the accelaration parameter r=1.50 h=pi/real(n+ 1) hh=h*h q=1.0 do 10 i=1,n xi=i*h r d(i)=(2.0/hh+2.0*q*COS(2·ce*h» hd(i)=d(i)/2.0

.do 11 i=1,n _.a(i)=-1 .O/hh ~.c(i)=a(i) ' .. u(i)=0.1 ·u(O)=O.O 'c:u(n+ 1 )=0.0 '. a(1 )=0.0 7c(n)":0.O -;'~sum:l =0.0

.:_. _.7., _'_

"." ,::"'.: ~ .... : .

34 -.. sum2=0.0 35 7od(1 )=hd(1 )*u(tJ+C(1)*d(2)" ~, •. , ...-.

36 do 13 i=2,n-1 37 13 ud(i)=a(i)*u(i-1 )+hd(i)*u(i)+c(i)*u(i+ 1) 38 ud(n)=a(n)*u(n-1 )+hd(n)*u(n) 39 do 14 i=1 ,n

294

40 sum1 =sum1 +ud(i)*u(i) 41 14 sum2=sum2+u(i)*u(i) 42 lemda2=sum1/sum2 43 1=0 44 15 lemda=lemda2 45 30 1=1+1 46 16 do 17 i=1 ,n 47 b(i)=lemda*u(i) 48 c call the AGE subroutine. 49 c 50 call age(a,b,c,hd,m,n,eps,u,r) 51 sum1=0.0 52 sum2=0.0 53 ud(1 )=hd(1 )*u(1 )+c(1 )*u(2) 54 do 17 i=2,n-1 55 17 ud(i)=a(i)*u(i-1 )+hd(i)*u(i)+c(i)*u(i+ 1) 56 . ud(n)=a(n)*u(n-1 )+hd(n)*u(n) 57 d018i=1,n 58 . sum1=sum1+u(i)*ud(i) 59 18 . sum2=sum2+u(i)*u(i) 60 c Calculate a new value for lemda 61 'c 62 63 64 65 66 67 68 19 69 70 71 20 .72 21

--- <--73 c 74 75 76 22 77 78

lemda2=sum1/sum2 if(abs((lemda2-lemda)/lemda).gt.eps) goto 30

• i -·.lemda=lemda2; '." .~ ... write(*,19) n,eps,m,r write(*,20)(u(i),i=1,n) write(* ,21) lemda,1

, "format(/5x,'n,,;~';i3,5x/epsc', f9.6, . c* /5x,' No. of iterations=',i4,5x,'r = ',f9.6

* /5x,'The solution vector is '/5x,22(' " ')) . - format(5x,f1 0.6)~-.. format(5x,'~em~a= ',f9.6,4x;T= ',i3) ; !- '" , .. ude(1 )=hd(1 )*u(1 )+c(1 )*u(2) - .

do 22 i=2,n-1 ude(i)=a(i)*u(i-1 )+hd(i)*u(i)+c(itu(i+ 1) ude(n)=a(n)*u(n-1 )+hd(n)*u(n) do 23 i=1,n

295

t!.1· .

79 23 le(i)=lemda·u(i) 80 write(·,24) 81 write(· ,25) 82 24 format(1I2x,'A·u(i)',8x,' lemda·u(i)' ) 83 25 format(2x,·"nllllnnnllllllnll""" I,ax,'""""""""""""" ') 84 do 26 i=1 ,n 85 26 write(·,27) ude(i),le(i) 86 27 format(2x,f9.6, 11 x,f9.6) 87 stop 88 end 89 C

90 C

91 subroutine age(a,b,c,hd,m,n,eps,u,r) 92 parameterGmax,;,,50) 93 dimension aGmax),bGmax),cGmax),hdGmax),uGmax), ' 94 • u1 Gmax),u2Gmax),detGmax),wGamx),vUmax) 95 real length 96 n1=n-1 97 n2=n-2 98 do 28 i=1,n 99 w(i)=hd(i)+r 100 v(i)=hd(i)-r 101 28 det(i)=(w(i)·w(i+ 1 )-a(i+ 1 )·c(i» 102 m=O 103 29 m=m+1 104 do;31 i=1,n 105 31 u1 (i)=u(i) 106 c ••••• The first sweep •••••

107 c ============== 108 do 32 i=1 ,n1 ,2 109 i1=i+1 110 111 112 113 114 115 116 117 118

32 c c

119 c

i2=i+2 i3=i-1

vi =b(i)-a(i)·u1 (i3)-v(i)·u1 (i) v2=b(i1 )-v(i1 )·u1 (i1 )-c(i1 )·u1 (i2) u2(i)=(w(i1 )·v1-c(i)*v2)/det(i) u2(i1 )=(-a(i1 )*v1 +w(i)·v2)/det(i) continue

..... The second sweep •••••

================

296

1

297

120 c 121 u(1 )=(b(1 )-v(1 )·u2(1 )-c(1 )·u2(2))/w(1) 122 do 33 i=2,n2,2

123 i1=i+1 124 i2=i+2 125 i3=i-1

126 v1 =b(i)-a(i)·u2(i3)-v(i)·u2(i)

127 v2=b(i1 )-v(i)·u2(i1 )-c(i1 )·u2(i2)

128 u(i)=(w(i1 )*v1-c(i)*v2)/det(i)

129 u(i1 )=(-a(i1 )·v1 +w(i)·v2)/det(i)

130 33 continue 131 u( n )=(b(n )-a( n )*u2( n 1 )-v( n)· u2( n) )/w( n)

132 do 34 i=1,n 133 33 if(abs(u(i)-u1 (i)).gt.eps) goto 29

134 sumsq=O.O. 135 do 34 1=1,n

136 34 sumsq=sumsq+u(i)**2

137 length=sqrt(sunsq) .

138 do 351=1,N

139 35 u (i) =u (i)/length 140 return 141 end

- - ~ ...... -.. ... :.... .............. .

'': '."

1 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 c 15 c 16 c 17 c 18 c 19 c 20 c 21 c 22 c 23 c 24 c 25 c 26 c 27 c 28 c 29 c 30 c 31 32 33 34 35 36 37 38 39 40

*

Program 5.1 One-dimensional diffusion-convection equation solved using the parallel AGE algorithm with the second strategy. dU/dt=edU**2U/dx**2-kdU/dx , 0<x<1 ,O<t<oo. Boundary conditions, U(O,t)=O, U(1,t)=1 Initial condition U(x,O)=O 0<x<1.

n: The order of the system a: The subdiagonal of A c: The superdiagonal of A d: The diagonal of A hd: Half the diagonal of A r: The acceleration parameter tl: Maximum value in the t-direction k: Increments in the time sx: Maximum value in the x-direction h: Increments in the x-axis nt: Number of time level parameterOmax=50) dimension u(0:jmax),u1 (0:jmax),u2(0:jmax),u3(0:jmax)

,u40max)

integer*4 nprocs,status,time1, time2 nprocs: The number of processors for each run status: This variable is used to store the return value of

the libr~ry routine that actually sets the number of processors;- : ,",:

write(*,,) 'Number-of procs',n ' read(*, *) nprocs;"n'.'-:: data·eps,kmax;pilO.00005',300,3.14159261 tl 0 1 :.-- ... = ; "- " .. , .. ,---. sx';-l.O r=0.5 k=0.001 nt=21 i1=1 i2=1

.: =.

298

299 ,

41 h=sxl(n+ 1) 42 hh=h·h 43 nt1 =nt-1 44 rl=klhh 45 c Input matrix size 46 alf=0.5·rl·h 47 a=-0.5·(rl-alf) 48 c=-0.5·(rl+alf) 49 d=(1.0HI) 50 hd=d/2. 51 v=r-hd 52 w=r+hd 53 det=w·w-a·c 54 c 55 do 10 i=O,n 56 u(i)=O.O 57 10 . u3(i)=u(i) 58 . a1=-c·w 59 b1=rl·w 60 c1 =-a·rl 61 d1=a·a 62 a2=c·c 63 b2=-c·rl 64 d2=-a·w 65 s1=d1 66 s2=d2 67 do 11 jt=1,nt1 68 t1 =jt·k 69 jt2=jt+ 1 70 n(n+1)=1.0 71 u3(n+ 1 )=u(n+ 1) 72 ea=0.5·(rl+alf) 73 f=1.0-rl 74 _ g=0.5·(rl-alfj ..

75 do 12 k=1,kmax ... :.' '.: ."; <'.~':'~ ~.

76 ic=1 77 c ••••• Fi rst sweep .....

78 c =========== 79 c u(k+ 1/2)=(G1 HI)<-1 >·[(rl-G2)·u(k)+b]

80 c 81 82 83 c 84· c 85 c 86 c 87 88 c 89 90 c

b11 =ea·u3(O)-c·u3(O)+f·u3(1 )+g·u3(2) u1 (1 )=(v·u(1 )-a·u(2)+b11 )/w Set the number of processors. Makes the beginning of the timed section of code, which includes everything except VO.

status=m_setJ>rocs(n0J>rocs)

call_clock_time(time1)

91 c$doacross share(u1,u,ea,u3,f,g,w,a,c,a1,b1,c1,d1,det,a2,b2, 92 c$& d2),local(b11,b22,r1) 93 do 13 i=3,n-2,2

.94 b11 =ea·u3(i-1 )+f·u3(i)+g·u3(i+ 1) 95 b22=ea·u3(i)+f·u3(i+ 1 )+g·u3(i+2) 96 r1 =w·b11-a·b22 97 13 u1 (i)=(a1·u(i-1 )+b1·u(i)+c1·u(i+ 1 )+d1·u(i+2)+r1 )/det 98 c$doacross share(u1,u,ea,u3,f,g,w,a,c,a1,b1,c1,d1,det,a2,b2,d2) 99 c$& ,local(b11,b22,r2) 100 do 14 i=2,n-3,2 101 b11 =ea·u3(i-1 )+fu3(i)+g·u3(i+ 1) 1 Q2 b22=ea·u3(i)+f·u3(i+ 1 )+g·u3(i+2) 103 r2=w·b22-c·b11 104 14 u1 (i)=(a2·u(i-1 )+b2·u(i)+b1·u(i+ 1 )+d2·u(i+2)+r2)/det 1 C5 b11 =ea·u3(n-2)+f·u3(n-1 )+g·u3(n) 106 b22=ea·u3(n-1 )+f·u3(n)+g·u3(n+ 1 )-a 107 r1=w·b11-a·b22 1Q8 109 11':) 11, c 112 c 113 c 11-4 c 115 11:5· 117 11,3 11'3

r2=w·b22-c·b11 u1 (n-1 )=(a1·u(n-2)+b1·u(n-1 )+c1·u(n)+r1 )/det u1 (n)=(a2·u(n-2)+b2·u(n-1 )+b1·u(n)+r2)/det ••••• Second sweep •••••

============

u(k+ 1 )=(G2+rI)<-1 >·[(rl-G1 )*u(k+ 1/2)+b]

b11 =ea·u3(O)+f·u3(1 )+g·u3(2) b22=ea·u3(1 )+f·u3(2)+g·u3(3) r11 =w·b11-a·b22 r22=w'b22-c·b11 u2(1 )=(b1·u1 (1 )+c1·u1 (2)+s1·u1 (3)+r11 )/det

300

120 121 122 123 124 125

u2(2)=(b2*u1 (1 )+b1*u1 (2)+s2*u1 (3)+r22)/det c$doacross share(u2.ea.u3.f.g.w.a.c.a1.b1.c1.s1.u1.det.a2.b2. c$& s2).local(b11.b22.r11)

126 15 127

do 15 i=3.n-2.2 b11 =ea*u3(i-1 )+f*u3(i)+g*u3(i+ 1) b22=ea*u3(i)+f*u3(i+ 1 )+g*u3(i+2) u2(i)=(a1*u1 (i-1 )+b1*u1 (i)+c1*u1 (i+ 1 )+s1*u1 (i+2)

* +r11 )/det 128 129 130 131

c$doacross share(u2.ea.u3.f.g.w.a.c.a1.u1.b1.c1.s1.det.a2.b2. c$& s2).local(b11.b22.r22)

132 133 134 16 135 136 137 138 c 139 c 140 c 141 c 142 143 144 18 145 17 146 147 19 148 149 150 20 151 c 152 c 153 c 154 155 156 157 158 159

do 16 i=4.n-1.2 b11 =ea*u3(i-1 )+f*u3(i)+g*u3(i+ 1) b22=ea*u3(i)+f*u3(i+ 1)+g*u3(i+2) r22=w*b22-c*b11 u2(i)=(a2*u1 (i-1 )+b2*u1 (i)+b1*u1 (i+ 1 )+s2*u1 (i+2)

* +r22)/det bn=ea*u3(n-1 )+f*u3(n)+g*u3(n+ 1 )-a u2(n)=(-c*u1 (n-1 )+v*u1 (n)+bn)/w Generate solutions on each time level. Set ic=1 for successful convergence and 0 otherwise. begin iterative process.

do 17i=1.n if(abs(u2(i)-u(i))-eps)17.17.18 ic=O continue do 19i=1.n u(i)=u2(i) if(ic.ne.1) goto 12 do 20 i=1.n u3(i)=u2(i) Terminates the timing. Terminates the child processors creatrd by the c$doacross. call 310ck_time(time2) call m_kill..J)rocs print* ,'The time = ·.(time2-time1 )/1 00.0 do 21 jw=i1.nt1.i1 ifUt.eq.jw)then write(* .23)jt2.t1

301

160 161 23 162 163 22 164 165 166 167 168 21 169 170 12 171 172 24 173 * 174 11 175 176

write(*,22)(u(i),i=1,n) format(/,'The Age iterative solutions at time level

* nt=',i31 'time t c' ,f1 0.6) format{2x,1 Of1 0.6)

goto 21 els goto 21 endif continue goto 11 continue write(* ,24 )kmax,jt2 format(,Method fails to converge in ',i4, 'iteration',!

'at time level nt =' ,i3) continu stop end

302

1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 c 15 c 16 c 17 c 18 c 19 c 20 c 21 22 23 24 25 c 26 c 27 c 28 c 29 30 c 31 32 33 34 35 36 37 38 39 40

*

Program 5.2 One-dimensional heat equation solved using the parallel AGE algorithm with the first strategy. AGE with D'Yakonov splitting. dU/dt=d**2U/dx**2 , 0<x<1, ,O<t<nt. Boundary conditions, U(o,t)=O ,U(7t,t)=O ,O<t<nt. Initial condition, U(x,O)=sinx O<x<7t. n: The order of the syste a: The subdiagonal of A c: The superdiagonal of A d: The diagonal of A hd: Half the diagonal of A r: The acceleration parameter tl: Maximum value in the t-direction k: Increments in the time

sx: Maximum value in the x-direction h: Increments in the x-axis nt: Number of time level parameter Gmax= 100) dimension u(0:jmax),u1 (0:jmax),u2(0:jmax),u3(0:jmax),

u4(0:jmax),erGmax),reGmax) integer*4 nprocs,status,time1,time2 nprocs: The number of processors for each run status: This variable is used to store the return value of the

library routine that actually sets the number of processors.

data pi,eps,kmaxl3.1415926,0.000001 ,3001

write(*, *)'Number of procs,n' read(* :)nnprocs,n tl=0.1 k=0.005 sx=pi r=0.5 nt=21 i1=4 i2=1 h=pi/(n+ 1)

303

41 42 43 44 c 45 c 46 47 48 49

.50 51 52 53 54 55 56 57 10 58 c 59 c 60 61 62 63 64 65 66 67

.68 c 69 70 71 72 73 74 75 14 76 77 c 78 c 79

hh=h·h nt1 =nt-1 rl=klhh Input data

a=-rl c=a hd=1.0+r1 w=hd+r v=r-hd aa=a·a cc=c·c w=v·v det=w·w-a·c do10i=1,n u(i)=sin(i·h) u4(i)=u(i)

a1 =-(v·w·c+a·cc) b1 =(w·w+v·a·c) c1 =-(v·w·a+w·a) d1 =(aa'w+v'aa) a2=(v·cc+w·cc) b2=-(w·c+v·w·c) c2=(v·a·c+w·w) d2=-(aa·c+v·w·a)

do 12 jt=1,nt1 t1 =jt*k jt2=jt+ 1 do 13 k=1 ,kmax ic=1 do 14 i=1 ,n u3(i)=exp(-t1 )*sin(i·h) ••••• First Sweep .....

========== u(k+ 1/2)=(G1 +ri)<-1 >·[(ri-G2)·u(k)+bl b11 =(2.0-2.0·rl)·u4(1 )+rI·u4(2)

304

80 81 c 82c 83 c 84 c 85 86 c 87

c

u1 (1 )=(w·u(1 )-v·a·u(2)+b11 )/w Set the number of processors. Makes the beginning of the timed section of code, which includes everything except 1/0.

status=m_setJlrocs(noJlrocs)

call_clock_time(time1 ) 88 89 90 91 92

c$doacross share(u4,rl,u1,u,a1,b1,c1,d1,a2,b2,c2,d2,det,w,c,a), c$& local(b11,b22,r1,r2)

93 94 95 96 97 15 98 99 100

101 102 103 104 105 106 c

do 15 i=2,n-3,2 . b11 =rl·(u4(i-1 )+u4(i+ 1 ))+(2.0-2.0·rl)·u4(i)

b22=rl·(u4(i)+u4(i+2))+(2.0-2.0·rl)·u4(i+ 1) . r1 =(w·b11-a·b22) r2=-(c·b11-w·b22) u1 (i)=(a1·u(i-1 )+b1·u(i)+c1·u(i+ 1 )+d1·u(i+2)+r1 )/det u1 (i+ 1 )=(a2·u(i-1 )+b2·u(i)+c2·u(i+ 1 )+d2·u(i+2)+r2)/det b11 =rl·(u4(n-2)+u4(n))+(2.0-2.0·rl)*u4(n-1) b22=rl·(u4(n-1 )+(2.0-2.0·rl)·u4(n) r1 =(w·b11-a·b22)

r2=(c·b11-w·b22) u1 (n-1 )=(a1·u(n-2)+b1·u(n-1 )+c1·u(n)+r1 )/det u1 (n)=(a2·u(n-2)+b2·u(n-1 )+c2·u(n)+r2)/det ••••• Second Sweep •••••

=============

u(k+ 1 )=(G2+ri)<-1 >··u(k+ 1/2) 107 c 108 109 110

u2(1 )=(w·u1 (1 )-a·u1 (2))/det u2(2)=(-c·u1 (1 )+w·u1 (2))/det

c$doacross share(u2,u1,a,c,w,det),local(i) do 16 i=3,n-1,2 111 u2(i)=(w·u1 (i)-a·u1 (i+ 1 ))/det

16 u2(i+ 1 )=(-c·u1 (i)+w·u1 (i+ 1 ))/det u2(n)=(u1 (n))/w

112 113

.114 115 c 116 c 117 c 118 c 119 c

Generate solutions on each time level. Set ic=1 for successful convergence and 0 otherwise. Begin iterative process.

305

306

120 d017 i=1,n 121 if(abs(u2(i)-u(i))-eps)17, 17,18 122 18 ic=O 123 17 continue 124 do 19 i=1,n 125 19 u(i)=u2(i) 126 do 20 i=1,n 127 20 er(i)=abs(u(i)-u3(i)) 128 do 21 i=1,n 129 21 re(i)=abs( (u3(i)-u(i) )/u3(i)) 130 if(ic.ne.1) goto 13 131 do 22 i=1,n 132 22 u4(i)=u2(i) 133 c Terminates the timing .. 134 c Terminates the child processors created by 135 c the c$doacross. 136 call _clock_time(time2) 137 call m_kill-procs 138 print· ,'The time =',(time2-time1 )/1 00.0 139 do 23 jw=i1 ,nt1 ,i1 140 ifUt.eq.jw) then 141 write(· ,24) jt2,t1 142 write(· ,25) I 143 write(· ,26) 144 write(· ,27) 145 write(· ,28)( u (i), i= 1 ,n, ic2) 146 write(· ,29) 147 write(· ,27) 148 write(· ,28)( u3(i) ,i= 1 ,n, ic2) 149 write(· ,30) 150 write(· ,27) 151 write(·,28)(er(i),i=1,n,ic2) 152 write(· ,31 ) 153 write(· ,27) 154 write(·,28)(re(i),i=1,n,ic2) 155 24 format(/,'AGE iterative solution at time level nt = ',i31 156 • 'Time t = ',f10.6) 157 25 format(, Method converges with I = ',i4, ' iterations ',I) 158 26 format(, The numerical solutions is ') 159 27 format(, nlllltltlllllllllnllllllftllllllltllll"""""l1nn",,"""n ')

160 28 161 29 162 30 163 31

164 165 166 167 168 23 169 170 13

format(1 Of1 0.6) formate' The exact solution is formate' The absolute error is format(f The relative error is

goto 23 else goto 23 endif continue goto 12 continue write(" ,32) kmalS,jt2

') ') ')

171 172 173

32 format(, Method fails to converge in ',i4, ' At time level nt = ',i4) "

174 175 176

177

c 12continu

stop end

307

iterations' ,I

Documents

Multiprocessor computer architectures : algorithmic design ... · Multiprocessor computer architectures : algorithmic design and ... ARCHITECTURES: ALGORITHMIC DESIGN AND APPLICATIONS