Parallel Algorithms for Distributed Control A Petri Net

Parallel Algorithms for Distributed Control

A Petri Net Based Approach

Zdenek HanzalekDepartment of Control Engineering

Czech Technical University in PragueKarlovo nam. 13

121 35 Prague 2, Czech [email protected]

January 6, 2003

Life can only be understood going backwards,but it must be lived going forwards.

Kierkegaard

Pro me rodice

Acknowledgments

This research has been conducted at LAAS-CNRS Toulouse and at DCE CTUPrague as part of the research projects: New control system structures for produc-tion machines (supported by CTU grant GACR 102/95/0926), INCO COPERNI-CUS - TRAFICC (supported by the Commision of the European Communities),Trnka Laboratory for Automatic Control (supported by the Ministry of Educationof the Czech Republic under VS97/034) and French research project (supportedby Ambassade de France a Prague).

I would like to express my gratitude and appreciation to all the people whoseefforts contributed to this thesis:Gerard Authie for his correctness and patienceJirı Bayer for his encouragements and confidenceRobert Valette for his ability to simplify complicated problemsJan Bılek for his valuable suggestionsDominique de Geest and Vincent Garric for their friendshipPatrick Danes for the address ’mon meilleur ami’Christophe Calmettes for his touchy humourFrederic Viader for his rough humourJean Philippe Marin for his good-natureand many others.

I also express my gratitude to the chairman, the reviewers and the membersof my dissertation committee: Vladimır Kucera, Hassane Alla, Milan Ceska, JirıKadlec, Guy Juanole and Branislav Hruz.

i

ii

Preface

In order to better understand principal problems comming from the nature ofparallel processing we introduce basic concepts including modelling by data de-pendence graphs, program transormations, partitioning and scheduling. A staticscheduling of nonpreemptive tasks on identical processors is surveyd as a math-ematical background reflecting the problem comnplexity.

Time complexity measures and global communication primitives are given tointroduce a principal terminology originated from computer sciences. Improtanceof global communication primitives is very well illustrated in one example ofparallel algorithm - gradient training of feedforward neural networks.

A message-passing architecture is presented to simulate multilayer neural net-works, adjusting its weights for each pair, consisting of an input vector and adesired output vector. Such algorithm comprise a lot of data dependencies, so itis very difficult to parallelize. Then the implementation of a neuron, split intothe synapse and body, is proposed by arranging virtual processors in a cascadedtorus topology. Mapping virtual processors onto node processors is done withthe intention of minimizing external communication. Then, internal communica-tion is reduced and implementation on a physical message-passing architectureis given. A time complexity analysis arises from the algorithm specification andsome simplifying assumptions. Theoretical results are compared with experimen-tal ones measured on a transputer based machine. Finally the algorithm basedon the splitting operation is compared with a classical one.

The example shows very well one possibility, how to make an efficient imple-mentation of parallel algorithm. This approch demands a deep knowlege of givenapplication and big experience in parallel processing. This difficulty is probablyone of the major reasons why parallel computers are more popular among theo-reticiens than in applications. Another possibility is to to transform sequentialprograms into equivalent parallel programs. Even if the complete applicationcannot be translated automatically, the aim is to facilitate the task of the pro-grammer by translating some sections of the code and performing operationsexploiting parallelism and detecting global data movements. Such approach isvery complex task and that is why we do not focuse only on the problem sulu-tion but we focuse our interest namely on the problem fromalisation and analysisin order to beter understand the problem nature. Petri Nets are the formalismadopted in this thesis, because they make it possible to model and visualize be-haviours comprising concurrency, synchronization and resource sharing. As suchthey are convenient tool to model parallel algorithms.

The first objective is to study structural properties of Petri Nets which of-fer profound mathematical backround originated namely from linear algebra andgraph theory. It is argued that namely positive invariants are of our interestwhen analyzing structural net properties and the notion of a generator is in-troduced.The importance of generators lies in their usefulness for analyzing netproperties. Then three existing algorithms finding set of generators are imple-

iii

mented. It is argued that the nuber of generators is nonpolynomial and originalalgorithm comprising structural information of fork-join components is proposed.Usefullnes of generators consists in their ability to express structural properties ofPetri Nets. That is why the set of generators in combination with initial markingare often used to prove properties like livenes or they serve as an input data forscheduling algorithms.

The second objective is to model algorithms via Petri Nets. It is stated thatmodel can be based either on the problem analysis or on the sequential algorithm.The problems are modelled as noniterative or iterative ones and reduction rulespreserving model properties are studied. An attempt is made to put DDGs andPetri Nets on the same platform when removing antidependencies and outputdependencies.

The third objective is to schedule nonpreemptive tasks with precedence con-straints given by event graphs with on unlimited , but possibly minimal, numberof identical processors. An importance is paid to a loop scheduling which isesential when designing efficient compilers for parallel architectures.

The fourt objective studied in this thesis is detection of global communica-tion primitives. In the presence of communication, the complexity of the abovescheduling problem has been found to be much more difficult than in the classi-cal scheduling problem, where communication is ignored. Global data movementssuch as a broadcasting or a gathering occur quite frequently in certain classes ofalgorithms. If such data movement patterns can be identified from the PN modelalgorithm representation, then the calls to communication routines can be issuedwithout having detailed knowledge of the target machine, while the communica-tion routines are optimized for the specific target machine.

This thesis has strong interdisciplinary character. An attempt is made to puta knowledge of various scientific branches to the same theoretical platform. Wetry to adopt a method developped in one of them and to elaborate it in anotherscientific branche. Such scientific branches include computer sciences, Petri Nets,linear algeba and graph theory.

iv

Contents

1 Introduction 1

2 Basic concepts of parallel processing 52.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Modelling and parallelism detection . . . . . . . . . . . . . 72.2.2 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.4 Survey on static scheduling of nonpreemptive tasks on iden-

tical processors . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Time complexity measures . . . . . . . . . . . . . . . . . . . . . . 192.4 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Communication model . . . . . . . . . . . . . . . . . . . . 212.4.2 Network topologies . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Global communication primitives . . . . . . . . . . . . . . 25

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 An example of parallel algorithm 313.1 Algorithms for neural networks . . . . . . . . . . . . . . . . . . . 313.2 Neural network algorithm specification . . . . . . . . . . . . . . . 323.3 Simple mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Cascaded torus topology of virtual processors . . . . . . . . . . . 353.5 Mapping virtual processors onto processors . . . . . . . . . . . . . 373.6 Data distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.7 Time complexity analysis . . . . . . . . . . . . . . . . . . . . . . . 413.8 Some experimental results . . . . . . . . . . . . . . . . . . . . . . 443.9 Comparison with a classical algorithm . . . . . . . . . . . . . . . 463.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Structural analysis of Petri nets 514.1 Basic notion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Linear invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.3 Finding positive P-invariants . . . . . . . . . . . . . . . . . . . . . 56

v

vi CONTENTS

4.3.1 An algorithm based on Gauss Elimination Method . . . . 624.3.2 An algorithm based on combinations of all input and all

output places . . . . . . . . . . . . . . . . . . . . . . . . . 694.3.3 An algorithm finding a set of generators from a suitable Z

basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.4 Time complexity measures and problem reduction . . . . . 814.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5 Petri net based algorithm modelling 875.1 Additional basic notions . . . . . . . . . . . . . . . . . . . . . . . 87

5.1.1 Implicit place . . . . . . . . . . . . . . . . . . . . . . . . . 885.1.2 FIFO event graph . . . . . . . . . . . . . . . . . . . . . . . 885.1.3 Uniform graph . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2 Model based on the problem analysis . . . . . . . . . . . . . . . . 905.2.1 Noniterative problems . . . . . . . . . . . . . . . . . . . . 905.2.2 Iterative problems . . . . . . . . . . . . . . . . . . . . . . . 925.2.3 Model reduction . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 Model based on the sequential algorithm . . . . . . . . . . . . . . 985.3.1 Acyclic algorithms . . . . . . . . . . . . . . . . . . . . . . 995.3.2 Cyclic algorithms . . . . . . . . . . . . . . . . . . . . . . . 1005.3.3 Detection and removal of antidependencies and

output dependencies in a PN model . . . . . . . . . . . . . 1005.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6 Parallelization 1136.1 Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.1.1 Data parallelism . . . . . . . . . . . . . . . . . . . . . . . 1136.1.2 Structural parallelism . . . . . . . . . . . . . . . . . . . . . 1166.1.3 Noniterative versus iterative scheduling problems . . . . . 118

6.2 Cyclic scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.2.1 Additional terminology . . . . . . . . . . . . . . . . . . . . 1206.2.2 Structural approach to scheduling . . . . . . . . . . . . . . 1216.2.3 Quasi-dynamic scheduling . . . . . . . . . . . . . . . . . . 126

6.3 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.3.1 The problem complexity . . . . . . . . . . . . . . . . . . . 1276.3.2 Finding global communications . . . . . . . . . . . . . . . 1286.3.3 Relation to automatic parallelization . . . . . . . . . . . . 130

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7 Conclusion 133

List of Tables

2.1 Earliest, latest, and feasible task execution time . . . . . . . . . . 14

3.1 Numerical values for neural network with 30-150-150-30 neurons . 46

4.1 Generator computational levels . . . . . . . . . . . . . . . . . . . 59

vii

viii LIST OF TABLES

List of Figures

2.1 Data dependence graph . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Data dependence graph . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Partitioning a computational graph: (a) fine grain; (b) coarse grain 112.4 Directed acyclic graph . . . . . . . . . . . . . . . . . . . . . . . . 132.5 An instance of a scheduling problem . . . . . . . . . . . . . . . . 172.6 An instance of a scheduling problem with the critical subgraph . . 182.7 Earliest schedule for the instance from Figure 2.5 . . . . . . . . . 192.8 Some specific topologies . . . . . . . . . . . . . . . . . . . . . . . 232.9 Hypercube interconnection . . . . . . . . . . . . . . . . . . . . . . 242.10 Hierarchy and duality of the basic communication problems . . . 252.11 Hierarchy example . . . . . . . . . . . . . . . . . . . . . . . . . . 272.12 Duality example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Artificial neuron j in layer l . . . . . . . . . . . . . . . . . . . . . 323.2 Example of a multilayer neural network (NN 2-4-4-2) . . . . . . . 343.3 The activation in the second hidden layer ( 4 neurons in both

hidden layers mapped on 4 NPs) and its Petri net representation. 353.4 Cascaded torus topology of VPs (for NN 2-4-4-2) . . . . . . . . . 363.5 VPs simulating NN with 16-16-16-16 neurons mapped on 4× 4 NPs 383.6 Realization on an array of 17 transputers . . . . . . . . . . . . . . 403.7 Separate parts of the execution time for NN with 64-64-64-64 neurons 433.8 Theoretical execution time of NN algorithm . . . . . . . . . . . . 443.9 Comparison of theoretical and experimental results . . . . . . . . 453.10 Experimental results achieved on a T-node machine . . . . . . . . 46

4.1 An example of Petri Net . . . . . . . . . . . . . . . . . . . . . . . 524.2 A Petri Net (an billiard balls in [22]) . . . . . . . . . . . . . . . . 534.3 A Petri Net with two positive P-invariants and one T- invariant . 574.4 A Petri net with one negative P-invariant . . . . . . . . . . . . . . 584.5 Event graph with two generators . . . . . . . . . . . . . . . . . . 594.6 Subspace of positive linear invariants . . . . . . . . . . . . . . . . 604.7 A Petri Net with four generators . . . . . . . . . . . . . . . . . . 614.8 Subnet of the PN representation of a Neural network algorithm . 684.9 Example of Petri Net with exponential number of generators . . . 81

ix

x LIST OF FIGURES

5.1 Implicit place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.2 Expressive power of different modelling methods . . . . . . . . . . 895.3 Representation of data flow by means of PNs and DAG . . . . . . 915.4 Matrix[3,3]-vector[3] multiplication . . . . . . . . . . . . . . . . . 925.5 Discrete time linear system . . . . . . . . . . . . . . . . . . . . . . 925.6 PN model of the discrete time linear system of second order with

PD controller shown in Figure 5.5 . . . . . . . . . . . . . . . . . . 935.7 NN learning algorithm represented by Petri Net . . . . . . . . . . 955.8 Simplified PN model . . . . . . . . . . . . . . . . . . . . . . . . . 955.9 PN model after reduction . . . . . . . . . . . . . . . . . . . . . . 975.10 Representation of the algorithm from Example 5.1 . . . . . . . . . 995.11 PN representation of Example 5.2 . . . . . . . . . . . . . . . . . . 1015.12 Detection (a) and removal (b) of antidependence . . . . . . . . . . 1015.13 Antidependence in a cyclic algorithm . . . . . . . . . . . . . . . . 1035.14 Output dependence . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.15 Two antidependencies and one output dependence . . . . . . . . . 1055.16 The six possible instances of an output dependence . . . . . . . . 1065.17 Problem with two competitors and two destinations . . . . . . . . 1095.18 Rough comparison of the two modelling approaches . . . . . . . . 111

6.1 Two PN models of the same vector operation . . . . . . . . . . . 1146.2 PN model of cyclic algorithm from Figure 5.11 after removal of

IP-dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.3 Graph corresponding to the parallel matrix Π of the algorithm

given in Figure 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.4 Cyclic version of matrix[3,3] vector[3] multiplication . . . . . . . . 1196.5 A simple instance for structural scheduling . . . . . . . . . . . . . 1226.6 Underlying directed graph for Figure 6.5 and its reduction . . . . 1256.7 Schedule of the cyclic version of the matrix[3,3]-vector[3] multipli-

cation (from Figure 6.4 by elimination of implicit places) . . . . . 1296.8 Global communication primitives of matrix-vector multiplication . 1306.9 Automatic parallelization . . . . . . . . . . . . . . . . . . . . . . . 131

Chapter 1

Introduction

This thesis presents an original method for algorithm parallelization using PetriNets.

General view

Two basic approaches are used for designing parallel algorithms:a) direct designb) use of already existing sequential algorithms which are parallelized automati-cally.

The approach a), shown in Chapter 3, requires deep understanding of theproblem for which we write a parallel algorithm. The first step that one maytake is to understand the nature of computations. The second step is to designa parallel algorithm. The third is to map the parallel algorithm onto a suitablecomputer architecture. This design is often linked to a given architecture. Itsimplementation on other machines can lead to serious problems.

In the approach b), studied in Chapters 6 and 5, we investigate a possibility totransform automatically sequential programs into equivalent parallel programs.This task is very complex and it is not possible to find a solution for a generalcase. That is why we look for basic principles in order to do as much work aspossible automatically. Even if the complete application cannot be translatedautomatically, the aim is to facilitate the task of the programmer by translatingsome sections of the code and performing operations exploiting parallelism anddetecting global data movements.

There are several surveys on automatic parallelization [1], [53] and severaldescriptions of experimental systems [54], [80]. Banerjee et al. [5] presented anoverview of techniques for a class of translators whose objective is to transformsequential programs into equivalent parallel programs. These studies are usuallybased on data dependence graphs.

The methodology adopted in this thesis is based on Petri Nets and their

1

2 CHAPTER 1. INTRODUCTION

structural properties, described in Chapter 4. We try to clarify the contributionof Petri Nets in the domain of automatic parallelization. The contribution of thisthesis does not lay in the theory of Petri Nets but it studies why and how theycan be used in a general methodology for program parallelization.

This thesis has strong interdisciplinary character. An attempt is made to puta knowledge of various scientific branches to the same theoretical platform. Wetry to adopt a method developed in one of them and to elaborate it in anotherscientific branch. Originated by problems arising from parallel processing we lookfor solutions from domains like Petri Nets and graph theory. In general we cansay that the methods adopted in this thesis fall into applied mathematics.

We do not concentrate only on the problem solution but we focus our interestnamely on the problem analysis in order to better understand the problem nature.

Organization

The thesis is divided into five principal chapters.Chapter 2 is an introduction to parallel processing. Parallel processing is a

fast growing technology that covers many areas of computer science and con-trol engineering. It is natural that the concurrences inherent in physical systemsstart to be reflected in computer systems. Parallelizm brings higher speed andboth hardware and software distribution, but rises a new set of complex prob-lems to solve. Most of these problems are surveyed in Chapter 2. The chaptercontains basic computer models, modelling by Data dependence graphs, parti-tioning, scheduling, performance measures and global communication primitives.It is a basic chapter introducing the concept of dependencies and techniques forparallelism analysis in algorithms. Simple examples introducing the notions ofantidependencies and output dependencies are given.

Chapter 3 is an example of a parallel algorithm implementation. This chapterpresents a usual approach to parallel processing where parallelization is not doneautomatically. The example chosen is a message-passing architecture simulatingmultilayer neural networks, adjusting its weights for each pair, consisting of aninput vector and a desired output vector. It is shown why such an algorithm isdifficult to parallelize. A solution based on fine-grain partitioning is studied, im-plemented and compared with a classical one demanding more communications.A time complexity analysis is given and theoretical results are compared withexperimental ones measured on a physical machine.

Petri Nets make it possible to model and visualize behaviours comprising con-currency, synchronization and resource sharing. As such they are a convenienttool to model parallel algorithms. The objective of Chapter 4 is to study struc-tural properties of Petri Nets which offer profound mathematical backgroundoriginated namely from linear algebra and graph theory.

Chapter 5 uses Petri Nets to formalize algorithm data dependencies. It is

3

stated that a model can be based either on the problem analysis or on a sequentialalgorithm. Noniterative as well as iterative problems are under consideration.An attempt is made to put Data dependence graphs and Petri Nets on the sameplatform when removing antidependencies and output dependencies. A Petri Netbased data flow model is defined and the difficulties arising from the algorithmrepresentation are clarified.

After modeling issues and structural analysis we focus our interest on schedul-ing problem without communication delays. A scheduling problem of cyclic al-gorithms is studied on an unlimited, but possibly minimal, number of processorresources in Chapter 6. Then data movement patterns are identified from thePetri Net model algorithm representation and the calls to communication rou-tines are issued without having any detailed knowledge of the target machine.

The field of parallel processing is expanding rapidly and new, improved resultsbecome available every year. That is why current parallel supercomputers anddetailed implementation issues of parallel algorithms are described only in brief.

4 CHAPTER 1. INTRODUCTION

Chapter 2

Basic concepts of parallelprocessing

The objective of this chapter is to introduce the basic terminology and someelementary methods from the field of parallel processing. It has evolved fromclass notes for a graduate level course in parallel processing originated by theauthor at the Department of Control Engineering, CTU Prague.

Parallel processing comprises algorithms, computer architectures, program-ming, scheduling and performance analysis. There is a strong interaction amongthese aspects and it becomes even more important when we want to implementsystems. A global understanding of these aspects allows programmers to maketrade-offs in order to increase the overall efficiency. This chapter is organized infour paragraphs. First a systematic architectural classification is given. Thenthe principal problems of program parallelization as modelling, partitioning andscheduling are classified. The third paragraph gives the basic terminology of timecomplexity measures and the fourth presents communication aspects of parallelprocessing.

The chapter emphasizes the crucial problems of parallel processing - mod-elling, scheduling and communications. Difficult concepts are introduced viasimple motive examples in order to facilitate understanding.

2.1 Architecture

A multitude of architectures has been developed for multiprocessor systems. Thefirst systematic architectural classification by Flynn [31] is not unique in every re-spect, but it is still widely used. Flynn classifies the computing systems accordingto the hardware provided to service the instruction and data streams.

A classical, purely serial monoprocessor, executes the only stream of instruc-tions, the execution is sequential. The system that includes the only processorof this type is classified as SISD - Single Instruction, Single Data stream

5

6 CHAPTER 2. BASIC CONCEPTS OF PARALLEL PROCESSING

machine. There exists certain level of parallelism in SISD machines, but it is lim-ited to overlapping of instruction cycles (on RISC processors) and to the internal,register level parallelism in hardware.

Classical form of parallelism exists in vector processors that support massivepipelining and in array processors that operate on multiple data streams. Thesecomputers are obviously denoted as SIMD - Single Instruction, MultipleData stream machines and are generally used for numerical computation.

Finally there is a large group of computing systems that includes more thanone processor able to process Multiple Instruction over Multiple Datastreams - MIMD. These computers are suitable for a much larger class ofcomputations than SIMD computers because they are inherently more flexible.This flexibility is achieved at the cost of a considerably more difficult mode ofoperation. Only MIMD computers will be considered further in this thesis.

Two distinct classes of MIMD machines exist regarding the way in whichprocessors communicate:(1) shared-memory communication, in which processors communicate via acommon memory(2) message passing communication, in which processors communicate viacommunication links.

Running parallel algorithms on these two distinct classes of parallel computersleads to quite distinct kinds of problems (e.g. memory access management inshared-memory computers or message routing in message passing computers).Only message passing communication will be considered further in this thesis. Insuch systems each processor uses its own local memory for storing some problemdata and intermediate algorithmic results, and exchanges information with otherprocessors in groups of bits usually called packets using the communication linksof the network.

2.2 Parallelization

First of all we introduce a term SIMD parallelization where an algorithm isof such kind that no code decomposition is needed because we run the samecode on different processors working on different data. First data are distributedto processors, then computation is performed in each processing element with-out communication with others and finally results are collected in one processor.This type of parallelization is very closed to the SIMD computers, where one in-struction operates on several operands simultaneously. An algorithm partitioning(see Section 2.2.2) and scheduling (see Section 2.2.3) are easier problems in dataparallel algorithms.

In this thesis we will talk about so-called functional parallelization, whereprocessors run different codes and communicate with each other during the pro-cessing. In such case the term parallelization covers the following problems that

2.2. PARALLELIZATION 7

have to be solved when mapping a program onto a MIMD computer:1) algorithm modelling and parallelism detection2) partitioning the program into sequential tasks3) scheduling the tasks onto processors

The parallelism in a program depends on the nature of the problem and thealgorithm used by the programmer. The parallelism analysis is usually indepen-dent of the target machine. On the other hand, partitioning and scheduling isdesigned to minimize the parallel execution of the program on a target machine,and depends on parameters such as number of processors, processor performance,communication time, scheduling overhead etc.

2.2.1 Modelling and parallelism detection

There are several ways, how to model a dynamic behaviour of algorithms; themost common modelling techniques are data dependence graphs (DDG), directedacyclic graphs (DAG), and Petri nets. The following paragraph will explain inbrief a modelling technique based on DDGs. DAGs and Petri nets will be analyzedlater.

Parallelism detection involves finding sets of computations that can be per-formed simultaneously. The approach to parallelism is based on the study of datadependencies. The presence of a data dependence between two computationsimplies that they cannot be performed in parallel; the fewer the dependencies,the greater the parallelism. An important problem is determining the way toexpress dependencies.

Data dependence graph

The data dependence graph (DDG) is a directed graph G(V, E) with verticesV corresponding to statements in a program, and edges E representing datadependencies of three kinds:1) Data-flow dependencies (indicated by the symbol−→) express that the variableproduced in a statement will be used in a subsequent statement2) Data antidependencies (indicated by the symbol −→/ ) express that the valueproduced in one statement has been previously used in another statement. Ifthis dependency is violated (by parallel execution of the two statements), it ispossible to overwrite some variables before they are used3) Data-output dependencies (indicated by the symbol −→◦ ) express that bothstatements overwrite the same memory location. When executed in parallel it isnot determined which statement writes first.


Example 2.1: consider a simple sequence of statements:

S1: A = B + CS2: B = A + ES3: A = B

An analysis of this example reveals many dependencies. The data dependencegraph is shown in Figure 2.1(a). Statement S1 produces the variable A that isused in statement S2 (flow dependence d1), statement S2 produces the variableB that is used in statement S3 (flow dependence d2), and the previous value ofB was used in statement S1 (antidependence d3); both statements S1 and S3produce variable A (output dependence d4); statement S3 produces variable A,previously used in statement S2 (antidependence d5).

The antidependencies and output dependencies can be eliminated at the costof introducing new redundant variables. Some techniques for this elimination,proposed in [70] are variable renaming, scalar expansion and node splitting.

��

��

� � ��

� ��

Figure 2.1: Data dependence graph

The following demonstrates the variable renaming where the reocurrencesof old variables are replaced with new variables. The program from the previousexample does not change if statements S2 and S3 are replaced by S2’ and S3’respectively:

S1: A = B + CS2’: BB = A + ES3’: AA = BB

If this change is made, the antidependence and output dependence arcs areremoved (see Figure 2.1(b)).

Example 2.2: consider the following loop program:


FOR I=1,20S1: A(I) = X(I) - 3S2: B(I+1) = A(I) * C(I+1)S3: C(I+4) = B(I) + A(I+1)

The dependence graph is shown in Figure 2.2(a). There is a data dependencecycle S1,S2,S3, which indicates dependencies between loop iterations. However,one of the arcs in the cycle corresponds to an antidependence, if this arc isremoved, the cycle will be broken. The antidependence relation can be removedfrom the cycle by node splitting:

FOR I=1,20S0: AA(I) = A(I+1)S1: A(I) = X(I) - 3S2: B(I+1) = A(I) * C(I+1)S3’: C(I+4) = B(I) + AA(I)

The modified loop has the data dependence graph shown in Figure 2.2(b).The data dependence cycle S1,S2,S3 has been eliminated by splitting the nodeS3 in the DDG into S0 and S3’.

��

� �

��

� �

��

��

� �

��

� ��

� �

Figure 2.2: Data dependence graph

Statements S2 and S3’ are connected in a data dependence cycle which cannotbe removed. In principle a data dependence cycle (this cycle consists exclusivelyof data flow dependencies) cannot be removed if all cycles containing at leastone antidependence or output dependence were already removed. It is possibleto vectorize (to perform a single instruction on multiple data, in another wordsto make use of the data parallelism) the statements that are outside the cycle.Thus the previous loop may be partially vectorized, as follows:

S0: AA(1:20) = A(2:21)S1: A(1:20) = X(1:20) - 3FOR I=1,20

S2: B(I+1) = A(I) * C(I+1)S3’: C(I+4) = B(I) + AA(I)


That means: we have used 20 processors to run the instructions given by S0and S1 on 20 different data. These processors do not perform any communicationamong them. Simultaneous execution of S0 and S1 is impossible due to theantidependence relation from S0 to S1.

Data dependence analysis done in DDG, and removal of antidependencies andoutput dependencies in fact lead to a simple directed graph plus informationsabout index shrinking given in the dependence vector (see [65]). The dependencevectors indicate the number of iterations between the generated variables andthe used variables. In Example 2.1 there are no indices, and the value of thedependence vectors is zero. In Example 2.2 there is one iteration index I and fourdependencies, so the four entries of dependence vector are given as d1 = I−(I) =0, d2 = I + 1− (I) = 1, d3 = I + 4− (I + 1) = 3 and d4 = I + 1− (I) = 1.

Directed acyclic graph

A directed acyclic graph (DAG) is a directed graph that has no positivecycles, that is, no cycles consisting exclusively of directed path (for additionalterminology see 6.2.1).

Let G = (V, E) be a DAG where V is a set of vertices (V is corresponding tostatements in a program), and edges E representing data-flow dependencies. Inparticular, an arc (i, j) ∈ V indicates the fact that the operation correspondingto vertex j uses the results of the operation corresponding to node i. This impliesthe operation j to be performed after the operation i. That is why these graphsare acyclic.

Comparison of DDGs, DAGs and PNs will be given further in the chapter 4.

2.2.2 Partitioning

When talking about the operations (vertices) in DAGs and DDGs it was assumedthat the operation could be elementary (e.g., an arithmetic or a binary operation),or it could be a high-level operation like the execution of a subroutine.

Let us now introduce a term granularity - a measure of the quantity ofprocessing performed by an individual process. In general we distinguish betweenfine-grain parallelism (army of ants approach) and coarse-grain parallelism(elephants approach).

The partitioning of a program specifies the nonparallelized units of the algo-rithm. Let us refer to these units as processes. There are some important processproperties:1) the process sequential execution time, which is a measure of the process size2) the process inputs and outputs, which produce communication overhead3) the synchronization requirements given by the precedence constraints (wrongprocess schedule can lead to processor busy waiting)


��

Figure 2.3: Partitioning a computational graph: (a) fine grain; (b) coarse grain

Execution time is influenced by the process granularity determining the threefactors mentioned above. The ideal execution time (computations without com-munication overhead), increases with the process size due to the loss of paral-lelism, while the communication overhead decreases with the process size. So,working with small granularity increases the parallelism but also increases theamount of communication, in addition to increasing software complexity. Parti-tioning should be designed in such a way that it provides a process size for whichthe effective execution time is minimized.

This minimization is not simple because two different partitionings lead to twodifferent schedules with different precedence constraints. In addition a continu-ous variation of process size is an oversimplification of the partitioning process.Real programs are discrete structures, and it may not be possible to partition aprogram into processes of equal size. Thus finding automatically the optimumpartition for a real program is rather difficult.

A partitioning technique proposed by Sarkar [83] is to start with an initialfine-granularity partition and then iteratively merge some processes selected byheuristics until the coarsest partition is reached, as illustrated in Figure 2.3. Foreach iteration, compute a cost function and then select the partitioning thatminimizes the cost function that is a combination of two terms: the critical pathand communication overhead.

Another general conclusion is that fine-grain parallelism is found in tightlycoupled systems (fast communication), and as hardware becomes increasingly


loosely coupled (slow communication, big startup), the granularity of data andprogram increases.

2.2.3 Scheduling

There is a big amount of scheduling methods for quite different purposes rangingfrom operating systems and parallel programming to manufacturing.

Scheduling is defined as a function that assigns processes to processors. Thegoals of the scheduling, or task allocation, function are to spread the load to allprocessors as evenly as possible (so called load balancing) in order to obtainprocessor efficiency (decreasing its busy waiting time) and to minimize data com-munication, which will lead to shorter overall processing time. Allocation policiescan be classified as static or dynamic.

Under the static allocation policy, tasks are assigned to processors beforerun time either by the programmer or by the compiler. In some parallel languages,the programmer can specify the processor on which a task is to be performed,the communication channel used, and so on. There is no run-time overhead, andallocation overhead is incurred only once even when the programs are run manytimes with different data.

Under the dynamic allocation policy, tasks are assigned to processors atrun time. This scheme offers better utilization of processors, but at the price ofadditional allocation time.

In addition the scheduling policies can be divided into preemptive and nonpre-emptive. In a preemptive environment, tasks may be halted before completionby another task that requires service. This method requires a task to be interrupt-ible, which is not always possible. In general, preemptive techniques can generatemore efficient schedules than those that are nonpreemptive. However, a penaltyis also paid in the preemptive case. This penalty lies in the overhead of taskswitching, which includes discontinuous processing and the additional memoryrequired to save the processor state.

For example one scheduling method coming from a real-time control applica-tion is the use of deadlines, or scheduled completion times established for individ-ual processes. If there is some time associated with the completion of individualtasks, and this time is bounded, it is called a hard deadline or a hard real-timeschedule.

When programming MIMD machines, the scheduling is usually based on theDAG model of the problem and an execution time of each process. Before pre-senting a short survey on static scheduling of nonpreemptive tasks on identicalprocessors in Section 2.2.4, we first give a simple introductory example.

Example 2.3: consider the program having a dependence graph shown inFigure 2.4 and adjacency matrix A (adjacency matrix A of a directed graph


G(V,E) is the asymmetric matrix | V | × | V | having element A(i, j) = 1when there exists a directed edge from vertex i to vertex j).

A =

1 2 3 4 5 6 7 81 0 1 1 0 0 0 0 02 0 0 0 1 1 0 0 03 0 0 0 0 0 1 0 14 0 0 0 0 0 0 1 05 0 0 0 0 0 1 0 06 0 0 0 0 0 0 1 07 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0

�

� �

�

�

�

�

�

Figure 2.4: Directed acyclic graph

For simplicity it is assumed that all processes, represented by the verticesof the graph, have the same execution time.

We wish to find the feasible execution time intervals for each processor. Firstwe determine the earliest execution time for each process by the following proce-dure:

repeat the following actions until the adjacency matrix disappears:1) identify the processes whose columns contain only zeros2) put these processes into separate sets3) eliminate the rows and columns corresponding to these processes

The following sets of nodes are obtained for the earliest execution time:{1},{2,3},{4,5},{6},{7},{8}

Then we find the sets of the latest processing time using the same procedurebut with transposed matrix A: {8},{7},{6,4},{5,3},{2},{1}

The time interval in which each task may be scheduled without delaying theoverall execution starts at the earliest execution time and ends at the latest


Process Earliest time Latest time Permissible time

1 1 1 12 2 2 23 2 3 2,34 3 4 3,45 3 3 36 4 4 47 5 5 58 6 6 6

Table 2.1: Earliest, latest, and feasible task execution time

processing time as shown in Table 2.1. Concrete schedules would be found inrespect to the feasible execution time and other constraints, like network topology,communication requirements etc.

2.2.4 Survey on static scheduling of nonpreemptive taskson identical processors

Computational complexity theory

The theory of NP-completeness proves that a wide class of decision problems areequivalently hard, in the sense that if there is a polynomial-time algorithm forone NP- complete problem, then there is a polynomial-time algorithm for anyNP-complete problem. When talking about optimisation problem we will use theterm ’NP-hard’ and when talking about decision problem ( problems solved bysimple answer YES or NO) we will use the term ’NP-complete’.

For some problems, for any arbitrarily small error tolerance, there exists apolynomial-time approximation algorithm that delivers a solution within thattolerance. We shall say that algorithm A is a ρ-approximation algorithm if,for each instance I, A(I) gives a solution within a factor of ρ of the optimal value,where ρ > 1.

The main technique for proving that certain approximation algorithms areunlikely to exist is based on proving an NP-completeness result of the relateddecision problem. The following theorem shows that there is a minimum boundof ρ for certain class of optimisation problems.

Theorem 2.1 Consider a combinatorial optimization problem for which all fea-sible solutions have non-negative integer objective function value. Let c be afixed positive integer. Suppose that the problem of deciding if there exists a feasi-ble solution of value at most c is NP-complete. Then for any ρ < (c + 1)/c, theredoes not exist a polynomial-time ρ- approximation algorithm A unless P=NP.


Proof: see page 4 in [16].

As stated by Gerasoulis and Yang [35] the objective of scheduling is to allo-cate tasks onto the processors and then order their execution so that every taskdependence is satisfied and length of the schedule, known as a parallel time, isminimized.

The following text introduces the main results obtained so far in the field ofscheduling nonpreemptive tasks on identical processors.

Independent tasks, no communication,limited number of processors, no duplication

Consider the problem of scheduling n independent tasks T1, ..., Tn on p identicalprocessors (machines) M1, ...,Mp. Each task Tj where j = 1, ..., n is to be pro-cessed by exactly one processor, and requires processing time tj. We wish to finda schedule of minimum length. Even if we restrict attention to the special case ofthis problem in which there are two processors (p = 2), the problem of computinga minimum length schedule is NP-hard.

Graham [36] analyzed a simple approximation algorithm that finds a goodschedule for this multiprocessor scheduling problem called ”list scheduling”. Heshowed, that if we list the tasks in any order, and whenever a processor becomesidle the next task from the list is assigned to it, then the length of the scheduleproduced is at most twice the optimum: in other words, it is a 2-approximationalgorithm. Graham later refined this result, and showed that if the tasks arefirst sorted in order of nonincreasing processing times, then the length of theschedule produced is at most 4/3 times the optimum. Subsequently, a num-ber of polynomial-time algorithms with improved performance guarantees wereproposed.

Tasks with precedence constraints, no communication,limited number of processors, no duplication

Even if we introduce precedence relations among the tasks T1, ..., Tn (given e.g.by a DAG) it is possible to find a good approximation algorithm. Graham [36]showed that the following algorithm is a 2-approximation one: the tasks are listedin any order that is consistent with the precedence constraints, and whenever aprocessor becomes idle, the next task on the list with all of its predecessorscompleted is assigned to that processor; if no such task exists, then the processoris left idle until the next processor completes a task.

Lenstra and Rinnooy Kan [59] showed that, even if each task Tj requiresprocessing time tj = 1, deciding if there is a schedule of length 3 is NP-complete(they showed that the NP-complete clique problem [21] can be reduced to thisscheduling problem). So that with respect to Theorem 2.1, for any ρ < 4/3,there does not exist a polynomial-time ρ-approximation algorithm, unless P=NP.


Tasks with precedence constraints, communication delays,limited number of processors, no duplication

Now consider the same scheduling problem with the following stronger precedenceconstraint: let Tj be a predecessor of Tk, if Tj and Tk are processed on differentprocessors, then not only must Tk be processed after Tj completes, but it mustbe processed at least cjk units afterwards. The special case in which each tj = 1and each cjk = 1 was shown to be NP-complete by Hoogeveenet et al. [47];more precisely, they showed that deciding if there is a schedule of length atmost 4 is NP-complete. Consequently, for any ρ < 5/4, no polynomial-time ρ-approximation algorithm exists for this problem unless P=NP.

Sarkar [83] has proposed a good approximation algorithm for scheduling withcommunication consisting of two steps:1) Schedule the tasks on an unbounded number of processors of a completelyconnected architecture. The result of this step will be clusters of tasks, with thelimitation that all tasks in a cluster must be executed in the same processor.2) If the number of clusters is larger than the number of processors, then merge theclusters to the number of processors, and also incorporate the network topologyin the merging step. The processor assignment part is also known as clustering.

Using the same basis, a more efficient algorithm, called DSC (dominant se-quence clustering), has been developed and analyzed by Gerasoulis and Yang [34][35]. They introduce two types of clustering, the nonlinear and linear. A cluster-ing is nonlinear if two parallel tasks are mapped in the same cluster, otherwise itis linear. Linear clustering fully exploits the natural parallelism of a given DAG,while nonlinear clustering sequentialises independent tasks to reduce parallelism.

Tasks with precedence constraints, communication delays,unlimited number of processors, no duplication

Hoogeveen et al. [47] also considered the model, including precedence constraintswith communication delays as above, but there is no limit on the number ofidentical processors that may be used. For the special case in which each tj = 1and each cjk = 1 they gave a polynomial-time algorithm to decide if there isa schedule of length 5, and yet deciding if there is a schedule of length 6 isNP- complete. Hence, for any ρ < 7/6, no polynomial-time ρ- approximationalgorithm exists for this problem unless P=NP.

Despite of these complexity results it is possible to find polynomial algo-rithms when a restriction on the precedence constraints is imposed (e.g. fork-joinstructure) or limitation of the communication time is assumed. The SCT (smallcommunication time) assumption means that the largest communication time isless or equal to the smallest processing time. Such situation occurs in applicationprograms involving tasks with a large granularity. The SCT assumption is some-times called coarse-grain assumption (see a granularity theory - chapter 6 in [16]or paragraph 2.2.2 in this thesis).


Chretienne [15] has developed an O(n) algorithm solving the special casewhen SCT assumption is satisfied and the precedence constraints are given as anin-tree ( the tree consisting of directed paths going from the leaves to the root)or an out-tree ( the tree consisting of directed paths going from the root to theleaves). For additional terminology see paragraph 6.2.1. Valdes et al. [90] give apolynomial algorithm for fork-join graphs when SCT assumption is satisfied.

For a general case (when no assumption on the structure of precedence con-straints and no SCT assumption are satisfied) a good approximation algorithmusing task clustering was proposed by Sarkar [83] and Gerasoulis and Yang [34][35].

Tasks with precedence constraints, communication delays,unlimited number of processors, duplication

When duplication is not allowed, each task must be processed only once. Soa schedule is entirely defined by assigning to each task Tj a starting time andprocessor. When an unlimited number of processors is assumed and duplicationis allowed it can be faster to process the same task on several processors andeliminate comunication delays.

4

3

2

2

2

3

3

3

2

2

2

1

1

1

1

1

1

1

1

2

2

2

T1

T5

T3 T6

T4 T7 T9

T2 T8

Figure 2.5: An instance of a scheduling problem

A polynomial algorithm developed by Colin and Chretienne [18] provides anoptimal schedule when SCT assumption is satisfied. An instance of this problemis shown in Figure 2.5 and will serve to illustrate how the algorithm works. Theset of the immediate predecessors (respectively successors) of Tj is denoted byIN(j) (respectively OUT (j)). First a topological order of the tasks is used tocompute bi the release time of each task Ti:

if Ti has no predecessor (IN(j) = ∅)then bi = 0

if Ti has one predecessor s (IN(j) = {s})then bi = bs + ts

if Ti has more predecessors (index s is used for a special predecessor)


4

3

2

2

2

3

3

3

2

2

2

1

1

1

1

1

1

1

1

2

2

2

0

0

4

4

3

7

6

6

11

T1

T2

T3

T4

T5

T6

T7

T8

T9

Figure 2.6: An instance of a scheduling problem with the critical subgraph

then bi = max{bs + ts, maxTk∈IN(i)−{s}{bk + pk + cki}where s is the index of a special predecessor task Ts denoting the predecessorwhich satisfies the equation bs + ts + csi = maxTk∈IN(i){bk +pk + cki}. The releasetime bi is found in topological order for all tasks (in Figure 2.6 indicated bynumbers above vertices).

For example when calculating the release time b7 of task T7 the release timesfor its predecessors are already known (b3 = 4 and b5 = 3) because of topologicalorder. Then T3 becomes the special predecessor (s = 3) because b3 + t3 + c37 >b5 + t5 + c57. Finally we calculate b7 = max{b3 + t3, b5 + t5 + c57}.

An arc (Ti, Tj) of the precedence graph is said to be critical if bi + pi + cij >bj. By removing all noncritical arcs from the precedence graph we get so-calledcritical subgraph (in Figure 2.6 indicated by thick lines) in which each task has:- no input arc (a case where a special predecessor can be chosen among severalpossible ones)- one input arc.So the critical subgraph is a spanning outforest.

The optimal schedule is finally built by assigning one processor to each criticalpath (i.e. a path from a root to a leaf in the critical subgraph) and by processingall the corresponding copies at their release times. The optimal schedule is shownin Figure 2.7.

The algorithm provides an earliest schedule, but, as shown by Figure 2.7, itdoes not necessarily minimize the number of processors. The example shows thatthe tasks T6 and T9 could be assigned to processor M2.

However, it has been shown by Picouleau [78] that minimizing the number ofprocessors is an NP-hard problem when the minimum makespan must be guar-anteed. The above algorithm also works if for any Tj the largest communicationtime of an ingoing arc of Tj is at most the smallest processing time in IN(j) - aweaker assumption than SCT.

Let us consider the case when the communication times may be larger than the

2.3. TIME COMPLEXITY MEASURES 19

T1 T3 T7

T2 T5

T1 T4 T8

T6 T9

M1

M2

M3

M4

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Figure 2.7: Earliest schedule for the instance from Figure 2.5

processing times. Papadimitru and Yannakakis [72] have shown that the specialcase tj = 1, cij > 1 is NP-hard and have proposed a sophisticated polynomial2-approximation algorithm.

2.3 Time complexity measures

It is assumed now that a particular model of parallel computation has been chosenand a schedule of the algorithm has been done. Let us consider a computationalproblem parameterized by a variable n representing the problem size. Timecomplexity is generally dependent on n. A few concepts are described that aresometimes useful when comparing sequential and parallel algorithms. Suppose aparallel algorithm using p processors, terminating in time Tpar. Let Tseq be theoptimal processing time of the sequential algorithm running on one processor

The ratio:

S(n, p) =Tseq(n)

Tpar(n, p)(2.1)

is called the speedup S of the algorithm, and describes the speed advantageof the parallel algorithm, compared to the best possible sequential algorithm.

The ratio

E(n, p) =S(n, p)

p=

Tseq(n)

pTpar(n, p)(2.2)

is called the efficiency E of the algorithm, and measures the fraction of timeduring which a typical processor is usefully running.

Ideally, S(n, p) = p and E(n, p) = 1, in which case, the availability of pprocessors allows to speed up the computation by a factor of p. For this to occur,the parallel algorithm should be such that no processor ever remains idle orcommunicates. This ideal situation is practically unattainable. A more realisticobjective is to aim at an efficiency that stays bounded away from zero, as n andp increase.


It is obvious that S(n, p) ≤ p. The proof can be done easily by contradiction:if S(n, p) > p then it is profitable to run the parallel algorithm on the networkof p virtual processors mapped to one node processor where p virtual processorsshare the time of the node processor. Such a static schedule of the parallelalgorithm could be used as a new sequential algorithm with a processing timeshorter than Tseq. This is a contradiction because Tseq is the processing time ofthe best possible sequential algorithm Q.E.D. In practice the situation is evenworse (communication overhead, scheduler overhead, etc..).

Another fundamental issue is whether the maximum attainable speedup, givenby Tseq(n)/Tpar(n,∞), can be made arbitrarily large, as n is increased. In cer-tain applications, the required computations are quite unstructured, and therehas been considerable debate on the range of achievable speedups in real worldsituations. The main difficulty is that some programs have some sections that areeasily parallelizable, but also have some sections that are inherently sequential.When a large number of processors is available, the parallelizable sections arequickly executed, but the sequential sections lead to bottlenecks, because justone processor is working and other processors perform so called busy waiting.This observation is known as Amdahl’s law and can be quantified as follows: ifa program consists of two sections, one that is inherently sequential and one thatis fully parallelizable, and if the inherently sequential section consumes a fractionf of the total computation, then the speedup is limited by:

S(n, p) ≤ limp→∞

1

f + (1− f)/p=

1

f(2.3)

On the other hand, there are numerous computational problems for which fdecreases to zero as the problem size n increases, and Amdahl’s law is no more aconcern.

2.4 Communication

In many parallel and distributed algorithms and systems the time spent for in-terprocessor communication is a sizable fraction of the total time needed to solvea problem. In this case the term communication/computation ratio is oftenapplied in the literature and it can be given as:

Tpar − Tinst

Tinst

(2.4)

where Tpar is the time required by the parallel algorithm to solve the givenproblem, and Tinst is the corresponding time that can be attributed just to com-putation and busy waiting, that is, the time that would be required if all commu-nication were instantaneous. This section is devoted to a discussion of a numberof factors affecting the communication penalty.

2.4. COMMUNICATION 21

2.4.1 Communication model

To solve different communication problems it is needed to specify the basic ter-minology used to model a real communication.

When a source and destination processors are not directly connected then apacket must travel over a route involving several processors. There are severalpossibilities for rooting the packet:

1) a store-and-forward (sometimes called message-switched) packet switch-ing data communication model, in which a packet may have to wait at any nodefor some communication resource to become available before it gets transmittedto the next node. In some systems, it is possible that packets are divided andrecombined at intermediate nodes on their routes, but this possibility will not beconsidered in this thesis.

2) a circuit switching, where the communication resources needed for packettransfers are reserved via some mechanism before the packet’s transfer begins. Asa result the packet does not have to wait at any point along its route. Circuitswitching is almost universally employed in telephone networks, but is seldomused in data networks or in parallel and distributed computing systems. It willnot be considered further in this thesis.

In the case of two processors p1 and p2 directly connected by one bidirectionalcommunication link, packets can be transmitted in both directions but the twofollowing cases can appear:

1) the packet can be communicated between the two processors just in onedirection at a given time, either from p1 to p2, or from p2 to p1. This kind ofcommunication, called half-duplex, occurs for example in radio communicationon one frequency.

2) two packets can be communicated in opposite direction simultaneously fromp1 to p2 and from p2 to p1. This mode, called full-duplex, is used for examplein normal telephone conversation (with exception when you are talking to yourwife).

In addition it is needed to specify the interface memory/communication linkfor each processor:

1) when each processor can use just one link then the communication is called1-port (sometimes processor-bound or whispering)

2) on the other side when the processor can use all links simultaneously thenthe communication bound is a ∆-port (sometimes link-bound or shouting)

3) between the two frequent cases there is the communication bound when klinks (where k is less than all the processor links) can be used simultaneously.This bound is called k-port and will not be considered in this thesis.

The following notation will be used: F1 and H1 indicate full duplex and halfduplex 1-port models; F∗ and H∗ indicate full duplex and half duplex ∆-portmodels.

And now it is needed to specify the communication between two connected


processors. This communication is influenced by L the length of the message.That’s why the most general formulation of the communication time betweentwo neighbours is the sum of:

1) the start-up β corresponding to a register/memory initialisation and send-ing/receiving procedures

2) propagation time Lτ proportional to the message length L and to thepropagation time a unit length message τ (the link bandwith is 1/τ)

Such a model is called linear time and it is given by the following equation:

Tneighbour to neighbour = β + Lτ (2.5)

In practical applications there is very often a case where β � τ . In suchcases it is sufficient for the theoretical analysis to use a model called constanttime where every communication between two neighbours costs one time unit.In this model the original messages can be recombined or split into separate partswithout any influence on the unit communication time of new messages.

Tneighbour to neighbour = 1 (2.6)

2.4.2 Network topologies

In systems whose principal function is numerical computation, the network typ-ically exhibits some regularity, and is sometimes chosen with a particular appli-cation in mind. Some example network topologies are presented in this section,and we focus on their communication properties.

Topologies are usually evaluated in terms of their suitability for some standardcommunication tasks (see Section 2.4.3 on page 25). The following are two typicalcriteria:

(a) The diameter r of the network is the maximum distance between anypair of nodes. Here the distance of a pair of nodes is the minimum number oflinks that have to be traversed to go from one node to the other. For a network ofdiameter r, the time for a packet to travel between two nodes is O(r) , assumingno queueing delays at the links.

(b) The connectivity of the network provides a measure of the number ofindependent paths connecting a pair of nodes. We can talk here about thenode or the arc connectivity of the network, which is the minimum numberof nodes (or arcs, respectively) that must be deleted before the network becomesdisconnected. In some networks, a high connectivity is desirable for reliabilitypurposes, so that communication can be maintained even if several link and nodefailures occur.

Some specific topologies are considered:1) Linear processor arrayHere there are p processors numbered 1, 2, . . . , p, and there is a link (i,i+1) for

every pair of successive processors (see Figure 2.8). The diameter and connectivity


1 2 p-1 p-� � - -�

Array

1 2 p-1 p-� � - -�� - �

Ring

-� � - -�

-� � - -�

-� � - -�

-� � - -�

6?

?

6

6?

6?

?

6

6?

6?

?

6

6?

6?

?

6

6?

1 2√

p

p

Mesh

-� � - -�� - �

-� � - -�� - �

-� � - -�� - �

-� � - -�� - �

6?

?

6

6?

� �

6

?

6?

?

6

6?

� �

6

?

6?

?

6

6?

� �

6

?

6?

?

6

6?

��

��

6

?1 2

√p

p

Mesh with wraparound (torus)

Figure 2.8: Some specific topologies

properties of this topology are the worst possible.2) RingThis is a topology having the property that there is a path between any pair

of processors even after any one communication link has failed. The number oflinks separating a pair of processors can be as large as d(p− 1)/2e.

3) TreeA tree network with p processors provides communication among all proces-

sors with a minimal number of links (p − 1). One disadvantage of a tree is itslow connectivity; the failure of any one of its links creates two subsets of proces-sors that cannot communicate with each other. Furthemore, depending on theparticular type of tree used, its diameter can be as large as p− 1 (note that thelinear array is a special case of a tree). The star network has minimal diame-ter among tree topologies; however the central node of the star handles all thenetwork traffic, and can become a bottleneck.

4) MeshIn a d-dimensional mesh the processors are arranged along the points of a

d-dimensional space that have integer coordinates, and there is a direct commu-nication link between nearest neighbors. In fact this is a d-dimensional versionof linear array.

Mesh with wraparound (torus) has in addition to the links of the ordinarymesh the links between the first and the last processor in a given dimension. Meshwith wraparound is in fact a d-dimensional version of ring.

5) HypercubeA hypercube consists of 2d processors, consecutively numbered with binary

integers using a string of d bits. Each processor is connected to every otherprocessor whose binary pid (processor identity number) differs from its own by


sd = 0

ss

d = 1

ss

ss

d = 2

ss

sss

sss

��

��

��

��

d = 3

ss

sss

sss

��

��

��

��

ss

sss

sss

��

��

��

��

d = 4

Figure 2.9: Hypercube interconnection

exactly one bit. This connection scheme places the processors at the vertices of ad-dimensional cube. Formally the d-dimensional cube is the d-dimensional meshthat has two processors in each dimension. Hypercube interconnection networksfor d varying from 0 to 4 are shown in Figure 2.9. The hypercube has the propertythat it can be defined recursively. A hypercube of order 0 has a simple node, andthe hypercube of order d + 1 is constructed by taking two hypercubes of order dand connecting their respective nodes. This interconnection has several propertiesof great importance for parallel processing, such as these:

a. As the number of processors increases, the number of connection wiresand related hardware (such as ports) increases only logarithmically, so that sys-tems with a very large number of processors become feasible. It follows that thediameter of a d-cube is d or log p, where p = 2d is the number of processors.

b. A hypercube is a superset of other interconnection networks such as arrays,rings, trees, and others, because these can be embedded into a hypercube byignoring some hypercube connections.

c. Hypercubes are scalable, a property that results directly from the fact thathypercube interconnections can be defined recursively.

d. Hypercubes have simple routing schemes. A message-routing policy maybe to send a message to the neighbor whose binary pid agrees with the pid of thefinal destination in more bits. The path length for sending a message between anytwo nodes is exactly the number of bits in which their pid differ. The maximumis of course d, and the average is d/2.


? ?

PPPPPPq

��)

? ?

? ?

? ?

-

-

-

�

�

�

total exchangepATA

gossipingATA

scatteringpOTA

broadcastingOTA

multinode accumulATA

gatheringpATO

singlenode accumulATO

point to pointOTO

Figure 2.10: Hierarchy and duality of the basic communication problems

2.4.3 Global communication primitives

Communication delays required by some standard tasks are typical in many al-gorithms performing regular numeric computations (see Chapter 3). These tasksdescribed in the rest of this section will be called global communications or globalcommunication primitives in the rest of this thesis.

1) Broadcasting (single node broadcast, One to All, diffusion)

The same packet is sent from a given processor to every other processor. Tosolve the single node broadcast problem, it is sufficient to transmit the packetalong a spanning tree rooted at the given node, that is, a spanning tree of thenetwork together with a direction on each link of the tree such that there is aunique positive path from the given node (called the root) to every other node.With an optimal choice of such a spanning tree, a single node broadcast takesO(r) time, where r is the diameter of the network, as shown in Figure 2.12(a).Radio braodcasting is an example of this global communication primitive.

2) Gossiping (multinode broadcast, All to All)

Gossiping is a generalized version of single node broadcast, where a singlenode broadcast is performed simultaneously from all nodes. To solve the multin-ode broadcast problem, it is needed to specify one spanning tree per node. Thedifficulty here is that some links may belong to several spanning trees; this com-plicates the timing analysis, because several packets can arrive simultaneously ata node, and require transmission on the same link that results in a queueingdelay.


Let us imagine this example arising from our everyday life: there are x women,and each of them knows a part of a story which is not known by other women.They communicate by phone and exchange all details of the story. How manycalls is it necessary to perform, so that all of them would know the whole story.

3) Single node accumulation (All to One)A packet is sent to a given node from every other node. It is assumed that

packets can be ”combined” for transmission on any communication link, with a”combined” transmission time equal to the transmission time of a single packet.This problem arises, for example, when it is desired to form at a given node asum consisting of one term from each node as in an inner product calculation (seeFig. 2.12(b)). Addition of scalars at a node can be viewed as ”combining” thecorresponding packets into a single packet of the same length.

4) Multinode accumulation (All to All)Involves a separate single node accumulation at each node. For example, a

certain method for carrying out parallel matrix-vector multiplication involves amultinode accumulation (see example given in [7]).

5) Scattering (personalized One to All, distribution)This problem involves sending a separate packet from a single node to every

other node. This problem appears usually in the initialisation phase in eachparallel algorithm when data are distributed over the processor network.

6) Total exchange (personalized All to All, complete exchange, multi-scattering)

Where a packet is sent from every node to every other node (here a nodesends different packets to different nodes in contrast with the multinode broadcastproblem where each node sends the same packet to every other node). Thisproblem arises frequently in connection with matrix computations.

7) Gathering (personalized All to One)Gathering involves collecting a packet at a given node from every other node.

This global communication occurs for example when data are collected from dis-tributed sensors.

Hierarchy

Note that the total exchange problem may be viewed as a multinode versionof both a single node scatter and a single node gather problem, and also as ageneralization of a gossiping, where the packets sent by each node to differentnodes are different.

The communication problems form a hierarchy in terms of difficulty, as il-lustrated in Figure 2.10. A directed arc from problem A to problem B indicatesthat an algorithm that solves A can also solve B simply by omitting certain al-gorithm parts. Figure 2.10 is an improved version of the similar one given by [7]where the author omitted the directed arcs from gossiping to broadcasting andfrom multinode accumulation to single node accumulation. The point-to-point


2

1

3

4

-

6

�

?

1|m42|m33|m2

1|m22|m13|m4

1|m12|m43|m3

1|m32|m23|m1

gossiping

2

1

3

4

-

6

�

?

1|m42|m33|m2

1|m2

1|m32|m2

transmission of the messagefrom the node 2 in time 1

gathering to node 1

Figure 2.11: Hierarchy example

communication is added here to complete the logic meaning, but in fact it is aglobal communication.

In particular, a total exchange algorithm can also solve the multinode broad-cast (accumulation) problem; a multinode broadcast (accumulation) algorithmcan also solve the single node gather (scatter) problem; and a single node scat-ter (gather) algorithm can also solve the single node broadcast (accumulation)problem.

Figure 2.11 illustrates hiearchy relation of gossiping and gathering on giveninstance of four nodes arranged in oriented ring.

Duality

It can be shown that a single node accumulation problem can be solved in thesame time as a broadcast problem. And more, any single node accumulationalgorithm can be viewed as a broadcast algorithm running in reverse time; theconverse is also true. As shown in Figure 2.12 the broadcast uses a tree that isrooted at a given node (which is node 1 in the figure). The time next to each linkis the time at which transmission of the packet on the link begins. The singlenode accumulation problem involving summation of n scalars a1, . . . , an (one perprocessor) at the given node (which is node 1 in the figure) takes 3 time units.So the single node accumulation and the broadcasting take the same amount oftime if a single packet in the latter problem corresponds to packets combined inthe former problem.

This relation called duality (indicated in Figure 2.10 by horizontal bidirec-tional arcs) is observed between some communication problems in the sense thatthe spanning tree(s) used to solve one problem can also be used to solve the dualin the same amount of communication time.

It is important to notice, that the above mentioned abstractions (definitionof the communication primitives, hierachy and duality) is done regardless of thenetwork being used.


1

2 3

4 5 6

7 8

��AAU

��AAU

��AAU

AAU

1|m11|m1

2|m1 2|m1 2|m1

3|m1 3|m1

(a) broadcasting

1

2 3

4 5 6

7 8

��AAK

��AAK

��AAK

AAK

3|m363|m24578

2|m4 2|m578 2|m6

1|m7 1|m8

(b) single node accumulation

Figure 2.12: Duality example

Example 2.4:

Let us now calculate the communication time for the scattering (gather-ing) and the gossiping (multinode accumulation) implemented on differenttopologies (1-port and unidirectional).

I.Ring (see Figure 2.8 on page 23)I.i.The scattering algorithm could be designed in OCCAM as follows (code

for i-th processor)

PROC scattering (VAL INT i,CHAN OF [L]REAL64 in, out)SEQ

IFi=1

[p*L]REAL64 data:SEQ k=0 FOR (p-1)

out![data FROM(((p-k)-2)*L)+1 FOR L]TRUE

[L]REAL64 data.for.me, data.for.others:SEQ

SEQ k=0 FOR (p-i)SEQ

in ? data.for.othersout ! data.for.others

in ? data.for.me

Assuming L to be the message length, the solution time for scattering is givenby:

sringH1 = (2(p− 2) + 1)× (β + Lτ) = (2p− 3)× (β + Lτ) (2.7)

I.ii. Supposing p to be an even value, a gossiping algorithm could be designedin OCCAM as follows (code for i-th processor).

2.5. CONCLUSIONS 29

PROC gossiping(VAL INT i,CHAN OF [L]REAL64 in, out)[p][L]REAL64 data:SEQ

IF(i REM 2) = 0

SEQ k=0 FOR (p-1)SEQ

out ! data[((p-k)+i) REM p]in ? data[((p-k)+(i-1)) REM p]

(i REM 2) = 1SEQ k=0 FOR (p-1)

SEQin ? data[((p-k)+(i-1)) REM p]out ! data[((p-k)+i) REM p]

Then the solution time for gossiping is given by:

gringH1 = 2(p− 1)× (β + Lτ) = (2p− 2)× (β + Lτ) (2.8)

II. Torus (see figure 2.8 on page 23)II.i. The algorithm scattering messages from node processor 1 can be de-

signed for example in the following two phases:1) scattering in the upper horizontal ring2) scattering in all vertical rings

storusH1 = (2

√p− 3)× (β +

√pLτ)

+(2√

p− 3)× (β + Lτ) (2.9)

II.ii. The algorithm performing the gossiping in the torus topology can workfor example in these two phases (

√p is assumed to be an even value):

1) gossiping in all horizontal rings2) gossiping in all vertical rings

gtorusH1 = (2

√p− 2)× (β + Lτ)

+(2√

p− 2)× (β +√

pLτ) (2.10)

2.5 Conclusions

This chapter is a basic one introducing the terminology and some elementarymethods from the field of parallel processing. Being a summary of contemporary


literature this chapter contains just a few original ideas.Some of the chapter’s most distinctive features are:

• It covers the majority of the important topics in parallel processing.

• It classifies systematically large and redundant terminology of parallel pro-cessing needed for comprehensive reading of the rest of this thesis.

• It introduces the concept of dependencies and simple techniques for paral-lelism analysis in sequential algorithms.

• It summarizes the results and the references of static scheduling of nonpre-emptive tasks on identical processors.

• It presents global communication primitives, which are important in manyalgorithms performing regular numeric computations.

There are several books covering various aspects of parallel processing. Amongthe basic textbooks in parallel processing are [7] by Bertsekas & Tsitsiklis and[65] by Moldovan. Many aspects of scheduling theory are presented in [16] byChretienne et al. and [27] by El-Rewini et al.

Chapter 3

An example of parallel algorithm- gradient training of feedforwardneural networks

This chapter presents a usual approach to parallel processing where parallelisationis not done automatically. It is a slightly modified version of the journal article [45]to appear in Parallel Computing.

A message-passing architecture is presented to simulate multilayer neural net-works, adjusting its weights for each pair, consisting of an input vector and adesired output vector. First, the multilayer neural network is defined, and thedifficulties arising from its parallel implementation are clarified. Then the imple-mentation of a neuron, split into the synapse and body, is proposed by arrangingvirtual processors in a cascaded torus topology. Mapping virtual processors ontonode processors is done with the intention of minimizing external communication.Then, internal communication is reduced and an implementation on a physicalmessage-passing architecture is given. A time complexity analysis arises from thealgorithm specification and some simplifying assumptions. Theoretical results arecompared with experimental ones measured on a transputer based machine. Fi-nally the algorithm based on the splitting operation is compared with a classicalone.

This chapter does not require deep understanding of neural networks. Exce-lent survey articles dedicated to this branch of artificial intelligence are [61] byLippmann and [48] by Hush & Horne.

3.1 Algorithms for neural networks

The neural approach to computation deals with problems for which conventionalcomputational approaches have been proven ineffective. To a large extent, suchproblems arise when a computer interfaces with the real world. This is difficult

31

32 CHAPTER 3. AN EXAMPLE OF PARALLEL ALGORITHM

because the real world cannot be modelled with concise mathematical expres-sions. Some problems of this type are image processing, character and speechrecognition, robot control and language processing.

Programs simulating neural networks (NN) are notorious for being com-putationally intensive. Many researchers have therefore programmed simulatorsof different neural networks on different parallel machines (e.g. [6, 8]). Someimplementations of algorithms such as self-organizing networks [69, 24] or het-erogeneous neural networks [57] have been realized on transputer-based machines[26, 74, 79, 86]. For more references see the bibliography of neural networks onparallel machines [85].

A large number of neural network implementations on message-passing archi-tectures have been reported in the last few years. These implementations usuallydeal with a conventional neural network adjusting its parameters (weights) afterperforming back propagation on a large number of input/output vectors. Suchalgorithms have high degree of data parallelism, so they are intuitively easier todecompose and many of them have already achieved linear speed-up.

The aim of this chapter is to describe the implementation of a parallel neuralnetwork algorithm performing back propagation on a single sample pair consistingof an input vector and a desired output (target) vector for a given time. Thenthe weights are adjusted for each sample input/output pair; this loop is called anepoch. In contrast with conventional neural networks, this function introducesa specific noise that is convenient in certain applications. Such networks aresometimes called neural networks with a stochastic gradient learning.

3.2 Neural network algorithm specification

The neural network under consideration is a multilayer neural network using er-ror back propagation with stochastic learning. The sigmoid activation functionis used. The neuron under consideration is shown in Figure 3.1.

�ΣΣ

� ��

��

Figure 3.1: Artificial neuron j in layer l

As shown in Figure 3.2, consecutive layers are fully interconnected. The fol-lowing equations specify a function of the stochastic gradient learning algorithm

3.2. NEURAL NETWORK ALGORITHM SPECIFICATION 33

simulating a multilayer neural network with one input, two hidden and one out-put layer. Nl is the number of neurons in layer l, k denotes an algorithm iterationindex, I l

j(k) denotes the input to the cell body of neuron j in layer l, ulj(k) denotes

the output of neuron j in layer l, δli(k) denotes the error back propagated through

the cell body of neuron i in layer l, wlij(k) denotes the synapse weight between

cell body i in layer l − 1 and cell body j in layer l, ηl denotes the learning rate,and αl denotes the momentum term in layer l.Activation - Forward Propagation

I lj(k) =

Nl−1∑i=1

[wlij(k)× ul−1

i (k)]

∀l = 1 . . . 3,∀j = 1 . . . Nl

I0j (k) . . . j − th neural network input (3.1)

ulj(k) = f(I l

j(k)) =2

1 + e−Ilj(k)

− 1

∀l = 0 . . . 3,∀j = 1 . . . Nl (3.2)

Error Back Propagation - Output layer

δli(k) = f ′(I l

i(k))× (udesiredi (k)− ul

i(k))

for l = 3 (3.3)

Error Back Propagation - Hidden layers

δli(k) = f ′(I l

i(k))×Nl+1∑j=1

(δl+1j (k)× wl+1

ij (k))

for l = 2, 1 (3.4)

Learning - Gradient Method (k = 1, 2, 3)

∆wlij(k) = ηl × δl

j(k)× ul−1i (k) + αl ×∆wl

ij(k − 1)

∀l = 1 . . . 3 (3.5)

wlij(k) = wl

ij(k − 1) + ∆wlij(k)

∀l = 1 . . . 3 (3.6)


ΣΣ�

ΣΣ�

ΣΣ�

ΣΣ�

�

�

ΣΣ�

ΣΣ�

ΣΣ�

ΣΣ�

ΣΣ�

ΣΣ�

� ��

Figure 3.2: Example of a multilayer neural network (NN 2-4-4-2)

In this thesis, a stochastic gradient learning algorithm is assumed. The term”stochastic” is used because the weights are updated in each cycle (activation ⇒back prop. ⇒ learning ⇒ activation ⇒ ... ). Such a processing introduces a littlenoise into the learning procedure that could be advantageous in certain neuralnetwork applications.

Let us consider a neural network used for non-linear system simulation withunknown model or a neural network used as a controller [39] interacting witha controlled system. In such cases, we deal with the problem of the dynamicbehavior of a neural network algorithm. The problem is difficult to understandif the NN behavior is described just in terms of matrix operations, that is why aPetri net algorithm representation is given in paragraph 5.2.2.

3.3 Simple mapping

How the simulation task of the NN with various configurations and sizes is dividedinto subtasks is important for efficient parallel processing. The data partitioningapproach proposed in [81] is dependent on the learning algorithm and needs theduplication of stored data. The network partitioning approach used by manyresearchers (e.g.[8, 56]), in this chapter called a ”classical algorithm”, uniformlydivides neurons from each layer among p node processors (NP). Then each pro-cessor simulates N0/p+N1/p+N2/p+N3/p neurons. One part of the activationphase in the second hidden layer is represented in Figure 3.3. The problem dueto this partitioning is seen from the Petri net representation: each neuron at

3.4. CASCADED TORUS TOPOLOGY OF VIRTUAL PROCESSORS 35

each processor has to receive the outputs of the previous layer from all otherprocessors.

��

��

��

��

��

��

��

��

��

��

��

� � ��

�� !#" $&%'��("*),+

Σ-= ./ 0

− 1[ 2 - 34 ( 5 ) • � -4 − . ( 5 )]

� -4 − . ( 5 )

� 6 34( 5 )

6 34( 5 )

Figure 3.3: The activation in the second hidden layer ( 4 neurons in both hiddenlayers mapped on 4 NPs) and its Petri net representation.

In order to avoid this problem we split the neuron into synapses and a cellbody. The splitting operation makes it possible to divide the computation of oneneuron into several processes and to minimize the communication as it is shownin the following section and proven in section 3.9.

3.4 Cascaded torus topology of virtual proces-

sors

In this section the algorithm running on a network of virtual processors (VPs)will be considered, hence we don’t have to care about load balancing and trainingdata delivering. Problems of this type will be solved in the following two sections,in this section we will focus on the algorithmic matters, so that the results willbe applicable to several architectures.

The network of VPs arranged in Cascaded Torus Topology (CTT) of sizeN0 − N1 − N2 − N3 corresponding to the neural network given in Figure 3.2 isshown in Figure 3.4. The VPs are divided into three categories:

• synapse virtual processors (SVPs)

• cell virtual processors (CVPs)

• input/output virtual processor (IO)


�� !�!�!�!"�"�"�"#�#�#�#$�$�$�$%�%�%�%&�&�&'�'�'(�(�()�)�)*�*�*+�+�+,�,�,-�-�-.�.�./�/�/0�0�01�1�12�23�34�45�56�67�78�8

9�9:�:;�;<�<=�=>�>?�?@�@A�AB�B�BC�C�CD�D�DE�E�E�EF�F�F�FG�G�G�GH�H�H�HI�I�I�IJ�J�J�JK�K�K�KL�L�L�LM�M�M�MN�N�N�NO�O�O�OP�P�P�PQ�Q�Q�QR�R�R�RS�S�S�ST�T�T�TU�U�U�UV�V�V�VW�W�W�WX�X�X�XY�Y�Y�YZ�Z�Z�Z[�[�[�[\�\�\�\]�]�]�]^�^�^�^_�_�_�_`�`�`�`a�a�a�ab�b�b�bc�c�c�cd�d�d�de�e�e�ef�f�fg�g�gh�hi�ij�jk�kl�lm�mn�no�op�p

qsrutwv xzy {}|s~}��

��rut�v xzy {}|s~}��

qsrutwv xzy {}|s~}��

��rut�v xzy {}|s~}��

q�r�t�v xz�� y {}|�~}�

��rut�v x�� y {}|s~}�

��

��

�

�

�

� � �

�

�

�

� � �

Figure 3.4: Cascaded torus topology of VPs (for NN 2-4-4-2)

Each SVP performs operations corresponding to the functions of a synapsein the neural network - the sum operation given in (3.1) and (3.3) and weightupdating given in (3.5) and (3.6). The CVP simulates the functions correspondingto the activation sigmoid function given in (3.2) and error evaluations given in(3.3) and (3.4). All SVPs and CVPs are connected to their four neighbours byunidirectional channels.

Using the terminology introduced in section 2.4.3 the program simulating themultilayer neural network is described as follows:

• initialize weights wlij in SVPs to a random number

calculate u0j in IO and distribute it to the SVPs in the layer 1 (scattering)

• for the 1st, 2nd and output layers do:calculate the product wl

ij × ul−1i in SVPs

accumulate I lj in CVPs (single node accumulation)

3.5. MAPPING VIRTUAL PROCESSORS ONTO PROCESSORS 37

calculate ulj in CVPs

send ulj to the SVPs in the following layer (broadcasting)

• receive u3j into IO

calculate output error in IO and send it to CVPs in the output layer

• for the output, 2nd and 1st layers do:calculate δl

i in CVPssend δl

i to the SVPs in the same layer (broadcasting)calculate the error δl+1

j (k)× wl+1ij (k) in SVPs

accumulate the products in CVPs (single node accumulation)

• update weights in SVPs

The whole network (cascaded torus topology) could be seen as a set of rings(each ring has just one active virtual processor) because the communications areperformed only in vertical or only in horizontal rings at a given time.

The cascaded torus topology of VPs is reminiscent of the systolic approachto parallel computing [50, 81]. But the systolic processing is essentially pipelinedarray processing and this algorithm has a very low degree of pipeline parallelismowing to the data dependency loop and lack of tokens in this loop (see the section5.2.2 and [77]). Mapping VPs of different layers to one node processor avoids theinefficiency of the systolic approach as will be seen in the next paragraph.

3.5 Mapping virtual processors onto processors

In this section VPs arranged in CTT are mapped onto a torus of P × Q nodeprocessors (NPs) or simply processors.

The input arguments of the mapping algorithm given below are P, Q andN(1..3), the number of neurons. The output arguments are the row and the col-umn node processor indexes of all SVPs and CVPs. Where colCVP(L,j) indicatesa column index of processor calculating j-th cell in layer L and rowSVP(L,i,j) in-dicates a row index of processor calculating synapse between i-th cell in layer(L-1) and j-th cell in layer L.

Instruction ’floor’ rounds its argument towards minus infinity and instruction’rem’ gives remainder after division.

for L = 1...3for j = 1...N(L)

for i=1...N(L-1)if (L rem 2) = 1

rowSVP(L,i,j)=floor((j-1)/(N(L)/P))colSVP(L,i,j)=floor((i-1)/(N(L-1)/Q))

else


rowSVP(L,i,j)=floor((i-1)/(N(L-1)/P))colSVP(L,i,j)=floor((j-1)/(N(L)/Q))

endendif (L rem 2) = 1

x=((((j-1) rem (N(L)/P))*(N(L-1)/Q)) rem N(L-1)) + 1else

x=((((j-1) rem (N(L)/Q))*(N(L-1)/P)) rem N(L-1)) + 1endrowCVP(L,j)=rowSVP(L,x,j)colCVP(L,j)=colSVP(L,x,j)

endend

In order to fully demonstrate the mapping strategy, a larger neural network(containing 16 neurons in each layer) mapped on a torus of 16 processors is shownin Figure 3.5.

�

��

��

��

��

��

�

��

��

��

�

��

� �

��

� �

�

��

!#"�

�

��%$�&'$)('*,+.-0/2143'365�78'92:<;>=?:A@CB�9EDGF6H':I=�@.J'K!'L MON,PRQ B�F6FE=�ST8'9:<;>=?:A@CB�9EDGF6H':I=�@.J'K

��

��

��

��U

Figure 3.5: VPs simulating NN with 16-16-16-16 neurons mapped on 4× 4 NPs

3.6. DATA DISTRIBUTION 39

Assuming the environment without virtual channels, we cannot simply map agroup of VPs on one NP, because each pair of adjacent NPs is connected by onechannel. One solution is to add multiplexing and demultiplexing processes. Thesolution chosen in our implementation is to create more complex processes thatcould function as groups of VPs and that eliminate internal communication.

To achieve uniform workload distribution among node processors, each NPneeds VPs of both categories (CVP and SVP) and from all layers. The reasonis seen from Figure 5.7 (for example: output layer has to wait for results fromlayer 2 in the activation phase). In the following analysis, it is assumed that thenumber of neurons in each layer (N0, N1, N2, N3) is greater than or equal to thenumber of NPs (p = P ×Q).

One possible solution for workload distribution is row and column permutation[33]. In this case, CVPs of one layer are divided into P ×Q parts and the meshof SVPs from the following layer has to be divided into P × P ×Q parts.

In our solution the CTT is split into six subregions (three rectangular sub-regions of SVPs and three diagonal subregions of CVPs) and each subregion ismapped onto P × Q processors. When using scattered mapping each nodeprocessor has a part of each subregion. As seen from Figure 3.5, the solution tothe mapping problem is done by reconfiguration of NPs in the case of layer 2.Then all subregions are divided into P ×Q parts.

3.6 Data distribution

Delivering of training data is a crucial problem for efficient parallel simulationof large scale neural networks. We assume that training data are available onone node processor - typically on the root processor (processor connected to thehost computer). Assuming what was mentioned in the previous section, the inputdata are delivered to the first row of a P ×Q torus (see Figure 3.6). Output layerneurons are mapped onto all NPs so the output has to be sent from all NPs tothe root processor and the output error has to be sent by the root processor toall NPs.

The implementation realized on a transputer array is shown in Figure 3.6.The solution to data distribution is to create a message passing process (MP)on each node processor and to connect it to the process performing computation.MPs in the first row of the torus are connected in a horizontal ring with the IOprocess mapped on the root processor, and MPs in each column of the torus areconnected into vertical rings. All MPs are connected by the channels mappedon the same physical links as channels connecting computation processes but inthe opposite direction. Each MP is a high priority process. This implementationwritten in OCCAM is available from the author upon request.

There are two kinds of MPs:

• 1) MP in the first row of the torus is transmitting an input from its neigh-


��

��

� ��

��

!�"$#%!�&'#%"��)(��+*-,.�/�.!0��1�

2'2 243 2'5 2'6

3+2 3'3 3+5 3+6

5'2 543 5'5

6'2 643 6'5 6'6

5'6

78�

�9��(:*;�<��=�(:�/�<��>��?

Figure 3.6: Realization on an array of 17 transputers

bour in row, its neighbour in column and its computation process. Thenthe message identification number is decoded and the message is sent to thecorrect direction.

• 2) Other MPs are performing similar actions with the exception of trans-mitting messages from neighbour in row.

It is clear that the node processors in the first row communicate more thanthe other node processors. A possible solution to this problem could be to createa more complex interconnection of MPs.

3.7. TIME COMPLEXITY ANALYSIS 41

3.7 Time complexity analysis

Assumptions:

• 1) P ×Q = p ... number of NPs without rootP ≥ 2, Q ≥ 2, P = Q =

√p.

• 2) Each node processor can transmit messages along one of its links at atime (1-port). We assume no gain of physical parallelism of type (PARin? out! ) and no overlap of communication and computation (seereference [2]).

• 3) Oriented topologies (CTT with unidirectional links).

• 4) Linear time communication model τt = β + L× τ . We assume messagesto be of constant length containing just one data unit (we don’t assumeany minimization of communication overhead). So the time required fortransferring one data unit is τt.

• 5) The processing time required for the sigmoid function (derivative of thesigmoid function respectively) is denoted τs. The processing time requiredfor one multiplication and one addition is denoted τm.

• 6) Each node processor contains the same amount of VPs.

According to the algorithm specification (Eq. 3.1 to 3.6 in section 3.2, thealgorithm description and synchronization between the root processor and CTT),the time requirements for one iteration can be evaluated as:


Tpar(N0, N1, N2, N3, p) = (τs + 2τt)N0+︸︷︷︸scattering from the Root

+∑3

l=1 [2τtNl−1−Nl−1/p√

p+ τm

Nl−1Nl

p+

+2τtNl−Nl/p√

p+ τs

Nl

p]+︸︷︷︸

activation

+ 2(τm + 2τt)N3+︸︷︷︸simulation in the root, gathering and scattering

+ τsN3

p+ 2τt

N3 −N3/p√p

+︸︷︷︸back−propagation−output layer

+∑1

l=2 [τmNl+1Nl

p+ 2τt

Nl−Nl/p√p

+

+τsNl

p+ 2τt

Nl−Nl/p√p

]+︸︷︷︸back−propagation−hidden layers

+3∑

l=1

[3τmNl−1Nl

p]︸︷︷︸

learning

(3.7)

In order to clarify dependence on the problem size let us assume the particularcase when there is the same number of neurons in each layer (n = N0 = N1 =N2 = N3). Then the time complexity is given by:

Tpar(n, p) = τsn + 2τmn + 6τtn︸︷︷︸Troot

+

+22τt(n− n/p)

√p︸︷︷︸

Tcomm

+14τmn2 + 6τsn

p︸︷︷︸Tcomp

(3.8)

• Troot is the sequential part of the algorithm (for its influence on speedup,refer to Amdahl’s law)The communication part (6τtn) is not dependent on p. This is not the casewhen communication overhead would be minimized by omitting assump-tion 4. In the case when all data for one NP would be sent in one packet,then the communication time with root (scattering and gathering) woulddepend on (p×β +n×L× τ). In the case when bigger packets (containingdata for one column of NPs) would be sent in the first row of node proces-sors, the communication with root would depend on (2×√p×β+n×L×τ).

3.7. TIME COMPLEXITY ANALYSIS 43

4 9 16 25 360

10

20

30

40

50

60

70

p number of node processors

time

/ms/

Tcomp

Troot

Tcomm

Figure 3.7: Separate parts of the execution time for NN with 64-64-64-64 neurons

The computation part could be done using pipeline parallelism (in the casewhen the input data are available in advance) with computations in CTT.This fact was not taken into account in the time complexity analysis.

• Tcomm includes 11 communications (1 broadcasting, 5 times gossiping, 5times multinode accumulation) on vertical/horizontal rings consisting of√

p NPs, where each NP works with n/p data units.

• Tcomp is the computational part of the algorithm. It corresponds to the partof Tseq (the processing time of the sequential algorithm running on one nodeprocessor) distributed among p node processors in CTT.

We write Tpar(n, p), to denote that the processing time is a function of thenumber of neurons and number of NPs, because τt, τs and τm are constants givenby the parallel computer hardware. By assuming that data unit length is 12bytes (one REAL64 and one INT32 as the identification number) and by apply-ing β = 3.9µs and τ = 1.1µs/byte (refer to [25]), we obtain τt = 17.1µs/data unit.On the basis of simple benchmark program results the floating point operationsprocessing time was estimated by τm = 4.6µs and τs = 32µs. Figure 3.8 visualizes


0 16 3264

1281

4 6912

1620

30

0

50

100

150

200

250

300

one

itera

tion

exec

utio

n tim

e /m

s/

sequ

entia

l alg

orith

m


parallel algorithm

n number of neurons in each layer

Figure 3.8: Theoretical execution time of NN algorithm

one iteration processing time Tpar given by Eq. (3.7) and labelled as ”parallelalgorithm”. The parabolic curve estimating Tseq is labelled as ”sequential algo-rithm” (in this case τt = 0).

3.8 Some experimental results

Figure 3.9 compares the theoretical results given by Eq. (3.7) and the practicalresults measured on a parallel computer (Telmat T-node 32 x T800). Smalldivergencies are given namely by the assumptions 1, 2 and 6, but in general wecan claim that Eq. (3.7) estimates very well the time complexity of the givenparallel algorithm. From the hyperbolic character of the curves in Figure 3.9, itseems that we succeeded to reduce the time complexity O(n×n) of the sequentialalgorithm to O(n × n/p) with the proposed parallel algorithm. This question isclarified by Figure 3.10 showing experimental speedup results.

The speedup is defined as the ratio:

S(n, p) =Tseq

Tpar

=Tseq

Troot + Tcomm + Tcomp

(3.9)

3.8. SOME EXPERIMENTAL RESULTS 45

1 4 6 9 12 16 20 25 300

50

100

150

200

250

300

number of processing elements

one

itera

tion

exec

utio

n tim

e /m

s/

NN with 30-150-150-30 neur.

NN with 64-64-64-64 neurons

NN with 32-32-32-32 neurons

theoretical execution time

Figure 3.9: Comparison of theoretical and experimental results

It is clear that with a large scale neural network, a very good speedup could beachieved with any parallel algorithm that does not communicate anything dealingwith synapses (n × n). In the case of the mentioned algorithm and referring toEq. (3.8) and (3.9) we can write:

limn→∞

S(n, p) = limn→∞

τsn + 2τmn + 14τmn2 + 6τsn

Tpar

= p (3.10)

A more difficult task is to achieve a reasonable speedup in the case when thenumber of node processors p approaches the number of neurons n (for example,imagine a real-time neural controller used to control a fast physical plant).

To get an indication of the speed-up, dependent upon the network size, anumber of different NN configurations have been executed. The aim in this casewas not to learn a specific example problem, but to get general speedup resultsof the algorithm. To indicate speedup, a small number of iterations suffices.

The results for the varying sizes of 4-layer networks are given in Figure 3.10and Table 3.1. The results are better in the case when (N1 + N2) > (N0 + N3)because there is relatively less work for the communication subsystem includedin Troot.


1 5 10 15 20 25 301

5

10

15

20


S sp

eedu

p NN 30−150−150−30 neurons

NN 64−64−64−64 neurons

NN 32−32−32−32 neurons

Figure 3.10: Experimental results achieved on a T-node machine

Number of node processors 1 4 6 9 15 20 25 30Execution time [ms] 753 212 144 99 63 48 40 34Speedup 1 3.5 5.2 7.6 11.9 15.5 18.7 21.8

Table 3.1: Numerical values for neural network with 30-150-150-30 neurons

3.9 Comparison with a classical algorithm

In the following analysis, we will distinguish between a ”classical algorithm” andthe one explained in the sections 3.4 to 3.8 - ”splitting algorithm”. In the caseof the classical algorithm it is assumed that each node processor handles onepartition of neurons (refer to Figure 3.3) as shown in the section 3.3. All weightscoming into a neuron are stored at the same NP as the neuron. In other words,the neuron was not split into the cell and body.

To derive the time complexity of the classical algorithm, let us assume thesame conditions as in paragraph 3.7 with the exception of assumption 4. Thismeans that the messages will differ in length, of type β + x× L× τ where x is acount of data units and L is the data unit length.

Using the terminology of section 2.4.3, let us imagine one iteration of the

3.9. COMPARISON WITH A CLASSICAL ALGORITHM 47

classical algorithm:

• calculate input layer at the ROOTdistribute results [u0

1, ... , u0N0

] to the processor network (scattering)

• for 1st, 2nd and output layers do:calculate [ul

1, ..., ulNl

]exchange results with all other node processors (gossiping)

• collect results [u31, ..., u

3N3

] at the ROOT (gathering)

• calculate the error at the ROOTdistribute results [e3

1, ..., e3N3

] to the processor network (scattering)

• calculate [δ31, ..., δ

3N3

]

• for 2nd and 1st layers do:calculate the partial sums of errors,exchange results (gossiping),add the partial sums and calculate [δl

1, ..., δlNl

]

• update weights

As argued by [67], there is an upper bound for the gossiping problem. Letus omit assumptions 2) and 3) from the section 3.7 and let us now consider ageneral topology. Each node processor in this topology has ∆ fully duplex linksable to work in parallel (∆ port). During scattering the node processor 0 has tosend (p−1) packets of length n/p over ∆ links, so the solution time for scatteringsF ∗ is at least p−1

∆npLτ . Let us consider that this topology has a diameter r,

so the solution time for scattering sF ∗ is at least rβ. Then the lower bound forscattering is:

sF ∗(n) ≥ max(rβ, Lτp− 1

∆

n

p) (3.11)

This fact shows that Troot is at least proportional to n in both algorithms(classical and splitting). Communication with ROOT could be accelerated usingprocessing elements having more communication links and arranged in a conve-nient architecture. The efficiency of this fact could be increased in the classicalalgorithm, because the connection of four links are already predefined in the split-ting algorithm. Assuming ∆ is a constant given by the processor hardware, thementioned acceleration is only constant depending on ∆ and the given topology.Concerning the hierarchy of basic communication problems, it is evident thatgossiping takes at least the same time as scattering (sF ∗ ≤ gF ∗). During gos-siping in the general topology, any node processor has to receive (p− 1) packetsof length n/p from ∆ links, so the lower bound for gossiping (used only by the


classical algorithm) is also at least proportional to n. In the case of the classicalalgorithm, it means that: Tcomm.clas = 5× gF ∗(n). On the other hand, in the caseof the splitting algorithm, we communicate only n/

√p data units in vertical and

horizontal rings, so: Tcomm = 11 × gF ∗( n√p). Please refer to Eq. (3.8). So finally

we can write:

Tsplitting = Troot(n) + Tcomm(n√

p) + Tcomp(

n2

p) (3.12)

Tclassical = Troot.clas(n) + Tcomm.clas(n) + Tcomp(n2

p) (3.13)

The above-mentioned equations express the difference between both algo-rithms. The computational workload is the same, the time for communicationwith ROOT can differ, but it is a function of n in both cases. The only differenceis in the communication time inside the processor network that is decreased by√

p in the case of the splitting algorithm. This difference is significant in thecase of a large processor network. The equation (3.13) shows that the splittingalgorithm is faster than the classical one, but the difference is not large.

The splitting algorithm is better than the other known algorithms in the caseof fully connected neural networks adjusting weights for each input/output pair.The classical network partitioning approach is the most effective in the case ofneural networks with sparse connections between layers. A data partitioningapproach can be used only in the case where the neural network does not usestochastic learning. In such a case, separate input/output pairs are treated in thedifferent processors, each of them containing the whole neural network. Whenusing a parallel computer with a big communication/computation ratio, thenthe data partitioning algorithm is probably the only one achieving reasonablespeedup.

3.10 Conclusion

The problem of multilayer neural network simulation on message passing multi-processors was addressed in this chapter.

The benefit of this chapter for the rest of the thesis lies namely in the fact thatwe have presented a typical algorithm performing regular numeric computations.Our parallelisation strategy needed deeper analysis of the algorithm structure. Itwas argued that the splitting of the neuron into synapse and cell body makes itpossible to efficiently simulate a neural network of a given class. The decomposi-tion and the mapping on this architecture was proposed, as well as a simple andconvenient message passing scheme. The experimental results show a very goodspeedup, especially for networks having many neurons in hidden layers. Thetime complexity analysis matches the experimental results well and facilitatesestimation of the parallel execution time for large processor networks.

3.10. CONCLUSION 49

This example of parallel algorithm reveals many interesting items:

• Fine grain partitioning leads to small granularity which increases paral-lelism. On the other hand small granularity can increase communications,in addition to increasing software complexity.

• Good overall speedup is achieved through a compromise between granularityand communication.

• Even if the data parallelism is very low it is possible to obtain good resultswith the use of structural parallelism.

• Parallelism detection in iterative algorithms is a very complex task and itneeds deep algorithm analysis.


Chapter 4

Structural analysis of Petri nets

Petri Nets make it possible to model and visualize behaviours comprising concur-rency, synchronization and resource sharing. As such they are a convenient toolto model parallel algorithms. The objective of this chapter is to study structuralproperties of Petri Nets which offer profound mathematical background originatednamely from linear algebra and graph theory.

Carl Adam Petri is a contemporary German mathematician who defined ageneral purpose mathematical tool for describing relations existing between con-ditions and events [76]. This work was conducted in the years 1960-62. From thattime, these nets have been developped in the USA, notably at MIT in the earlyseventies. Since the late seventies, european researchers have been very active inorganizing workshops and publishing conference proceedings on Petri Nets (PN)in the series LNCS by Springer-Verlag.

Murata [66] provides a good survey on properties, analysis and applications ofPetri Nets. Several books on Petri Nets, where theory takes an important place,have been published [10, 22, 75].

This chapter gives the basic terminology first, then the notion of linear invari-ants is introduced. It is argued that only positive invariants are of our interestwhen analyzing structural net properties and the notion of generator is intro-duced. Then three existing algorithms finding a set of generators are explainedand implemented. The importance of these vectors lies in their usefulness foranalyzing net properties, especially the parallelism detection as will be seen inthe following chapters. Time complexity measures are given and an original algo-rithm, first reducing fork-join parts of PNs and then finding a set of generators, isproposed. The three existing algorithms were implemented in Matlab and testedon various examples.

4.1 Basic notion

A Petri Net is a particular kind of directed graph, together with an initial statecalled the initial marking. The underlying graph of a Petri Net is a directed,

51

52 CHAPTER 4. STRUCTURAL ANALYSIS OF PETRI NETS

weighted, bipartite graph consisting of two kinds of nodes, called places andtransitions, and arcs connecting places to transitions or transitions to places.

Definition 4.1 A Petri net is four-tuple < P, T, Pre, Post > such that:P is a finite and non-empty set of places (represented as a vector with entriesP1, . . . , Pi, . . . , Pm),T is a finite and non-empty set of transitions (represented as a vector withentries T1, . . . , Tj, . . . , Tn),Pre is an input function, representing weighted arcs connecting places to transi-tions called precondition matrix of size [m,n],Post is an output function, representing weighted arcs connecting transitions toplaces called postcondition matrix of size [m,n].

��

��?

?

?

?

��r��

@@@@

@@I

��r��@@R

��@@I

2

2

T1

P2

T2

P3

T3

P1 P4

Figure 4.1: An example of Petri Net

In graphical representation (see Figure 4.1), places are drawn as circles, tran-sitions as bars or boxes. Arcs are labeled with their weights (positive integers),where a k-weighted arc can be interpreted as a set of k parallel arcs. Labels for1-weighted arcs are usually omitted.

A marking assigns to each place a nonnegative integer. If a marking assignsto place Pi a nonnegative integer k, we say that Pi is marked with k tokens.

Definition 4.2 A marked Petri net is a five-tuple < P, T, Pre, Post, M0 > suchthatM0 is an initial marking

Pictorially, we place k black dots ( tokens) in place Pi. A marking is denotedby M , a vector with entries M(1), . . . ,M(i), . . . ,M(m), where m is the totalnumber of places. The i-th component of M , denoted by M(i), is the number oftokens in place i.

4.1. BASIC NOTION 53

The behaviour of dynamic systems can be described in terms of system statesand their changes. In order to simulate the dynamic behavior of a system, a stateor marking in a Petri Net is changed according to the following firing rule:

1) A transition Tj is said to be enabled if each input place Pi is marked withat least Preij tokens, where Preij is the weight of the arc from Pi to Tj.

2) An enabled transition may or may not fire (depending on whether or notthe event associated with the transition actually takes place).

3) A firing of an enabled transition Tj removes Preij tokens from each inputplace Pi of Tj, and adds Postkj tokens to each output place Pk of Tj, where Postkj

is the weight of the arc from Tj to Pk.A firing sequence is an ordered set of firing operations. To a firing sequence

is associated a characteristic vector ~s, whose i-th component indicates the numberof times the transition Ti is fired in the sequence. From the marking in Figure4.2 one can have, for example, the firing sequence A = T1, B = T1T2T3 orC = T1T2T3T1T3 whose characteristic vectors are ~sA = (1, 0, 0), ~sB = (1, 1, 1),~sC = (2, 1, 2). A characteristic vector may correspond to several firing sequences:for example (1,1,1) corresponds to both T1T2T3 and T1T3T2. But not all the ~svectors whose components are positive or zero integers are possible; for example,there is no firing sequence ~s = (0, 1, 1) from M0 since neither transition T2 nortransition T3 can be fired before a firing of transition T1.

�� r r

�� @

@@R

@@@I �

�� @

@@R

@@@I�

��T1T2 T3

P1 P2

P3 P4

Figure 4.2: A Petri Net (an billiard balls in [22])

If a firing sequence ~s is applied from marking M then the reached markingM ′ is given by the fundamental equation:

M ′ = M + Post · ~s− Pre · ~s for M ≥ 0, ~s ≥ 0 (4.1)

A transition without any input place is called a source transition, and onewithout any output place is called a sink transition. A place Pi having thesame transition Tj as input and output (Preij = Postij 6= 0) is called a selfloop place (for example place P4 in Figure 4.1). A transition Tj having oneinput place Pi and one output place identical to the input one (Preij = Postij)is called self loop transition.


Definition 4.3 Let a PN with P = P1, ..., Pm and T = T1, ..., Tn be given. A Ma-trix C = (cij) where (1 ≤ i ≤ m, 1 ≤ j ≤ n) is called the incidence matrix ofPN iff:

C = Post− Pre (4.2)

Then the fundamental equation 4.1 could be rewritten as:

M ′ = M + C · ~s for M ≥ 0, ~s ≥ 0 (4.3)

Remarks: The incidence matrix does not have a capability to hold an infor-mation about self-loop places and self-loop transitions. The incidence matrix issometimes called the change matrix.

Definition 4.4 A Petri Net is pure iff:

∀Pi ∈ P and ∀ Tj ∈ T : Postij × Preij = 0 (4.4)

Definition 4.4 implies that a pure net does not contain any self-loop place orself-loop transition and it is fully representable by the incidence matrix C.

4.2 Linear invariants

In this section we introduce structural properties of Petri Nets. The structuralproperties are those that depend on the topological structure of Petri nets. Theyare independent of the initial marking M0 in the sense that these propertieshold for any initial marking or are concerned with the existence of certain fir-ing sequences from some initial marking. Thus these properties can often becharacterized in terms of the incidence matrix C and its associated homogeneousequations or inequalities. That is why we introduce a term linear invariantscomprising P-invariants and T-invariants defined further in this chapter.

Let us consider the PN given in Figure 4.2. The sum

M(1) + M(3) (4.5)

is equal to 1 for M0 = (1, 1, 0, 0). Firing of T1 or T2 does not change anythingon this sum. Even any other firing does not change anything on this sum. So wecan write for any marking M of the given Petri Net:

M(1) + M(3) = 1 (4.6)

That is why the subnet, formed by the set of places P1 and P3 and their inputand output transitions, is called a conservative component.

Similarly when considering a circuit of places P1,P2 and P3 in Figure 4.1 wecan write:

4.2. LINEAR INVARIANTS 55

2M(1) + M(2) + M(3) = 2 (4.7)

In order to have a general rule we multiply each term of the fundamentalequation 4.3 by fT from the left side and we obtain:

fT ·M ′ = fT ·M + fT · C · ~s (4.8)

It is clear that the only possibility for fT to fulfil fT ·M ′ = fT ·M = constantfor any firing vector ~s, is to satisfy the following set of linear equations:

fT · C = 0 (4.9)

In a similar way we introduce a repetitive component. Let us have alook again on Figure 4.2. Firing transitions T1, T2 and T3 gives again the samemarking. The subnet, formed by these transitions and their input and outputplaces, is called a repetitive component.

Definition 4.5 Let a finite PN system with P = P1, ..., Pm and T = T1, ..., Tn begiven.1. A vector f ∈ Zm is called a P-invariant of the given PN, iff:

CT · f = 0 (4.10)

2. A vector s : s ∈ Zn is called a T-invariant of the given PN, iff:

C · s = 0 (4.11)

To find a solution of the homogeneous set of equations (4.10 or 4.11) is an easytask solved by the Gauss elimination method in linear time. But in the followinganalysis we will be interested in a positive version of linear invariants.

Definition 4.6 Let a finite PN system with P = P1, P2, ..., Pm and T = T1, T2, ..., Tn

be given.1. A vector f : f ∈ Zm is called a positive P-invariant of the given PN, iff:

CT · f = 0 ∧ fi ≥ 0∀i = 1, ...,m (4.12)

2. A vector s : s ∈ Zn is called a positive T-invariant of the given PN, iff:

C · s = 0 ∧ si ≥ 0∀i = 1, ..., n (4.13)

Linear invariants were introduced by Lautenbach [58]. In this chapter, fur-thermore, only P-invariants will be considered. To obtain T-invariants, we onlyneed to transpose the matrix C and use the same method.


4.3 Finding positive P-invariants

We are looking for solutions of the equation:

CT .f = 0 for f ≥ 0

This is in fact a set of n homogeneous equations (n is the number of tran-sitions) and m variables, the entries of the vector f = (f1, ..., fm)T (m is thenumber of places):

c11.f1 + c21.f2 + ... + cm1.fm = 0c12.f1 + c22.f2 + ... + cm2.fm = 0

... + ... + ... + ... = 0c1n.f1 + c2n.f2 + ... + cmn.fm = 0

n equations

Example 4.1: Let us consider for example the Petri net given in Figure4.3. Then the set of equations is:

[f1 f2 f3 f4

].

−1 1 01 −1 00 1 −1−1 0 1

= 0

− f1 + f2 − f4 = 0f1 − f2 + f3 = 0

− f3 + f4 = 0

When solving this set of three equations with four variables we can deducethe following:1) from the third equation we obtain f3 = f4

2) then from the first equation f3 = f2 − f1

3) the second equation is a linear combination of the first one and the thirdone

So the solution of this set of equations is: f = (f1, f2, f2−f1, f2−f1)T . Byvariation of two parameters f1 and f2 we obtain all possible P-invariants.In other words the dimension of the P-invariant subspace (which is Kernel(CT )):dim(P − invariant subspace) = 2.

With respect to linear algebra (see Newman [68]) we can deduce:

dim(subspace of P − invariants) = m− rank(C) (4.14)

For example one basis of the P-invariant subspace is:

4.3. FINDING POSITIVE P-INVARIANTS 57

?

?��

��66

��@@R

@@6

��

�� rP1 P2

P4

P3

6

6

T1

T2

T3

Figure 4.3: A Petri Net with two positive P-invariants and one T- invariant

F = [f1 f2] =

1 00 1−1 1−1 1

for f1 = 1, f2 = 0for f1 = 0, f2 = 1

But the first P-invariant f1 is not positive, so we can perform a new vari-ation of parameters f1 and f2 in order to obtain positive P-invariants:

F =

1 01 10 10 1

for f1 = 1, f2 = 1for f1 = 0, f2 = 1

It is not always possible to find positive P-invariants (see Figure 4.4 where wecan find just one P-invariant (−1, 1, 1, 0, 0)T which is negative). From equation4.14 we deduce:

dim(subspace of positive P − invariants) ≤ m− rank(C) (4.15)

Definition 4.7 An invariant f = (f1, · · · , fm)T , solution of CT ·f = 0, is calledstandardized iff:1) f1, · · · , fm ∈ Z2) for the first fi 6= 0 holds fi > 03) f cannot be divided by an element k ∈ N , k > 1 (without destroying condition1).

Example 4.2: If

f1 =

0−1

7+3−1

3


��

��

��

��

?

?

?

?

?

?

��

��

@@@@@@R

P4

T1

P2

T2

P3

T3

P5

P1

Figure 4.4: A Petri net with one negative P-invariant

the equivalent standardized invariant is:

f2 =

03−637

Definition 4.8 A standardized invariant f ≥ 0 is called minimal, iff it cannotbe composed of other k standardized invariants in the form:

f =k∑

i=1

λifi for λi ∈ Q+ (4.16)

Example 4.3: Among the following three invariants only the invariantsf1 and f2 are minimal, because f3 = 0.5f1 + 0.5f2

f1 =

2201

f2 =

0201

f3 =

1201


?

?�� @@R

@@��

�� rP1 P2

6

6

6

6

T1

T2

P3

Figure 4.5: Event graph with two generators

More generally a given invariant f can be written as a composition of invari-ants called generators

f =g∑

i=1

λixi (4.17)

with factors λi, generators xi and g number of generators.The generators xi are the invariants that are used for the composition. A

composition of the form 4.17 is obviously simpler if the factors λi are elements ofZ+. Unfortunately, in the case of high simplicity of composition, the complexityof calculating the generators xi is much higher than the complexity of calculatingthe basis of P-invariant subspace.

For example the subspace of positive P-invariants of the Petri net in Figure4.5 is given by two generators x1 = (1 1 0)T and x2 = (0 1 1)T . This subspace isshown in Figure 4.6.

Kruckenberg and Jaxy [52] considered several algorithms calculating the gen-erators and divided the computations into five levels as shown in Table 4.1.

Level λi ∈ Generators xi Set {xi}1 Q xi ∈ Zm {xi} Base2 Z xi ∈ Zm {xi} Base3 Q+ xi ≥ 0 {xi} Unique4 Z+ xi ≥ 0 {xi} Unique5 {0,1} xi ∈ {0, 1}m {xi} Unique

Table 4.1: Generator computational levels

In the following subsections we will show the algorithms finding the genera-tors from the third level as the positive minimal standardized P-invariants. InPascoletti [73] it is proved that the set of generators X is finite and unique for agiven net if this set is characterized by a minimality condition.

In this thesis we will focus only on the third level, where each positive invariantis a positive linear combination of generators, which are minimal standardized


0

5

10

0

5

100

5

10

15

20

P3

P1

P2

generators

beginning of discrete solution subspace


Figure 4.6: Subspace of positive linear invariants


��

��

��

��

@@@@R

@@@@R

��

��

��

��

@@@@R

@@@@R

��r

�

6

P1 P2

t1

P3 P4

t2

t3

P5

Figure 4.7: A Petri Net with four generators

invariants. It is evident that in the case of event graphs (subclass of PNs whereeach place has no more than one input and one output arc with weight one)λi ∈ Z+ and xi ∈ {0, 1}m holds already for the third level. In other words: inthe case of event graphs the sets of generators X for the third, fourth and fifthlevel are identical. In the rest of this thesis the generators from the third levelwill be called simply generators (or Q+-generators) and the set of P-invariantgenerators will be denoted by the matrix X of size [m number of places, gnumber of generators].

It is clear that the number of generators can be larger than the dimensionof the P-invariant subspace. This fact is demonstrated in Figure 4.7 where wefind a unique set of four generators X and by choosing three of them we obtaina P-invariant basis F .

X =

1 0 1 00 1 0 11 1 0 00 0 1 11 1 1 1

F =

1 0 10 1 01 1 00 0 11 1 1

When solving the Example 4.1 it was possible to use the following approaches

to find positive P-invariants:1) find a basis of (m− rank(C)) linearly independent P-invariants by a modifiedGauss Elimination Method (GEM) favorising positive P-invariants (see subsection4.3.1)2) find a set of positive P-invariant generators by solving a set of equations (seesubsection 4.3.2)3) first find a basis of a certain type and then construct the generators by varying


the basis vectors (see subsection 4.3.3)

4.3.1 An algorithm based on Gauss Elimination Method

The method introduced in this subsection is explained in details in [92] and itis developed to find P-invariants and T-invariants all at once in [71]. I am verygrateful to professor Valette for various consultations on the subject.

The algorithm performs gaussian operations with rows on the set of equationsin order to find a maximum number of positive P-invariants.

Algorithm 4.1:

% GAUSS - Modified Gauss’s algorithm / see polycop by Valette page 40/.% F = GAUSS(C) is the base of the Petri Net specified by% the incidence matrix C. Rows of F are the P-invariants.% To find T-invariants use F = GAUSS(C T)F T = identity matrix of size m ∗mwhile (dim(C) 6= 0) %end test

% phase 1 - transition with one input and no output or one output and no input;% place connected to this transition can never form positive conservative comp.% (e.g. source transition can generate an infinite number of tokens to just one place)j = 1while (j ≤ number of columns in C)

if there is a unique nonzero element Cij in column C:j

delete column C:j

delete row Ci:

delete row F Ti:

elsej = j + 1

endendCATCHED=FALSE% phase 2.1 - transition with one output and at least one inputj = 1while ((j ≤ number of columns in C) and (not CATCHED))

if there is a unique positive element Cij in column C:j

for all rows k with negative element Ckj

%annulate element Ckj

Ck: = a∗Ck: + b∗Ci:

F Tk: = a∗F T

k: + b∗F Ti:

enddelete column C:j

delete row Ci:

delete row F Ti:

CATCHED=TRUE;


elsej = j + 1

endend% phase 2.2 - transition with one input and at least two outputsj = 1while ((j ≤ number of columns in C) and (not CATCHED))

if there is a unique negative element Cij in column C:j

for all rows k with positive element Ckj

%annulate element Ckj


F Tk: = a∗F T

k: + b∗F Ti:


delete row Ci:

delete row F Ti:

CATCHED=TRUEelse

j = j + 1end

end% phase 3 - transition with more inputs and outputsj = 1while ((j ≤ number of columns in C) and (not CATCHED))

if there are at least two positive and two negative elements in C:j

choose one row Ci: with positive element Cij

choose one row Cy: with negative element Cyj

for all rows k with negative element Ckj except row Cy:

%anulate negative element Ckj


F Tk: = a∗F T

k: + b∗F Ti:

endCATCHED=TRUE%algorithm will continue in the phase 2.2

elsej = j + 1

end% phase 4 - transition with many inputs and no output or many outputs and no input% places connected to these transitions can take part% just in a negative conservative component (even if a% source transition generates an infinite number of tokens% these tokens are subtracted owing to negativeness of the component)j = 1while ((j ≤ number of columns in C) and (not CATCHED))

if there are at least two positive or two negative elements in C:j

choose one row Ci: with nonzero element Cij


for all rows k with nonzero element Ckj except row Ci:

%annulate nonzero element Ckj


F Tk: = a∗F T

k: + b∗F Ti:


delete row Ci:

delete row F Ti:

CATCHED=TRUEelse

j=j+1end

endend

. End of algorithm


Example 4.4: Let us apply Algorithm 4.1 to the Petri net shown below:

P1

P2 P3

P4

P5 P6 P7

P8

T1 T2

T3 T4

T5 T6

C=

T1 T2 T3 T4 T5 T6

P1 1 −1 0 0 0 0P2 1 0 0 −1 0 0P3 0 −1 0 1 0 0P4 0 1 0 0 0 −1P5 0 0 1 0 −1 0P6 0 0 0 1 −1 0P7 0 0 0 −1 0 1P8 0 0 0 0 −1 1

F T =

P1P2P3P4P5P6P7P8

1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

Step 1 - phase 1 executed (row 5 and column 3 deleted)

P1

P2 P3

P4

P6 P7

P8

T1 T2

T4

T5 T6

C=

T1 T2 T4 T5 T6

1 −1 0 0 01 0 −1 0 00 −1 1 0 00 1 0 0 −10 0 1 −1 00 0 −1 0 10 0 0 −1 1

F T =

P1P2P3P4P5P6P7P8

1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

Step 2 - phase 2.1 executed (row 4 and column 2 deleted)

P1+4

P2 P3+4

P6 P7

P8

T1

T4

T5 T6

C=

T1 T4 T5 T6

1 0 0 −11 −1 0 00 1 0 −10 1 −1 00 −1 0 10 0 −1 1

F T =

P1P2P3P4P5P6P7P8

1 0 0 1 0 0 0 00 1 0 0 0 0 0 00 0 1 1 0 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 1

Step 3 - phase 3 executed (row 5 replaced by combination of the rows 5,3)

P1+4

P2 P3+4

P6 P3+4+7

P8

T1

T4

T5 T6

C=

T1 T4 T5 T6

1 0 0 −11 −1 0 00 1 0 −10 1 −1 00 0 0 00 0 −1 1

F T =

P1P2P3P4P5P6P7P8

1 0 0 1 0 0 0 00 1 0 0 0 0 0 00 0 1 1 0 0 0 00 0 0 0 0 1 0 00 0 1 1 0 0 1 00 0 0 0 0 0 0 1



P1+4+8

P2 P3+4+8

P6 P3+4+7

T1

T4

T5

C=

T1 T4 T5

1 0 −11 −1 00 1 −10 1 −10 0 0

F T =

P1P2P3P4P5P6P7P8

1 0 0 1 0 0 0 10 1 0 0 0 0 0 00 0 1 1 0 0 0 10 0 0 0 0 1 0 00 0 1 1 0 0 1 0


P1+4+8

P2+3+4+8

P2+6

P3+4+7

T1

T5

C=

T1 T5

1 −11 −11 −10 0

F T =

P1P2P3P4P5P6P7P8

1 0 0 1 0 0 0 10 1 1 1 0 0 0 10 1 0 0 0 1 0 00 0 1 1 0 0 1 0


P3+4+7

P2-1+3

P-1+2-4+6-8

T1

C=

T1

000

F T =

P1P2P3P4P5P6P7P8

−1 1 1 0 0 0 0 0−1 1 0 −1 0 1 0 −10 0 1 1 0 0 1 0

The Algorithm 4.1 is successful when solving Example 4.4 because the basisF contains a maximum number of positive P-invariants (in this case we have justone invariant P3P4P7)

The Example 4.5 shows the case when the algorithm finds just negative P-invariants in spite of existence of one positive P-invariant P4P6


Example 4.5: Let us apply the Algorithm 4.1 to the Petri net shownbelow:

P1

P2 P3

P4

P5

P6T1

T2

T3

T4

C=

T1 T2 T3 T4

P1 0 1 −1 0P2 1 0 0 −1P3 0 0 −1 1P4 0 1 0 −1P5 1 −1 0 0P6 0 −1 0 1

F T =

P1P2P3P4P5P6

1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1


P1

P2 P3

P4

P5

P1+6

T1

T2

T3

T4

C=

T1 T2 T3 T4

0 1 −1 01 0 0 −10 0 −1 10 1 0 −11 −1 0 00 0 −1 1

F T =

P1P2P3P4P5P6

1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 01 0 0 0 0 1


P1+5

P2 P3

P4+5

P1+6

T1 T3

T4

C=

T1 T3 T4

1 −1 01 0 −10 −1 11 0 −10 −1 1

F T =

P1P2P3P4P5P6

1 0 0 0 1 00 1 0 0 0 00 0 1 0 0 00 0 0 1 1 01 0 0 0 0 1


P1+5

P2 P3

P3+4+5

P1+6

T1 T3

T4

C=

T1 T3 T4

1 −1 01 0 −10 −1 11 −1 00 −1 1

F T =

P1P2P3P4P5P6

1 0 0 0 1 00 1 0 0 0 00 0 1 0 0 00 0 1 1 1 01 0 0 0 0 1



P1+5

P2+3

P3+4+5

P1+2+6

T1 T3

C=

T1 T3

1 −11 −11 −11 −1

F T =

P1P2P3P4P5P6

1 0 0 0 1 00 1 1 0 0 00 0 1 1 1 01 1 0 0 0 1


P-1+2+3-5

P -1+3+4

P2-5+6

T3

C=

T3

000

F T =

P1P2P3P4P5P6

−1 1 1 0 −1 0−1 0 1 1 0 00 1 0 0 −1 1

The Example 4.5 proves that Algorithm 4.1 is not always able to find a max-imum number of positive P-invariants. The reason lies in the phase 3 where onecombination of input and output arcs is chosen among several possible ones.

P1

P2

P3

P4 P5

P6P7

P8

P9

P10P11

P12

P13P14

T1

T2

T3

T4

T5

T6

T7

T8

Figure 4.8: Subnet of the PN representation of a Neural network algorithm


The Figure 4.8 shows a Petri net with generators x1 to x5:

XT =

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14

x1 0 0 0 1 0 0 1 1 1 0 0 0 0 1x2 0 0 1 1 0 0 1 0 1 0 0 1 0 1x3 0 0 1 1 0 0 1 0 0 1 0 0 0 0x4 0 0 0 0 0 0 0 0 0 0 1 1 0 1x5 0 0 0 1 0 0 1 1 0 1 1 0 0 1

The fact that matrix X has rank 4 shows that four linear independent positive

P-invariants can be found in Figure 4.8, but the algorithm finds a basis containingjust one positive P-invariant:

F T =

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14

f 1 −1 0 0 1 1 0 0 0 0 0 0 0 0 0f 2 −1 −1 0 0 1 1 0 −1 −1 0 0 0 0 −1f 3 0 0 0 1 0 0 1 1 1 0 0 0 0 1f 4 0 0 1 0 0 0 0 −1 −1 1 0 0 0 −1f 5 0 0 0 0 0 0 0 0 −1 1 1 0 0 0f 6 0 0 1 0 0 0 0 −1 0 0 0 1 0 0f 7 −1 −1 0 0 0 0 0 −1 −1 0 0 0 1 −1

When permuting rows and columns in matrix C before running the algorithm,

we can get a basis with four positive P-invariants. This is another proof ofdisfunction of Algorithm 4.1, because it cannot be dependent on the permutationof the input matrix C.

4.3.2 An algorithm based on combinations of all input andall output places

The algorithm presented in this subsection was published in [63]. It generalisesin some sense the Jensen’s rules [49], offering a systematic way to finding allinvariants. In the sense of the table 4.1 Algorithm 4.2 finds generators of the 3-rdlevel.

Algorithm 4.2:

%SILVA - algorithm Alfa2 /Martinez&Silva - LNCS 52/%X = SILVA(C) is a non-negative matrix such that:%1) each positive p-invariant could be done as a lambda%combination of the columns of X%2) no column of X could be done as a lambda


%combination of other columns of X%Remark: lambda is a vector of non-negative rational numbersQ+

%phase 1 - initialize XT

XT = identity matrix of size m ∗mfor j = 1 : n %eliminate all transitions

%phase 2.1 - generate new places resulting as a non-negative linear combination%of one input and one output place to the transition j (new places are neither%input nor output places to the transition j)for all rows p with positive element Cpj

for all rows q with negative element Cqj

%combine row p and row q when annuling j-th entry%and add this line to C and XT

Cnew: = a*Cq: + b*Cp:

XTnew: = a*XT

q: + b*XTp:

endend%phase 2.2 - eliminate input and output places to the transition jfor all rows i with nonzero element Cij

delete row Ci:

delete row XTi:

end%phase 2.3 - eliminate the non minimal generatorsfor all rows i

%when XTi: is already completed P-invariant

if Ci: = zero line %has the same effect as a usual condition%for P-invariant XT

i: × Coriginal =0

%find non-zero indices in row Xi:

arr = vector of q indices of Xi: that are non-zero%create submatrix of CM = Carr:

if q 6= rank(M) + 1delete row Ci:

delete row XTi:

endend

endend

. End of algorithm


Example 4.6: Let us apply Algorithm 4.2 to the Petri net shown below:

P1

P2 P3 P4 P5 P6

T1 T2

T3 T4

C=

T1 T2 T3 T4

P1 1 −1 0 0P2−1 0 1 0P3 1 0 −1 0P4−1 0 0 1P5 0 1 0 −1P6 0 −1 0 1

XT =

P1P2P3P4P5P6

1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

Step 1 - transition T1 deleted

P1P5 P2P6P3P3+4

P4P1+4

P5P1+2

P6P2+3

T1T2

T2T3 T3T4

C=

T2 T3 T4

P5 1 0 −1P6 −1 0 1

P1+2−1 1 0P2+3 0 0 0P1+4−1 0 1P3+4 0 −1 1

XT =

P1P2P3P4P5P6

0 0 0 0 1 00 0 0 0 0 11 1 0 0 0 00 1 1 0 0 01 0 0 1 0 00 0 1 1 0 0


P1P5+6P2P3+4

P3P1+4+5

P4P1+2+5

P5P2+3

T1T3 T2T4

C=

T3 T4

0 0−1 10 01 −10 0

XT =

P1P2P3P4P5P6

0 1 1 0 0 00 0 1 1 0 00 0 0 0 1 11 1 0 0 1 01 0 0 1 1 0

Step 3 - transition T3 deleted and nonminimal invariant P1P2P3P4P5 eliminated

P1P5+6

P2P1+4+5

P3\\////\\

P1+2+3+4+5

P4P2+3

T1T4

C=

T4

000

XT =

P1P2P3P4P5P6

0 1 1 0 0 00 0 0 0 1 11 0 0 1 1 0



P1P5+6

P2P1+4+5

P3P2+3

C=[ ] XT =

P1P2P3P4P5P6

0 1 1 0 0 00 0 0 0 1 11 0 0 1 1 0

Example 4.6 shows a case when a non-minimal invariant P1P2P3P4P5 is elim-inated because it contains an already existing invariant P2P3.

4.3.3 An algorithm finding a set of generators from a suit-able Z basis

Kruckenberg and Jaxy [52] propose an algorithm which computes generatorsbased on the algorithm [51] by Kannan and Bachem calculating a Hermite normalform.

Z basis coming from a Hermite normal form

Equations 4.12 and 4.13 can be regarded as a linear homogeneous Diophantinesystem:

A · x = 0, x ∈ Z (4.18)

where the matrix A corresponds to the transposition of the incidence matrix Cof the associated Petri Net.

Theories of sets of linear Diophantine equations can be found in the paper byFiorot/Gorgan [30] and the book by Newman [68]. Algorithms for solving a set oflinear Diophantine equations are described in Bradley [9] and Frumkin [32]. Thebasic idea used in their algorithms is to triangularize the augmented coefficientmatrix of the system by a series of column (row) operations consisting of addingan integer multiple of one column (row) to another, multiplying a column (row)by -1 and interchanging two columns (rows). Frumkin [32] has observed thatthe order of intermediate expression growing in the algorithm by Bradley [9]can be exponential in the number of equations. Kannan and Bachem present intheir paper [51] two polynomial algorithms. In the following we describe how theconstruction of the Hermite normal form can be used in solving linear Diophantineequations.


Theorem 4.1 Given a nonsingular (m,m) integer matrix A, there exists a (m,m)unimodular matrix K (a matrix whose determinant is either 1 or -1) such thatAK is lower triangular with positive diagonal elements. Further, each off-diagonalelement of AK is non positive and strictly less in absolute value than the diagonalelements in its row.

Proof: see Hermite [46]AK is called the Hermite normal form of A. The original algorithm computingthe Hermite normal form is concerned with square nonsingular integer matrices.But an examination of the procedures (see Kannan/Bachem [51]) shows that thealgorithm works on rectangular matrices as well.

Algorithm 4.3:

%Hermite Normal Form of Rectangular matrices - Kannan & Bachem%SIAM J.Comput.,Vol8,No.4.,November 1979,chap.2, rectang.-pp.500-504%Supposing A to be a rectangular integer matrix with full row rank.%K = HNFR(A) is a square unimodular matrix (determinant being +1 or -1)%such that A*K is lower triangular with positive diagonal elements.%Further, each off-diagonal element of A*K is nonpositive and%strictly less in absolute value than the diagonal element in its row.[t, u]=size(A)%phase 1 - initialize KK = identity matrix of size u ∗ u%phase 2%permute the columns of A so that every principal minor from (1,1) to%(t, t) is nonsingularfor i = 1 to t

%find column j, making principal minor non-singularj = iwhile ((j < u)&(d=0))

j = j + 1d=det(submatrix of A consisting of rows [1 : i] and columns [1 :

(i− 1), j])endinterchange columns i and j of A and of K

end%by joining identity matrix (u− t, u− t) and zero matrix (u− t, t)%make matrix A square put the first principal minor into HNFif A1,1 < 0

A:,1 = −A:,1

K:,1 = −K:,1

end%phase 3 to phase 6for i = 1 to (u− 1)


%phase 4%put (i + 1)(i + 1) principal minor into HNFfor j = 1 to i

%phase 4.1%calculate greatest common divisor[r, p, q]=gcd(Aj,j , Aj,i+1) %where r = p*Aj,j + q*Aj,i+1

%phase 4.2%perform elementary column operations in A and K%so that Aj,i+1 becomes zeroif r6=0

D1,1=pD2,1=qD1,2 = −Aj,i+1/rD2,2 = Aj,j/rA:,[j,i+1] = A:,[j,i+1]) · DK:,[j,i+1] = K:,[j,i+1]) · D

end%phase 4.3if j > 1

reduce off diagonal elements in column jend

end%phase 5reduce off diagonal elements in column i + 1

end

. End of algorithm

Algorithm 4.3 performs elementary operations with matrix A(n, m) (corre-sponding to CT ). These operations are memorized in matrix K(m, m) (servingin similar way like matrix F in Algorithm 4.1 mentioned in subsection 4.3.1) andwere performed in such a way that the product A ·K is in a Hermit normal form:

n

{ m︷︸︸︷[A

]×

m︷︸︸︷[K

]}m =

m︷︸︸︷

h11 0 · · · · · · · · · 0...

. . . 0 · · · 0 · · · ...hi1 · · · hii 0 · · · 0...

. . ....

hn1 · · · · · · hns︸︷︷︸s=rankA

|||||

0 · · · 0...

...... 0

......

...0 · · · 0

(4.19)

The next theorem shows how a Z-basis (the generatorts of the 2nd level - inthe sence of Table 4.1 ) of a homogenous linear Diophantine system can be found.


Theorem 4.2 Let A be an (n,m) integer matrix, K be a (m,m) unimodularmatrix, such that

A ·K = (h1, · · · , hs, 0, · · · , 0) (4.20)

and the columns h1, · · · , hs are linearly independent.Then the set B = {ks+1, · · · , km} formed from the last r=(m-s) columns of K isa Z-basis of 4.18.

Proof: see Newman [68]This reasoning is clear when we take into account that all the columns of B satisfythe basic equation of P-invarians 4.12 (see equation 4.19).

Finding Q+generators from Z basis

To demonstrate a procedure finding a set of Q+generators we consider first thefollowing set of linear homogenous inequalities:

B · y ≥ 0 (4.21)

y ≥ 0

B is a Z-basis of size (m, r) where m is number of places in associated Petrinet, r = m − rank(C) and y is a column vector of size (r, 1). Sets of linearequations may (in obvious manner) be considered as a special case of sets ofinequalities.

Then general idea to find the Q+generators is the following:The Z-basis B contains the very important and complete information to generateall integer solutions of the problem - by linear combination (with integer factors)of the vectors of B. The set of all nonnegative solutions is a subset of all integersolutions. The set of the Q+generators is a subset of the set of all nonnegativesolutions. Therefore it is possible to generate the Q+generators by suitable linearcombinations of the vectors of the Z-basis B. The problem is only to constructthe Q+generators by a systematic reduction procedure. A procedure describedin an Algorithm 4.5 from the paper [52], leads to the result.

The following Algorithm 4.4 shows how a set of Q+generators is found usingprocedures HNFR (Hermite normal form of rectangular matrices) in the Algo-rithm 4.3 and JAXY in the Algorithm 4.5 (explained afterwards).

Algorithm 4.4:

% Q+ generator - Kruckenberg & Jaxy (LNCS 266, pp.104-131)% Let C is an incedence matrix representing a Petri Net free of% sink and source places (having excusively output or input arcs).% X = QGEN(C) is a nonegative matrix such that:


% 1) each positive p-invariant could be done as a lambda% combination of the columns of X% 2) no column of X could be done as a lambda% combination of other columns of X%Remark: the lambda is a vector of nonnegative rational numbers Q+

%Calls - HNFR, JAXY[m,n]=size(C)%delete linearly dependent columns (than C is full column rank)for j = 1 to n

if column C:,j is linear combination of C:,1, ..., C:,j−1

delete column C:,j

endend%find Z basis of the linear Diophantine systemK=hnfr(C ′);B = K:,[from (rank(C)+1) to last column] % B is Z-basis%find a set of Q+generators XY =jaxy(B);X = B ∗ Y ;

. End of algorithm

So now we explain how a set of Q+generators is found from Z basis B(m, r)(Algorithm 4.5 consisting of the procedure JAXY). The idea is based on Theorem4.3.

Theorem 4.3 Given Y old a finite set of Q+generators of the set of linear ho-mogenous inequalities:

y ≥ 0

B · y ≥ 0 (4.22)

where B is a given (k,r) integer matrix with 1 ≤ k ≤ m and y is a columnvector of size (r,1) (matrix B is identical to the first k lines of matrix B of size(m,r)). Then Y new a new finite set of Q+generators of the set enlarged by a newconstraint ((k + 1)− th line of matrix B):

y ≥ 0

B · y ≥ 0

b · y ≥ 0 (4.23)

is created by the following rules:1) old generators satisfying constraint b · y ≥ 0 are put among new generators


2) let POZ be a set of old generators satisfying constraint b · y > 0 andNEG is a set of old generators satisfying the constraint b · y < 0. Find all couples[yi, yj] for which the conditions a) b) and c) are fulfilled and make new generatory =| b · yi | ·yj+ | b · yj | ·yi.Conditions:a) yi ∈ POZ and yj ∈ NEGb) the vectors yi and yj annul at least (r − 2) linear inequalities of the set 4.23simultaneouslyc) there does not exist a third vector yl, (l 6= i, j) which annuls all those inequal-ities which are annulled by yi and yj

Proof:Rule 1) the proof is trivial because the new constraint does not chage anythingon the capability of the vector y to be a generatorRule 2) here we should remain that each generator annules at least (r − 1) in-equalities, but two generators never annul the same (r − 1) inequalities. So thegenerator y annules one equation by its definition and (r − 2) equations being alinear combination of two vectors annuling the same (r−2) equations. In additionwe have to exclude a case when there is already such a vector (condition c)).

When applying iteratively the rules from the Theorem 4.3 we calculate theset of Q+generators:

Algorithm 4.5:

%Kruckenberg & Jaxy (LNCS 266, Theorem 6.4, page 121)% Supposing B to be rectangular integer matrix with fullcolumn rank% ( size(B,2)=rank(B) )% Y = JAXY(B) is a matrix of Q+ generators such that X=B*Y% fulfils the following:% 1)each nonnegative vector of subspace given by base B% could be done as a lambda combination of the columns of X% 2)X >= 0% 3)no column of X is a lambda combination of other columns% Remark: the lambda is a vector of nonnegative rational numbers Q+

[m, r]=size(B);M = identity matrix of size r∗r %subspace (corresponds to y ≥ 0 from 4.22Y = identity matrix of size r ∗ r %old generatorsfor k = 1 : m %each place in the Petri gives one constraint%each iteration in this loop is based on the theorem%phase 0

if equation k was not already present, then


%phase 1 - update subspaceM = k-th line of B joint to M%calculate scalar productS = B(k, :) · YPOZ=indexes of generators with S > 0ZER=indexes of generators with S = 0NEG=indexes of generators with S < 0%find new generators of updated subspaceNG=[ ]%phase 2 - hold generators with nonnegative scalar productfor j = 1 to length of POZ

NG = [NG,Y (:, POZj)]endfor j = 1 to length of ZER

NG = [NG,Y (:, ZERj)]end%phase 3 - for generators i, j whose scalar product differ in signfor i = 1 to length of POZ

ii = POZi

for j = 1 to length of NEGjj = NEGj

%phase 3.1. - do generators ii, jj annul at least r−2 equations?

if annulANNU=TRUE%phase 3.2. - does exist any vector annuling the same equations?if exist

EX=TRUE;end

endif ANNU (not EX)

%add new generatoralpha=abs(S(1, ii))/gcd(S(1, ii), S(1, jj))beta=abs(S(1, jj))/gcd(S(1, ii), S(1, jj))NG = [NG, alpha ∗ Y (:, jj) + beta ∗ Y (:, ii)]

endend

end%new generators are old generators for a next iterationY = NG

endend

. End of algorithm


Example 4.7: Let us apply the Algorithm 4.4 to the Petri net shownbelow.

First of all we find a Z-basis B applying procedure hnfr (given in the Algorithm4.3):

��

�

� �

��

� �

� ��

��

��

��

�B=

b1 b2b3 b4 b5

P1−1 1 1−1 1P2 0 0 1−1 0P3−1 1 0−1 0P4 1 0 0 0 0P5 0 1 0 0 0P6 0 0 1 0 0P7 0 0 0 1 0P8 0 0 0 0 1

Afterwards we apply the procedure jaxy (8 steps of this procedure correspondto 8 Petri net places):Step 0 - initiate M and Y

�

�

�

� �

�

�

� �

�

M=

b1b2b3b4b5

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 1

Y =

y1y2y3y4y5

b1 1 0 0 0 0b2 0 1 0 0 0b3 0 0 1 0 0b4 0 0 0 1 0b5 0 0 0 0 1

Step 1 - y1, y2, y3 are chosen in phase 2 (e.g. y1 shows that b2 is OK for place P1)and y4 to y9 are chosen in phase 3.1. (e.g. y4 shows a combination of b1 and b2

to be OK for P1)

��

�

� �

��

� �

� ��

��

� � ��

M=

b1 b2b3 b4 b5

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 1−1 1 1−1 1

Y =

y1y2y3y4y5y6y7y8y9

b1 0 0 0 1 0 1 0 1 0b2 1 0 0 1 1 0 0 0 0b3 0 1 0 0 0 1 1 0 0b4 0 0 0 0 1 0 1 0 1b5 0 0 1 0 0 0 0 1 1


Step 2 - y1 to y7 are chosen in phase 2

��

�

� �

��

� �

� ��

��

��

� � ��

M=

b1 b2b3 b4 b5

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 1−1 1 1−1 10 0 1−1 0

Y =

y1y2y3y4y5y6y7

b1 0 1 0 0 1 0 1b2 0 0 1 0 1 0 0b3 1 1 0 0 0 1 0b4 0 0 0 0 0 1 0b5 0 0 0 1 0 0 1

Step 3 - y1 to y4 are chosen in phase 2, y5, y6 chosen in phase 3.1., y6 is eliminatedin phase 3.2. (y6 annul 3rd, 4th and 7th equations in M but these are alreadyannulled by y3)

��

�

� �

��

� �

� � �

��

��

� � ��

M=

b1 b2b3 b4 b5

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 1−1 1 1−1 10 0 1−1 0−1 1 0−1 0

Y =

y1y2y3y4y5y6

b1 0 0 0 1 0 6 1b2 1 0 0 1 1 6 1b3 0 1 0 0 1 6 0b4 0 0 0 0 1 6 0b5 0 0 1 0 0 6 1

Step 4,5,6,7,8 - phase 0 (these lines are already present in M)The resulting set of generators is:

XT =

P1P2P3P4P5P6P7P8

1 0 1 0 1 0 0 01 1 0 0 0 1 0 01 0 0 0 0 0 0 10 0 0 1 1 0 0 01 0 0 0 1 1 1 0


4.3.4 Time complexity measures and problem reduction

When using the mentioned algorithms for further analysis it is important toknow their time complexity. Algorithm 4.1 is working in polynomial time, butits results are only of reduced applicability, because it is not able to find themaximum number of positive P-invariants.

Time complexity for non-polynomial algorithms is often evaluated in terms ofthe size of the algorithm output. In this case it is sufficient to find an instancein which the algorithm output is of non polynomial size. Then we can say thatthe algorithm is running in non polynomial time. Such an instance is shown inFigure 4.9 where g (the number of Q+generators) is given by equation 4.24:

g = kn (4.24)

P11

P1k

P21

P2k

P31

P3k

Pn1

Pnk

T1 T2 T3 Tn . . . . . . .:: : :

Figure 4.9: Example of Petri Net with exponential number of generators

To facilitate the analysis of large systems, we often reduce the system modelto a simpler one, while preserving the system properties to be analyzed. Thereexist many transformation techniques for Petri Nets [66, 91]. In our analysis wewill use only very simple reduction rules which are a subset of more complex rulespreserving liveness, safeness and boundeness. We will take only the binary PetriNets into the consideration. In the case of the example in Figure 4.9 (Fork-JoinPetri Net) we can apply a reduction procedure (see Algorithm 4.6) and representthe Petri Net in the form of ’recepie’ containing the same information as the setof Q+generators (the set of Q+generators can be obtained as an evolution of the’recepie’). The ’recepie ’ uses two operators (each operator specifies a relationbetween two nodes N1 and N2 representing either two places or transitions):‖ ... parallel relation,� ... serial relation.

Algorithm 4.6:

%FJ - algorithm reducing a fork-join part of a Petri Net%[PPre,PPost,T,P] = FJ(Pre,Post) are nonnegative matrices such that:


%PPre and PPost are precondition and postcondition matrices of%reduced network and T and P are sets of transitions and places where%each entry is a parallel/serial combination of the original Petri Net transitions/placesPPre = PrePPost = PostT = [T1, T2, ..., Tn]P = [P1, P2, ..., Pm]while not KERNEL

%place with one input and one output arcif exist Px, Ty, Tz such that

PPostx,y is only nonzero element in PPostx,: and in PPost:,yand PPrex,z is only nonzero element in PPrex,: and in PPre:,z

then eliminate Px, Ty and Tz = Ty � Px � Tz

%transition with one input and one output placeelseif there exist Tx, Py, Pz such that

PPrex,y is only nonzero element in PPre:,x and in PPrey,:

and PPostx,z is only nonzero element in PPost:,x and in PPostz,:

then eliminate Tx, Py and Pz = Py � Tx � Pz

%two places with one input and one output transitionelseif there exist Px, Py such that

PPrex,: = PPrey,: and PPrex,: has just one nonzero elementand PPostx,: = PPosty,: and PPostx,: has just one nonzero element

then eliminate Py and Px = Py ‖ Px

%two transitions with one input and one output placeelseif there exist Tx, Ty such that

PPre:,x = PPre:,y and PPre:,x has just one nonzero elementand PPost:,x = PPost:,y and PPost:,x has just one nonzero element

then eliminate Ty and Tx = Ty ‖ Tx

else%no serial/parallel relation foundKERNEL = TRUE

endend

. End of algorithm

When applying Algorithm 4.6 to the example in Figure 4.9 we obtain inpolynomial time the ’recepie’ T1� (P11‖ · · · ‖P1k)�T2� (P21‖ · · · ‖P2k)�· · ·Tn�(Pn1‖ · · · ‖Pnk) containing the same information as a set of Q+generators (theset of Q+generators can be obtained by evolution of the ’recepie’ to the form(��)‖(��)‖(��)).

Unfortunately not all Petri Nets are representable by the ’recepie’, but wecan always run Algorithm 4.6 before running the algorithms finding the set ofQ+generators in order to reduce nonpolynomial execution time only to the kernel(non reducible) part of the Petri Net.


A Petri Net of this kind is shown in Example 4.8.


Example 4.8: Let us apply Algorithm 4.6 to the Petri Net shown below.

P1

P2

P3

P4

P5

P6

P7 P8

P9

T1

T2 T3

T4

T5

T6

T7 C=

T1 T2 T3 T4 T5 T6 T7

P1 1 −1−1 0 0 0 0P2 0 1 1 −1 0 0 0P3 0 0 0 1 −1 0 0P4 1 0 0 0 0 −1 0P5 0 0 0 0 1 −1 0P6 0 0 0 0 1 0 −1P7 0 0 0 0 0 1 −1P8 0 0 0 0 0 1 −1P9−1 0 0 0 0 0 1

The result is seen below - this Petri Net is the kernel, that could not be writtenin the form of Fork-Join relations.

PP1 P4

P5

P6 PP2

P9

T1

T5 T6

T7 C=

T1 T5 T6 T7

PP1 1 −1 0 0P4 1 0 −1 0P5 0 1 −1 0P6 0 1 0 −1

PP2 0 0 1 −1P9 −1 0 0 1

Where ’recepie’ for PP1 is P1 � (T2‖T3) � P2 � T4 � P3 and ’recepie’ for PP2

is P7‖P8.The Q+generators of the kernel could be found either by Algorithm 4.4 or by

Algorithm 4.2:

XXT =

PP1 P4 P5 P6 PP2 P9

1 0 0 1 0 10 1 0 0 1 11 0 1 0 1 1


The Q+generators X of the original Petri Net are found by evolution of PP1

and PP2.

XT =

P1 P2 P3 P4 P5 P6 P7 P8 P9

1 1 1 0 0 1 0 0 10 0 0 1 0 0 1 0 10 0 0 1 0 0 0 1 11 1 1 0 1 0 1 0 11 1 1 0 1 0 0 1 1

The reduction of Fork-Join Petri Nets leads to polynomial size ’recepie’ con-taining information about the structural properties. This fact reveals a possibilityto base scheduling algorithms of Fork-Join Petri Nets on the ’recepie’ and not onthe set of Q+generators. This observation corresponds to studies proving a gen-eral scheduling problem to be NP-hard and only for special classes of DAG’s (suchas Fork-Join and trees) special polynomial algorithms are known [16].

4.3.5 Conclusion

The objective of this chapter was to study structural properties of directed,weighted, bipartite graphs called Petri Nets.

Some of the most distinctive features of the chapter are:

• It introduces the notion of minimal standardized invariants called gener-ators, which are used in data dependence analysis in Chapter 5 and inscheduling algorithms in Chapter 6.

• It proves the reduced applicability of algorithm 4.1, because it is not ableto find the maximum number of positive P-invariants.

• It shows an implementation of two different algorithms finding the set ofQ+generators accompanied by illustrative examples.

• It refers to a non-polynomial size of the set of Q+generators and it describesa simple and original reduction method.

One of the fundamental features of this chapter is its closed relation to otherdisciplines, namely convex geometry [17] and integer linear programming [93].Therefore we can find many algorithms, some of them dating from the last cen-tury [28, 29], using different terminologies. So non-negative left annullers of anet’s flow matrix are called positive P-invariants or P-semiflows or direction of apositive cone. In a similar way the set of Q+generators is called ’set of minimalsupport invariants’ or ’set of extremal directions of a positive cone’ or simply’generator of P-semiflows’. A recent article [19] by Colom & Silva highlights theconnection between convex geometry and Petri Nets and presents two algorithmsusing heuristics for selecting the columns to annul. Performance evaluation ofseveral algorithms can be found in [87] by Treves.


Chapter 5

Petri net based algorithmmodelling

This chapter focuses on algorithm representation by means of Petri Nets.

The algorithm modelling and parallelism analysis are essential for designingnew parallel algorithms, transforming sequential algorithms into parallel forms,mapping algorithms onto architectures, and, finally, designing specialized parallelcomputers. Some of these problems, like scheduling will be addressed in thenext chapter. PNs are frequently used in modelling, designing, and analyzingconcurrent systems (see [82], [11], [12]) owing to their capability to model and tovisualize behaviours including concurrency, synchronization and resource sharing.Dependency analysis based on PNs for synthesizing large concurrent systems wasgiven by Chen et al. [13]. The proposed method, knitting, synthesizes largePN by combining smaller PN and basic properties are verified by dependenceanalysis.

This chapter gives some additional terminology first, then various modelingtechniques are briefly compared. It is stated that a model can be based either onthe problem analysis or on the sequential algorithm. The problems are modelledas noniterative or iterative ones and corresponding algorithms are modelled asacyclic or cyclic ones. Finally an attempt is made to put DDGs and Petri Netson the same platform when removing antidependencies and output dependencies.

This chapter contains many original ideas; that is why it is written in educativemanner. It is possible that some ideas were just rediscovered and I will be verygrateful to all comments on the subject.

5.1 Additional basic notions

This paragraph introduce an additional terminology needed for further analysisof PN models and comparison of various formalisms.

87

88 CHAPTER 5. PETRI NET BASED ALGORITHM MODELLING

��

��?

?

?

?

��

@@@@@@R

T1

P2

T2

P3

T3

P1

(a)

��

��?

?

?

?

r��

��

��

@@@@@@R

T1

P2

T2

P3

T3

P1

(b)

Figure 5.1: Implicit place

5.1.1 Implicit place

Definition 5.1 Let a PN with P = P1, ..., Pm and T = T1, ..., Tn be given. Aplace Px is called an implicit place iff the two following conditions hold:1) any reachable marking M(Px) is equal to a positive linear combination of mark-ings of places from P2) M0(Px) does not impose any additional condition to fire its output transitions

Meaning of the condition 2 from Definition 5.1 is explained in Figure 5.1. ThePN given in Figure 5.1(a) shows a case where the place P1 is implicit to places P2

and P3. The structural property reflected by the condition 1 from Definition 5.1is fulfilled because M(P1) = M(P2)+M(P3). The condition 2 is fulfilled becauseM0(P1) ≥ M0(P2)+M0(P3). In Figure 5.1(b) the place P1 is not implicit, becausethe condition 2 is not fulfilled.

Notion of implicit place is important for structural parallelism detection, be-cause implicit places do not bring any information of data dependencies.

5.1.2 FIFO event graph

A basic assumption that will be made throughout this thesis is that both placesand transitions are First In First Out channels.

Definition 5.2 A place pi is FIFO if the k-th token to enter this place is alsothe k-th token which becomes available in this place.A transition tj is FIFO if the k-th firing of tj to start is also the k-th to complete.An event graph is FIFO if all its places and transitions are FIFO.

This assumption is needed, when cyclic algorithms are modelled with a use ofevent graphs.

5.1. ADDITIONAL BASIC NOTIONS 89

5.1.3 Uniform graph

Let us introduce a new graph theory formalism, called uniform graph. Uniformgraphs are used to model cyclic algorithms (see chapter 9 in [16]) with uniformconstraints.

The algorithms with uniform constraints consist of n statements having ingeneral the following form:

xi(k) = fi(x1(k − βi1), ..., xn(k − βin)) (5.1)

where k is the iteration index (k =0,1,2,...,), and βi1 to βin are constants from N .That means that there is no statement of the form x2(k) = f2(x1(2 ∗ k − 3))

(linear constraints) or of the form x2(k + 2) = f2(x1(k + 1) (general uniformconstraints). It is important to notice here that only general uniform constraintswill be considered further in this thesis. For more details see Equation 5.3.

DAG

unmarked event graphdirected graph

uniform graph(Max,+) algebra

general Petri Nets

marked event graph

Figure 5.2: Expressive power of different modelling methods

Definition 5.3 The triple (G, L, H) is a uniform graph if it is such that:G(V, E) is a directed graph where V is a set of vertices and E is a set of edgesL : E → {N , 0} is a length associated to each edgeH : E → N is a height associated to each edge

The notion of the uniform graph is given here in order to compare mod-elling methods based on various formalisms comming from from different scientificbranches. Figure 5.2 summarizes in brief the relationships among some modelling


methods used in this thesis. The time aspect (usually associated to vertices oredges) is not taken into consideration for the figure simplicity. Figure 5.2 shows ahierarchy of different modelling methods given by their generality, two methodsin the same rectangle signify the same expressive power of the two methods. Incertain sense it clerifies why only DAGs and marked event graphs are used tomodel and schedule algorithms.Event graph in contrast to general Petri Net does not contain structural conflict,which introduces nonlinear behaviour. So event graphs are more popular amongtheoreticians analyzing net properties and solving scheduling problems.Unmarked event graph containing cycle does not bring information, whether thecycle contains zero tokens (evoking deadlock situation) or one token (reflectingthe fact that all transitions in the cycle have to be fired in sequence) or moretokens. So only acyclic versions of directed graphs are of the interest. For deaperdiscussion on this subject see paragraph 6.1.3.

In the following text just only event graphs instead of general PNs will beused, because no conditional branching, represented by structural conflict, willbe taken into consideration.

5.2 Model based on the problem analysis

In this paragraph we will focus on the situation when the model is constructeddirectly from the problem specification. It means that we do not make use of anysequential algorithm, because the sequential algorithm specifies in which orderthe instructions will be performed. In other words, when we construct a modeldirectly at the moment when we make the problem analysis and there is noconditional branching then the model contains just data dependencies.

5.2.1 Noniterative problems

The situation mentioned above is seen from Figure 5.3 serving as data-flow modelof the simple numerical problem given by the equation y = (x2+1)(x2−1). Figure5.3(a) shows a Petri Net model constructed in the following way:

• transitions correspond to computational blocks (e.g. procedures or separateinstructions)

• data are represented by places

• input relation of data to the computational blocks is represented by matrixPre

• output relation of data to the computational blocks is represented by matrixPost

5.2. MODEL BASED ON THE PROBLEM ANALYSIS 91

• presence of a token in a place signifies validity of the data

For a correct use of PNs it is necessary to have the following restriction: eachdatum is represented by so many places as many times the datum is used. Theexample given in Figure 5.3(a) shows a data-flow computation where places P1

and P2 represent the same value.

��P0 xr

?��@@R��

P1x2 ��P2 x2

? ?

? ?��P3x2 − 1 ��

P4 x2 + 1

��P5 y = (x2 + 1)(x2 − 1)

?

��

@@R

��

BBBBBBBN��

BBBBBBBN

r v0

rv1 r v2

r v3

(a) (b)

T0 SQR

T1DEC T2 INC

T3 MUL

Figure 5.3: Representation of data flow by means of PNs and DAG

Let us clerify a relationship between two modelling methods - Petri Nets andDAGs. Figure 5.3(b) shows a DAG representation of the same problem as aPetri Net in Figure 5.3(a). In this case the DAG has the same information forscheduling purposes as the Petri Net given in Figure 5.3(a) because:

• no program cycles are assumed

• no gain of pipe-line parallelism, enabled by more tokens, is assumed

Another example of a noniterative problem is the matrix-vector multiplication,see Figure 5.4. For a fixed matrix size [3,3] and a vector with three entries wecan write the following equation:

a11 a12 a13

a21 a22 a23

a31 a32 a33

.

b1

b2

b3

=

c1

c2

c3

(5.2)

This example will be used later to show detection of data parallelism andglobal communications.


a11 b1 a12 b2 a13 b3

c1

a21 b1 a22 b2 a23 b3

c2

a31 b1 a32 b2 a33 b3

c3

Figure 5.4: Matrix[3,3]-vector[3] multiplication

5.2.2 Iterative problems

PN model of a PD controller

Let us imagine a simulation problem shown on Figure 5.5 where a discrete timelinear system of second order with a PD controller is modelled.

��

��

��

-1 Dz

Systeme(k)w(k)

e(k-1)

u(k) y(k)

d(k)

Pp(k)

Figure 5.5: Discrete time linear system

The following equations hold for the specific blocks of Figure 5.5 (please noticethat the order in which the equations are written is not important):

• Sum block:

e(k) = f1(w(k), y(k))

• Controller:

p(k) = f2(e(k))d(k) = f3(e(k), e(k − 1))u(k) = f4(p(k), d(k))


• System:

x1(k + 1) = f5(x1(k), x2(k), u(k))x2(k + 1) = f6(x1(k), x2(k), u(k))y(k) = f7(x1(k), x2(k))

1x1

w e

1e

p

d

1x2

1x1

1x2

u

u

1x1

1x2

y

e

T5

T7

T6

T4T1

T2

T3

Figure 5.6: PN model of the discrete time linear system of second order with PDcontroller shown in Figure 5.5

The general structure of the iterative problem under consideration consists ofn statements having in general the following form:

xi(k + αi) = fi(x1(k − βi1), ..., xn(k − βin)) (5.3)

where:k is the iteration index (k =0,1,2,...,)α,β are constants from Z (this assumption implay that also negative number oftokens will be under considaration)f1,...,fi,...,fn are functions Rn −→ R represented by the PN transitionsx1,...,xi,...,xn are variables from R represented by the PN places (each variable isrepresented by so many PN places as many times it is read)

Rule 5.1: The PN model of iterative problems consisting of state-ments given by Equation 5.3 can be constructed in the following way:1) create the transition Ti for each function fi with the input placescorresponding to the input variables (matrix Pre)2) put β tokens into the transition input places3) draw arcs from transitions to places (matrix Post)4) put α tokens into the transition output places.


Remark: Notice that if some α and β are negative numbers then the resultingnumber of tokens can be positive.

Finally, a positive number of tokens in the PN model corresponds to variablesthat have to be initialized. An algorithm, reading data before their actualisation,implies that a certain part of the model is not live (e.g. comprising a negativenumber of tokens in a place).

Neural network PN model

Behaviour of the neural network algorithm given by equations 3.1 to 3.6 in Section3.2 is modelled by the PN in Figure 5.7. Place and transition representationsare as follows:P0 ... inputs to neural networkT0 ... sigmoid function in input layerP1,P1′ ,P2,P2′ ,P2′′ ,P3,P3′ ,P3′′ , P4 ... outputs from layersT1,T2,T3 ... activation proceduresP5 ... desired network outputsT4 ... error evaluation in output layerT5, T6 ... error evaluation in hidden layersP6,P

′6,P7,P

′7,P8 ... error values

T7,T8,T9 ... learning procedures (when α = 0)P9,P9′ ,P9′′ ,P10,P10′ ,P10′′ ,P11, P11′ ... weights

The initial markings in P9,P9′ ,P9′′ ,P10,P10′ ,P10′′ ,P11, P11′ represent initial weightsgenerated by a random generator.

5.2.3 Model reduction

In order to show which transitions could be computed in parallel it is neededto simplify the PN model - eliminate the places that do not influence sequentialexecution of any transitions because this sequential execution is given by otherdata dependencies. Implicit places and self-loop places are places of this kind.

Correct elimination of implicit places and self-loop places preserves the prop-erties of liveness, safeness and boundedness because this elimination does notchange the graph of reachable markings.

Manual reduction

Let us consider for example the Petri net given in Figure 5.7. Then the manualmodel reduction could be done as follows:

First of all we can eliminate the self-loop places P9′ ,P10′ and P11′ . The placeP3′ is implicit to the places P3-P4-P6 and the place P2′ is implicit to the placesP2-P3-P4-P6-P7. The resulting PN is shown in Figure 5.8.


��

��

�� !�"��#��$��%��$��

&�'

&�(%(

&�)

&�* &�+&�,

-�.

-�+ -�, -�/

0!��$�� &�( &�1 &�2&�/

-�' -�( -�1 -�243 ��65��

&�187 &�287&�(%7

9�:�;�<�= > 9?:�;@<�=BA C�D�E�F�D�E9�:�;�<�=

&�187G7 &�287G7

&�(%(%7

&�(�'-�)&�(�'87G7&�(�'87

&�.&�.87G7&�.87

-�*

&�*!7 &�+87

Figure 5.7: NN learning algorithm represented by Petri Net

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 5.8: Simplified PN model


Similarly P3′′ is implicit to P3-P4-P6′ , P2′′ is implicit to P2-P3-P4-P6-P7′ andP1′ is implicit to P1-P2-P3-P4-P6-P7-P8.

The fact that P9′′ is implicit to P9-P4-P6 can be derived in details as follows:

• From the cycle P4-P6′-P9 it is easy to derive the equation:

M(P4) + M(P6′) + M(P9) = 1 (5.4)

M(P6′) ≤ 1 (5.5)

• From the relationship between the transitions T4 and T5 we can derive:

M(P6′) + M(P9′′)− 1 = M(P6) (5.6)

which givesM(P9′′) = M(P6)−M(P6′) + 1 (5.7)

• When substituing M(P6′) from the equation 5.5 it is possible to eliminatethe place P9′′ because:

M(P9′′) ≥ M(P6) (5.8)

In a similar way we can deduce that P10′′ is implicit to P10-P3-P4-P6-P7.The resulting PN model can be seen in Figure 5.9.When analyzing Figure 5.9 we can observe the following:

• T0 could be done in the pipe-line in parallel with rest of the algorithm

• T7 could be done in parallel with the sequence T5- T6-T9-T1-T2

• T8 could be done in parallel with the sequence T6- T9-T1

• the equations:M(P2) + M(P3) + M(P4) + M(P6) + M(P7) + M(P8) + M(P11) = 1M(P3) + M(P4) + M(P6) + M(P7′) + M(P10) = 1M(P4) + M(P ′

6) + M(P9) = 1prove the liveness of the PN model because each generator holds at leastone token (sufficient condition for liveness of event graphs).

Automatic reduction

Elimination of implicit places and self-loop places can be done either manuallyusing generally known reduction rules [91] (as shown above) or automatically bythe use of reduction algorithms. This section explains in brief some reductionalgorithms.


��

��

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

��

��

��

Figure 5.9: PN model after reduction


Rule 5.2: A self-loop place Pi can be eliminated if M0(Pi) > 0. Afterexamining all the places in the PN model the reduced PN model isobtained.

Rule 5.3: An implicit place reduction algorithm can be schematicallywritten in the following way:

fori=1...number of placesinverse the arcs of place Pi

find generators of the new PN modelif (there exists a generator Xl containing Pi) AND

(M0(Pi) ≥ M0((all places in Xl)− Pi)) AND(there is no self loop transition with Pi)then eliminate place Pi

else revert the arcs of place Pi

endifendfor

The algorithm given in Rule 5.3 is based on a research of generators, so thatit is executed in nonpolynomial time (see Section 4.3.4). In fact we do not needto look for the whole set of generators. We are interested in finding a linearcombination of places implicit to a place Pi. In terms of graph theory it meansthat we are interested in finding a path from a vertex (input transition of Pi) toanother vertex (output transition of Pi). This task can be solved very simply bya slightly modified depth first search algorithm in polynomial time (see [64] orsections 3.1.1. to 3.1.4. in [23])

5.3 Model based on the sequential algorithm

The way of data representation by the PN places does not match the real functionof the data in the computer. The data in the computer memory are written onceand could be read several times but the token in the PN is put to the placeonce and got also once. In other words this way of representing an algorithmby PNs is not exactly correct because PNs were designed to represent flows ofcontrol and not flows of data. For a correct use of PNs it is necessary to have thefollowing restriction: each datum is represented by so many places as many timesthe datum is used. This restriction means that we have to know how often eachdatum will be used before we will start to draw the PN. This fact could createproblems in the case of conditional branching. Otherwise the PN representationcould be done easily from the algorithm specification.

In this section we will adopt the same method as in Section 5.2. In additionwe introduce a new term IP dependencies reflecting the sequential characterof a modelled algorithm.

5.3. MODEL BASED ON THE SEQUENTIAL ALGORITHM 99

5.3.1 Acyclic algorithms

In many sequential programs, there are dependencies that are introduced artifi-cially by the programmer and that may be eliminated. Imagine for example analgorithm for the noniterative problem stated in Figure 5.3:

Example 5.1:

T0: A = X . XT1: B = A - 1T2: C = A + 1T3: Y = C . B

Example 5.1 sequentializes execution of statements T1 and T2. This kind ofcontrol dependencies is given by an instruction pointer and will be called IPdependencies in the rest of this thesis. IP dependencies will be modelled by thicklines. When modelling a sequential algorithm without conditional branching thereshould always be an oriented path of IP dependencies going through all transitionsand holding just one token. It is obvious that IP dependencies are very oftenimplicit (in the sense of Definition 5.1) to data dependencies. In other wordsdata dependencies order partially execution of statements while IP dependenciesorder totaly execution of statements.

The model of the algorithm given in Example 5.1 is shown in Figure 5.10. Itis clear that this model is very similar to the one given in Figure 5.3 arising fromthe problem analysis.

��P0

r?�

��

HHHHj��

P1 ��P2

? ?

? ?��P3 ��

P4

��P5

?

��

HHHHj

��IP1

��IP2

��IP3

��

��

�+ - -��

�+

��

T0

T1 T2

T3

Figure 5.10: Representation of the algorithm from Example 5.1


5.3.2 Cyclic algorithms

On a parallel computer, the statements of a loop body may be performed simul-taneously. Let us assume that the number of iterations is very large, and that theprocessing time of each statement is independent of the iteration index k. Theloop can be considered as a set of generic tasks to be performed infinitely often.The data dependencies partially order the occurrences of the generic tasks. Notethat some dependencies may involve executions from different iterations.

Example 5.2: consider an example algorithm given by the followingpseudo-code:

k = 0;while TRUE

k = k + 1x1(k) = 3 ∗ x7(k − 1)x2(k + 2) = x1(k − 2) + 4x3(k) = x1(k)/6x4(k + 1) = 2 ∗ x3(k)x5(k − 1) = x2(k + 1) + x4(k + 1)x6(k) = x3(k − 1) + 100x7(k) = x5(k − 1)/x6(k)

endwhile;

This code is represented by a PN model given in Figure 5.11. The PN modelis constructed with use of Rule 5.1 in a similar way as the one shown in Figure5.6, in addition it includes IP dependencies. Markings correspond to the presenceof valid data computed by the previous transition or assigned to in the algorithminitialisation.

5.3.3 Detection and removal of antidependencies andoutput dependencies in a PN model

Many dependencies are inherent, and cannot be removed. However they may bemodified by various transformation techniques. Profound analysis of this problemwas done by researchers using data dependence graphs (see section 2.2.1 and [65]for more details). The aim of this section is to show that similar analysis can bedone also with the use of Petri Nets. It is important to notice that the analysisdone by DDGs is based on a computer art knowledge. On the other hand theanalysis done by means of Petri Nets brings a new perspective arising from thefact that Petri Nets are a mathematical and visualisation tool with profoundtheoretical background.


2P1 P2

P31

P4

P5

P6

P7 P8

1P9

IP1

IP2

IP3

IP4

IP5

IP6

1IP0

T1

T4

T5 T6

T7

1

T2 T3

Figure 5.11: PN representation of Example 5.2

Example 5.3: consider a simple sequence of statements containing anantidependence:

T1: A = BT2: B = C

B

A

IP1

C B

A

IP1

C

BB

1B

A

IP1

C

T1 T2 T1 T2 T1 T2

(b) (c)(a)

Figure 5.12: Detection (a) and removal (b) of antidependence

This example shows a simple situation where variable B is used twice, but itcan be replaced by two independent variables. The PN model, given in Figure


5.12(a), contains one generator IP1 − B with zero tokens. In order to eliminatethis generator, evoking a deadlock situation, it is needed either to split IP1 (thatmeans to change the order of the operations T1 and T2 → but this change wouldbring a result different from the original one) or to split B as shown in Figure5.12(b). In this case B becomes an initialized variable (as well as variable C)which matches very well the real situation.

Remark: Please notice that the task is to remove dependencies and corre-sponding cycles, not only to avoid the deadlock situation (in such case it will besufficient only to put one token to B as shown in Figure 5.12(c), that means toinitialize the variable B). In the terminology of DDGs we can state that it is pos-sible to remove antidependence and output dependence, but it is not possible toremove flow dependence represented by a normal (not IP) place in the PN model.Figure 5.12(c) has no more antidependence, but there is a flow dependence dueto variable B which disables to fire T1 and T2 simultaneously. For more detailssee discussion to Example 2.2 on page 8.


Example 5.4: consider the following loop program containing an antide-pendence:

FOR k=1:NT1: A(k) = B(k)T2: C(k) = A(k+2)

The PN model, given in Figure 5.13(a), contains a generator A − IP0 withminus one token. Please notice that this situation will be represented in DDG bymeans of a flow dependence with negative value in the dependence vector, but infact it reflects the same relation as an antidependence.

In order to eliminate the generator A− IP0 it is needed to split variable A asshown in Figure 5.13(b). A negative number of tokens in the initialized variableA shows a shift of this vector. So the resulting code corresponding to Figure5.13(b) is as follows (the fact, that the vector B is initialized too, is omitted):

initialize(A(3:N+2))FOR k=1:N

T1: AA(k) = B(k)T2: C(k) = A(k+2)

B

IP1

C

-2A 1IP0

B

IP1 1IP0

C

AA

A

T1

T2

T1

T2

(a) (b)

Figure 5.13: Antidependence in a cyclic algorithm

Rule 5.4: An antidependence is related to a non-live part of a PetriNet. In the case of event graphs it is related to a generator with a nonpositive number of tokens.


Remark: There is a polynomial-time algorithm finding a deadlock in eventgraphs based on Theorem 1 given in [20] (marking is live iff the token countof every directed circuit is positive). If the graph is strongly connected, thenit is sufficient to erase all places holding token and then look for existence ofdirected circuit. If the event graph is not strongly connected, then it is a bitmore complicated (one has to compute the strongly connected components), butthe problem is still polynomial.

Example 5.5: consider a simple sequence of statements containing anoutput dependence:

T1: A = BT2: A = CT3: D = A

B C

D

IP1

Ax AyIP2

B

AA

IP1

C

A

D

IP2

T1 T2

T3

T1 T2

T3

(a) (b)

Figure 5.14: Output dependence

The PN model, given in Figure 5.14(a) demonstrates the place Ax to beimplicit to IP1 − Ay. Figure 5.14(b) shows a resulting model corresponding tothe following code:

T1: AA = BT2: A = CT3: D = A

Definition 5.4 A place Px (output place of Tx) is a competitor of Py (outputplace of Ty) and Py is a competitor of Px iff:1) they represent the same variable2) Tx 6= Ty.


Rule 5.5: Output dependence is related to the place implicit to itscompetitor. In the case of event graph: two competitors Px and Py

are related to output dependence iff there is a generator containingboth competitors Px and Py (one of them with inverted arcs).

The correctness of Rule 5.5 is proven in the following text.

Example 5.6: consider the sequence of statements given in Example 2.1and modelled by a DDG in Figure 2.1.

T1: A = B + CT2: B = A + ET3: A = B

B C

Ax

IP1

Ay

IP2

E

B C

IP1 A

E

IP2 BB

AA

B

T1

T2

T3 T1

T2'

T3'

(a) (b)

Figure 5.15: Two antidependencies and one output dependence

This example modelled by the Petri Net in Figure 5.15(a) contains two antide-pendencies given by the generators Ax−B and B−Ay and one output dependenceAx −Ay. First of all we will try to remove antidependecies. The only possibilityto remove the generator Ax−B is to split the place B and the only possibility todisconnect the generator B − Ay is to split the place Ay. The resulting model isgiven in Figure 5.15(b) and reflects the same algorithm as the DDG model givenin Figure 2.1(b). It is important to notice here that there is no more outputdependence. The output dependence Ax − Ay disappeared when the generatorB − Ay was disconnected. But this is not always the case.

Theorem 5.1 Let PNO be a Petri net used to model the output dependence inan antidependence free algorithm. Then searching for a generator containing both


Px Py Px Py Px Py

IP1 IP2 IP2

IP3 IP3

IP1

Px Py Px Py Px Py

1IP3

IP1 IP2

IP3

1IP1 IP2

IP3

IP11

IP2

CompX

Dest

CompY CompX CompY

Dest

CompX CompY

Dest

CompX CompY CompX CompY CompX CompY

Dest Dest Dest

(a) (b) (c)

(d) (e) (f)

y x x y x y

Figure 5.16: The six possible instances of an output dependence


competitors (Py with inverted arcs) in PNO is limited to four instances given inFigure 5.16(c)(d)(e)(f).

Proof:

• Notice that instances (a)(b)(c) model the three possible combinations ofCompX, CompY whose are the sources transitions of the two competitorsPx, Py and one data destination Dest in acyclic algorithms and instances(d)(e)(f) the three possible combinations in cyclic algorithms.

• Models (a)(b) cannot be taken into consideration because they hold antide-pendencies. For example Figure 5.15(a) is one case of instance (a).

• The following holds for instances (c)(d)(e)(f):Any additional path (not shown in Figure 5.16) from CompX to Dest orfrom Dest to CompX or from CompY to Dest or from Dest to CompYor from CompX to CompY cannot take part in the generator containingboth competitors Px and Py (Py with inverted arcs). So these paths neednot be present in Figure 5.16 for the needs of Theorem 5.1 searching for agenerator containing both competitors.

• The following holds for instances (c)(e)(f):Any additional path from CompY to CompX is implicit to IP3.

• The following holds for instance (d):Any additional path from CompY to CompX with zero tokens creates anantidependence with IP1−IP2. Otherwise, with more tokens, the additionalpath after reduction to one place is implicit to IP3.

2

Theorem 5.2 Each of instances (c)(d)(e)(f) has a generator containing bothcompetitors Px and Py (Py with inverted arcs).

Proof: see Figure 5.16 2

Theorem 5.2 implies that we are able to prove whether place Py is implicit toIP3 − Px or not when comparing number of tokens in the generator.

Theorem 5.3 Let PNO be a Petri net used to model the output dependencein an antidependence free algorithm. Then the following holds for instances(c)(d)(e)(f):Py is implicit to IP3 − Px ⇔ elimination of Py does not change behaviour andresults of the modelled algorithm.

Proof:


• instance c)Py is always implicit ⇒ the sequence of IP dependencies shows that thedata from the competitor Y are first overwritten by the competitor X andthen read by destination

• instance f)Py is implicit (y ≥ x) ⇒ the same memory location is first assigned bycompetitor Y and then overwritten by the competitor XPy is not implicit (y < x) ⇒ the same memory location is first assignedby competitor X from a previous iteration and then overwritten by thecompetitor Y in one of the successive iterations

• instance e)x = 0 or y = 0 ⇒ there is an antidependencex > 0 and y > 0 ⇒ the transition Dest could be fired and a marking equalto the one in instance c) reached

• instance d)y = 0 ⇒ there is an antidependencey > 0 ⇒ the transitions CompX and Dest could be fired and the markingequal to the one in instance c) reached

2

Theorem 5.4 Py is not implicit to IP3−Px ⇔ elimination of Px does not changebehaviour and results of the modelled algorithm.

Proof: From Theorem 5.3 Py is not implicit ⇔ elimination of Py change the algo-rithm behaviour ⇔ data of the competitor Y are used ⇔ data of the competitorX are not used (see Definition 5.4).2

Theorem 5.5 An algorithm detecting and removing antidependencies and outputdependencies can be schematically written in the following way:

begin1) get PN model from sequential algorithm2) eliminate antidependencies3) eliminate output dependencies4) remove IP-dependencies5) save result as DAG (acyclic algorithm)

or event graph (cyclic algorithm)end

Proof: see Theorems from this section2


Example 5.7: consider the following loop program with non specifiedconstant z:

FOR k=1:NT1: A(k+1) = B(k)T2: C(k) = A(k)T3: A(k+z) = C(k)T4: D(k) = A(k)

1P2

P3

1P1 P4

1IP0 IP1 IP3 IP2

P6

P5

T1 T3

T2 T4

z

z

Figure 5.17: Problem with two competitors and two destinations

The PN model, is given in Figure 5.17. Let us study the output dependenciesfor various values of the constant z:

• z ≤ 0first of all the antidependence P3 − IP2 has to be eliminated and then wecan look for an output dependence in a new model

• z = 1P3 is not implicit to IP3 − IP0 − P1 ⇒ P1 is eliminatedP2 is implicit to IP1 − IP2 − P4 ⇒ P2 is eliminated

• z ≥ 2P3 is implicit to IP3 − IP0 − P1 ⇒ P3 is eliminatedP2 is not implicit to IP1 − IP2 − P4 ⇒ P4 is eliminated

Remark: the case when comparing P1 to P3 for destination T2 correspondsto the instance d) and the case when comparing P2 to P4 for destination T4

corresponds to the instance f).


5.4 Conclusion

The objective of this chapter was to build a PN model of problems given eitherin the form of the equations or in the form of the algorithm.

The basic structural features of algorithms are dictated by their data andcontrol dependencies. These dependencies refer to the precedence relations ofcomputation that need to be satisfied in order to compute the problem correctly.The absence of dependencies indicates the possibility of simultaneous computa-tions as addressed in the following chapter.

Figure 5.18 shows a general view of several modeling approaches where in-formation is represented by elipses and methods are represented by rectangles.As shown in Figure 5.18 both modelling approaches represented by Section 5.2(model based on the problem analysis) and Section 5.3 (model based on the se-quential algorithm) lead to the same outcome - representation of data dependen-cies by event graphs with nonnegative number of tokens in each place. Afterwardsfor the purpose of parallelism detection the event graph can be reduced as shownin paragraph 5.2.3.

The original features of the chapter are:

• It introduces a new term ’general uniform contraints’ leading to negativenumber of tokens and to antidependencies.

• It makes an attempt to compare various modelling techniques.

• It shows that the data dependencies of iterative problems can be modelledby an event graph.

• It shows how to simplify the model by reduction of implicit places andself-loop places, leading to the reduced model which is easier to schedule.

• It underlines importance of antidependencies and output dependencies andit shows algorithm tranformations based on PN analysis (removal of depen-dencies increases a structural parallelism).

• As a whole it presents a modeling strategy, which is coherent for noniterativeas well as iterative problems.

5.4. CONCLUSION 111

Programming

+ removal of antidependencies

and output dependencies

Problem analysis

DAG Event graph

Reduction of implicit

and self loop places

Reduced event graph

Cyclic schedulingScheduling

Sequential algorithm

DDG

Problem specification

PN modelling

DDG modelling

Figure 5.18: Rough comparison of the two modelling approaches


Chapter 6

Parallelization

In this chapter we focus our interest on specific questions that can be useful whendesigning parallel compilers. In order to better understand the problem naturewe first clarify the terms data parallelism and structural parallelism.

Loop scheduling is particularly important when designing efficient compilersfor parallel architectures. In this chapter a cyclic schedule of nonpreemptive taskswith precedence constraints and no communication delays on an unlimited num-ber of identical processors will be proposed. In addition we will try to minimizethe number of processors used without releasing the time-optimality condition.

In the last paragraph we address one of the most important problems for asuccessful implementation of parallel algorithms - namely communication.

6.1 Basic principles

In order to make clear some terminology we introduce the following hypothesis.Hypothesis: there is no other parallelism than data parallelism and structuralparallelism.

6.1.1 Data parallelism

Let us have a look on the matrix-vector multiplication problem modeled by Figure5.4 on page 92. This model contains three independent parts. In terms of graphtheory we can say that the three subgraphs (G1 = T2 − T3 − T4 − T5, G2 =T6 − T7 − T8 − T9 and G3 = T10 − T11 − T12 − T13) of the underlying undirectedgraph (that is, the graph resulting from the PN model by ignoring places andedge directions) are not connected. In general we can say: if the subgraphs Gi

and Gj are not connected, then the corresponding tasks can be done in parallel.

We will talk about data parallelism in the two following cases:1) if the problem can be modelled in such a way that the resulting PN consistsof several identical disconnected subgraphs

113

114 CHAPTER 6. PARALLELIZATION

2) if there is a P-invariant or degenerated P-invariant (in event graphs corre-sponding to a path from a source place to a sink place; when adding a dumytransition and connecting this transition to the source place and the sink placewe obtain a P-invariant consisting of degenerated P-invariant and the dumy tran-sition) containing more than one token.

x y

x+y

z

1 x 1 y

z

1 x 1 y

z

1 x 1 y

z

ADD

MUL

n n 2 211 nn

21 n

(a) (b)

o o o o

Figure 6.1: Two PN models of the same vector operation

An interesting example of data parallelism is a simple operation on vectors(e.g. z = 2.(x + y)). This problem can be modelled in two ways:- in shortened form, as shown in Figure 6.1(a)- in unfolded form, as shown in Figure 6.1(b), where n identical disconnectedsubgraphs can be drawn.Remark: Notice that the model given in Figure 6.1(a) is not a FIFO Petri Net.When constructing a state graph associated to this Petri Net it is needed to paintthe tokens in order to distinguish among them and to obtain the same state graphfor both PN models.

In our opinion it is very misleading to talk about ’pipe-line parallelism’, be-cause pipe-line is not a source of the parallelism, but it is a scheduling strategy.That is why we will talk about pipe-line parallelization.

SIMD parallelization

In a similar way it is misleading to talk about ’data parallelization’ when identicaldisconnected subgraphs are scheduled each on one processor. That is why we willuse the term ’SIMD parallelization’ as stated in paragraph 2.2 on page 6.

Pipe-line parallelization

For example iterative problems without dependence cycles can be executed ina pipe-line manner, meaning that each instruction is scheduled on a separateprocessor (this technique was adopted for example in vector computers or inside

6.1. BASIC PRINCIPLES 115

ALUs). Example shown in Figure 6.1(a) could be done in a pipe-line manner,where one processor computes ADD and the second computes MUL.

Let us imagine that the execution time of instruction ADD is equal to theexecution time of instruction MUL, that is equal to 1. Then the example problemshown in Figure 6.1 solved by using SIMD parallelization will take 2 time units (ingeneral s time units, where s is the number of instructions in a separate subgraph).The same problem solved with the use of the pipe-line paralleilization takes n+1time units (in general n + (s− 1) time units).

In general we can say that the SIMD parallelization (where each subgraphholds one processor) is always faster than the pipe-line parallelization. On theother hand the pipe-line parallelization is usually used in quite different circum-stances (when input data are much bigger than the number of processors or whenthe input data are not delivered together). In certain level of abstraction we cansee the SIMD parallelization and the pipe-line parallelization as two orthogonalapproaches, both gaining from data parallelism.

Token’s view of scheduling problem

A generalization of data parallelism leads to a schedule allowing the data to beprocessed as soon as they are able to do so. The fact that the valid data arerepresented by the tokens leads to the following reasoning.

Definition 6.1 We say that ’token x holds its own processor y’ iff there is nofiring of any transition by any token z 6= x scheduled on processor y.

Theorem 6.1 Let a schedule be such that for all markings M of a PN modeleach token holds its own processor, then the schedule is time-optimal.

Proof:The token is the only active element of the PN. If all tokens for all possiblemarkings hold their own processors, then the situation, when any of the tasks isready to be processed (the token is waiting in front of an enabled transition) andit is not processed, can never appear. Such schedule is time optimal.2

Remark: Please notice that this approach is very similar to a dynamic schedul-ing policy: execute a task as soon as possible. It is important to mention thatthe ’token’s view of scheduling’ is of particular importance in the case of iterativeproblems, because data parallelism of noniterative problems can be always mod-eled by disconnected subgraphs (see Figure 6.1(b)). That is why we usually modelnoniterative problems by DAGs (see Figure 5.18 and paragraph 6.1.3). There aretwo main studies devoted to the dynamic schedule behaviour of cyclic problems.The first, developed by Chretienne [14], which uses graph theory arguments andlongest-path computations and the second, developed by Cohen et al. [4], whichuses the (max,+) algebra approach.


6.1.2 Structural parallelism

This paragraph introduces some new notions used to analyze structural prop-erties of cyclic problems. Structural properties are independent of marking, soevent graphs could be replaced by ordinary directed graphs. Generalization ofnoniterative problems to iterative ones is trivial and will be shown afterwords.

Definition 6.2 Let Π be a square symmetrical matrix called a parallel matrixrepresenting structural parallelism. An element Πij is equal to 1 iff there is noQ+ generator passing through transition Ti and transition Tj. Otherwise it isequal to 0.

An algorithm constructing the parallel matrix Π could be schematically writ-ten in the following way:

for i=1...number of transitionsfor j=1...number of transitions

if (there is no Q+ generator passingthrough transition Ti and transition Tj)then Πij = 1else Πij = 0

endifendfor

endfor

The parallel matrix Π is symmetrical so it can be represented by an undirectedgraph G(T,E). There is an edge e between vertices Ti and Tj (corresponding tothe transitions of the underlying PN model) if and only if the transitions Ti andTj could be fired concurrently. Please notice that Π represents just a structuralparallelism, so it does not hold any information about data parallelism.

Example 6.1: Consider a simple cyclic algorithm given in Example 5.2on page 100 (PN model given in Figure 5.11).

When analysing this PN we discover that there are no antidependencies andno output dependencies, that is why we can remove all IP-dependencies (seeFigure 6.2). The resulting model will serve as a basis for structural parallelismanalysis, because no other places (like implicit places) can be eliminated. Thecorresponding parallel matrix Π, given by the undirected graph in Figure 6.3,shows that the transitions T2, T4, T6, forming a clique, can be fired in parallel.

A similar approach, called concurrency theory, was suggested by C.A. Petri.Concurrency theory is an axiomatic theory of binary relations of concurrency andcausality. For a selection of recent results see [55].

6.1. BASIC PRINCIPLES 117

P1 P2

P3 P4

P5

P6

P7 P8

P9

T1

T2 T3

T4

T5 T6

T7

. .

. . .

Figure 6.2: PN model of cyclic algorithm from Figure 5.11 after removal of IP-dependencies

1

2

34

5

6

7

Figure 6.3: Graph corresponding to the parallel matrix Π of the algorithm givenin Figure 6.2


Remark: up to now we supposed cyclic algorithms to be modeled by au-tonomous PN, where no time is associated to the places or to the transitions. Ina certain level of abstraction (scheduling with identical task processing time) wecan see the problem to find all possible task schedules similar to the problem ofhow to find all possible vertex disconnected cliques in a graph representing Π.We should remind that there is no polynomial algorithm able to find a clique oforder k.

6.1.3 Noniterative versus iterative scheduling problems

When talking about relationship between noniterative and iterative schedulingproblems we should reveal that noniterative problems are represented by DAGsand iterative ones by FIFO event graphs. In the case of this paragraph, where nocommunication delays are assumed, we can restrict our attention to the reducedmodels (elimination of the implicit places does not change a scheduling strategywhen no communication delays are assumed).

Generalization of noniterative problems to iterative ones is trivial:Let P be a classical noniterative scheduling problem. An instance I of P isdefined by an arbitrary DAG G. Let C(P) be an iterative version of P , theinstance I can be reduced to an instance C(I) of the iterative version C(P). Anevent graph associated with precedence relations given by G is built as follows:- add to G a dummy task source and directed arcs from source to all sourcevertices of G- add to G a dummy task sink and directed arcs from all sink vertices of G tosink- build an event graph by replacing the vertices of G by transitions and the arcsof G by places with no tokens- add one place holding one token and connect this place to input transition sinkand output transition source.

Figure 6.4 illustrates the transformation of the matrix-vector multiplicationgiven in Figure 5.4 into its iterative version.

The transformation given above reveals a possibility to schedule noniterativeproblems with cyclic scheduling algorithms (such approach is used in Example6.3). But in practice we are interested in an opposite operation: to transform anevent graph into a DAG. After this transformation, the acyclic scheduling algo-rithms could be applied to schedule iterative problems. But this transformationis not a trivial one, because a DAG holds just structural properties of the givenscheduling problem. In concern to this we present the following observations:1) if there exists a generator containing zero tokens, then there is a deadlock (seeTheorem 1 in [20]: Marking is live iff the token count of every directed circuit ispositive)2) if every generator contains exactly one token, then we eliminate the places hold-ing tokens and we construct a graph, which is acyclic. Then the DAG scheduling

6.2. CYCLIC SCHEDULING 119

1

source

sink

Figure 6.4: Cyclic version of matrix[3,3] vector[3] multiplication

problem, requiring each vertex to be scheduled exactly once, is identical to theevent graph scheduling problem (see Theorem 7 in [20]: Let M be a live marking,then there exists a firing sequence leading from M to itself, in which every tran-sition is fired exactly once). Finally we remind that each generator is P-invariantpreserving the number of tokens, so we can state: if each generator g holds exactlyone token, then all transitions in g have to be fired in a sequence; this sequenceis given by the precedence relations. In other words we can say that an issue ofdirected circuit in an event graph (or P-invariant in event graph) matches exactlyan issue of directed path in a DAG.3) if there is a generator containing more than one token, then the transformationhas to lead to a more complicated DAG structure, where each transition of theoriginal event graph will probably be represented by more vertices of a DAG. Fortransformations leading to an event graph with at most one token in each placein initial marking see the construction of ”standard autonomous equation” in(max,+) algebra [4]. For ”Transforming Cyclic Scheduling Problems into AcyclicOnes” see Chapter 11 in [16].

6.2 Cyclic scheduling

Loop scheduling is particularly important when designing efficient compilers forparallel architectures. Up to now, iterative scheduling problems have been studiedfrom several points of view, depending on the target application. Some theoreticalstudies have recently been devoted to these problems, in which basic results areoften proved independently using different formalism. We hope that this chaptermight contribute to the synthesis of this class of problems.


6.2.1 Additional terminology

In a directed graph G(V,E) we define a directed path from vertex u to vertexv as an alternating sequence of vertices and edges v1, e1, v2, e2, ..., ek−1, vk wherev1 = u, vk = v, all vertices and edges are distinct, and successive vertices vi

and vi+1 are endpoints of the intermediate directed edge ei. If the first and thelast vertices of a directed path coincide (u = v) then we call the resulting closeddirected path a cycle. If we relax the restriction of directed edge (we allow ei toequal either (vi, vi+1) or (vi+1, vi)) the corresponding notions are that of semipathand semicycle in directed graphs.

A graph G(V, E) is connected iff there is a semipath between every pair ofvertices. A directed graph G(V, E) is strongly connected, if there is a directedpath between every pair of vertices.

A tree is an acyclic connected graph. The following theorem summarizes thebasic properties of trees:

Theorem 6.2 Let G(V, E) be a graph. Then G is a tree iff one of the followingproperties holds:1) G is connected and |E| = |V | − 12) G is acyclic and |E| = |V | − 13) there exists a unique semipath between every pair of vertices in G

Proof: see chapter 4.2 in [23] or chapter 4.1 in [64].

We call a directed graph a tree if its underlying graph is a tree in the undirectedsense. We call a vertex v a root of a directed graph G if there are directed pathsfrom v to every other vertex in G. A directed graph is called a directed tree if itis a tree and it contains a root (root has no input edge). We call a vertex withno output edge in a directed tree an endpoint.

There are simple relations between the spanning trees (set of trees coveringthe graph) and cycles in undirected graphs. To describe these relations we willintroduce some terminology. Let G(V, E) be a connected graph and let Tr be aspanning tree of G. An edge of G not lying in Tr is called a chord of Tr. Eachchord of Tr determines a cycle in G, called fundamental cycle; namely, the cycleproduced by adding the chord to Tr. Any cycle in G can be represented as alinear combination of fundamental cycles.

It is evident that a set of fundamental cycles forms the cycle subspace in asimilar way like a basis of P-invariants when the Petri Net under assumption isan event graph (see Table 4.1 on page 4.1).

In the following paragraphs we will adopt a scheduling policy making use firstof the structural parallelism and then of the data parallelism.


6.2.2 Structural approach to scheduling

This paragraph introduces a structural approach to scheduling problems, so notokens are assumed to be in the Petri Net model. The algorithms presented in thisparagraph were inspired by P-invariants and the fact that the set of Q+generatorsis unique.

Definition 6.3 A timed event graph is a pair < N, Θ > such that:N is a marked event graphΘ is a time associated with transitionsΘ : T → R+

Remark: in the case of event graphs a representation of the scheduled algo-rithm by a T-timed event graph (where a processing time is associated to a tran-sition) is equivalent to a representation by a P-timed event graph (the processingtime is associated to output places). For a detailed discussion see paragraph2.5.2.6. in [4].

Definition 6.4 Let Λ be a schedule matrix of size [number of transitions, numberof processors]. An element Λij is equal to 1 iff a transition Ti is allocated toprocessor j.

Rule 6.1: Let X be a matrix specifying the set of Q+generators ofa given event graph. An algorithm A1 for structural scheduling (nogain of data parallelism is assumed) without communication on anunbounded number of processors could be schematically written in thefollowing way:

Input: Θ, P reT , XOutput: Λwhile there exists a zero line in the schedule matrix Λ

j= index of the maximum element in (Θ× PreT ×X)concatenate a column [PreT ×X]:j to the schedule matrix Λfor i=1..number of places

if Xij = 1then zero the line Xi:

endifendfor

endwhile


Remark 1: Notice that matrix [PreT ×X] is a set of generators expressed inthe terms of transitions (each column is composed of transitions present in givengenerator). Then (Θ × PreT ×X) is a vector with entries corresponding to thetotal execution time of given generator.

Remark 2: The schedule Λ is time-optimal because ( time of parallel algorithmexecution) = ( time of a generator with the bigest execution time).

Example 6.2: consider an example of algorithm modeled by the eventgraph given in Figure 6.5 with Θ=[ 1 3 1 1 4 2 1]:

P1

P2 P3

P4 P5

P6 P7

P8 P9

T1

T2 T3

T4

T5 T6

T7

Figure 6.5: A simple instance for structural scheduling

Step 1 - generator X1 chosen to be scheduled on processor 1

X =

X1 X2 X3 X4

P1 6 1 6 1 6 1 6 1P2 6 1 6 1 6 0 6 0P3 0 0 1 1P4 6 1 6 1 6 0 6 0P5 0 0 1 1P6 6 1 6 0 6 1 6 0P7 0 1 0 1P8 6 1 6 0 6 1 6 0P9 0 1 0 1

Θ× PreT ×X = [10 8 8 6] Λ =

T1 1T2 1T3 0T4 1T5 1T6 0T7 1


Step 2 - generator X4 chosen to be scheduled on processor 2

X =

X1 X2 X3 X4

P1 0 0 0 0P2 0 0 0 0P3 6 0 6 0 6 1 6 1P4 0 0 0 0P5 6 0 6 0 6 1 6 1P6 0 0 0 0P7 6 0 6 1 6 0 6 1P8 0 0 0 0P9 6 0 6 1 6 0 6 1

Θ× PreT ×X = [0 2 1 3] Λ =

T1 1 0T2 1 0T3 0 1T4 1 0T5 1 0T6 0 1T7 1 0

In the following we will show that less than rank(X) processors will be neededby the algorithm A1.

Theorem 6.3 Let G(V, E) be a graph consisting of k connected components.Then G contains m − n + k linearly independent cycles, where m = |E| andn = |V |.

Proof:Notice that each of the k components is supposed to be connected (there exists asemipath from each vertex to each vertex), not strongly connected (there exists adirected path from each vertex to each vertex). The proof is given in paragraph14.1 in [23].

Remark: notice the similarity with Equation 4.14 on page 56 where dim(P −invariant subspace) = m − rank(C). In a similar way we deduce dim(T −invariant subspace) = n − rank(C) which is in fact equal to k in the case ofevent graphs. By substitution of rank(C) we obtain the equation holding forevent graphs: dim(P − invariant subspace) = m− n + k.

Theorem 6.4 Let G(V, E) is a strongly connected directed graph specifying prece-dence constraints of instance I. Algorithm A1 allocates p processors to schedulethe instance I, where p ≤ m− n + 1.

Proof:1) from Theorem 6.3 ⇒ dim(semicycles of G) = m− n + 12) each cycle is a semicycle3) dim(generators of G) = dim(cycles of G)4) from 1),2) and 3) ⇒ dim(generators of G) ≤ m− n + 15) each generator chosen in the k-th iteration of A1 is linearly independent ofgenerators chosen in the (k − 1) preceding iterations (this is due to the fact thatthe generator chosen in the k-th iteration has a nonzero element in (Θ×X), soit contains at least one element of Θ that was not zeroed in the (k− 1) precedingiterations) ⇒ by mathematical induction we prove: the chosen generators arelinearly independent6) from 4)⇒ p = (number of processors) = (number of chosen generators) ≤dim(generators of G)


7) from 4) and 6) ⇒ p ≤ m− n + 12

As stated before, the matrix X can have a nonpolynomial number of columns(corresponding to generators), so we will be more interested in finding an algo-rithm which operates on a basis of non-negative vectors (corresponding to cycles).

Theorem 6.5 Let G(V, E) be a strongly connected directed graph. It is possibleto find in linear time a basis B such that:i) B is nonnegativeii) each cycle of G can be represented as a linear combination of fundamentalcycles represented by columns of B (

∑gi=1 λib

i).

Proof:1) G is strongly connected⇒ A Depth First Search algorithm (see basic textbookson graph theory [23, 64]) starting in an arbitrary vertex r ∈ V finds a directedspanning tree T covering the graph G2) the vertex r, called root, has no input edge laying in T3) there is a directed path from r to every other vertex4) from 2) and (G is strongly connected) ⇒ there is at least one edge x ∈ Ewhich is an input edge to r and which is a chord (not laying in T )5) from 3) and 4) ⇒ there is a cycle bk consisting of edge x and the path from rto input vertex of x6) when reducing (joining) all edges and all vertices of cycle bk into one vertexr, we obtain a new graph G, which is still strongly connected and a new tree Twhich is still a directed spanning tree of G7) when repeating 2)3)4)5)6) we obtain as many cycles bk as many chords thereare8) each cycle bk is linearly independent of cycles b1, ..., bk−1 because it contains achord x which was not present in cycles b1, ..., bk−1

9) from Theorem 6.2 ⇒ there are m− (n− 1) chords10) from 8) and 9) ⇒ all cycles b1, ..., bm−n+1 are linearly independent, so they arefundamental cycles found in polynomial time = (Depth First Search) + (m-n+1)2

Remark: An algorithm finding directed spanning trees of general directedgraphs (consisting of more strongly connected components) is given in paragraph5.2.3. in [64].

Figure 6.6 illustrates the proof of Theorem 6.5 on a given instance reduced inthree steps. The proof of the Theorem 6.5 is in fact an algorithm finding basisB:


e 3

e 7

e 3

e 1 = x

e 9

e 8

e 6

e 5 = x

e 9

e 5e 4

e 7

e 2

e 3e9=x

( a ) ( c ) ( b )

Figure 6.6: Underlying directed graph for Figure 6.5 and its reduction

B =

b1 b2 b3

e1 1 1 1e2 1 0 1e3 0 1 0e4 1 0 1e5 0 1 0e6 1 1 0e7 0 0 1e8 1 1 0e9 0 0 1

Remark: Consequence for structural properties of Petri Nets: we can always

find a basis consisting of positive P-invariants in fully connected event graphs.If the event graph under assumption is not fully connected it is desirable to findfully connected components first.

Theorem 6.6 Let an algorithm A2 be created from the algorithm A1 in such away that the set of generators X is replaced by a basis B. Then schedule obtainedby algorithm A2 is time- optimal.

Proof: Let us assign one token to each P-invariant corresponding to the col-umn of the schedule matrix Λ, then for each marking each token holds at least


one processor because the number of tokens inside P-invariant is constant andthe schedule matrix Λ covers the whole net. The schedule is time-optimal as aconsequence of Theorem 6.1.2

Remark 1: An algorithm A2 is computed in polynomial time = (time to findB) + maximally (m-n+1) iterations of the algorithm A2.

Remark 2: The algorithm A2 does not find a schedule with a minimal numberof processors. When scheduling the instance given in Figure 6.6(a) we obtain aschedule onto three processors:

Λ =

t1 1 0 0t2 1 0 0t3 0 0 1t4 1 0 0t5 1 0 0t6 0 1 0t7 1 0 0

It is evident that the number of processors is not minimized, because it is

possible to find the time-optimal schedule on two processors as shown in Example6.2.

6.2.3 Quasi-dynamic scheduling

Only structural parallelism was under consideration up to now. In this para-graph we make use of both forms of parallelism - data parallelism and structuralparallelism.

The new term ’quasi-dynamic’ is used to express the fact that the schedulingpolicy adopted in this paragraph assigns each task to a set of processors. Anywaythis scheduling policy is still static, because we are able to specify the processoron which the k-th iteration of a given task will be scheduled.

The quasi-dynamic scheduling policy is based on the observation given inTheorem 6.1 leading to algorithm A3.

Theorem 6.7 Let an algorithm A3 be created from algorithm A2 in such a waythat a column of the schedule matrix Λ is replicated as many times as many tokensare present in the corresponding positive P-invariant. Then the schedule obtainedby algorithm A3 is time-optimal.

Proof: similar to the proof of Theorem 6.6:1) the schedule matrix Λ covers the whole net2) the number of tokens inside a P-invariant is constant3) from 1) and 2) ⇒ each token in any marking holds at least one processor

6.3. COMMUNICATIONS 127

4) from 3) ⇒ the sufficient condition for Theorem 6.1 is satisfied2

To be exact the algorithm A3 could be written in the following way:

Input: Θ, P reT , X,MOutput: Λwhile there exists a zero line in the schedule matrix Λ

oldB = Bj = index of the maximum element in (Θ× PreT ×B)q = number of tokens in the P-invariant given by oldB:j

concatenate q-times a column PreT ×B:j to the schedule matrix Λfor i=1..number of places

if Bij = 1then zero the line Bi:

endifendfor

endwhile

6.3 Communications

6.3.1 The problem complexity

To get the best performance from distributed memory computers we need tosearch for the best compromise between parallelism and communication delays.For a given task graph to be executed, this problem is a scheduling where thecommunication times between dependent tasks assigned to distinct processorsmust be taken into account.

In the presence of communication, the complexity of the above schedulingproblem has been found to be much more difficult than in the classical schedulingproblem, where communication is ignored. The general problem is NP-hard, andeven for simple graphs and unlimited number of processors the problem is stillNP-hard [15], [72], [83]. Only for special classes of DAG’s, such as Fork-Join andtrees, special polynomial algorithms are known [3]. Good results are achievedfor scheduling tasks with communication on an unlimited number of processorswith task duplication when small communication time is assumed (see paragraph2.2.4). A very good survey on scheduling algorithms with communications isgiven by Chretienne in [16].

As stated above the distribution of data across processors is of critical impor-tance to the efficiency of the parallel program in a distributed memory system.Since interprocessor communication is much more expensive than a local memoryaccess, it is essential that a processor be able to do as much of its computationas possible using just local data. Excessive communication among processors can


easily offset any gains made by the use of parallelism. Mace [62] has shown thatthe problem of finding optimal data storage patterns for parallel processing isNP-hard.

Recently several researchers have addressed the problem of automatically de-termining a data partitioning scheme, or of providing help to the user in this task.Li and Chen [60] address the issue of data movement between processors due tocrossreferences between multiple distributed arrays. They also describe how ex-plicit communication can be synthesized and communication costs estimated byanalyzing reference patterns in the source program [60]. Gupta and Banerjee [38]introduce the notion of constraints on data distribution, and show how, based onperformance considerations, a compiler identifies constraints to be imposed onthe distribution of various data structures. These constraints are then combinedby the compiler to obtain a complete and consistent picture of the data distribu-tion. Tzen and Li [89] propose a data dependence uniformisation, to overcomethe difficulties in parallelizing a doubly nested loop with irregular dependenceconstraints. Their approach is based on the concept of vector decomposition.

Some of the approaches mentioned above have a problem of restricted ap-plicability. It seems that any strategy for automatic data partitioning can beexpected to work well only for applications with a regular computational struc-ture and static dependence patterns that can be determined at compile time.

The method studied in the following paragraph does not find an optimalschedule with communications, but it tends to detect global communications ina task schedule.

6.3.2 Finding global communications

Global data movements such as broadcasting or gathering occur quite frequentlyin certain classes of algorithms [7]. If such data movement patterns can be identi-fied from the PN model algorithm representation, then the calls to communicationroutines can be issued without having detailed knowledge of the target machine,while the communication routines are optimized for the specific target machine.

Definition 6.5 Let ∆ = C ×Λ be a matrix of size [number of places, number ofprocessors] representing data movements among processors so that:∆ij = 1 signifies a processor j to be a source of data i,∆ij = −1 signifies a processor j to be a destination of data i

Rule 6.2: With the use of the matrix ∆ specifying the source and thedestination of each communication, find the undirected graph wherethe vertices Pi and Pj corresponding to places are connected iff theycould be elements of one global communication.Places Pi and Pj can be elements of one global communication iff:


(there is no generator containing both places) AND (they have a com-mon source processor OR a common destination processor).

P1 P2 P3

P4 P5 P6

P7

P8 P9 P10

P11 P12 P13

P14

P15 P16 P17

P18 P19 P20

P21

1P22

T1 T2T3

T4

T5 T6T7

T8

T9 T10T11

T12

T13

T14

763 52 4 8 91

1

1

1

23

Figure 6.7: Schedule of the cyclic version of the matrix[3,3]-vector[3] multiplica-tion (from Figure 6.4 by elimination of implicit places)

Example 6.3: When applying algorithm A3 to the cyclic version of thematrix-vector multiplication given in Figure 6.7 we obtain the schedule on9 processors (indicated on each of the transitions in Figure 6.7):

Λ =

T1 0 0 0 0 0 1 0 0 0T2 0 0 0 0 0 0 1 0 0T3 0 0 1 0 0 0 0 0 0T4 0 0 1 0 0 0 0 0 0T5 0 0 0 0 1 0 0 0 0T6 0 0 0 1 0 0 0 0 0T7 0 1 0 0 0 0 0 0 0T8 0 1 0 0 0 0 0 0 0T9 0 0 0 0 0 0 0 1 0T10 0 0 0 0 0 0 0 0 1T11 1 0 0 0 0 0 0 0 0T12 1 0 0 0 0 0 0 0 0T13 1 0 0 0 0 0 0 0 0T14 1 0 0 0 0 0 0 0 0

The places corresponding to communicated data are indicated by thick cir-cles in Figure 6.7.


When applying Rule 6.2 we obtain an undirected graph shown in Figure 6.8where the cliques of vertices are indicated by big circles. Therefore we canmake the following observations:- places P2, P3, P9, P10, P16, P17 form a global communication that could becombined with places P1, P8. The global communication is a gathering be-cause there is just one source processor - processor 1 (see Figure 6.7) anddata are personalized.- places P5, P6 and places P12, P13 form two scattering that could be com-bined with places P1 and P8

- places P7, P14, P19, P20 form a scattering.

This structural analysis of global communication is dependent on the scheduleΛ (obviously given as a time optimal schedule) but is independent from commu-nication overhead. So that it shows all possible combinations of global commu-nications (cliques in the graph given by Figure 6.8).

P 1

P 2, P 3, P 9, P 10, P 16, P 17

P 4

P 5, P 6

P 7, P 14, P 19, P 20P 8

P 11

P 12, P 13

P 15

P 18

P 21

P 22

Figure 6.8: Global communication primitives of matrix-vector multiplication

6.3.3 Relation to automatic parallelization

We propose a three-phase method for scheduling algorithms on a specific machine:1) Schedule the tasks on an unbounded number of processors of a completelyconnected architecture.


� �algorithm

modelling� �model

model reduction� �reduced model

scheduling� �schedule

global communication� �parallel alg.

mapping?

?

?

?

?

?

?

?

?

?

� �implementation

� �instr. time

� �comm. time� �comm. model� �topology

-

@@R-

��

Machine

Figure 6.9: Automatic parallelization

2) Find global communications.3) Map the task schedule with global communications onto specific machine whentaking into consideration the number of tasks, the number of processors andlimitations to a machine topology given by global communications.

Figure 6.9 describes in brief such method including the algorithm modeling,given in more details in previous chapter (see Figure 5.18). An information isrepresented by ellipses and methods are represented by rectangles. The objectiveof this method is to perform a maximum of operations independently from thespecific machine.

Such a parallelization does not guarantee optimality but it has the followingadvantages:i) it should not be far from the time-optimal schedule, owing to the parallelismdetected in the first phase,ii) it represents a parallel algorithm in a very simple manner (see for example therepresentation of parallel algorithm in paragraph 3.4) especially for algorithmsperforming matrix computations,iii) the resulting implementation of parallel algorithm can achieve a very goodefficiency because the global communications are able to minimize the communi-cation start-up β when joining several communications together,iv) the global communications can be optimized for a specific machine separatelyfrom the parallel algorithm.On the other hand it is important to notice that the resulting schedule can be


far away of the time-optimal if we tend to make communications as global aspossible.

6.4 Conclusion

The objective of this chapter was to bring original ideas to automatic paralleliza-tion.

Some of the most distinctive features of the chapter are:

• It specifies the terms ’data parallelism’ and ’structural parallelism’ in orderto understand where parallelism comes from

• It clarifies the relation between cyclic and noncyclic scheduling.

• An attempt is made to see a scheduling problem from the token side.

• Via two cyclic scheduling algorithms, gaining just from the structural par-allelism, it leads to an original cyclic scheduling algorithm called ’quasi-dynamic scheduling’. The quasi-dynamic scheduling is time-optimal andachieves the same results like the so-called ’Periodic scheduling’ (see [16])and very similar results like cyclic scheduling based on (max,+) algebra(see [4]), but it uses its proper reasoning. This reasoning allows to extendquasi-dynamic scheduling to general Petri Nets (not only event graphs) ifa positive P-invariant basis B will be found for general Petri Nets.

• It proposes an original parallelization method performing a maximum ofoperations independently from the target machine.

• It underlines the importance of global communication primitives and itproposes a simple introductory technique to detect them.

Chapter 7

Conclusion

Developing efficient programs for parallel computers is difficult because of thearchitectural complexity of those machines. There are at least two reasons whythe parallelization of sequential programs is important. The first is that there aremany already existing sequential programs that would be convenient to executeon parallel computers. The second is that powerful parallelizers should facilitateprogramming by allowing the development of much of the code in a familiarsequential programming language.

The aim of this thesis was to present a new dataflow model for parallel algo-rithms and its analysis based on Petri Nets and P-invariant generators. In orderto develop a parallel algorithm, such an analysis is followed by a simple schedulingand a detection of global communication primitives.

Summary and thesis contribution

Chapter 2 was a basic one introducing the terminology and some elementarymethods from the field of parallel processing. It covered the majority of theimportant topics in parallel processing and classified systematically large andredundant terminology of parallel processing needed for comprehensive readingof the rest of the thesis. It introduced the concepts of data dependence graphs,static scheduling techniques and global communication primitives.

Chapter 3 presented a usual approach to parallel processing where paralleliza-tion is not done automatically. The benefits of this chapter for the rest of thethesis lay namely in the fact that we have presented a typical algorithm per-forming regular numeric computations where good overall speedup is achievedthrough a compromise between granularity and communication. Both theoreticaland experimental results showed that even if the data parallelism is very low itis possible to obtain good results with the use of structural parallelism. Paral-lelism detection in iterative algorithms is a very complex task and it needs deepalgorithm analysis. This fact formulated the objective for the rest of the thesis- to facilitate the task of the programmer by translating some sections of the

133

134 CHAPTER 7. CONCLUSION

code and performing operations exploiting parallelism and detecting global datamovements.

Chapter 4 showed that Petri Nets are a convenient tool to model systemscomprising parallelism and synchronization. It was argued that only positiveinvariants are of our interest when analyzing structural net properties and thenotion ofQ+generators was introduced. These vectors generating all non-negativeleft annulers of a net’s flow matrix were used in data dependence analysis inChapter 5 and in scheduling algorithms in Chapter 6. Three existing algorithmsfinding a set of generators were explained, implemented and evaluated. A simplereasoning referred to a non-polynomial size of the set of Q+generators. That iswhy a simple reduction method applicable to many ”real life” Nets was given.

The basic structural features of algorithms are dictated by their data andcontrol dependencies. These dependencies refer to the precedence relations ofcomputation that need to be satisfied in order to compute the problem correctly.The absence of dependencies indicates the possibility of simultaneous computa-tions (the fewer dependencies the bigger parallelism). That is a principal reason,why we are so interested in removal of dependencies. Chapter 5 presented amodeling strategy, which is coherent for noniterative as well as iterative prob-lems. We showed that the data dependencies of iterative problems with ’generaluniform constraints’ can be modelled by an event graph. We distinguished be-tween two modeling approaches - the first based on the problem analysis andthe second based on the sequential algorithm for which we introduce the termIP-dependencies. We showed how to simplify the model by reduction of implicitplaces and self-loop places. Importance of antidependencies and output depen-dencies was underlined and algorithm transformations based on Petri Net analysiswere shown. This original approach allowed us to put a knowledge of automaticparallelization via data dependence graphs and Petri Nets to the same theoreticalplatform and to join the two scientific branches.

In the last chapter we focused our interest on specific questions that can beuseful when designing parallel compilers. In order to better understand the prob-lem nature we clarified the terms data parallelism, structural parallelism andcyclic scheduling. We have shown how to find a positive basis of P-invariantsin event graphs. We proposed a quasi-dynamic scheduling algorithm finding acyclic schedule of nonpreemptive tasks with precedence constraints and no com-munication delays on an unlimited number of identical processors. We provedtime-optimality of the given algorithm which is based on token’s view of thescheduling problem.

Finally we proposed an introductory technique identifying data movementpatterns from the algorithm Petri Net representation allowing to issue the callsto communication routines without having a detailed knowledge of the targetmachine.

135

Future work

The absolute goal of research in program parallelization is to develop a method-ology that will be effective in translating a wide range of sequential programs foruse with several classes of parallel machines. Although it is not clear how closewe are to that goal, it is clear that we are not there yet and that our researcheffort must continue because of the great profit that effective parallelizers havefor ordinary users.

This thesis contributes to the above mentioned research effort and opens newresearch topics:

• To unify a redundant terminology in parallel processing.

• To develop a new Petri Nets reduction technique leading to ’recepie’ con-taining information about the structural properties. With the use of moreadvanced reduction rules [91] it is possible to go further than Fork-Joinreduction algorithms (see Algorithm 4.6).

• To adopt methods developed in automatic parallelization via data depen-dence graphs and to elaborate them with the use of Petri Nets.

• When using a good model to formalize some hints and tricks arising fromapplication programs, to compare scheduling strategies often depending ontarget application and to bring new ideas to scheduling theory.

• To develop an algorithm finding a positive P-invariant basis B for generalPetri Nets. Then the quasi-dynamic scheduling algorithm will be immedi-ately applicable to general Petri Nets, possibly able to model algorithmswith conditional branching.

• To develop a formal specification of parallel algorithms with global datamovements in a similar way like Petri Nets algorithm representation. Sucha specification can be advantageous namely for algorithms performing ma-trix operations. Possible use of coloured Petri Nets or other abbreviationtechniques should be considered keeping in mind the complexity of thestructural properties of coloured Petri Nets.

• To enjoy research!

136 CHAPTER 7. CONCLUSION

Bibliography

[1] J. Allen, K. Kennedy, Automatic Translation of Fortran Programs to VectorForm, ACM Trans. on Programming Languages and Systems, Vol. 9, No. 4(1987) 491-542.

[2] P. Atkin, Performance Maximization, INMOS Technical Note 17, 72 TCH01700, 1987.

[3] F.D. Anger, J. Hwang, Y. Chow, Scheduling with Sufficient Loosely CoupledProcessors, J. Parallel Distrib. Computing, Vol. 9 (1990) 87-92.

[4] F. Baccelli, G. Cohen, G.J. Olsder, J.P. Quadrat Synchronization and Lin-earity: an algebra for discrete event systems John Wiley & Sons, (1992).

[5] U. Banerjee, R. Eigenmann, A. Nicolau, D.A. Padua Automatic ProgramParallelization, Proc. of IEEE, Vol. 81, No. 2 (1993).

[6] P. Banerjee et al., Parallel Simulated Annealing Algorithm for Standard CellPlacement on a Hypercube Multiprocessors, IEEE Transactions on Paralleland Distributed systems, Vol. 1 (1990) 91-106.

[7] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computation -Numerical Methods, (Prentice Hall, 1989).

[8] G. Blelloch, and C.R. Rosenberg, Network Learning on the Connection Ma-chine, in Proc. IJCAI, (Milano, 1987) 323-326.

[9] G.H. Bradley, Algorithms for Hermite and Smith Normal Matrices and Lin-ear Diophantine Equations, Math. Comp., Vol.25 (1971) 897-907.

[10] J.W. Brams, Reseaux de Petri: theorie et pratique, Masson Ed., Paris (1983).

[11] J. Carlier, P. Chretienne, C. Girault, Modelling Scheduling Problems withTimed Petri Nets, LNCS 188, Springer-Verlag (1985).

[12] J. Carlier, P. Chretienne, Timed Petri Net Schedules, LNCS 340, Springer-Verlag (1988) 62-84.

137

138 BIBLIOGRAPHY

[13] Y. Chen, W.T. Tsai, D. Chao, Dependency Analysis - A Petri Net BasedTechnique for Synthesizing Large Concurrent Systems, IEEE Trans. on Par.and Distr. Systems, Vol. 4, No. 4 (1993) 414-426.

[14] P. Chretienne, Transient and Limiting behavior of timed event graphsRAIRO-TSI, 4, (1985) 127-192.

[15] P. Chretienne, A Polynomial Algorithm to Optimally Schedule Tasks Overa Virtual Distributed System Under Tree-like Precedence Constraints,Eur.J.Oper.Res., 43, (1989) 348-354.

[16] P. Chretienne, E.G. Cofman, J.K. Lenstra, Z. Liu, Scheduling Theory andits Applications, John Wiley & Sons, (1995).

[17] V. Chvatal, Linear programming, W.H. Freeman and Company, New York(1983).

[18] J.Y. Colin, P. Chretienne, CPM Scheduling with Small Communication De-lays, Oper.Res., 39, (1991) 680-684.

[19] J.M. Colom, M.Silva, Convex Geometry and Semiflows in P/T Nets - AComparative Study of Algorithms for Computation of Minimal P-semiflows,Advances in Petri Nets 1990, LNCS 483, Springer Verlag, (1990), 79-112.

[20] F. Commoner, A.W. Holt, S. Even, A. Pnueli, Marked Directed Graphs,Journal of Computer and System Science Vol. 5, (1971) 511-523.

[21] S.A. Cook, The Complexity of theorem-proving procedures, Proceedings ofthe 3rd Annual ACM Symposium on Theory of Computing, (1971) 151-158.

[22] R. David, H. Alla, Du Grafcet aux Reseaux de Petri, Hermes, Paris (1989).

[23] J. Demel, Graphs (in Czech), (SNTL Prague, 1989).

[24] V. Demian, J.-C. Mignot, Optimization of the self-organizing feature mapon parallel computers, in Proc. IJCNN, (Nagoya, 1993) 483-486.

[25] F. Desprez and B. Tourancheau, Modelisation des Performances de Commu-nication sur le Tnode avec le Logical System Transputer Toolset, La lettredu transputer et des calculateurs distribues, (1990) 65-72.

[26] N. Dodd, Graph Matching by Stochastic Optimisation Applied to the Im-plementation of Multi-layer Perceptrons on Transputer Networks, ParallelComputing, Vol. 10 (1989) 135-142.

[27] H. El-Rewini, T.G. Lewis, H.H. Ali, Task scheduling in parallel and dis-tributed systems, Prentice Hall (1994).

BIBLIOGRAPHY 139

[28] J. Farkas, Theorie der einfachen Ungleichungen, Journal fur die reine undandgewandte mathematik, 124 (1902) 1-27.

[29] J.B.J. Fourier, Solution d’une question particuliere du calcul des inegalites.Oeuvres II, Gauthier-Villars, Paris (1826!), 317-328.

[30] J.Ch. Fiorot, M. Gordan, Resolution des systemes lineaires en nombres en-tiers, E.D.F - Bulletin de la Direction des Etudes et Recherche, Serie C -Mathematique, Informatique No 2 (1969) 65-116.

[31] M.J. Flynn, Very High-Speed Computing Systems, Proc. IEEE, Vol. 54(1966) 1901-1909.

[32] M.A. Frumkin, Polynomial Algorithms in the Theory of Linear DiophantineEquations, M.Karpinski (ed): Fundamentals of Computation Theory LNCS56, Springer, Berlin (1977) 386-392.

[33] Y. Fujimoto, N. Fukuda, T. Akabane, Massively Parallel Architectures forLarge Scale Neural Networks Simulations, IEEE Transactions on Neural Net-works, Vol. 3, No. 6 (1992) 876 - 887.

[34] A. Gerasoulis, T. Yang, On the Granularity and Clustering of DirectedAcyclic Task Graphs, IEEE Trans. on Par. and Distr. Systems, Vol. 4, No.6 (1993) 686-701.

[35] A. Gerasoulis, T. Yang, DSC: Scheduling Parallel Tasks on an UnboundedNumber of Processors, IEEE Transactions on Parallel and Distributed Sys-tems, Vol.5, No.9 (1994) 951-967.

[36] R.L. Graham, Bounds for certain multiprocessing anomalies, Bell SystemTech. J., Vol.45, (1966) 1563-1581.

[37] R.L. Graham, Bounds on multiprocessing timing anomalies, SIAMJ.Appl.Math., Vol.17, (1969) 416-429.

[38] M. Gupta, P. Banerjee, Demonstration of Automatic Data Partitioning Tech-niques for Parallelizing Compilers on Multicomputers, IEEE Trans. on Par.and Distr. Systems, Vol. 3, No. 2 (1992) 179-193.

[39] Z. Hanzalek, Real-time Neural Controller Implemented on Parallel Architec-ture, in: A. Crespo (ed.): Proc. Artificial Intelligence in Real-Time Control,Elsevier Science, Amsterdam (1995) 313-316.

[40] Z. Hanzalek, Parallel Algorithm Design - Example, in: Ch.H.Nevison (ed.):Proc. Parallel Computing for Undergraduates, Colgate University, USA,(1994), 1-10.

140 BIBLIOGRAPHY

[41] Z. Hanzalek, Neural Networks Simulation on Massively Parallel Architecture,in: L.Kulhavy, M.Karny, K.Warwick (ed.): Proc. of the IEEE EuropeanWorkshop on Computer-Intensive Methods in Control and Signal Processing,UTIA CAV and University of Reading, Prague (1994), 299-305.

[42] Z. Hanzalek, Laboratory for Distributed Real-Time Control, in: J. Zalewski(ed.): Proc of the IEEE Workshop on Real-Time Systems Education, Day-tona Beach, IEEE Computer Society Press, Los Alamitos, Calif. (1996),99-105.

[43] Z. Hanzalek, Finding Global Communication from a Petri Net AlgorithmRepresentation, in: Proc of the 2nd IEEE European Workshop on Com-puter Intensive Methods in Control and Signal Processing, UTIA CAV andUniversity of Reading, Prague 1996, 37-43.

[44] Z. Hanzalek, E. Schmieder, P. Wenzel, Creation of the Laboratory forFieldbus-based Automation Systems, in: D. Mosse (ed.): Proc of the SecondIEEE Real-Time Education Workshop, Montreal (1997), to be published byIEEE Computer Society Press,

[45] Z. Hanzalek, A Parallel Algorithm for Gradient Training of Feedforward Neu-ral Networks, journal article accepted for publication by Parallel Computing,Elsevier Science.

[46] C. Hermite, Sur l’introduction des variables continues dans la theorie desnombres, J.Reine Angew. Math., Vol 41 (1951) 191-216.

[47] J.A. Hoogeven, J.K. Lenstra, B. Veltman, Three, four, five, six, or thecomplexity of scheduling with communication delays, Oper.Res.Letters, 16,(1994) 129-137.

[48] D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks, IEEESignal Processing Magazine, Vol. 10 (1993) 8-39.

[49] K. Jensen, How to find invariants for Colored Petri Nets, DIAMI PB-120,Bonn, Computer Science Department, Aarhus University (1980).

[50] J. Kadlec, F.M.F. Gaston, G.W. Irwin, The block regularised parameterestimator and it’s parallel implementation, IFAC Automatica, Vol. 31, No.7(1995) 1125-1136.

[51] R. Kannan, A. Bachem, Polynomial Algorithms for Computing the Smithand Hermite Normal Forms of an Integer Matrix, SIAM J. Comput., Vol. 8,No. 4 (1979) 499-507.

BIBLIOGRAPHY 141

[52] F. Kruckeberg, M. Jaxy, Mathematical Methods for Calculating Invariantsin Petri Nets, in: G. Rozenberg (ed): Advances in Petri Nets, LNCS 266,Springer (1987) 104-131.

[53] D.J. Kuck, R.H. Kuhn, D.A. Padua, B. Leasure, M. Wolfe, DependenceGraphs and Compiler Optimisation, Proc. 8th ACM Symp. on Principles ofProgramming Languages, (1981) 207-218.

[54] D.J. Kuck, R.H. Kuhn, B. Leasure, M. Wolfe, The Structure of an Ad-vanced Vectorizer for Pipelined Processors, Proc. COMPSAM 80, The 4thInt. Computer Software and Applicationc Conf., (1980) 709-715.

[55] O. Kummer, M.O. Stehr, Petri’s Axioms of Concurrency - A selection ofRecent Results, LNCS 1248, Springer-Verlag (1997) 195-215.

[56] S.Y. Kung, J.N. Hwang, Parallel architectures for artificial neural nets, inProc. ICNN, (San Diego, 1988) Vol.2, 165-172.

[57] T.E. Lange, Simulation of Heterogeneous Neural Networks on Serial andParallel Machines, Parallel Computing, Vol. 14 (1990) 287-303.

[58] K. Lautenbach, H.A. Schmid, Use of Petri Nets for Proving Correctness ofConcurrent Process Systems, IFIP 74, North Holland Pub. Co., (1974) 187-191.

[59] J.K. Lenstra, A.G.H. Rinnooy Kan, The complexity of scheduling underprecedence constraints, Oper. Res., 26, (1978) 22-35.

[60] J. Li, M. Chen, Generating Explicit Communication from Shared-MemoryProgram References, in: Proc. Supercomput. ’90, New York, NY, (1990)865-876.

[61] R.L. Lippmann, An Introduction to Computing with Neural Nets, IEEEASSP magazine, Vol. 4(2), (1987) 4-22.

[62] M. Mace, Memory Storage Patterns in Parallel Processing, Boston, MA:Kluwer Academic (1987).

[63] J. Martinez, M. Silva, A Simple and Fast Algorithm to Obtain All Invariantsof a Generalised Petri Net, in: C. Girault, W. Reisig (eds): Application andTheory of Petri Nets, Informatik Fachberichte No.52, Springer (1982), 301-310.

[64] J.A. McHugh, Algorithmic Graph Theory, (Prentice Hall, 1990).

[65] D.I. Moldovan, Parallel Processing - From Applications to Systems, (MorganKaufmann Publishers, 1993)

142 BIBLIOGRAPHY

[66] T. Murata, Petri Nets: Properties, Analysis and Applications, Proc. IEEE,Vol. 77, No. 4 (1989) 541-580.

[67] J. Murre, Transputers and Neural Networks: An Analysis of ImplementationConstraints and Performance, IEEE Transactions on Neural Networks, Vol.4, No. 2 (1993) 284 - 292.

[68] M. Newman, Integral Matrices, Academic Press, New York, (1972).

[69] K. Obermayer, H. Ritter, K. Schulten, Large-scale Simulations of Self-Organizing Neural Networks on Parallel Computers: Application to Bio-logical Modelling, Parallel Computing, Vol. 14 (1990) 381-404.

[70] D.A. Padua and M.J. Wolfe, Advanced Compiler Optimizations for Super-computers, Comm ACM, Vol. 29, (1986) 1184-1201.

[71] M. Paludetto, Sur la commande temps reel des procedes industriels: unemethodologie basee objets et reseaux de Petri, PhD Thesis, Rapport LAAS-CNRS No. 91467, 1991.

[72] C. Papadimitriou and M. Yannakakis, Towards an Architecture IndependentAnalysis of Parallel Algorithms, SIAM J. Computing, Vol. 19, No. 2 (1990)322-328.

[73] K.H. Pascoletti, Diophantische Systeme und Losungsmethoden zur Bestim-mung aller Invarianten in Petri-Netzen, Berichte der GMD, Bonn, No. 160(1986) .

[74] H. Paugam-Moisy, Parallelisation de Reseaux de Neurones Artificiels surReseaux de Transputers, La lettre du transputer et des calculateurs distribues(1992) 7-18.

[75] J.L. Peterson, Petri Net Theory and the Modelling of Systems, Prentice Hall(1981).

[76] C.A. Petri, Kommunikation mit Automaten, Bonn:Institut fur Instru-mentelle Mathematik, Schiften des IIM Nr.3, 1962.

[77] A. Petrowski, G. Dreyfus, C. Girault, Performance analysis of a pipelinedbackpropagation parallel algorithm, IEEE Transactions on Neural NetworksVol. 4 (1993) 970-981.

[78] C. Picoleau, Etude de problems d’optimisation dans les systemes distribues,These Universite Pierre et Marie Curie, (1992).

[79] A. Pinti et al., Etude d’un Reseau de Neurones Multi-couches pour l’AnalyseAutomatique du sommeil sur T-Node, La lettre du transputer et des calcu-lateurs distribues (1990) 21-32.

BIBLIOGRAPHY 143

[80] C. Polychronopoulos, M. Reza Highighat, C.L. Lee, B. Leung, D. Schouten,Paraphrase-2: A New Generation Parallelizing Compiler, Int. J. High SpeedComputing Vol. 1, No 1 (1989) 45-72.

[81] D.A. Pomerleau, G.L. Gusciora, D.S. Touretzky, H.T. Kung, Neural networksimulation at wrap speed: how to get 17 million connections per second, inProc. ICNN, (San Diego, 1988) Vol.2, 143-150.

[82] C. Ramchandani, Analysis of Asynchronous Systems by Timed Petri Nets,PhD Thesis, MIT, 1973.

[83] V. Sarkar, Partitioning and Scheduling Parallel Programs for Execution onMultiprocessors, Cambridge, MA: MIT Press (1989).

[84] M. Silva, R. Valette, Petri Nets and Flexible Manufacturing, Advances inPN’84, LNCS 84, Springer-Verlag.

[85] T. Tollenaere, Bibliography on Neural Networks on Parallel Machines, Par-allel Computing, Vol. 14 (1990) 1-12.

[86] T. Tollenaere and G.A. Orban, Simulating Modular Neural Networks onMessage-passing Multiprocessors, Parallel Computing, Vol. 17 (1991) 361-379.

[87] N. Treves, Comprehensive Study of Different Techniques for Semi-flows Com-putation in Place/Transition Nets, Advances in Petri Nets 1989, LNCS 424,Springer Verlag, (1989) 433-452.

[88] P. Tvrdık, Parallel Systems and Algorithms (in Czech) Publishing house ofCTU, Prague (1994).

[89] T.H. Tzen, L.M. Ni, Dependence Uniformisation: A Loop ParallelisationTechnique, IEEE Trans. on Par. and Distr. Systems, Vol. 4, No. 5 (1993)547-558.

[90] J. Valdes, R.E. Tarjan, E.L. Lawler, The recognition of series-parallel di-graphs, SIAM J.Comput., Vol. 11 (1982) 298-313.

[91] R. Valette, Analysis of Petri Nets by Stepwise Refinement, J. Comput. Syst.Sci, Vol. 18 (1979) 35-46.

[92] R. Valette, Les Reseaux de Petri, polycop, LAAS CNRS Toulouse (1992).

[93] H.P. Williams, Fourier-Motzkin elimination extension to integer program-ming problems, Journal of combinatorial theory, (A) 21 (1976) 118-123.

Index

E - efficiency, 19F∗ - full duplex, ∆- port, 21F1 - full duplex, one port, 21H∗ - half duplex, ∆-port, 21H1 - half duplex, one port, 21I lj(k) - input to the cell body of neu-

ron j in layer l, 33L - message length, 22M - marking, 52Nl - number of neurons in layer l, 33P - set of places, 52Post - postcondition matrix, 52Pre - precondition matrix, 52S - speedup, 19T - set of transitions, 52Tpar - time of parallel execution, 19Tseq - time of sequential execution, 19X - set of P-invariant generators, 61αl - momentum term in layer l, 33δli(k) - error back propagated through

the cell body of neuron i inlayer l, 33

ηl - learning rate in layer l., 33ρ-approximation algorithm, 14τ - communication time of the unit

length message, 22~s - firing sequence, 53f - algorithm fraction, 20fT - P-invariant, 55g - number of generators, 61k - algorithm iteration index, 33n - problem size, 19p - number of processors, 19r - diameter, 22, 25, 47sT - T-invariant, 55

ulj(k) - output of neuron j in layer l,

33wl

ij(k) - weight of synapse betweencell body i in layer l − 1 andcell body j in layer l, 33

algorithm, 7Amdahl’s law, 20ATA - All to All, 25, 26ATO - All to One, 26

broadcasting, 25

cell body, 35circuit switching, 21coarse-grain parallelism, 10communication/computation ratio, 20connectivity

arc, 22node, 22

conservative component, 54constant time, 22CTT - Cascaded Torus Topology, 35CVP - cell virtual processor, 35

DAG - directed acyclic graph , 10data dependence , 7data parallelism, 9DDG - data dependence graph , 7Diophantine system, 106duality, 27dynamic allocation, 12

error back propagation, 32event graph, 61

FIFO, 128

fine-grain parallelism, 10

144

INDEX 145

firing rule, 53full-duplex, 21functional parallelization, 6fundamental equation, 54

gathering, 26generators, 59gossiping, 25, 28, 29granularity, 10

halfduplex, 21hard real-time schedule, 12hierarchy, 26, 47hypercube, 23

implicit place, 128incidence matrix, 54invariant

minimal, 58standardized, 57

IO - input/output virtual processor,35

IP dependence, 140

linear processor array, 22load balancing, 12loosely coupled systems, 11

mesh, 23with wraparound, 23

message passing, 6message-switched, 21MIMD - Multiple Instruction, Multi-

ple Data stream , 6MP - message passing process, 39multinode accumulation, 26

NN - neural network, 32node splitting, 9NP - node processor, 37NP-complete, 14

OTA - One to All, 25overlap of communication and com-

putation, 41

parallelism detection , 7pATA - personalized All to All, 26pATO - personalized All to One, 26Petri Net, 51pid - processor identity number, 23places, 52port

∆-port, 21k-port, 211-port, 21

positive P-invariant, 55positive T-invariant, 55pOTA - personalized One to All, 26preemptive scheduling, 12

queueing delay, 25

repetitive component, 55ring, 23, 28

scattered mapping, 39scattering, 26, 28, 29scheduling , 12self loop place, 53self loop transition, 53shared-memory, 6SIMD

parallelization , 6SIMD - Single Instruction, Multiple

Data stream, 6single node accumulation, 26sink transition, 53SISD - Single Instruction, Single Data

stream, 5source transition, 53spanning tree, 25star, 23start-up, 22static allocation, 12stochastic gradient learning, 32store-and-forward, 21SVP - synapse virtual processor, 35synapses, 35

tightly coupled systems, 11

146 INDEX

token, 52topologies, 22torus, 23, 29, 37total exchange, 26transitions, 52

enabled, 53tree, 23

uniform graph, 129

variable renaming, 8VP - virtual processor, 35

workload distribution, 39

Documents

Parallel Algorithms for Distributed Control A Petri Net