DESIGN OF ALGORITHM TRANSFORMATIONS FOR by A THESIS … · DESIGN OF ALGORITHM TRANSFORMATIONS FOR VLSI ARRAY PROCESSING by RAVISHANKAR DORAIRAJ, B.E. A THESIS IN ... cheap as a result

DESIGN OF ALGORITHM TRANSFORMATIONS FOR

VLSI ARRAY PROCESSING

by

RAVISHANKAR DORAIRAJ, B.E.

A THESIS

IN

COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for

the Degree of

MASTER OF SCIENCE

December, 1986

he-

:Z. O^'^.^ ACKNOULEDGEMENTS

I am deeply indebted to the Committee Chairman,

Dr. Gopal Lakhani, for his guidance in preparing this

thesis. I am also grateful to the members of my Committee,

Dr. Ichiro Suzuki and Dr. Don Gustafson, for their invalua-

ble-assistance.

I uish to express grateful appreciation of my family

members and friends for their encouragement and help.

11

ABSTRACT

The rapid advances in the very large scale integrated

(VLSI) technology has created a flurry of research in

designing future computer architectures. Many methods have

been developed for parallel processing of algorithms by

directly mapping them onto parallel architectures.

A procedure, based on the mathematical transformation

of the index set and dependence vectors of an algorithm, is

developed to find algorithm transformations for VLSI array

processing. The algorithm is modeled as a program graph

uhich is a directed graph. Techniques are suggested to

regularize the data-flou in an algorithm, thereby minimiz

ing the communication requirements of the architecture.

Ue derive a set of sufficient conditions on the struc

ture of data-flou of a class of algorithms, for the exist

ence of valid transformations. The VLSI array is modeled as

a directed graph, and the program graph is mapped onto this

using the algorithm transformation.

I l l

CONTENTS

ACKNOULEDGEMENTS 11

ABSTRACT ii i

CHAPTER

1. INTRODUCTION 1

1.1 VLSI Architectural Design Principles 1 1.2 VLSI Array Processors 3

1.2.1 Systolic Array Architecture 4 1.3 Synthesization of VLSI Array Algorithms 7

1.3.1 Methodologies for VLSI Array Design 10 1.3.2 Moldovan's Method to Design VLSI

Arrays 16 1.4 Limitations of the Existing Methods 18 1.5 Outline of the Thesis 20

2. ALGORITHM MODEL AS A PROGRAM GRAPH 23

2.1 Data Dependence Vectors 23 2.2 Program Graphs 25 2.3 Program Graph Model 27 2.4 Pipelining of Data 27

2.4.1 Example to Illustrate Pipelining 30 2.5 Variable Dependence Vectors 32 2.6 Modelling of the Dynamic Programming

Algorithm as a Program Graph 38

3. THE TIMING FUNCTION 45

3.1 The Description of the Timing Fuction 45 3.2 Concurrency 46 3.3 Sufficient Conditions for the Existence

of the Timing Function 47 3.3.1 Restrictions on the Class of

Algorithms 49 3.4 The Definition of the Timing Function 54 3.5 The Timing Function for the Dynamic

Programming Algorithm 56

IV

4. THE ALLOCATION FUNCTION 60

4.1 Model for VLSI Arrays 60 4.2 The Definition of the Allocation

Function 61 4.3 VLSI Implementation of the Dynamic


5. PROCEDURE FOR VLSI IMPLEMENTATION OF ALGORITHMS 72

5.1 Procedure for Mapping Algorithms onto VLSI Arrays 72

5.2 VLSI Implementation of the Shortest Path Algorithm 75

6. CONCLUSIONS 90

6.1 Contributions 90 6.2 Future uork 91

REFERENCES 92

LIST OF FIGURES

1. VLSI Implementation for Polynomial Evaluation 6

2. Systolic Array Configurations 8

3. Types of Data-flou 28

4. Data-flou of the Convolution Product Algorithm 32

5. Data-flou uhen g is Many-to-one Function 36

6. Data-flou uhen u is Many-to-one Function 39

7. Original Program Graph of the Dynamic Programming Algorithm 42

8. Modified Program Graph of the Dynamic Program

ming Algorithm 44

9. Time Model of the Dynamic Programming Algorithm 48

10. Partitioning of the Index Space of the Dynamic


11. Data-flou Defined by Restriction R3 53

12. Application of the Timing Function on the Pro

gram Graph 59

13. A Square Array uith 8-neighbor Connections 67

14. VLSI Array for the Dynamic Programming Algorithm 71

15. Original Program Graph of the Shortest Path Algorithm 77

16. Modified Program Graph of the Shortest Path Algorithm 79

17. VLSI Array for the Shortest Path Algorithm 88

VI

LIST OF TABLES

1. Algorithms and Desired VLSI Array Structures 9

2. Data Dependence Vectors of the Dynamic Programming Algorithm 41

3. Data Dependence Vectors of the Shortest Path Algorithm 76

VI 1

CHAPTER 1

INTRODUCTION

High-performance computers are under heavy demand in

the areas of scientific and engineering applications. Even

though faster and more reliable harduare devices do achieve

high-performance, major improvements in computer archi

tecture and processing techniques are in order. Advanced

computer architectures are centered around the concept of

parallel processing. Parallel processing computers provide

a cost-effective means to achieve high performance through

concurrent activities.

The rapid advances in very large scale integrated

(VLSI) technology has created a neu architectural horizon

in implementing parallel algorithms directly in harduare.

It has been projected that by the late eighties it uill be

possible to fabricate VLSI chips uhich contain more than

10 individual transistors. The use of VLSI technology in

designing high-performance multiprocessors and pipelined

computing devices is currently under intensive investi-

gat ion.

LiX iiLSI Architectural Design Principles

VLSI architectures should exploit the potential of the

VLSI technology and also take into account the design

constraints introduced by the technology. Some of the key

design issues are summarized belou:

Simplicltv and Regularity! Cost effectiveness has

aluays been a major concern in designing VLSI archi

tectures. A structure, if decomposed into a feu types of

building blocks uhich are used repetitively uith simple

interfaces, results in great savings. In VLSI, there is an

emphasis on keeping the overall architecture as regular and

modular as possible, thus reducing the overall complexity.

For example, memory and processing pouer uill be relatively

cheap as a result of high regularity and modularity.

Concurrency and Communication: Uith current technolo

gy, tens of thousands of gates can be put in a single chip,

but no gate is much faster than its TTL counterpart of 10

years ago. The technological trend clearly indicates a

diminishing grouth rate for component speed. Therefore, any

major improvement in computational speed must come from the

concurrent use of many processing elements. Massive paral

lelism can be achieved if the underlying algorithm is

designed to introduce high degrees of pipelining and paral

lel processing. Uhen a large number of processors uork in

parallel, coordination and communication become important.

Especially in VLSI technology routing costs dominate the

pouer, time, and area required to implement a computation.

The issue here is to design algorithms that support high

degrees of concurrency, and in the meantime to employ only

simple, regular communication and control to allou effi

cient implementation. The locality of interprocessor

connections is a desired feature.

Computat ion Intensive: Compute-bound algorithms are

more suitable for VLSI implementation rather than I/O-bound

algorithms. In a compute-bound algorithm, the number of I/O

operations is very small compared to the number of comput

ing operations. This is other uay around in the case of

I/O-bound algorithms. The I/O-bound algorithms are not

suitable for VLSI implementation because the VLSI package

must be constrained to a limited number of pins. Therefore,

a VLSI implementation must balance its computation uith the

I/O banduidth.

I J 2 yLSI Array Processors

The choice of an appropriate architecture for an im

plementation is very closely related to the VLSI technolo

gy. The constraints of pouer dissipation, I/O pin count,

relatively long communication delays, etc., are all criti

cal in VLSI. On the brighter side, VLSI offers very fast

and inexpensive computational elements.

Parallel structures that need to communicate only uith

their nearest neighbors uill gain the most from VLSI

technology. If the modules that are far apart must communi

cate, considerable time is lost, and increase the communi

cation requirements of the system. The designer must keep

this communication bottleneck uppermost in his or her mind

uhen evaluating possible VLSI implementations.

1.2.1 Systolic Array Architecture

The systolic architectural concept uas developed by

Kung and associates (5, 10, 11, 1 8 ) . This subsection re-

vieus the basic concepts of systolic architectures and

explains uhy they should result in cost-effective, high-

performance, special-purpose systems for a uide range of

applicat ions.

The mean the thro syst pump syst proc thro pr in proc cess form the at a nece call

term s con cont r ugh t ol ic mul t

em. T essor ugh t cipl e essor or pu s a q items proc

ssary y and

"sys trac act i he c syst ipl e he r s ma he n att com

mps uick . Al esso - t per

tolic" t ion, a on of t irculat em, the stream

egular intains etuork. r ibute putes 0 data it operat

1 opera r simul he proc petual1

comes f nd in p he hear ory sys proces

s of da beat ing a cons This "

of a sy n each ems thr ion uhi nds for taneous essors y ( 1 6 ) .

rom s hysiol t that tem of sors a ta thr of th

tant f bl ood stolic cl ock ough i ch may a com

ly. No j ust c

yst 0 ogy dr i the

re h ough ese 1 ou pres sys

tick tsel upd

puta ual

ompu

le", refer ves b body

eart 5 out t paral of da sure" t em. . As f, it ate s t ion ting te, r

uhich s to 1 ood . In a that

he lei ta is the

Every a pro-per-

ome of arr ive is hythmi-

The crux of this approach is to ensure that once a

data item is brought out from memory it can be used

effectively at each processor it passes. This is possible

for a uide class of compute-bound algorithms uhere multiple

operations are performed on each data item repetitively. A

concrete example of a VLSI array uill be the VLSI algo

rithm, uhich is taken from {14}, for the simple problem of

polynomial evaluation.

Suppose ue have the follouing polynomial:

P(x) = A^x"^ + A^ .x'""^ +...+ A^ m m-1 u

Ue uish to evaluate P(x.) at points x., 1 <= i <= n. The

polynomial can be reformulated to:

P(x) = (((A^x + A^ .)x +...+ A.)x + Apj m m-l 1 u

The value of P(x.) for each x^, P^, is computed by an

algorithm uhose inner loop is

for i = 1 to n do

for j = 1 to n do

p. = Pi*x.+A.

The VLSI implementation is shoun in Figure 1. In this

design, the coefficients are held in cells through uhich

the X. and p. data flou. On every clock cycle, each cell

inputs X. and P., multiplies them, adds its stored coeffi

cient and outputs the result as P^^^^ uhile passing on x^

unchanged. The result P(x.) appears as the output of the

rightmost cell m clock cycles after x^ is input to the

leftmost cell. This example illustrates the pipelining and

the parallel operations of a typical VLSI array.

Pin

^.n

Pea t

CC

A

R-ct

Figure 1: VLSI Implementation for Polynomial Evaluation

As a VLSI array is made up of simple cell types, it is

cheaper to design than a circuit containing a variety of

complex cells. Unlike architectures that broadcast data to

many points, they can be easily scaled up to handle large

problems. The cells communicate only uith nearby cells,

reducing data broadcasts. Therefore, being able to use each

input data item a number of times is just one of the many

advantages of the systolic approach. Other advantages in

clude modular expansibi1ty, simple and regular data and

7

control flous, use of simple and uniform cells, elimination

of global broadcasting, limited fan-in and fast response

t ime.

In general, VLSI array designs are applied to a

computation-intensive problem that is regular, i.e., one

uhere repetitive computations are performed on a large set

of data. Systolic arrays (from nou on referred as VLSI

arrays) can assume many different structures for different

compute-bound algorithms. Figure 2 shous various systolic

array configurations and their potential aplications is

listed in Table 1 ( 8 ) . These algorithms form the basis of

signal and image processing, matrix arithmetic, combinato

rial, and database algorithms.

L L ^ Svnthesization of VLSI Array AlqorithT)S

Synthesis of VLSI algorithms is concerned uith gener

ating an optimum VLSI netuork topology for a computation

task. Previously the mapping of a computation onto an

architecture has been done in a rather ad hoc fashion. The

design of the array depended upon the inherent concurrency

and communication geometry of the algorithm. Several

attempts have been made in the automatic synthesis of algo

rithms from a description of a computation in some high-

level language. The synthesis must transform such a

8

(a) One-dimensional 1inear arr^y

(b) Tuo-dimensiona1 square array

(c) Tuo-dimensional h ar rav

exagona1

(d) Bi nary t ree (e) Triangular array

Figure 2: Systolic Array Configurations

TABLE 1

Algorithms and Desired VLSI Array Structures

VLSI array Computational algorithms structure

1-D linear FIR-filter, convolution, discrete arrays Fourier transform, solution of

triangular linear systems, carry pipelining, cartesian product, odd-even transportation sort, real-time priority queue, pipeline arithmetic units.

2-D square Dynamic programming for optimal arrays parenthesization, graph algorithms

involving adjacency matrices.

2-D hexago- Matrix arithmetic (matrix multi-nal arrays plication, L-U decomposition by

Gaussian elimination uithout pivoting, QR-factorization), transitive closure, pattern match, DFT.

Trees Searching algorithms (queries on nearest neighbor, rank, etc.), parallel function evaluation, recurrence evaluat ion.

Triangular Inversion of triangular matrix, arrays formal language recognition.

10

description into a correct array algorithm for the computa

tion. In the follouing subsection, ue shall trace the

development of these methodologies for the synthesis of

VLSI arrays.

1.3.1 Methodologies for VLSI Array Design

The mapping of computational algorithms onto VLSI

arrays implies first a transformation of the algorithm into

an equivalent but more appropriate form. The basic struc

tural features of an algorithm are dictated by the data and

control dependencies. These dependencies refer to precedent

relations of computations uhich need to be satisfied in

order to compute the problem correctly. The absence of

dependencies indicates the possibility of parallel computa

tions. So the study of data dependencies in an algorithm

becomes the most critical step in parallelizing the compu

tations of the algorithm.

In the last decade, considerable research has been

done in the area of data dependencies in high-level lan

guages. Muraoka (23) and Kuck et al. (9} studied the paral

lelism of simple loops. They addressed the problem of

discovering operations uhich may be performed concurrently

by examining algorithms at the statement level. For iter

ative loops, the index set and subscript forms are studied

to transform scalar operations into array operations uhich

11

may be performed in parallel. For iterative loops, syntac

tic trees are built for individual assignment statements,

and blocks of assignment statements, and then are applied

the techniques of arithmetic tree height reduction to re

duce the required processing time.

Toule {25}, Banerjee {2}, and Banerjee et al. {3}

extended the methodology of transforming ordinary programs

into highly parallel forms. They shou that a large number

of processors can be used effectively to speed up simple

Fortran-like loops consisting of assignment statements. A

practical method is given by uhich one can check uhether or

not a statement is dependent upon another. Four techniques,

loop freezing, the uave front method, the splitting lemma,

loop interchanging, have been suggested for transforming

programs into highly parallel forms.

Lamport {15} has developed methods for the parallel

execution of different iterations of a DO loop on both

asynchronous multiprocessor computers and array computers.

The concept of data-dependence vector uas introduced and a

computation uas abstracted in terms of its data dependen

cies. In a given algorithm, betueen the variables generated

at different index points, there are some dependencies.

These dependencies can be described as difference vectors

of index points uhere a variable is used and uhere that

12

value of the variable uas generated. Lamport's method de

termines hypersurfaces in the index set such that the index

points on a hypersurface are not data-dependent. This means

that the computations at all the points on a hypersurface

can be done in parallel, still preserving the sense of data

dependencies.

In the methods mentioned above, the analysis uas per

formed from the standpoint of a compiler for a multiproces

sor computer. The study of data dependencies uere done to

detect the inherent parallelism in the loops. Though these

methods contain many basic principles, they are not ade

quate enough for the synthesization of VLSI arrays. In

addition to a high degree of parallelism, VLSI arrays

employ pipelining of data. Also, the communication distance

and time should be kept to the minimum. There is a communi

cation path in the VLSI netuork for every data stream in

the original algorithm. Therefore, there is a direct one-

to-one correspondence betueen the data dependencies of the

algorithm and the communication requirements of its VLSI

implementation. The current approach in the synthesization

of VLSI arrays is not only on the deduction of data depend

encies but also on their modification. Here, ue shall

discuss some of the representative approaches in the design

of VLSI arrays.

13

Ueiser and Davis {26} proposed a uavefront notation

tool for the VLSI array design. In the uavefront implemen

tation, a delay operator is used to delay or to rotate the

direction of uavefronts. A uavefront is an ordered set in

uhich no tuo elements belong to the same data stream, and

all the elements of the set move uniformly in time or

space. Delayed or rotated uavefronts are used to represent

the desired operations. The interrelationship of the opera

tions thus defines the array netuork topology.

Capello and Steiglitz {4} proposed a method based on

geometric transformations. This transformation presents a

technique to place designs in a geometric frameuork. Uell-

knoun VLSI designs for computing polynomial products are

shoun to be related to one another by affine transforma

tions on a three-dimensional vector space. Design proper

ties such as broadcasting and pipelining can be formally

defined and their presence or absence in a particular

design can be readily ascertained. Likeuise, a design's

communication topology can be disclosed by projecting out

the time dimension of the representation.

Moldovan's method {20, 21, 22} is based on the trans

formation of index sets and dependence vectors. This

transformation derives the VLSI array topology by identi

fying algorithm transformations uhich favorably modify the

14

index set and the data dependencies, but preserve the

ordering imposed on the index set by the data dependence.

Data broadcast is eliminated to make local communication

possible. The transformation is partitioned into tuo func

tions, timing and netuork geometry. The netuork topology is

thus derived from an optimum transformation and the netuork

data flou is described by the timing function. As our uork

is an extension of this technique, it is described in more

detail in the next subsection.

Miranker and Uinkler {19} have developed an approach

similar to the one developed by Moldovan. The physical

process of computation is interpreted in terms of a graph

in physical space and time, and then an embedding into this

graph of another graph uhich charecterizes data flou in

particular algorithms is given. The VLSI array is complete

ly described as a special class of computational structure.

A technique is developed for mapping the graph of a partic

ular VLSI array algorithm onto a physical array.

Quinton's method {24} is a convex analysis approach

uhere the algorithms are first expressed as a set of uni

form recurrence equations over a convex set of cartesian

coordinates. The method consists of tuo steps. The first

step is to find a timing function as a quasi-affine trans

form for the computations that is compatible uith the

15

dependencies introduced by the equations. The second step

maps the domain onto another finite set of coordinates,

each representing a processor of the VLSI array, in such a

uay that concurrent computations are mapped onto different

processors.

Kung {12} developed the cut-set method in uhich a

computational task is represented by a signal flou graph.

The graph is then converted into VLSI array geometry. The

procedure first selects basic operation modules, then ap

plies a localization procedure using the cut set. Finally,

it combines delay and operation modules to form basic array

elements.

Li and Uah {17} characterize a VLSI array by three

classes of parameters: the velocities of data flous, the

spatial distribution of data, and the periods of computa

tion. By relating these parameters in constraint equations

that govern the correctness of the design, the design is

formulated into an optimization problem. The size of the

search space is a polynomial of the problem size, and a

methodology to systematically search and reduce this space

and to obtain the optimal design is proposed.

In the next section, ue uill discuss Moldovan's method

to design VLSI arrays. His technique forms the basis of our

uork.

16

1.3.2 Moldovan's Method to Design VLSI Arrays

Moldovan developed a technique uhich abstracts a com

putation in terms of its data dependencies. The method is

based on mathematical transformation of the index sets and

the data-dependence vectors associated uith the given algo

rithm. As our uork is mainly based on this concept of

defining transformations for mapping, ue shall briefly

revieu the basis of his method in this section.

Moldovon's method focusses not only on the detection

of the data dependence vectors, but also on their modifica

tion. To meet the specifications of the VLSI array, the

program structure is modified, mainly to further the local

ization of the data communications. The computation is

modeled as a lattice uith nodes representing operations and

edges representing data dependencies. The lattice is mapped

onto a space-time domain.

The transformation T mapping the algorithm onto a

space-time domain is divided into tuo independent transfor

mations, a timing function P, and the allocation function

S. Using P ( F " ) = constant, hyperplanes are determined on

the index set F^ of the algorithm such that all points on a

hyperplane can be executed in parallel. S maps the index

set onto the processors of a VLSI array, and the dependence

vectors D onto the communication links of the array. Thus

17

the netuork geometry, and the directions of the different

data streams is derived from S, and the timing is derived

from P. The matrices S and P together form the program

transformation T, a monotonic function in the sense that it

transforms the index set and at the same time preserves the

ordering imposed by the data dependencies on the index set.

The follouing steps describe the method:

1. Find the set of dependence vectors D.

The appearance of an array variable on the left-hand

side of an assignment statement is called a genera

tion; otheruise,. it is called a use. All possible

pairs of generated and used variables are formed.

Their indices are equated to compute the dependence

vectors.

2. Identify a valid transformation T.

The transformation T is partitioned into tuo func

tions P and S. P maps the index set F^ of the algo

rithm to the first k coordinates of the neu index

space, uhich is selected a priori. S maps the index

set onto the remaining coordinates of the neu index

space. Nou, P can be related to the processing time

and S can be related to the geometrical properties

of the algorithm uhich uill dictate the communica

tion requirements of the VLSI array. P is calculated

18

such that PD > 0 as this preserves the execution

ordering. S can be chosen depending on the netuork

geometry of the VLSI array chosen a priori.

3. Map the algorithm onto a VLSI array.

The functions F performed by the processing cells in

the VLSI array are derived directly from the mathe

matical expressions in the algorithm. The netuork

geometry G is derived from the mapping S. S maps the

dependence vectors onto the communication links in

the array, and the index space onto the processors.

The directions of the different data streams is

derived from S. The netuork timing uhich specifies

for each cell the time uhen the processing of func

tions F occurs and uhen the data communications

takes place is given by P.

L J L Limitations of Ulg Existing Methods

A typical assignment statement in an algorithm is of

the form

X(I) = f{X(I-dj), X(I-d2), ••• , X(I-d^)}

uhere f is a n-variable function and d., ^2$ ••• » d are

vectors of the index space of the algorithm, called depend

ence vectors. The dependence vectors denote that the compu

tation of the variable X at the index point I, depends on

the computations of the variable X at the points I-d.,

19

I-d2» ... , i~dp,» If the length of a dependence vector d is

the same through out the index space, it is termed as a

fixed or constant dependence vector. This is so uhen all

the components of the dependence vector are constants,

i.e., they are not functions of the index variables. A non-

fixed dependence vector is termed as a variable dependence

vector. An algorithm uith all fixed dependence vectors has

a regular data flou.

The methods described above make tuo assumptions on

the algorithm. The first assumption is that the data flou

is regular, i.e., all dependence vectors are fixed depend

ence vectors. There exist algorithms, dynamic programming

for an example, uhose dependence vectors have components

that are functions of the index variables (rei'er to Chapter

2 ) . These vectors may lead to broadcasting of data, uhich

uill strain the communication requirements of the final

archi-tecture. Also, the length of the dependence vector

may vary depending on the size of the algorithm. This may

lead to a very irregular data flou, and the transformation

methods developed so far may not suffice.

The second assumption is that the computation at all

points in the index space of the algorithm use the same

set of dependence vectors. These methods may not be able to

find the timing function if not all dependence vectors are

20

associated uith the computation at each point in the index

set, or uhere a subset of dependence vectors forms a cycle.

This is the case uith the shortest path algorithm as uill

be seen in Chapter 5. Furthermore, the timing function may

be sub-optimal, uith respect to the one actually possible.

This is because by the forced association of all dependence

vectors uith all the points in the index set, ue may be

introducing unnecessary ordering betueen points uhich uere

not data-dependent originally. This louers the degree of

paral1 elization. Hence, the timing function may be infe

rior .

In the follouing section ue briefly outline the goal

of the proposed research.

i ^ Outline of th^ Thesls

As explained in the previous section, there exist

algorithms uith irregular data-flou. Ue may not be able to

define linear hyper-planes on the index set to deduct

inherent parallelism of the algorithm. This leads to our

vieu of the algorithm as a nonlinear structure. Points in

the index space uhich are data-independent can then be

defined as points on a convex surface. As a simplification,

ue only consider the cases uhen a continuous nonlinear

convex surface is formed by a set of intersecting hyper

planes. Ue determine sets of intersecting hyperplanes on

21

the index space of an algorithm, such that points on a set

of intersecting hyper-planes are not data-dependent, and

consequently the computations at these points can be exe

cuted in parallel. This forms the crux of our approach.

A method has been developed to regularize the data

flou in case uhen the algorithm has variable dependence

vectors. This is done by finding, for each variable depend

ence vector, its equivalent set of fixed dependence vec

tors. Ue do ensure that the substitution of the variable

dependence vectors by the fixed dependence vectors uill

still preserve the original dataflou.

Further, ue partition the index space into regions.

The data-flou in a region is regular, in the sense that all

points in a region receive the same set of dependence

vectors. So hyper-planes can be defined on the points

uithin a region to denote data-independency among the

points on a hyper-plane. A separate timing function is

defined for each region.

Ue also determine the allocation function for each

region. This correctly maps the points in that region onto

the processors in the VLSI array uhich is selected a prio

ri .

Next, ue integrate the transformations of different

regions. This is done by taking into consideration the

22

points uhich fall on the boundary of any tuo regions, and

the dependence vectors uhich coexist in the data-flou in

the different regions. Conditions are derived to ensure the

validity of the integrated transformation.

This thesis is organized as follous: In Chapter 2, ue

define a data dependence vector and a program graph. Some

techniques are suggested to pipeline the data flou. Ue

examine the structural effects of variable dependence vec

tors on the program graph and develop a method to convert

the variable dependence vectors to equivalent fixed depend

ence vectors. In Chapter 3, ue discuss timing functions,

and derive the conditions sufficient for its existence. In

Chapter 4, a VLSI array model is given. An allocation

function is defined to map a given program graph onto a

prechosen VLSI array. Chapter 5 gives a procedure for

synthesization of VLSI arrays based on the concepts devel

oped in the previous chapters. Using our method, ue

determine an implementation of the all-pair shortest path

problem { 7 } . Finally, in Chapter 6, ue summarize the re

sults of our uork.

CHAPTER 2

ALGORITHM MODEL AS A PROGRAM GRAPH

2^ Qala. Dependence Vectors

In our research ue consider algorithms uhich can be

represented by nested loops as shoun belou:

for I^ = 1 u ^

for I^ = l^,u^

. * *

for I" = l ^ u "

begin

S^(I)

S2(I)

• • •

end

Here 1^ and u* are integer-value linear expressions in

volving I ,...,1''" , and I = (I ,1 ,... , 1 * ^ ) . S. , S2,..., S^

are assignment statements of the form X = E uhere X is a

variable and E is an expression of the variables of the

loop. Let Z denote the set of all integers, and let Z^

denote the set of n-tuples of integers. The index set of

the loop is a subset of Z^ and is defined as

F^(I) = { ( I ^ . . . , I ^ ) : 1^ <= I^ <= u ^ . . . .

l" <= I" <= u^}.

23

24

A sequential execution of the loop defines an ordering

on the points of the index set F^ uhich is knoun as lexico

graphical ordering of F* . This ordering is an induced one,

and can be modified.

The dependencies in the algorithm can be studied at

several distinct levels to extract parallelism. Since ue

are designing algorithms for VLSI arrays, ue uill focus

only on data dependencies at the variable level, uhich is

louest possible level before the bit level. The appearance

of an array variable on the left-hand side of an assignment

statement is called a generation; otheruise, it is called a

use. Values of the used variable in a statement are re

quired to compute the value of the generated variable,i.e.,

the generated variable is data-dependent on the used

variable.

Let g and u be tuo integer functions defined on the

set F* , and let X and Y be tuo variables uhose indices are

g and u respectively. The variables X(u(I)) and Y(g(I)) are

generated in statements S,(I) and S2(I)» respectively. The

variable Y(g(I)) is said to be data-dependent on the varia

ble X(u(I)) if

a) 1A < I2 iri lexicographical sense,

b) u d ^ ) = g(l2) • and

c) X(u(I)) is on the right-hand side of statement S2(I)»

25

Dfifinitioa 1: A dependence vector, d is the difference

I2 ~ Ij of index points, uhere Ij is the point of genera

tion of a variable, and I2 is the point of its use.

Ue denote by D, the set of dependence vectors of a

given algorithm. The last section of this chapter contains

an example, uhich demonstrates a computation of the set D.

In general, dependence vectors are functions of index

points, i.e., d(I«) = d(l2) for any tuo points I. and l2*

There is houever, a large class of algorithms uith fixed or

constant dependence vectors, such as the matrix multiplica

tion algorithm. On the other hand, there are algorithms in

uhich computations at different index points use different

subsets of dependence vectors. To extend the study of data-

dependencies of these algorithms, ue introduce the defini

tion of "vector-computation."

DfifiDiliOD 2 J A vector-computation is a computation

uhich is dependent on a specific set of dependence vectors.

Tuo vector-computations are equivalent only if the same set

of dependence vectors is associated uith the computations.

2i2 Ensanan] Graehs

As graphs provide a good deal of insight into the

system they represent., and also as they can be combina-

torially analyzed, ue model the nested-loop algorithm by a

data-flou program graph. Ue use the program graph, a

26

directed graph, to the describe the computations in the

algorithm at a high-level. The vertices of the program

graph are index points of the loop. Uith each vertex, there

is some computation of the algorithm associated. Uith each

arc in the program graph, there is a dependence vector

associated, uhich defines the data-dependence betueen the

tuo index points connected by the a r c A path in the graph

containing arcs associated uith the same vector is called a

data stream.

Incoming edges at a vertex represent a set of input

values uhich are required to compute the function asso

ciated uith the vertex. The outgoing edges represent a set

of output values generated at the vertex. A set of input

variables, and a set of output variables can be defined at

each vertex uhich carry the input values and the output

values, respectively.

If the algorithm has only one vector-computation, the

program graph is called a homogeneous graph, in the sense

that the vectors associated uith arcs incident at all

vertices are the same. On the other hand, the program graph

representing algorithms uith more than one vector-

computation, is called a heterogeneous graph. In this case,

different sets of vectors uill be incident at different

vert ices.

27

Zi2 Prpgram Graph Model

Ue define the program graph model as follous:

Definition 3: A heterogeneous program graph is a 5-tuple

PG = CV, D, C, X, Y3, uhere:

a) V is the set of vertices having one-to-one correspond

ence uith points in the algorithm's index space, F* .

b) D is the set of labels assigned to the arcs. Each label

corresponds to a dependence vector, and there are as many

different labels as there are dependence vectors.

c) C is the set of vector-computations. Each vector-

computation is defined by C^, the mathematical function it

computes and C., the set of dependence vectors or labels

associated uith the computation.

d) X is the set of input variables of the algorithm.

e) Y is the set of output variables of the algorithm.

Zi4 Plpg] Inlnq of da-La

The processing pouer of the VLSI array comes from

concurrent use of many simple cells rather than sequential

use of a feu pouerful processors. In addition to a high

degree of parallelism, some properties of VLSI arrays are

pipelining and reduced interprocessor distance. To make the

best of the architecture, the computations in the algorithm

should be arranged such that the locality of communications

is exploited.

28

To achieve localized flou of data, the data broad

casts, uhich may exist in the original algorithm, should be

avoided. If some value of a variable v is generated at

vertex I Q 6 V of the program graph and it is used at dif

ferent vertices I- € V, 1=1,2,...r, then the computed value

must be broadcast to all vertices I-. It, as shoun in

Figure 3 ( a ) , introduces r data dependence vectors. As a

result, there need to be paths from the processing cell

'^.c/

M x^

( a ) : Broadcasting of Data (b) : Pipelining of Data

Figure 3: Types of Data-flou

29

computing for I^ to cells processing for I.. It uould

unnecessarily saturate the communication capability of any

VLSI array. An elegant solution to this problem is to

pipeline the value of the variable V to the r vertices as

shoun in Figure 3 ( b ) .

The aim, as can be visualized from Figure 3 ( b ) , is to

reduce the number of data dependence vectors in the origi

nal algorithm. Data broadcasts in a program graph may occur

for many different reasons.The points I Q , I . , . . . , I 6 F^ may

lie on a line, or a plane or a hyper-surface, etc. Finding

a general solution for many different situations of broad

casting may not be feasible, and is generally not neces

sary. Here, ue give some guidelines on hou ue can pipeline

a value if the r+1 points lie along a line given by an

equation y = ax + c, uhere y and x are tuo of the n axes in

the index space, a is an integer taking values -1, 0, or 1,

and c is an integer.

Case i (a = 0 ) : The r+1 points are on a line y = c

This type of data broadcast is very common among the dif

ferent algorithms uhich have been synthesized so far. Typi

cally, the broadcast is signalled by a missing index of a

variable. Suppose there are n indices involved in an algo

rithm, and there are only n-1 indices associated uith a

variable in the algorithm. This indicates that the value of

30

the variable is being broadcast to all the points along the

line parallel to the missing index.

In order to eliminate broadcasting in this case, ue

fill in the missing index, and introduce neu artificial

variables such that for each generated variable, there is

only one destination.

Case 2 (a = - 1 ) : In this case, the points uhich

receive the broadcast value lie on a line y = - x + c There

fore, the data are pipelined along this line, by sending

the data from the point (x-l,y+l) to the point (x,y).

Case 3 (a = 1 ) : This case is very similar to the

previous one. Here the data broadcast is along the line

given by y = x+c. Ue pipeline the data along this line by

sending data from the point (x-l,y-l) to the point (x,y).

The three cases of data broadcasting are quite common

in most of the general algorithms amenable to VLSI imple

mentation. Any other situation of data broadcasting can be

uell handled by tailoring the above basic solutions to the

needs of the algorithm. This, of course, requires a lot of

intuition and insight.

2.4.1 Example to Illustrate Pipelining

Consider the convolution product algorithm. Given a

sequence x ( 0 ) , x ( l ) , . . . , x ( i ) , . . , , and a set of coeffi

cients u ( 0 ) , . . . , u ( K ) , the convolution algorithm computes

31

the sequence y(0) y ( i ) , . . . given by

y<i) = m u(k) * x(i-k) (2-1) o

This equation can be reuritten as:

y(i,-l) = 0

for k = 0 to K

y(i,k) = y(i,k-l) + u(k) • x(i-k) (2-2)

Consider the used variable u(k). As the index i is

missing in this variable, this falls under Case 1. Ue

modify the variable to u(i-l,k) and introduce an artificial

variable statement:

u(i,k) = u(i-l,k)

The value of the used variable x(i-k) must be pipe

lined along the line i-k = c, for all values of constant c.

This falls under Case 3. Therefore, x(i-k) is modified to

x(i-l,k-l), and the statement:

x(i,k) = x(i-l,k-l) is introduced.

After the pipelining of the variables, loop (2-2) is

reuritten as:

y(i,-l) = 0

for k = 0 to K begin

y(i,k)

u(i,k) x(i,k)

end

= y(i,k-l) + u(i-l,k) * x(i-l,k-l)

= u(i-l,k) = x(i-l,k-l)

(2-3)

32

Figure 4 illustrates the regularity in the data flou of

this algorithm after pipelining.

2 ^ Variable Dependence Vectors

Def1ni t i nn 4? A dependence vector uhose components are

integer constants, is a fixed dependence vector.

Definit ion 5? A dependence vector uhose components are

functions of the indices of the algorithm is a variable

dependence vector.

Figure 4: Data-flou of the Convolution Product Algorithm

33

Fixed dependence vectors denote regularity in the

communication structure of an algorithm as shoun in Figure

4. The program graph of the algorithm has regular data

flou. Algorithms, uith a regular program graph only have

been implemented on a VLSI array.

There are a number of algorithms uhich have variable

dependence vectors. The length of the dependence vector is

a function of the size of the algorithm, resulting in an

irregular data flou. Also, broadcasting may exist straining

the communication requirements of the VLSI architecture.

To design a simple and regular VLSI array for these

algorithms, the data flou should be regularized. This can

be done by replacing the variable dependence vector by a

set of fixed dependence vectors uhich uill not change the

output. In effect, ue desire to change the program graph

uhile preserving its equivalence in computations. Tuo prog

ram graphs are equivalent, if the data flou is preserved.

The follouing definition introduces a stronger criterion

for the program graphs to be equivalent.

DefInlt ion ^: Tuo program graphs PG = CV,D,C,X,YD and , * f * t t

PG = CV ,D ,C ,X ,Y D, are said to be equivalent if and

only if;

a) the execution ordering of graph PG is equivalent to the

execution ordering of PG , i.e. D is equivalent to D ;

34

b) the graph PG is input-output equivalent to PG ;

c) the set of vertices, V = V ;

d) any mathematical function in PG corresponds to an iden

tical function in PG .

For simplicity, first ue develop a basic solution for

a variable dependence vector having only one of its compo

nents variable. The solution for variable dependence vector

uith any number of variable components can be obtained as

an extension of the basic solution.

Let an assignment statement in the algorithm be of the

form:

X(g(Ij)) = F {X(u(l2) ^

Suppose, the generated-used pair X(g(I.)) and X(u(l2)) form

the dependence vector d. Then, d = I. - l2» Depending on

functions g and u, there are three cases uhich may make d

a variable dependence vector: (a) g is a many-to-one func

tion, (b) u is a many-to-one function, and (c) both g and u

are many-to-one functions. Ue shall consider the first tuo

cases. The third case can be solved as a combination of the

first tuo.

Case i: In this case, g is a many-to-one function.

Let Iii» ^12* * * * ' 1 i ' * *'"Îk ^® ^^® solutions of g(I) = a

constant, k is a function of the indices. Let d = (x.,

^2**'** ^j ' * * * ^n^ uith the jth component as the variable

35

one. As g has k solutions, x. uill have k values. Then d

; , may be represented by a linear sum of {d., d2 d-

dj^}, uhere each one of these vectors is a fixed dependence

vector. This situation is shoun in Figure 5 ( a ) .

Theorem i: The variable dependence vector d can be

replaced by the follouing set of fixed dependence vectors:

d . = (x.,X2»...»min(abs(x . ) ) , . . . x ) , uhere the jth

component gives the minimum of the absolute

values assumed by x., j = l,2,...,k, w

dp = (0,0,...,1,...0) uith the jth component = 1, and all

the other components = 0,

d. = (0,0,...,-1,...0) uith the jth component = - 1 , and

all the other components = 0.

Proof: As d = {d , d2» • • • » d^, . . . , d,^} , ue have to prove

that each of these fixed dependence vectors is equivalent

to the system of vectors <Jf,ip,» dp, and dj^.

As ue are taking the minimum of the absolute values

assumed by x., d_._ exists for any set of values of x-. ij ill X 11 . *J

Note that the value that x. takes is a function of the

indices of the algorithm. In other uords, the arc labeled

d . exists for any size of the program graph, min

Let dîn = h - îmin- ôu,

T, . - I i - = a*do + b*d, , uhere a*b = 0 and a,b >= 0. • ^ I m i n 11 K L

A l s o , I 2 - I i i = ( i 2 ' " ^ lm in^ " ^ ^ I m i n ' ^ l i ^ >

36

( a ) : I r r e g u l a r Data - f lou

5-' d ' • * • 4

rncn

6 . (b): Regularized Data-flou

Figure 5:. Data-flou uhen g is Many-to-one Funct ion

37

i.e., I2 - Ij. = d^.^ + a*dp + b*dL.

Therefore, d. = d^.^ + a*dp + b*d,_. Hence, by replacing the

variable dependence vector d by the set of fixed dependence

vectors d^^^^, dp and d|_, ue are not changing the original

execution ordering.

The transformed data flou is as shoun in Figure 5 ( b ) .

From the definition of equivalent graphs, ue conclude that

the graph PG uith the variable dependence vector d, uill be

equivalent to graph PG uith the equivalent set of fixed

dependence vectors.

The function g is associated uith the generated varia

ble X ( g ( I j ) ) . This means that before transformation the

partial value of the generated variable X is residing at

each one of the nodes, îi»*.*»îi (Figure 5 ( a ) ) . By the

transformation ue are actually finding the final value of

the generated variable and storing it at all the nodes, or

atleast in the node îmir,* There are many uays of finding

the final value of X from the partial values residing at

the different nodes, and storing it in the node Ii^^^. Imi n

To give the designer full flexibility in choosing the

appropraite uay to do this, ue do not include the vectors

dp and d _ in the equivalent set. Therefore, in the case

uhen g is a many-to-one function, ue replace d by d_,-mi n

only.

38

Case 2* in this case, u is a many-to-one function. Let

^21 *^22** * * *^2i* * * * *^2k ^® ^^® many solutions of u(I) = a

constant. Let d = (x.,x^,...,xj,...x^) uith the jth ^,^2 n Jth compo

nent as the variable one. As g has k solutions, x. uill <J

have k values. Then d = {dj, d2, .•., d .,..., dj^} uhere each

one of these vectors is a fixed dependence vector. This

situation is schematically shoun in Figure 6 ( a ) .

Theorem 1 holds for this case too. It can be similarly

proved that d can be replaced by an equivalent set of fixed

dependence vectors d^^^, dp, and d|_. The transformed data

flou is as shoun in Figure 6 ( b ) .

Thus, variable dependence vectors are eliminated fro m

the set of dependence vectors D. This results in a program

graph uith a regular communication structure, uhich can be

readily mapped on to a simple and regular VLSI array.

ZL£ Model 1 ing of th£ DynamU Programming Algorithm a^ ^ Program Graph

Ue consider here the optimal parenthesization algo

rithm based on dynamic programming {1}. It is given in the

nested loop form as shoun belou.

for i = 1 to n do

m(i,i) = 0

for 1 = 1 to n-1 do

for i = 1 to n-1 do

39

( a ) : I r r e g u l a r D a t a - f l o u

'J. I •a^

o

''^rvi t.n

<Ao —2K'I I.

(b): Regularized Data-flou

Figure 6: Data-flou uhen u is Many-to-one Funct ion

40

begin

j = i + 1

m(i,j) = MIN {m(i,k) + m(k+l,j) + r(i-1)r(k)r(j)}

end (2-4)

After pipelining the nested-loop (2-4) ue have

for i = 1 to n do

m(i, i) = 0

for 1 = 1 to n-1 do

for i = 1 to n-1 do

for k = i to 1+1-1 do

begin

m^(i,k) = m^"^(i,k)

m^(k+l,i+l) = m^"^(k+l,i+l)

m^(i,i+l) = MIN {m^"^(i,k) + m^"^(k+1,i+1) +

r(i-l)r(k)r(i+l)}

end. (2-5)

The data dependencies derived from the above algorithm

are shoun in Table 2. There are four used-generated pairs.

The data dependence vectors of the last tuo used-generated

pair have variable components, uith

X = 1-1,1-2 1 and y = -1,-2,...,-1+ 1.

Let the index function associated uith the generated

variable m (1,1+1) be g=(l,1,1+1). As can be verified, g is

a many-to-one function. This results in the variable

TABLE 2

Data Dependence Vectors of the Dynamic Programming Algorithm

41

Pairs of generated-used variables

! Data dependence vectors ! 1 1 k

m^ (i,k) , m^'"^(i,k)

m^ (k+1 ,1 + 1) , m^"'^(k + l,i + l)

m^(1,1+1), m^"^(i,k)

m^(1,1+1), m^"^(k+l,i+l)

(1

(1

(1

(1

0

- 1

0

- 1

0)^ = d

0)^ = d.

x)^ = d-

y)"^ = d,

dependence vectors dq» and d^. The data-flou graph of this

algorithm for n = 6 is shoun in Figure 7. Ue can see that

the data-flou associated uith the dependence vectors d^ and

d-, is similar to the one shoun in Figure 5 ( a ) .

The third component of d^ uhich is variable, assumes a

range of values given by x. The minimum of the absolutes of

these values is 1. Similarly the minimum of the absolutes

of the range of values assumed by the third component of d^

is - 1 . By theorem 2.1, ue substitute d3, and d^ by d^ =

(1 0 1 ) ^ and d^ = (1 0 - 1 ) ^ respectively.

So the equivalent set of fixed dependence vectors are

given as:

dj = (1 0 0)

d2 = <1 -1 0)

T

T

42

t \ ik f\--%

"• 1 •.

I

0) X ^

c o

X

E ^ ••-' • i 4

c o a>

o —

Q.C m u

CD

E (t) c a« o c

CL <r

"c- ':i ro 1 c

V

1 " ' •»

' c_

a' c

•«« E £ <TJ C a» a c

Q.

U •« E ft c >

• O C

••

r 0) (-D O)

J ^ o • * » t

^

43

d^ = (1 0 1) '

d^ = (1 -1 -1)"''.

The data-flou graph of the transformed algorithm is as

given in Figure 8. The data-flou is very regular as com

pared to the original data-flou shoun in Figure 7. From the

data-flou graph, ue see that there are tuo vector computa

tions. Vectors d*, d2 and d> are associated uith one

vector-computation, and vectors d., d2 and dc are assciated

uith the other one.

The program-graph model PG = CV, D, C, X, YD for this

algorithm found using Definition 3 is given belou:

The set of nodes is {V = (l,i,k): l<=l<=n-l, l<=i<=n-l,

i<=|<<=i + l-l}.

The set of labels assigned to the arcs D = {d., d2»

d^, d^} uhere d^ = (1 0 0 ) ^ , d2 = d -1 0 ) ^ , d^ =

(1 0 1 ) ^ , and d^ = (1 -1 -1)''^.

The set of vector-computations C= {c^, C2} uhere c*..

and C2f^ : { + » MIN}, and c^^ : {d^, d2» d^} and C2j : { d p

d/ , dc } .

The set of input variables X and the set of output

variables Y are easily identified using the Indices and are

Ignored here.

44

VA

I

I I

I I

0 E

c c

^ — a< l ft

CT' C

E E ft E i;_ (C

O CT (_ c

CL C CL

x»

4^ E — (D XI c G > C O

00

"T3 Tr

CHAPTER 3

THE TIMING FUNCTION

As described in the previous chapter, the data depend

ency in the program graph of an algorithm enforces an order

of execution uhich need not be the same as one given by the

lexicographical order. Data dependency introduces a partial

ordering on the points of the index space. In this chapter,

ue uill discuss properties of timing functions, uhich de

fine an order of execution of the program. This function

must preserve the partial ordering implied in the program

graph.

2 ^ liifi Description of Ikg Timing Function

Let T(I) denote the time at uhich the computation at I

€ F* is done. At time T ( I ) , all the input arguments for the

computation at node I should be available. Ue, therefore,

define a timing function as a mapping P from F"^ to F such

that P is non-negative and monotonic, i.e. for every pair

of points X, Y, X data-dependent on Y, P(X) 1 P(Y) + t^,

uhere t . is the transmission time from Y to X. If P(X) >

P(Y) + t ., it means that the value required for computation

at X arrives at X before it is needed. In that case, the

value is either bufferred, or is delayed by using a delay

operator. The data can be stored in a local memory

45

46

associated uith X and can be used later at time P ( X ) .

Otheruise, a delay operator can be introduced in the trans

mission path thereby suitably delaying the arrival of the

data at X.

Given a timing function P and a constant c, the set of

points such that P(I) = c form a hyper-surface in the index

space. As the points on a hyper-suface are assigned the

same time of execution, it is necessary that none of these

points are data dependent on each other.

If P is a linear function on the index space, the

hypersurfaces are hyperplanes, and the index space is di

vided into set of parallel planes. The hyperplanes them

selves are ordered in time such that the original execution

ordering of the nodes is still preserved. The number of

hyperplanes give the order of the total computation time of

the algorithm.

2 ^ Concurrency

As pointed out in the previous chapter, the computa

tion at all points in the index space of the algorithm may

not use the same set of dependence vectors. This results in

an irregular data-flou. Ue may not be able to divide the

index space into a set of linear hyperplanes To alleviate

this problem, ue vieu the index space as regions. Ue parti

tion the index space into regions such that the data-flou

47

is regular in each region. All points in a region use the

same subset of dependence vectors. Hyperplanes are defined

on the points uithin a region.

This can be uel1-111ustrated by assigning time ele

ments to the nodes of the program graph for the dynamic

programming problem considered in Chapter 2. Let us assign

t=l to the node (1,1,1), and from there assign appropriate

times to the other nodes, bearing tuo things in mind. One

that the execution ordering of the nodes should be pre

served, and tuo the that nodes uhich are not data-dependent

can be done in parallel.

The time model is as given in Figure 9. As it can be

seen, ue can define convex surfaces consisting of planes,

such that nodes on a convex surface are not data-dependent,

and therefore the computations at these nodes can be done

in parallei.

3.3 Sufficient Cnndltions fji£ Ihs Existence of til£ Timing Function

n By defining convex surfaces over the index space, F ,

uith each convex surface consisting of hyper-surfaces,

uhich, in turn, are associated uith a set of dependence

vectors, ue are actually partitioning the index space into

regions. Each region is associated uith a set of dependence

vectors. The uavefront of computation for each region is a

48

..-•(D-. / H3

/ £

t • • - p .

. / • • • i i ' - . /

u • I — •

£ C E > -C

C -^ - > — <

-C G •*-> CP

o

<— c

^ 6 o 6

£ o — L.

h- CL.

en

49

hyper-surface. The dependence vectors associated uith the

computations at any node in a region belong to the set of

dependence vectors associated uith the definition of that

region. This is uell illustrated by Figure 10.

Let D., 1=1,...,k be subsets of the dependence vectors

D of the program graph PG. The number of subsets k, is

equal to the number of vector computations as each D. is

the set of dependence vectors associated uith a vector

computation. Also, D = D, U ... U D|^. Let D(I) denote the

set of dependence vectors associated uith the computation

at node I. Let I -d-> J denote that J is dependent on I,

and that the vector d is associated uith this dependency.

3.3.1 Restrictions on the Class of Algorithms

Nou, ue introduce a set of conditions to be satisfied

by the algorithm so that the timing function ue defined can

be computed:

(RO) For all I 6 V, D(I) C D- for some i.

(Rl) There exists a minimum set of root nodes, R 6 V such

that D(R) C D . V i.

(R2) If Y -d-> X for index points X and Y, then D ( X ) n D ( Y )

C D^ for some i.

(R3) For each node X € V such that D(X) C D- for some 1,

and D(X) ^ D-, there exists a node Y 8 V such that

D(Y) C Dj, a path from Y = Y Q through nodes Y^,

50

Figure 10: Partitioning of the Index Space of the Dynamic Programming Algorithm

(R4)

51

^2* * * * *^h-l ^°^~^h* ^^^ ^ sequence of integers

J = J Q » j 2 » • • • » j ^ = i , not necessarily distinct, in the

range 1 to k, such that either

(a) Y^ - Y^_j = d 6 D for t = l,...,h; D(Y^) Q D.Ôj

Dj^, for t = 1, ... , h-1, or

(b) D(Y) C D.; and let d^ = Y^ - Y^_j for t = 1, ...,

h; then D(Y + d^) C D..

For each point X uith D(X) £ D- O D., there exists an

ancestor node Y such that D(Y) £ D- H Dj, and the

path from Y to X is defined as a linear combination

of vectors in D | H D..

The above restrictions are natural, and some explana

tion is in order. As stated before, ue assume that the

uavefront in a partitioned region is a hypersurface. A set

of dependence vectors D- is associated uith a region 1.

Nodes in a region can be arranged on a set of parallel

hyper-surfaces.

Restriction RO is necessary for dividing D into

subsets D..

Restriction Rl defines the roots of the program

graph uhere the initial computations are performed. The

roots can be considered as the input nodes. The definition

of the roots corresponds to the definition of the initial

boundary conditions in recurrence equations.

52

Restriction R2 defines data flou to be betueen tuo

parallel hyper-surfaces in a region only. That is, data

required for computation at nodes on a hyper-surface come

from the nodes on preceding parallel surfaces only. The

nodes on the line of intersection of tuo surfaces may feed

to nodes on surfaces parallel to these tuo surfaces only.

Restriction R3(a) defines data flou path betueen any

tuo nodes Y and X through a sequence of nodes on the inter

sections of hyper-surfaces. The data flou betueen any tuo

consequent nodes on the path is through the same dependence

vector d, i.e., X - Y = h*d uhere h is an integer (see

Figure 11 ( a ) ) .

Restriction R3(b) considers the case uhere the path

betueen nodes X and Y is a linear combination of a set of

dependence vectors not necessarily distinct. The nodes on

the path lie on a set of parallel hyper-surfaces (see

Figure 11 ( b ) ) .

Restriction R4 defines the existence of flou be

tueen tuo nodes on the intersection of tuo hyper-surfaces

associated uith tuo regions. Let a node X lie on the bound

ary of regions 1 and j, uith the associated set of

dependence vectors D- and D- respectively. Then X lies on

the intersection of tuo hyper-surfaces H. and H.. There

exists a node Y on the intersection of a hyper-surface

53

'""• °^*^-^'°" defined by R3(a)

\

<>»•• Data-f,ou defined by R3(,) Figure 11: Dafa-n

'*' " ° " °^^-«<^ by Restriction R3

54

parallel to H- and a hyper-surface parallel to H., and a

data path from Y to X consists of vectors in D- D Dj only.

2JA IhSL Definition oI ih^ Timing Function

Let p. be a linear function associated uith D-, for i

= l,...,k such that P. maps F' to F. Let us define the

timing function P of the program graph PG as P = max {P- :

i=l,...,k}. Nou, P is an equation of a set of convex sur

faces, and each of the P^ define the hyper-surfaces uhich

constitute a convex surface.

Theorem 2* Let P^, i=l,...,k satisfy the follouing

condit ions:

(CO) P.(d) > 0 for each d € D.;

(Cl) For each of the root nodes R, if D(R) Q D-, then V i,

P(R) = P^(R);

(C2) If Y -d-> X, D(Y) C D. n DJ n D, for some 1, j and k,

and D(X) C D. n Dj , and D(X) i D,^, then P.(d) >

P^(d), Pj(d) > P|^(d)» and P.(d) = Pj (d) • i may

possibly be equal to J.

For program graphs satisfying the restrictions R0-R3, if

D(X) C D. for node X, P(X) = P.(X).

Proof: Our aim is to shou that if the set of depend

ence vectors associated uith the computation at a node X,

D(X) C D., then P^(X) is the maximum of the P sub-

functions, thereby assigning the computation time to X.

55

There are tuo cases to be considered, one uhen D(X) C D.

for a j, and tuo uhen D(X) ^ D..

Case 1: D(X) C D- D Dj. Ue have to shou that P.(X) =

Pj(X). This is true at the root nodes of the program graph

from condition Cl. The root nodes are ancestors to other

points in the program graph. By restriction R4, there

exists an ancestor Y of X in the program graph such that

D(Y) C D. 0 D.. To prove by induction, let P.(Y) = P.(Y).

By condition C2, ue have that for every vector d in the

path X-Y, P.(d) = P.(d). Therefore, P.(X-Y) = P.(X-Y). This X J .1 J

results in P.(X) = P.(X). X J

Case 2: D(X) C D- and D(X) | D•. Ue have to prove that

P.(X) > Pj(X). By restriction R3, there is a nearest ances-

tor Y of X uith D(Y) Q D-. Ue prove this case by induction

on h in R3.

Case 2a: h=l. That is ue have the situation Y -d-> X.

Since D(Y) C D. f) Dj , P.(Y) = Pj (Y) . As D(X) | Dj , P.(d) >

P.(d). As the sub-functions of P are linear, it follous

that P.(X) = P.(Y+d) = Pi(Y) + P.(d) < P.(Y) + P.(d).

Therefore, ue have P^(X) > Pj(X).

Case 2b: h>l. There are tuo types of flous to be

considered given by restrictions R3(a) and R3(b). In the

first case, D(Y^) £ Djt'^^jt + l' ^^ implies that P(Y^) =

P.^(Y^) = Pjt + i^^^t^* ^^ condition C2, since Y^ -d-> Y^^^,

56

Pjt^d) < Pjt + i<<J> ôr^ t = 0,..., h-1. Therefore, (Pj(d) =

Pj0^d>> < Pji<d) <•••< <Pjh^^^ = Pi(d)). Further, D(Y^_j)

and D(Y^ = X) both C Dj^, m=h-l or h. If m=h-l, D(X=Y^) Q

^jh-^ D j ^ . Therefore, Pj^^(X) = Pjj^_^(X). By induction,

since D(Yp^_^) C Dj^_j, Pj(Y^_j) < Pj f , (Y^_ ) . By linearity

of the functions, P;(Y.) = P • (Y. .) + P.(d) < • J n J h - 1 J

(P jh-i^^-l> ^ Pjh-1^^> = Pjh-1^^>>- Since D(Y,) Q D ^ . . ^

Djh' Pjh-1^Y^> = P j h ^ ^ > = Pi^X^' Consequently, Pj(X) <

P-(X). Similar arguments hold for m=h.

Uhen ue consider the second type of data flou given by

R3(b), ue have D ( Y Q = Y ) Q D - , and D(Y) Q Dj. It follous that

Pj(Y) = P.(Y). Let d^ = Y^ - Y^_^. D(Y+d^) C D. for t =

l,..,h and D(Y) Q D ^ H D j . Therefore. Pj(d^) < P.(d^). By

linearity of the functions P. ard P^, P.(X-Y) < Pj(X-Y).

Consequently, Pj(X) < P.(X).

Thus Theorem 2 is proved. Nou the restrictions R0-R4,

and the conditions C0-C3 can be used to construct the

timing function P(x) = max {P|^<X> * ^^ *

3.5 Iii£ Timing Function fjiL iM Dynamic Programming Algorithm

Consider the program-graph of the dynamic programming

algorithm as shoun in Figure 8.There are tuo root nodes

(1,1,1) and (1,2,2). There are tuo vector-computations in

this algorithm. So ue uill define a timing function uhich

57

uill define a set of convex surfaces, each consisting of

tuo planes. So the timing function uill be of the form

P(l,i,k) = max {Pj(l,i,k), P2(l,i,k)}

uith

r. — a-i a^ a^ a A J

Pfy — Lb* hr\ bo b/i_j*

Applying Theorem 2 on the program-graph, ue have the

follouing conditions:

From condition CO, ue have

a. > 0; a.-a2 > 0; aj-a2-a3 > 0;

bj > 0; bj-b2 > 0; bj+b3 > 0.

From condition Cl, ue have

a.+a2+a2+a^ = \>^-^\>2^^3^î^*

a^+2*a2+2*a3+a^ = bj+2*b2+2*b3+b^.

From condition C2, ue have

^1 ^ '^l' 1~^2 b|-b2»

ai+a3 < bj+b3; aj-a2-a3 > b^-b2-b3.

Let us restrict the solution space by including the

condit ions:

a|+a2+a3+a^ <= 3;

bj+b2+b3+b^ <= 3.

Nou ue solve for the coefficients of P^, and P2»

satisfying the above conditions. One of the solutions is:

p. = C2 1 -1 OD and P2 = Cl -1 1 13.

58

So the timing function is:

P(l,i,k) = max {21+i-k, 1-i+k+l}.

Uith this timing function, ue reassign the time element to

each node in the program-graph as shoun in Figure 12.

Ue see that the timing function P defines convex

surfaces on the nodes, uith nodes on a convex surface

associated uith the same time element. The tuo sub-

functions P. and P2 define the tuo planes in a convex

surface. The nodes are partitioned into regions, uith each

region associated uith one of the sub-functions.The nodes

on a plane receive data only from the nodes on the pre

ceding parallel plane. Thus the uavefront in each region is

the plane associated uith the region.

59

c o

c

c

I— -C

a ^ CT)

*.- £ o a'

e

o o -^ c_ —' QL

a a c C o

(N

o

CHAPTER 4

THE ALLOCATION FUNCTION

In the last chapter ue defined the timing function for

the program graph. Once the time of computation at differ

ent nodes has been determined, it remains to map the compu

tations onto a set of processors. This is done in such a

uay that at most one computation is performed at one proc

essor, and computations uhich should be performed at the

same time are mapped onto different processors. The VLSI

array processors uill be modeled as a digraph, so that the

allocation function can be defined as a mapping of the

program graph onto the VLSI array graph.

i U Model inn ^^121 Arrays

Ue define the array graph model as follous:

DefInit ion 2* A heterogeneous array graph is a 3-tuple

AG = CG, L, F3 uhere

(a) G refers to the physical underlying of the netuork.

The number of processors is given by the index set 2*

The position of each processor is defined by its Cartesian

coordinates.

(b) L refers to the interconnections betueen processors

supporting the flou of data through the netuork. It is

defined as a matrix L = Cl. I2 ••• Ip^ uhere 1. is a

60

61

column vector representing a unique data communication link

in a specific direction. These links are termed as the

primitive communication links of the array graph.

(c) F represents the totality of arithmetic and logic

operations that a processor is capable to perform.

±:2 Ih£ Definition flf Ih^ Allocation

Function

For S to be a valid allocation function, some of the

requirements on S are: (a) the nodes X on a hyper-surface

for uhich P(X) = a constant should be mapped onto different

processors. This enables the computation at these nodes to

be done concurrently; (b) no node is mapped to more than

one processor; (c) S satisfies the timing constraint, i.e.

if Y -d-> X, then the propagation time of data from S(Y) to

S(X) should be less than or equal to P(X)-P(Y); (d) a

dependence vector is mapped to a unique communication path

constituting the primitive links in the netuork; and (e) no

processor performs more than one operation any time.

Recall that the index space is partitioned into re

gions, and uith each region, a subset of data-dependence

vectors and a timing function are associated. Ue define an allocation function as follous:

Pffi nit ion §: For a node X 6 V, S(X) =

{P.(X):i}.

{S.(X):i} if P(X)

62

The follouing theorem gives a method to construct an

allocation function S uith the desired properties:

Theorem 2* For a region 1, let

(a) S^ be a linear function,

(b) P and S together form a bijectlon,

(c) for each j, and d € D.riD., S.(d) = S.(d), X J X J

(d) S.D. = LU. uhere matrix U- indicates the utilization

of the primitive communication links in L onto uhich

D. is mapped. Each column in U- represents a vector d

€ D-, and the rou entries in the column correspond to

the primitive links used in the data path onto uhich

vector d is mapped. Matrix U. = Cu.(l,m)3 such that

u.(1,m) >= 0 (4-1)

yZ u.(l,m) <= min{P.(d), Pi(d)} for each j such that 1 A J

d € D H D - , if the column m repre-

sents the vector d (4-2)

for each j, and d G D ^ O D j , u.(l,m) = Uj(l,m) (4-3)

(e) for each root node R, S(R) = S.(R).

Then S- correctly maps the nodes in the region 1, and S =

{S.:i} is the valid allocation function for the program

graph, PG.

Proof: Conditions (a) and (b) of the theorem ensure

that the nodes uith the same time of computation, i.e..

63

uhich can be done concurrently, are mapped onto different

processors. Ue have to consider tuo cases. Let tuo nodes X

and Y lie on a convex-surface. This means that P(X) = P(Y),

and therefore the computations at these nodes can be done

in parallel. The function S should map these tuo nodes onto

different processors.

Case A: The tuo nodes X and Y lie on a hyper-surface

in the region i. Then P(X) = P.(X) and P(Y) = P.(Y). Since

from (a) and (b) ue have that P^ and S- together is a mono

tonic function, S.(X) = S^(Y) and therefore, S(X) = S(Y).

Case B: The tuo nodes X and Y lie on different hyper

surfaces in the regions i and j respectively. Then P(X) =

P.(X) and P(Y) = P.(Y). Ue resolve this case by considering

the uorst case situation i.e. uhen m nodes are on m differ

ent hyper-surfaces belonging to a convex surface. Let X. be

a node on the kth hyper-surface in the kth region. Ue have

P(X.) = P(X2) = ••• = P^^m^' "^^^ maximium value m, and

hence k can take is equal to the number of regions, and is

a small number in general. Ue define P* (X) = m P(X). This

allocates m time units to the convex surface so that the m

nodes X«, X2» ••• • X can be done in sequence even at the

same processor. Then ue replace P by P* . This modification

in the timing function uill naturally increase the compu

tation time, but uill ensure correct mapping.

64

Next ue have to prove that S is single-valued. This

ensures that no node is mapped to more than one processor.

For a node X in the region 1, S- is the mapping function.

Since S^ is a linear function, by (a), S.(X) gives a unique

value. Thus S maps node X onto only one processor.

Another case to be considered is uhen a node X lies on

the boundary of tuo regions 1 and j. Ue have to ensure that

S(X) = S.(X) = S.(X). For the root nodes R, by (e), ue have X J

that S(R) = S.(R) for all regions 1 on the boundary of

uhich lies the node R. For the other nodes ue apply induc

tion. Uhen a node X lies on the boundary of tuo regions 1

and j, ue have D(X) Q D - O D . . By restriction R4 on the X fj

program graph, there exists a node Y such that D(Y) C DÔ

D- and the path X-Y is a linear combination of the vectors

belonging to D - O D j . By induction, S(Y) = S-(Y) = S.(Y).

From (c), ue have that for each d in the path X-Y, S.(d) = S.(d). Hence, as S- and S- are linear functions by (a), it J 1 J

follous that S.(X-Y) = Sj(X-Y). Therefore, again by line-X J

arity of the functions invloved, ue have S^(X) = S.(X).

This proves that S uniquely maps a node X onto the array

graph.

Ue have to prove that S satisfies the execution order

ing. If Y -d-> X then the time taken for data to propagate

from S(Y) to S(X) should be less than or equal to P(X) -

65

P(Y) = P ( d ) . This ensures that the data computed at the

processor S(Y) reaches the processor S(X) before the compu

tation at S(X) starts. Ue defined a utilization matrix U in

d ) . Let X and Y be in the region 1. Then S(X) = S.(X), and

S(Y) = S ^ ( Y ) , and U^ is the utilization matrix. In the

matrix U^, there is a unique column for the vector d. The

summation of the elements of this column vector, say t,

gives the propagation time of data betueen tuo processors

through the communication path representing the vector d.

Also ue knou that P.(d) is the difference in the

computation-start times of the nodes connected by the vec

tor d. By equation (4-2) ue have that t <= P^(d). This

ensures that the execution ordering is preserved by the

array graph. Also by equation (4-3) ue ensure that a data

stream d is mapped uniquely onto a communication path.

The timing function P, and the allocation function S

together form the algorithm transformation T. The follouing

theorem summarizes our results.

Theorem 4? A transformation

T = P

S

of an algorithm such that P and S satisfy Theorem 2, and

Theorem 3 respectively, maps the given algorithm onto a

VLSI array in uhich the data flou is correct.

66

1 ^ VLSI Implementation of ihs. D y n ? m U Programming Algorithm

In the previous chapter, ue computed a timing function

for the dynamic programming problem. In the follouing sec

tion, ue shall derive the S function for the dynamic pro

gramming problem.

As given by Definition 7, the VLSI array is modeled as

an array-graph AG = CG, L, F3 uhere

G = {(g., 9r>) I I i 9^ i n-1, 1 i 9<y i n-1}

L = 1 -1 -1 0 1 -1

1 - 1 1 - 1 1 - 1

lo 1. Ic 1^ I7 1ft 1 h ^2 '3 '4 '5 '6 '7 '8 '9

F = {+, MIN}.

The VLSI array has a 8-neighbor bidirectional communi

cation paths and also a communication uithin a processor. A

typical processor and its 8 neighbors in the array are as

shoun in Figure 13. Each processor is capable of performing

the addition and the MIN operations.

Next, ue construct the S function uhich uill allocate

the processors in the VLSI array graph AG for the

computations at the different nodes in the program graph

PG. From Definition 8 in Section 4.2, S is given as:

S = {Sj, S2} such that S. is the valid allocation function

for the node X for uhich P(X) = P^(X).

67

Figure 13: A Square Array uith* 8-neighbor

Connect ions

6S

Let

Si = '11 ^12 ^13

'21 ^22 ^23

^2 = ^11 ^12 ^13

^21 ^22 ^23

From condition (c) of Theorem 3, ue have

Sjdj - ^2^1

This leads to the follouing equations:

11 = bj^; ^21 " ^21

^11 ^12 " ^ 11 ^12

^21"^22 ^ ^2r^22*

From this, ue have

aj2 - bj2' a/ / ~ U/y/y .

Applying condition (e) on the root node (1,1,1), ue

have

ajj+aj2+aj3 - bjj + bj2"^^13

21"*" 22'*" 23 " b2i + b22+b23»

From the above equations,ue derive that

^11 " ^11' ^12 " ^12* ^13 = b 13

'21 = b 21' ^22 = b 22' ^23 = b 23

Therefore, S = S^ = S2.

69

One possible set of utilization matrices U = {U., U2}

uhich satisfies the condition (d) is given belou:

Ui =

U2 =

0 0 0 0 0 0 0 1 0

^1

0 0 0 0 0 0 0 1 0

^1

0 0 0 0 0 1 0 0 0

< 2

0 0 0 0 0 1 0 0 0

'2

0 0 0 0 0 1 0 0 0

^ 4

0 0 0 0 0 0 0 1 0

^ 3

Uith these utilization matrices, ue have

^ — o 1 — >5^ —

1 1

0 -1

0

0

Ue apply the S function on the nodes of the program

graph PG. Ue see that at most only tuo nodes uith the same

computation time have been mapped to the same processor.

Note that there is a tuo unit computation time difference

betueen these nodes and the nodes feeding the input values

required for the computations at these nodes. Therefore,

70

from the proof of Theorem 3, ue ensure that P and S to

gether is a monotonic function.

From Theorem 4, the algorithm transformation function

T is given as

T = P

S uhere P: F^ -> F^; S: F^ -> F* ^.

The VLSI array architecture defined by the array graph

AG, and the allocation function S is as shoun in Figure 14.

This architecture uas first proposed by Guibas et al. {6}.

All the processing cells perform the same functions. The

directions of the different data streams are given by the

utilization matrices.

71

(y-*C)—<)—<y^--Q A ^ n I

It)

s^xv /-^^x^ /-Ni xi rpi-<y^^-^'""

(^ri-Ânii^ m c

Q ^ -Q'"^

O" -5

Figure 14: VLSI Array for the Dynamic Programming Algorithm

CHAPTER 5

PROCEDURE FOR VLSI IMPLEMENTATION OF ALGORITHMS

In this chapter, ue propose a procedure to transform

an algorithm to a highly parallel form, and then to map it

onto a prechosen VLSI array architecture. The procedure

summarizes the methods uhich uere discussed so far.

^Lil Procedure lor Mapping Algorithms onto ) A ^ Arrays

step 1: Pipeline all variables in the algorithm.

Step 2* Find the set of data-dependence vectors.

Step 2* Replace each variable dependence vector by its

equivalent fixed dependence vector.

Step 4 J Model the algorithm as a program graph.

Step 5J Divide the set of dependence vectors into

subsets uhere each subset is associated uith a vector-

computation in the algorithm.

Step î Compute a valid timing function for each

subset of data dependence vectors.

Step Zs Model the VLSI array architecture as an array

graph.

Step SJ Select a valid allocation function for each

subset.

72

73

Step 9: Integrate the time and allocation functions

into one transformation.

Step IQ: Map the program graph onto the VLSI array.

Explanat lor): The role of the first step is to elimi

nate all possible broadcasts uhich may exist in the origi

nal algorithm. Some methods have been suggested in Section

2.4 of Chapter 2 to identify broadcasting, and to eliminate

them by pipelining the variables in the algorithm.

If 'components of a dependence vector are functions of

the indices, i.e., the vector is not fixed, then this

vector may lead to an irregular data-flou. Irregular data-

flou uill strain the communication paths of an arch

itecture. Step 3) identifies such variable dependence vec

tors, and each such vector is replaced by an equivalent set

of fixed dependence vectors. Theorem 1 in Section 2.5 of

Chapter 2 states a technique to find the equivalent set.

As graphs are very suitable to represent systems, ue

model the nested-loop algorithm as a data-flou program

graph. Our model-definition of the program graph is given

in Definition 3 in Section 2.3 of Chapter 2.

In Step 5, ue construct subsets of dependence vectors

so that each subset is associated uith a vector-computa

tion. The conceptual necessity for such a division is

stated in Section 3.2 of Chapter 3.

74

In Step 6, ue construct a timing function uhich deter

mines convex surfaces on the index set. Computations at the

nodes on a convex surface can be done in parallel. Theorem

2 in Chapter 3 defines such a timing function, and the

method to construct a valid one.

Definition 7 in Chapter 4 gives the model-defini11 on

of a VLSI array graph. By Step 7, ue specify the desired

array model uhich constitutes number of processors, and

their identification, the primitive communication links,

and the arithmetic and logical operations that a processor

is capable to perform.

In the next step, ue compute an allocation function as

given by Theorem 3 in Chapter 4. This function maps the

nodes, and vectors of the program graph onto the pro

cessors, and the data-communication links of the VLSI array

graph respectively.

The timing function P, and the allocation function S

are integrated to form the algorithm transformation T. The

validity of the transformation T is given by Theorem 4.

In the last step, ue map the algorithm onto the VLSI

array. The algorithm transformation T maps the execution of

an algorithm onto a VLSI array architecture.

The ten steps of the procedure mentioned above de

scribe a systematic method for the implementation of VLSI

75

algorithms. The validity of the procedure is ascertained by

the Theorems 1-4.

In the next section, ue shall exemplify our procedure

on the shortest path problem.

Ii2 ^L$I Implementat inn of lil^ Sbgptest Path Algorithm

Consider the all-pair shortest path problem for a

directed graph { 7 } . Ue shall map this algorithm onto a 2-D

square array. A systolic architecture has been designed

before for the shortest-path problem by Lakhani (13} ex

ploiting synchronization in computation.

Let A {i,j) denote the length of a shortest path from

vertex 1 to vertex j goinc through no vertex of index

greater than k. The shortest path algorithm is given by

for k = 1 to n do

for 1 = 1 to n do

for j = 1 to n do

A^(i,j) = min { A ^ " ^ ( i , j ) , A^"^(i,k) +

A^"^(k,j)} (5-1)

Ue shall apply the steps of our mapping procedure on loop

(5-1) to design the VLSI array for this algorithm.

Step 1: The number of indices associated uith each one

of the variables in loop (5-1) is equal to the number of

76

the index variables of the loop. Therefore, step 1 of the

procedure does not apply in this case.

Step 2J The data-dependence vectors can be found using

Definition 1. All possible pairs of generated and used

variables are formed and their indices equated. The three

pairs thus formed are <A*^(i,j), A^"^(i,j)>, <A^(i,j),

e set of data de-A^ ^ i , k ) > and <A*^(i,j), A'^ ^(k,j)>.Th

pendence vectors derived from the loop(l) is shoun in Table

3. These denote the used-generated pairs. The data depend

ence vectors of the last tuo pairs have variable components

uith X = 1-n,..., 0,..., n-1.

TABLE 3

Data Dependence Vectors of the Shortest Path Algorithm

Pairs of generated-used variables

A'^(i,j), A'^~^(i,j)

A ^ ( i , j ) , A'^"^(i,k)

A*^(i,j), A'^"^(k,j)

I Data dependence vectors I k 1 j

(1

(1

(1

0

0

X

0 ) ^ = d

x ) ^ = d

0)"^ = d

Step 3: The data-flou of this algorithm for n = 3 is

given in Figure 15. Ue can see that the variable dependence

vectors d2 and d3 result in data-flou similar to the one

shoun in Figure 6 ( a ) .

77

/ / / ' K \

*C ~C (o fo

a;

o £

CL. -^ fC --<

CD o en

£ r-tt) <I

o -^ C (D

Q_ CL

c a; - r l .^- '

-^ o

O CO

IT)

c

cn

78

Using Theorem 1, ue obtain the dependence vectors d, =

(1 0 0 ) ^ , d^ = (0 0 1 ) ^ , and d^ = (0 0 - 1 ) ^ as the equiva

lent fixed dependence vectors for d2» Similarly, the equi

valent set of fixed dependence vectors for do is found to

be dj = (1 0 0 ) ^ , d^ = (0 1 0 ) ^ , and d^ = (0 -1 0 ) ^ .

The neu set of dependence vectors is as given belou:

dj = (1 0 0 ) ^

T

d^ = (0 0 1)

d^ = (0 0 -1)

d^ = (0 1 0)

dy = (0 -1 0 ) ' .

The data-flou graph of the modified algorithm is shoun

in Figure 16. From the data-flou graph ue can see that

there are four vector computations. The set of dependence

vectors associated uith the four vector-computations are

{d., d^, d^}, {d., dc, dy}, {d^, d^, d^}, and {d^, d^, d^}

respect ively.

Step 4: The program-graph model PG = CV, D, C, X, YD

for the shortest path algorithm constructed from Definition

3 is as given belou:

The set of nodes is {V = (k,i,j) : 1 < k ^ n, 1 ^ i ^ n,

1 1 j 1 n}.

79

^ -

r-!

rT

d

T> r-- 00

iA

i^-\ /

i H ^

» V

> /

^ " ^ y D ~ / \

7f

Vp— -# /

/ I

-^kL_>

I y

/

. ^

I \ /

^ ^

•<3>

( /

I , I > /

18

I

^ ^

* ( $

f / J

^ /

f-ff i

IL \

• \ / I

/ .• \

4-

. \ / I

-K^ Y

IT

; I 1 I

• «

o

/

\ i

_5

0/

£ a -.-'

c_ c CD O

£ — (D <I

o ^ c 0)

CL CL

"D -.-' O) iP

•<-* c

O X

n c/

vO

80

The set of labels assigned to the arcs is D = {d., d^,

d^, d^, dy} uhere d = (1 0 0)^, d^ = (0 0 1)^, d^ =

(0 0 - 1 ) ^ , d^ = (0 1 0 ) ^ , and dy = (0 -1 0 ) ^ .

The set of vector computations C = {c., C2. C3, c^}

uhere c^^, C2n,, C3^ and c^^ = { + , MIN}, and c^^ = {d^, d^,

d^}» C2J = {dj, d^, dy}, C3^ = {d^* d^, d^}, and c^^ = id^,

^5* ^6^*

The set of input variables X, and the set of output

variables are easily identified using the indices and are

ignored here.

Step 5J The set of dependence vectors D is divided

into four subsets given as follous:

DJ = {dj, d^, d^}

D2 = {dj, d^, dy}

D3 = {dj, d^, dy}

D4 = <dj, d^, d^}.

Step 6? Nou, ue derive the conditions for defining the

timing function from Theorem 2. The root nodes are (1,1,1),

(1,1,2), (1,1,3), (1,2,1) (1,3,3). The timing func

tion uill be of the form

P(k,i,j) = max {Pj(k,i,j), P2(l<»i»J)» P3(k,i,j), P4(k,i,j)}

uith

Pj = Cuj U2 U3 u^D

p2 = Cxj X2 X3 x^D

81

^3 = ^^1 >'2 ^3 U^

P4 = Czj Z2 Z3 z^D.

Applying Theorem 2 on the program-graph, ue have the

follouing conditions:

From condition CO, ue have

Uj > 0

Xj > 0

yj > 0

Zj > 0

U3 > 0

X3 < 0

yo > 0

Z3 < 0

U2 > 0;

X2 < 0;

Vo < 0;

Z2 > 0.

From condition Cl, ue have

Uj+U2+U3+u^ = X.+X2+X3+X-

= yi+y2+y3-^y4

= ZJ+Z2+Z3+24.

From condition C2, ue have

" 1

* < 1

"3

><3

"2

X2

<

>

>

<

>

=

'l

"1

X3)

^3i

Xji

yji

! Uj

1 Xj

1 U3

><3

"2

X2

<

>

=

=

>

<

-l

1

^3-

^3'

yji

22!

! Uj

• "1

! U3

• ^ 3

"2

^2

<

—

>

>

=

<

^1'

r ^3'

^3'

Z2!

Zj.

Let us restrict the solution space by including the

condit ions:

Uj+U2+W3+u^ i 3;

X2+X2+X3+X4 i 3;

82

l'*" 2"*' 3' 4 ^ ^*

Zi+Z2+Z3+z^ i 3.

Ue solve for the coefficients of P^, P2, P3 and P4

satisfying the above conditions. The only solution in our

solution space is :

Pj = Cl 1 1 0]

P2 = C5 -1 -1 0:

P3 = C3 -1 1 01

P4 = C3 1 -1 0:.

So the timing function is

P(k,i,j) = max {k+i+j, 5k-i-j, 3k-i+j, 3k+i-j}.

After assigning the time unit to each node using the

above timing function, the program-graph is as shoun in

Figure 16. It can be verified that the timing function is a

valid one.

Step 7: The VLSI array is modeled as an array-graph

AG = CG, L, FD as fol1ous:

G = i(9^f92^ * 1 £ g^ 1 n, 1 1 g2 1 n}.

L = 0 1 - 1 - 1 1

0 1 - 1 1 - 1

0 0 1 - 1

1 - 1 0 0

^1

^2

h h h U 5 6 7 U 9 This array has 8-neighbor bidirectional connections and

83

also a connection uithin a processor, as shoun in Figure

13.

F = {+. MIN}.

Step S: Nou ue have to construct an allocation func

tion S uhich uill map the nodes in the program graph onto

the nodes in the array graph. From Definition 8 in Section

4.2, S is given as:

S = {Sp S2» S3, S^} such that S^ is the valid allocation

function for the node X for uhich P(X) = P.(X).

Let

^1 = ^11 ^12 ^13

^21 ^22 ^23

^2 = bjj bj2 bi3

b2i b22 ^23

S3 = ^11 ^12 ^13

C21 C22 ^23

S4 = ^11 ^12 " 13

^21 ^22 ^23

From condition (c) of Theorem 3, ue have

Sjd^ = S2d2 = S3d^ = S4dj

$2^2 = S4d2

^1^3 " ^3^3

84

^2^4 - ^3^4

Sjd^ = S^d^.

This results in the follouing conditions on the co

efficients :

11

'21

13

13

12

12

have

- bj^ - Cjj - d^j

—

=

—

=

=

^ 21

^13

^13

C j 2

^12

—

«

*

^21 ""

^23 =

^23 ^

b22 =

^22 ""

^21

^23

*^23

C22

d22-

Applying condition (e) on the root node (1,1,1), ue

ajj+aj2+aj3 - b^.j+bj2"'^13

= c^j + c^2"^^13

a22 + a22+a23 = b2i+b22'*"t>23

" 21" 22"*" 23

= d2i+d22+^23'

From the above equations, ue derive that

^11

^12

^13

^21

^22

= b

= b

= b

11

12

13

= c

= c

11

12

= c

= b

= b

21

22

= c

= c

13

21

22

= d

= d

= d

11

12

13

= d

= d

21

22

^23 " ^23 " ^

85

23 " ^23*

Therefore, S = Sj = S2 = S3 = S^.

One possible set of utilization matrices uhich satis

fies the condition (d) is given belou:

Ul =

Uo =

Uo = _ I

0 0 0 0 0 0 1 1 0

^1

0 0 0 0 0 0 0 1 0

< 1

0 0 0 0 0 0 0 1 0

^1

1 0 0 0 0 0 0 0 0

^ 3

1 0 0 0 0 0 0 0 0

^ 2

1 0 0 0 0 0 0 0 0

^ 3

0 0 0 0 0 1 0 0 0

< 5

0 0 0 0 0 0 1 0 0

^ 4

0 0 0 0 0 0 1 0 0 .

^ 4

86

U4 =

0 0 0 0 0 0 0 1 0

^1

1 0 0 0 0 0 0 0 0

d2

0 0 0 0 0 1 0 0 0

'5

Uith these utilization matrices, and the condition (d)

ue have

O •" O . •" ^ O — ^ O "" A. ""

1 0 0

0 1 0

Next, ue have to ensure that P and S together is a

monotonic function. The proof of Theorem 3 gives a method

to this effect. Ue replace the P function by P* = m * P,

uhere m = 4 in our case. From Figure 16, ue see that at

most tuo nodes uith the same time of computation are mapped

to the same processor. Therefore, P* = 2 * P is sufficient

to make T = CP* , SD a monotonic function. The modified P

functions are as given belou:

^2

P3'

P4'

= C 4 4 4 ]

= CIO -2 -2 D

= C 6 -2 2 3

= C 6 2 -2 ] .

87

Step 2: As stated in Theorem 4, the timing function

P* , and the allocation function S together form the algo

rithm transformation function T such that

T = uhere P* : F* -> F ^ S: F " -> F^ ^.

Step IQ: The VLSI array architecture defined by the

array graph and the allocation function S is as shoun in

Figure 17. All processors are identical, and the structure

of a processor depends upon the computations requierd by

the shortest path algorithm as uell as the timing and data

communication dictated by the transformed data dependen

cies. The communication paths are labeled denoting the

dependence vectors they represent. Notice that a variable

uhich has a dependence d. moves from a processor to the

next via a vertical channel uith direction (1 0) .

It is important to comment here that tradeoffs are

possible betueen the time and space charecteristics of the

VLSI array. By simply selecting another transformation T,

it results in a different parallel execution time, differ

ent array dimensions, and different interprocessor communi-

cat ions.

88

Figure 17: VLSI Array for the Shortest Path Algor1thm

i^dSf*

89

The number of valid algorithm transformations is usu

ally very large. A transformation must be selected based on

a performance index, uhich uill measure the overall array

performance. Some of the charecteristics to be considered

are speed, processor complexity, number of interprocessor

connections, practical design considerations, etc.

CHAPTER 6

CONCLUSIONS

6-t_L C o n t r i h u t innc;

In this thesis, ue have developed a procedure to find

algorithm transformations for VLSI array processing. The

concepts on uhich our procedure is based, have been de

scribed in the previous chapters. In this section, ue

summarize our uork, highlighting its significance.

The most important information about an algorithm is

contained in its data dependences because these determine

the algorithm's communication requirements. Ue have sug

gested techniques to pipeline the data propagation. Ue

regularize the data-flou in algorithms uith variable

dependence vectors by replacing such vectors uith an equi

valent fixed dependence vectors.

Then ue examined a class of algorithms uith heteroge

neous data-flou, uhere computations at all points are not

dependent on the same set of dependence vectors. Ue pro

vided a set of sufficient conditions on the structure of

data-flou in this class for the existence of syntactically

correct mappings on a VLSI array. This is the most signifi

cant contribution of this thesis, as the existing methods

are inadequate to find mappings for the class of algorithms

uith heterogeneous data-flou.

90

91

The characterization developed in this uork sheds some

insight into the regularity of computations that can be

mapped onto VLSI arrays. Finally, the concepts introduced

in this thesis are not restricted only to VLSI systems;

they can also be used for mapping algorithms onto some

other fixed parallel computer architectures. The concepts

ue have developed to find the timing function can be used

in the compilation of the class of algorithms specified

above, for supercomputation.

^Li2 Future UorK

Pipelining of data is a significant step in the design

of VLSI arrays, as these dictate the communication require

ments of the array. A unifying approach to pipeline the

propogation of data has to be developed.

Techniques have to be developed to map algorithms of

any size, uith heterogeneous data-flou, onto fixed size

VLSI arrays.

There is no sound basis on uhich ue can evaluate the

transformation ue have designed. More uork is needed to

Identify a unifying performance index to measure the over

all array performance including speed, processor complex

ity, communication requirements, and practical design

considerat ions.

REFERENCES

1. Aho,A.; Hopcroft,J.E.; and Ullman,J.D. "The design and analysis of computer algorithms," Addison-Uesley, 1982.

2. Banerjee,U. "Data dependence in ordinary programs," M.S. thesis. Department of Computer Science, University of Illinois, Urbana-Champaign, Nov. 1976.

3. Banerjee,U.; Chen,S.C.; Kuck,D.J.; and Toule,R.A. "Time anc parallel processor bounds for Fortran-like loops," IEEE Transactions on Computers, Vol. C-28, Sep. 1979, pp. 660-670.

4. Capel lo,P.R. ; and Steiglitz,K. "Unifyin.:; VLSI array design uith geometric transformation," Proceedings of the 1983 International Conference on Parallel Processing, Aug. 1983, pp. 448-457.

5. Foster,M.J.; and Kung,H.T. "The design of special-purpose VLSI chips," Computer Magazine, Vol. 13, Jan. 1980, pp. 26-40.

6. Guibas,L.J.; Kung,H.T.; and Thomson,C.D. "Direct VLSI implementation of combinatorial algorithms," Proceedings of the Caltech Conference on VLSI, Jan. 1979, pp. 509-525.

7. Horouitz,E.; and Sahni,S. "Fundamentals of computer algorithms," Computer Science Press, 1984.

8. Huang,K.J and Briggs, A.F. "Computer architecture and parallel processing," McGrau-Hlll Book Co., 1984.

92

93

9. Kuck,D.J.; Muraoka,Y.; and Chen, S . C "On the number of operations simultaneously executable in Fortranlike programs and their resulting speedup," IEEE Transactions on Computers, Vol. C-21, Dec. 1972, pp. 1293-1310.

10. Kung,H.T. "Let's design algorithms for VLSI systems. Proceedings of the Caltech Conference on VLSI, Jan. 1979, pp. 65-90.

11. Kung,H.T. "Uhy systolic architectures ?," Computer Magazine, Vol. 15, Jan. 1982, pp. 37-46.

12. Kung,S.Y. "On supercomputing uith systolic/uavefront array processors," Proceedings of the IEEE, Vol. 72, July 1984.

13. Lakhani,G.D. "An improved distribution algorithm for shortest paths problem," IEEE Transactions on Computers, Vol. C-23, Sep. 1984, pp. 855-857.

14. Lam,M.S.; and Mostou,J. "A transformational model of VLSI systolic design," Computer, Vol. 18, Feb. 1985, pp. 42-52.

15. Lamport,L. "The parallel execution of DO loops," Communications of the ACM," Feb. 1974, pp. 83-93.

16. Leiserson,C.E. "Area efficient VLSI computation," Carnegie-Mellon University, Scribe Version 3A(1117), Nov. 1980.

17. Li,G.J.; and Uah,B.U. "The design of optimal systolic arrays," IEEE Transactions on Computers, Vol. C-34, Jan. 1985, pp. 66-77.

18. Mead,C.; and Conuay,L. "Introduction to VLSI systems," Addison-Uesley Publishing Co., 1980

94

19. Miranker,U.L.; and Uinkler,A. "Spacetime representations of computational structures," Computing, Vol. 32, 1984, pp. 93-114.

20. Moldovan,D.I. "On the analysis and synthesis of VLSI algorithms," IEEE Transactions on Computers, Vol. C-31, Nov. 1982, pp. 1121-1126.

21. Moldovan,D.I. "On the design of algorithms for VLSI systolic arrays," Proceedings of the IEEE, Vol. 71, Jan. 1983, pp. 113-120.

22. Moldovan,D.I. "ADVIS: A softuare package for the design of systolic arrays," Proceedings of the 1984 International Conference on Parallel Processing, 1984, pp. 158-164.

23. Muraoka,Y. "Parallelism exposure and exploitation in programs," Ph.D. dissertation. Department of Computer Science, University of Illinois, Urbana-Champaign, Feb. 1971.

24. Quinton,P. "Automatic synthesis of systolic array from uniform recurrent equations," Proceedings of the 11th Annual International Symposium on Computer Architecture, June 1984, pp. 208-214.

25. Toule, R. "Control and data dependence for program transformations," Ph.D. dissertation, Department of Computer Science, University of Illinois, Urbana-Champaign, Mar. 1976.

26. Ueiser,U.; and Davis,A. "A uavefront notation tool for VLSI array design," VLSI Systems and Computations, ed. by Kung, Sproull, and Steele, Computer Science Press, 1981, pp. 226-234.

PERMISSION TO COPY

In presenting this thesis in partial fulfillment of the

requirements for a master's degree at Texas Tech University, I agree

that the Library and my major department shall make it freely avail

able for research purposes. Permission to copy this thesis for

scholarly purposes may be granted by the Director of the Library or

my major professor. It is understood that any copying or publication

of this thesis for financial gain shall not be allowed without my

further written permission and that any user may be liable for copy

right infringement.

Disagree (Permission not granted) Agree (Permission granted)

Student's signature

/^ i^l^'^îÂ^'u^ V"\ Student's signature

Date

(0/2^/ es Date

Documents

DESIGN OF ALGORITHM TRANSFORMATIONS FOR by A THESIS … · DESIGN OF ALGORITHM TRANSFORMATIONS FOR VLSI ARRAY PROCESSING by RAVISHANKAR DORAIRAJ, B.E. A THESIS IN ... cheap as a result