Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
DESIGN OF ALGORITHM TRANSFORMATIONS FOR
VLSI ARRAY PROCESSING
by
RAVISHANKAR DORAIRAJ, B.E.
A THESIS
IN
COMPUTER SCIENCE
Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for
the Degree of
MASTER OF SCIENCE
December, 1986
he-
:Z. O^'^.^ ACKNOULEDGEMENTS
I am deeply indebted to the Committee Chairman,
Dr. Gopal Lakhani, for his guidance in preparing this
thesis. I am also grateful to the members of my Committee,
Dr. Ichiro Suzuki and Dr. Don Gustafson, for their invalua-
ble-assistance.
I uish to express grateful appreciation of my family
members and friends for their encouragement and help.
11
ABSTRACT
The rapid advances in the very large scale integrated
(VLSI) technology has created a flurry of research in
designing future computer architectures. Many methods have
been developed for parallel processing of algorithms by
directly mapping them onto parallel architectures.
A procedure, based on the mathematical transformation
of the index set and dependence vectors of an algorithm, is
developed to find algorithm transformations for VLSI array
processing. The algorithm is modeled as a program graph
uhich is a directed graph. Techniques are suggested to
regularize the data-flou in an algorithm, thereby minimiz
ing the communication requirements of the architecture.
Ue derive a set of sufficient conditions on the struc
ture of data-flou of a class of algorithms, for the exist
ence of valid transformations. The VLSI array is modeled as
a directed graph, and the program graph is mapped onto this
using the algorithm transformation.
I l l
CONTENTS
ACKNOULEDGEMENTS 11
ABSTRACT ii i
CHAPTER
1. INTRODUCTION 1
1.1 VLSI Architectural Design Principles 1 1.2 VLSI Array Processors 3
1.2.1 Systolic Array Architecture 4 1.3 Synthesization of VLSI Array Algorithms 7
1.3.1 Methodologies for VLSI Array Design 10 1.3.2 Moldovan's Method to Design VLSI
Arrays 16 1.4 Limitations of the Existing Methods 18 1.5 Outline of the Thesis 20
2. ALGORITHM MODEL AS A PROGRAM GRAPH 23
2.1 Data Dependence Vectors 23 2.2 Program Graphs 25 2.3 Program Graph Model 27 2.4 Pipelining of Data 27
2.4.1 Example to Illustrate Pipelining 30 2.5 Variable Dependence Vectors 32 2.6 Modelling of the Dynamic Programming
Algorithm as a Program Graph 38
3. THE TIMING FUNCTION 45
3.1 The Description of the Timing Fuction 45 3.2 Concurrency 46 3.3 Sufficient Conditions for the Existence
of the Timing Function 47 3.3.1 Restrictions on the Class of
Algorithms 49 3.4 The Definition of the Timing Function 54 3.5 The Timing Function for the Dynamic
Programming Algorithm 56
IV
4. THE ALLOCATION FUNCTION 60
4.1 Model for VLSI Arrays 60 4.2 The Definition of the Allocation
Function 61 4.3 VLSI Implementation of the Dynamic
Programming Algorithm 66
5. PROCEDURE FOR VLSI IMPLEMENTATION OF ALGORITHMS 72
5.1 Procedure for Mapping Algorithms onto VLSI Arrays 72
5.2 VLSI Implementation of the Shortest Path Algorithm 75
6. CONCLUSIONS 90
6.1 Contributions 90 6.2 Future uork 91
REFERENCES 92
LIST OF FIGURES
1. VLSI Implementation for Polynomial Evaluation 6
2. Systolic Array Configurations 8
3. Types of Data-flou 28
4. Data-flou of the Convolution Product Algorithm 32
5. Data-flou uhen g is Many-to-one Function 36
6. Data-flou uhen u is Many-to-one Function 39
7. Original Program Graph of the Dynamic Programming Algorithm 42
8. Modified Program Graph of the Dynamic Program
ming Algorithm 44
9. Time Model of the Dynamic Programming Algorithm 48
10. Partitioning of the Index Space of the Dynamic
Programming Algorithm 50
11. Data-flou Defined by Restriction R3 53
12. Application of the Timing Function on the Pro
gram Graph 59
13. A Square Array uith 8-neighbor Connections 67
14. VLSI Array for the Dynamic Programming Algorithm 71
15. Original Program Graph of the Shortest Path Algorithm 77
16. Modified Program Graph of the Shortest Path Algorithm 79
17. VLSI Array for the Shortest Path Algorithm 88
VI
LIST OF TABLES
1. Algorithms and Desired VLSI Array Structures 9
2. Data Dependence Vectors of the Dynamic Programming Algorithm 41
3. Data Dependence Vectors of the Shortest Path Algorithm 76
VI 1
CHAPTER 1
INTRODUCTION
High-performance computers are under heavy demand in
the areas of scientific and engineering applications. Even
though faster and more reliable harduare devices do achieve
high-performance, major improvements in computer archi
tecture and processing techniques are in order. Advanced
computer architectures are centered around the concept of
parallel processing. Parallel processing computers provide
a cost-effective means to achieve high performance through
concurrent activities.
The rapid advances in very large scale integrated
(VLSI) technology has created a neu architectural horizon
in implementing parallel algorithms directly in harduare.
It has been projected that by the late eighties it uill be
possible to fabricate VLSI chips uhich contain more than
10 individual transistors. The use of VLSI technology in
designing high-performance multiprocessors and pipelined
computing devices is currently under intensive investi-
gat ion.
LiX iiLSI Architectural Design Principles
VLSI architectures should exploit the potential of the
VLSI technology and also take into account the design
constraints introduced by the technology. Some of the key
design issues are summarized belou:
Simplicltv and Regularity! Cost effectiveness has
aluays been a major concern in designing VLSI archi
tectures. A structure, if decomposed into a feu types of
building blocks uhich are used repetitively uith simple
interfaces, results in great savings. In VLSI, there is an
emphasis on keeping the overall architecture as regular and
modular as possible, thus reducing the overall complexity.
For example, memory and processing pouer uill be relatively
cheap as a result of high regularity and modularity.
Concurrency and Communication: Uith current technolo
gy, tens of thousands of gates can be put in a single chip,
but no gate is much faster than its TTL counterpart of 10
years ago. The technological trend clearly indicates a
diminishing grouth rate for component speed. Therefore, any
major improvement in computational speed must come from the
concurrent use of many processing elements. Massive paral
lelism can be achieved if the underlying algorithm is
designed to introduce high degrees of pipelining and paral
lel processing. Uhen a large number of processors uork in
parallel, coordination and communication become important.
Especially in VLSI technology routing costs dominate the
pouer, time, and area required to implement a computation.
The issue here is to design algorithms that support high
degrees of concurrency, and in the meantime to employ only
simple, regular communication and control to allou effi
cient implementation. The locality of interprocessor
connections is a desired feature.
Computat ion Intensive: Compute-bound algorithms are
more suitable for VLSI implementation rather than I/O-bound
algorithms. In a compute-bound algorithm, the number of I/O
operations is very small compared to the number of comput
ing operations. This is other uay around in the case of
I/O-bound algorithms. The I/O-bound algorithms are not
suitable for VLSI implementation because the VLSI package
must be constrained to a limited number of pins. Therefore,
a VLSI implementation must balance its computation uith the
I/O banduidth.
I J 2 yLSI Array Processors
The choice of an appropriate architecture for an im
plementation is very closely related to the VLSI technolo
gy. The constraints of pouer dissipation, I/O pin count,
relatively long communication delays, etc., are all criti
cal in VLSI. On the brighter side, VLSI offers very fast
and inexpensive computational elements.
Parallel structures that need to communicate only uith
their nearest neighbors uill gain the most from VLSI
technology. If the modules that are far apart must communi
cate, considerable time is lost, and increase the communi
cation requirements of the system. The designer must keep
this communication bottleneck uppermost in his or her mind
uhen evaluating possible VLSI implementations.
1.2.1 Systolic Array Architecture
The systolic architectural concept uas developed by
Kung and associates (5, 10, 11, 1 8 ) . This subsection re-
vieus the basic concepts of systolic architectures and
explains uhy they should result in cost-effective, high-
performance, special-purpose systems for a uide range of
applicat ions.
The mean the thro syst pump syst proc thro pr in proc cess form the at a nece call
term s con cont r ugh t ol ic mul t
em. T essor ugh t cipl e essor or pu s a q items proc
ssary y and
"sys trac act i he c syst ipl e he r s ma he n att com
mps uick . Al esso - t per
tolic" t ion, a on of t irculat em, the stream
egular intains etuork. r ibute putes 0 data it operat
1 opera r simul he proc petual1
comes f nd in p he hear ory sys proces
s of da beat ing a cons This "
of a sy n each ems thr ion uhi nds for taneous essors y ( 1 6 ) .
rom s hysiol t that tem of sors a ta thr of th
tant f bl ood stolic cl ock ough i ch may a com
ly. No j ust c
yst 0 ogy dr i the
re h ough ese 1 ou pres sys
tick tsel upd
puta ual
ompu
le", refer ves b body
eart 5 out t paral of da sure" t em. . As f, it ate s t ion ting te, r
uhich s to 1 ood . In a that
he lei ta is the
Every a pro-per-
ome of arr ive is hythmi-
The crux of this approach is to ensure that once a
data item is brought out from memory it can be used
effectively at each processor it passes. This is possible
for a uide class of compute-bound algorithms uhere multiple
operations are performed on each data item repetitively. A
concrete example of a VLSI array uill be the VLSI algo
rithm, uhich is taken from {14}, for the simple problem of
polynomial evaluation.
Suppose ue have the follouing polynomial:
P(x) = A^x"^ + A^ .x'""^ +...+ A^ m m-1 u
Ue uish to evaluate P(x.) at points x., 1 <= i <= n. The
polynomial can be reformulated to:
P(x) = (((A^x + A^ .)x +...+ A.)x + Apj m m-l 1 u
The value of P(x.) for each x^, P^, is computed by an
algorithm uhose inner loop is
for i = 1 to n do
for j = 1 to n do
p. = Pi*x.+A.
The VLSI implementation is shoun in Figure 1. In this
design, the coefficients are held in cells through uhich
the X. and p. data flou. On every clock cycle, each cell
inputs X. and P., multiplies them, adds its stored coeffi
cient and outputs the result as P^^^^ uhile passing on x^
unchanged. The result P(x.) appears as the output of the
rightmost cell m clock cycles after x^ is input to the
leftmost cell. This example illustrates the pipelining and
the parallel operations of a typical VLSI array.
Pin
^.n
Pea t
CC
A
R-ct
Figure 1: VLSI Implementation for Polynomial Evaluation
As a VLSI array is made up of simple cell types, it is
cheaper to design than a circuit containing a variety of
complex cells. Unlike architectures that broadcast data to
many points, they can be easily scaled up to handle large
problems. The cells communicate only uith nearby cells,
reducing data broadcasts. Therefore, being able to use each
input data item a number of times is just one of the many
advantages of the systolic approach. Other advantages in
clude modular expansibi1ty, simple and regular data and
7
control flous, use of simple and uniform cells, elimination
of global broadcasting, limited fan-in and fast response
t ime.
In general, VLSI array designs are applied to a
computation-intensive problem that is regular, i.e., one
uhere repetitive computations are performed on a large set
of data. Systolic arrays (from nou on referred as VLSI
arrays) can assume many different structures for different
compute-bound algorithms. Figure 2 shous various systolic
array configurations and their potential aplications is
listed in Table 1 ( 8 ) . These algorithms form the basis of
signal and image processing, matrix arithmetic, combinato
rial, and database algorithms.
L L ^ Svnthesization of VLSI Array AlqorithT)S
Synthesis of VLSI algorithms is concerned uith gener
ating an optimum VLSI netuork topology for a computation
task. Previously the mapping of a computation onto an
architecture has been done in a rather ad hoc fashion. The
design of the array depended upon the inherent concurrency
and communication geometry of the algorithm. Several
attempts have been made in the automatic synthesis of algo
rithms from a description of a computation in some high-
level language. The synthesis must transform such a
8
(a) One-dimensional 1inear arr^y
(b) Tuo-dimensiona1 square array
(c) Tuo-dimensional h ar rav
exagona1
(d) Bi nary t ree (e) Triangular array
Figure 2: Systolic Array Configurations
TABLE 1
Algorithms and Desired VLSI Array Structures
VLSI array Computational algorithms structure
1-D linear FIR-filter, convolution, discrete arrays Fourier transform, solution of
triangular linear systems, carry pipelining, cartesian product, odd-even transportation sort, real-time priority queue, pipeline arithmetic units.
2-D square Dynamic programming for optimal arrays parenthesization, graph algorithms
involving adjacency matrices.
2-D hexago- Matrix arithmetic (matrix multi-nal arrays plication, L-U decomposition by
Gaussian elimination uithout pivoting, QR-factorization), transitive closure, pattern match, DFT.
Trees Searching algorithms (queries on nearest neighbor, rank, etc.), parallel function evaluation, recurrence evaluat ion.
Triangular Inversion of triangular matrix, arrays formal language recognition.
10
description into a correct array algorithm for the computa
tion. In the follouing subsection, ue shall trace the
development of these methodologies for the synthesis of
VLSI arrays.
1.3.1 Methodologies for VLSI Array Design
The mapping of computational algorithms onto VLSI
arrays implies first a transformation of the algorithm into
an equivalent but more appropriate form. The basic struc
tural features of an algorithm are dictated by the data and
control dependencies. These dependencies refer to precedent
relations of computations uhich need to be satisfied in
order to compute the problem correctly. The absence of
dependencies indicates the possibility of parallel computa
tions. So the study of data dependencies in an algorithm
becomes the most critical step in parallelizing the compu
tations of the algorithm.
In the last decade, considerable research has been
done in the area of data dependencies in high-level lan
guages. Muraoka (23) and Kuck et al. (9} studied the paral
lelism of simple loops. They addressed the problem of
discovering operations uhich may be performed concurrently
by examining algorithms at the statement level. For iter
ative loops, the index set and subscript forms are studied
to transform scalar operations into array operations uhich
11
may be performed in parallel. For iterative loops, syntac
tic trees are built for individual assignment statements,
and blocks of assignment statements, and then are applied
the techniques of arithmetic tree height reduction to re
duce the required processing time.
Toule {25}, Banerjee {2}, and Banerjee et al. {3}
extended the methodology of transforming ordinary programs
into highly parallel forms. They shou that a large number
of processors can be used effectively to speed up simple
Fortran-like loops consisting of assignment statements. A
practical method is given by uhich one can check uhether or
not a statement is dependent upon another. Four techniques,
loop freezing, the uave front method, the splitting lemma,
loop interchanging, have been suggested for transforming
programs into highly parallel forms.
Lamport {15} has developed methods for the parallel
execution of different iterations of a DO loop on both
asynchronous multiprocessor computers and array computers.
The concept of data-dependence vector uas introduced and a
computation uas abstracted in terms of its data dependen
cies. In a given algorithm, betueen the variables generated
at different index points, there are some dependencies.
These dependencies can be described as difference vectors
of index points uhere a variable is used and uhere that
12
value of the variable uas generated. Lamport's method de
termines hypersurfaces in the index set such that the index
points on a hypersurface are not data-dependent. This means
that the computations at all the points on a hypersurface
can be done in parallel, still preserving the sense of data
dependencies.
In the methods mentioned above, the analysis uas per
formed from the standpoint of a compiler for a multiproces
sor computer. The study of data dependencies uere done to
detect the inherent parallelism in the loops. Though these
methods contain many basic principles, they are not ade
quate enough for the synthesization of VLSI arrays. In
addition to a high degree of parallelism, VLSI arrays
employ pipelining of data. Also, the communication distance
and time should be kept to the minimum. There is a communi
cation path in the VLSI netuork for every data stream in
the original algorithm. Therefore, there is a direct one-
to-one correspondence betueen the data dependencies of the
algorithm and the communication requirements of its VLSI
implementation. The current approach in the synthesization
of VLSI arrays is not only on the deduction of data depend
encies but also on their modification. Here, ue shall
discuss some of the representative approaches in the design
of VLSI arrays.
13
Ueiser and Davis {26} proposed a uavefront notation
tool for the VLSI array design. In the uavefront implemen
tation, a delay operator is used to delay or to rotate the
direction of uavefronts. A uavefront is an ordered set in
uhich no tuo elements belong to the same data stream, and
all the elements of the set move uniformly in time or
space. Delayed or rotated uavefronts are used to represent
the desired operations. The interrelationship of the opera
tions thus defines the array netuork topology.
Capello and Steiglitz {4} proposed a method based on
geometric transformations. This transformation presents a
technique to place designs in a geometric frameuork. Uell-
knoun VLSI designs for computing polynomial products are
shoun to be related to one another by affine transforma
tions on a three-dimensional vector space. Design proper
ties such as broadcasting and pipelining can be formally
defined and their presence or absence in a particular
design can be readily ascertained. Likeuise, a design's
communication topology can be disclosed by projecting out
the time dimension of the representation.
Moldovan's method {20, 21, 22} is based on the trans
formation of index sets and dependence vectors. This
transformation derives the VLSI array topology by identi
fying algorithm transformations uhich favorably modify the
14
index set and the data dependencies, but preserve the
ordering imposed on the index set by the data dependence.
Data broadcast is eliminated to make local communication
possible. The transformation is partitioned into tuo func
tions, timing and netuork geometry. The netuork topology is
thus derived from an optimum transformation and the netuork
data flou is described by the timing function. As our uork
is an extension of this technique, it is described in more
detail in the next subsection.
Miranker and Uinkler {19} have developed an approach
similar to the one developed by Moldovan. The physical
process of computation is interpreted in terms of a graph
in physical space and time, and then an embedding into this
graph of another graph uhich charecterizes data flou in
particular algorithms is given. The VLSI array is complete
ly described as a special class of computational structure.
A technique is developed for mapping the graph of a partic
ular VLSI array algorithm onto a physical array.
Quinton's method {24} is a convex analysis approach
uhere the algorithms are first expressed as a set of uni
form recurrence equations over a convex set of cartesian
coordinates. The method consists of tuo steps. The first
step is to find a timing function as a quasi-affine trans
form for the computations that is compatible uith the
15
dependencies introduced by the equations. The second step
maps the domain onto another finite set of coordinates,
each representing a processor of the VLSI array, in such a
uay that concurrent computations are mapped onto different
processors.
Kung {12} developed the cut-set method in uhich a
computational task is represented by a signal flou graph.
The graph is then converted into VLSI array geometry. The
procedure first selects basic operation modules, then ap
plies a localization procedure using the cut set. Finally,
it combines delay and operation modules to form basic array
elements.
Li and Uah {17} characterize a VLSI array by three
classes of parameters: the velocities of data flous, the
spatial distribution of data, and the periods of computa
tion. By relating these parameters in constraint equations
that govern the correctness of the design, the design is
formulated into an optimization problem. The size of the
search space is a polynomial of the problem size, and a
methodology to systematically search and reduce this space
and to obtain the optimal design is proposed.
In the next section, ue uill discuss Moldovan's method
to design VLSI arrays. His technique forms the basis of our
uork.
16
1.3.2 Moldovan's Method to Design VLSI Arrays
Moldovan developed a technique uhich abstracts a com
putation in terms of its data dependencies. The method is
based on mathematical transformation of the index sets and
the data-dependence vectors associated uith the given algo
rithm. As our uork is mainly based on this concept of
defining transformations for mapping, ue shall briefly
revieu the basis of his method in this section.
Moldovon's method focusses not only on the detection
of the data dependence vectors, but also on their modifica
tion. To meet the specifications of the VLSI array, the
program structure is modified, mainly to further the local
ization of the data communications. The computation is
modeled as a lattice uith nodes representing operations and
edges representing data dependencies. The lattice is mapped
onto a space-time domain.
The transformation T mapping the algorithm onto a
space-time domain is divided into tuo independent transfor
mations, a timing function P, and the allocation function
S. Using P ( F " ) = constant, hyperplanes are determined on
the index set F^ of the algorithm such that all points on a
hyperplane can be executed in parallel. S maps the index
set onto the processors of a VLSI array, and the dependence
vectors D onto the communication links of the array. Thus
17
the netuork geometry, and the directions of the different
data streams is derived from S, and the timing is derived
from P. The matrices S and P together form the program
transformation T, a monotonic function in the sense that it
transforms the index set and at the same time preserves the
ordering imposed by the data dependencies on the index set.
The follouing steps describe the method:
1. Find the set of dependence vectors D.
The appearance of an array variable on the left-hand
side of an assignment statement is called a genera
tion; otheruise,. it is called a use. All possible
pairs of generated and used variables are formed.
Their indices are equated to compute the dependence
vectors.
2. Identify a valid transformation T.
The transformation T is partitioned into tuo func
tions P and S. P maps the index set F^ of the algo
rithm to the first k coordinates of the neu index
space, uhich is selected a priori. S maps the index
set onto the remaining coordinates of the neu index
space. Nou, P can be related to the processing time
and S can be related to the geometrical properties
of the algorithm uhich uill dictate the communica
tion requirements of the VLSI array. P is calculated
18
such that PD > 0 as this preserves the execution
ordering. S can be chosen depending on the netuork
geometry of the VLSI array chosen a priori.
3. Map the algorithm onto a VLSI array.
The functions F performed by the processing cells in
the VLSI array are derived directly from the mathe
matical expressions in the algorithm. The netuork
geometry G is derived from the mapping S. S maps the
dependence vectors onto the communication links in
the array, and the index space onto the processors.
The directions of the different data streams is
derived from S. The netuork timing uhich specifies
for each cell the time uhen the processing of func
tions F occurs and uhen the data communications
takes place is given by P.
L J L Limitations of Ulg Existing Methods
A typical assignment statement in an algorithm is of
the form
X(I) = f{X(I-dj), X(I-d2), ••• , X(I-d^)}
uhere f is a n-variable function and d., ^2$ ••• » d are
vectors of the index space of the algorithm, called depend
ence vectors. The dependence vectors denote that the compu
tation of the variable X at the index point I, depends on
the computations of the variable X at the points I-d.,
19
I-d2» ... , i~dp,» If the length of a dependence vector d is
the same through out the index space, it is termed as a
fixed or constant dependence vector. This is so uhen all
the components of the dependence vector are constants,
i.e., they are not functions of the index variables. A non-
fixed dependence vector is termed as a variable dependence
vector. An algorithm uith all fixed dependence vectors has
a regular data flou.
The methods described above make tuo assumptions on
the algorithm. The first assumption is that the data flou
is regular, i.e., all dependence vectors are fixed depend
ence vectors. There exist algorithms, dynamic programming
for an example, uhose dependence vectors have components
that are functions of the index variables (rei'er to Chapter
2 ) . These vectors may lead to broadcasting of data, uhich
uill strain the communication requirements of the final
archi-tecture. Also, the length of the dependence vector
may vary depending on the size of the algorithm. This may
lead to a very irregular data flou, and the transformation
methods developed so far may not suffice.
The second assumption is that the computation at all
points in the index space of the algorithm use the same
set of dependence vectors. These methods may not be able to
find the timing function if not all dependence vectors are
20
associated uith the computation at each point in the index
set, or uhere a subset of dependence vectors forms a cycle.
This is the case uith the shortest path algorithm as uill
be seen in Chapter 5. Furthermore, the timing function may
be sub-optimal, uith respect to the one actually possible.
This is because by the forced association of all dependence
vectors uith all the points in the index set, ue may be
introducing unnecessary ordering betueen points uhich uere
not data-dependent originally. This louers the degree of
paral1 elization. Hence, the timing function may be infe
rior .
In the follouing section ue briefly outline the goal
of the proposed research.
i ^ Outline of th^ Thesls
As explained in the previous section, there exist
algorithms uith irregular data-flou. Ue may not be able to
define linear hyper-planes on the index set to deduct
inherent parallelism of the algorithm. This leads to our
vieu of the algorithm as a nonlinear structure. Points in
the index space uhich are data-independent can then be
defined as points on a convex surface. As a simplification,
ue only consider the cases uhen a continuous nonlinear
convex surface is formed by a set of intersecting hyper
planes. Ue determine sets of intersecting hyperplanes on
21
the index space of an algorithm, such that points on a set
of intersecting hyper-planes are not data-dependent, and
consequently the computations at these points can be exe
cuted in parallel. This forms the crux of our approach.
A method has been developed to regularize the data
flou in case uhen the algorithm has variable dependence
vectors. This is done by finding, for each variable depend
ence vector, its equivalent set of fixed dependence vec
tors. Ue do ensure that the substitution of the variable
dependence vectors by the fixed dependence vectors uill
still preserve the original dataflou.
Further, ue partition the index space into regions.
The data-flou in a region is regular, in the sense that all
points in a region receive the same set of dependence
vectors. So hyper-planes can be defined on the points
uithin a region to denote data-independency among the
points on a hyper-plane. A separate timing function is
defined for each region.
Ue also determine the allocation function for each
region. This correctly maps the points in that region onto
the processors in the VLSI array uhich is selected a prio
ri .
Next, ue integrate the transformations of different
regions. This is done by taking into consideration the
22
points uhich fall on the boundary of any tuo regions, and
the dependence vectors uhich coexist in the data-flou in
the different regions. Conditions are derived to ensure the
validity of the integrated transformation.
This thesis is organized as follous: In Chapter 2, ue
define a data dependence vector and a program graph. Some
techniques are suggested to pipeline the data flou. Ue
examine the structural effects of variable dependence vec
tors on the program graph and develop a method to convert
the variable dependence vectors to equivalent fixed depend
ence vectors. In Chapter 3, ue discuss timing functions,
and derive the conditions sufficient for its existence. In
Chapter 4, a VLSI array model is given. An allocation
function is defined to map a given program graph onto a
prechosen VLSI array. Chapter 5 gives a procedure for
synthesization of VLSI arrays based on the concepts devel
oped in the previous chapters. Using our method, ue
determine an implementation of the all-pair shortest path
problem { 7 } . Finally, in Chapter 6, ue summarize the re
sults of our uork.
CHAPTER 2
ALGORITHM MODEL AS A PROGRAM GRAPH
2^ Qala. Dependence Vectors
In our research ue consider algorithms uhich can be
represented by nested loops as shoun belou:
for I^ = 1 u ^
for I^ = l^,u^
. * *
for I" = l ^ u "
begin
S^(I)
S2(I)
• • •
end
Here 1^ and u* are integer-value linear expressions in
volving I ,...,1''" , and I = (I ,1 ,... , 1 * ^ ) . S. , S2,..., S^
are assignment statements of the form X = E uhere X is a
variable and E is an expression of the variables of the
loop. Let Z denote the set of all integers, and let Z^
denote the set of n-tuples of integers. The index set of
the loop is a subset of Z^ and is defined as
F^(I) = { ( I ^ . . . , I ^ ) : 1^ <= I^ <= u ^ . . . .
l" <= I" <= u^}.
23
24
A sequential execution of the loop defines an ordering
on the points of the index set F^ uhich is knoun as lexico
graphical ordering of F* . This ordering is an induced one,
and can be modified.
The dependencies in the algorithm can be studied at
several distinct levels to extract parallelism. Since ue
are designing algorithms for VLSI arrays, ue uill focus
only on data dependencies at the variable level, uhich is
louest possible level before the bit level. The appearance
of an array variable on the left-hand side of an assignment
statement is called a generation; otheruise, it is called a
use. Values of the used variable in a statement are re
quired to compute the value of the generated variable,i.e.,
the generated variable is data-dependent on the used
variable.
Let g and u be tuo integer functions defined on the
set F* , and let X and Y be tuo variables uhose indices are
g and u respectively. The variables X(u(I)) and Y(g(I)) are
generated in statements S,(I) and S2(I)» respectively. The
variable Y(g(I)) is said to be data-dependent on the varia
ble X(u(I)) if
a) 1A < I2 iri lexicographical sense,
b) u d ^ ) = g(l2) • and
c) X(u(I)) is on the right-hand side of statement S2(I)»
25
Dfifinitioa 1: A dependence vector, d is the difference
I2 ~ Ij of index points, uhere Ij is the point of genera
tion of a variable, and I2 is the point of its use.
Ue denote by D, the set of dependence vectors of a
given algorithm. The last section of this chapter contains
an example, uhich demonstrates a computation of the set D.
In general, dependence vectors are functions of index
points, i.e., d(I«) = d(l2) for any tuo points I. and l2*
There is houever, a large class of algorithms uith fixed or
constant dependence vectors, such as the matrix multiplica
tion algorithm. On the other hand, there are algorithms in
uhich computations at different index points use different
subsets of dependence vectors. To extend the study of data-
dependencies of these algorithms, ue introduce the defini
tion of "vector-computation."
DfifiDiliOD 2 J A vector-computation is a computation
uhich is dependent on a specific set of dependence vectors.
Tuo vector-computations are equivalent only if the same set
of dependence vectors is associated uith the computations.
2i2 Ensanan] Graehs
As graphs provide a good deal of insight into the
system they represent., and also as they can be combina-
torially analyzed, ue model the nested-loop algorithm by a
data-flou program graph. Ue use the program graph, a
26
directed graph, to the describe the computations in the
algorithm at a high-level. The vertices of the program
graph are index points of the loop. Uith each vertex, there
is some computation of the algorithm associated. Uith each
arc in the program graph, there is a dependence vector
associated, uhich defines the data-dependence betueen the
tuo index points connected by the a r c A path in the graph
containing arcs associated uith the same vector is called a
data stream.
Incoming edges at a vertex represent a set of input
values uhich are required to compute the function asso
ciated uith the vertex. The outgoing edges represent a set
of output values generated at the vertex. A set of input
variables, and a set of output variables can be defined at
each vertex uhich carry the input values and the output
values, respectively.
If the algorithm has only one vector-computation, the
program graph is called a homogeneous graph, in the sense
that the vectors associated uith arcs incident at all
vertices are the same. On the other hand, the program graph
representing algorithms uith more than one vector-
computation, is called a heterogeneous graph. In this case,
different sets of vectors uill be incident at different
vert ices.
27
Zi2 Prpgram Graph Model
Ue define the program graph model as follous:
Definition 3: A heterogeneous program graph is a 5-tuple
PG = CV, D, C, X, Y3, uhere:
a) V is the set of vertices having one-to-one correspond
ence uith points in the algorithm's index space, F* .
b) D is the set of labels assigned to the arcs. Each label
corresponds to a dependence vector, and there are as many
different labels as there are dependence vectors.
c) C is the set of vector-computations. Each vector-
computation is defined by C^, the mathematical function it
computes and C., the set of dependence vectors or labels
associated uith the computation.
d) X is the set of input variables of the algorithm.
e) Y is the set of output variables of the algorithm.
Zi4 Plpg] Inlnq of da-La
The processing pouer of the VLSI array comes from
concurrent use of many simple cells rather than sequential
use of a feu pouerful processors. In addition to a high
degree of parallelism, some properties of VLSI arrays are
pipelining and reduced interprocessor distance. To make the
best of the architecture, the computations in the algorithm
should be arranged such that the locality of communications
is exploited.
28
To achieve localized flou of data, the data broad
casts, uhich may exist in the original algorithm, should be
avoided. If some value of a variable v is generated at
vertex I Q 6 V of the program graph and it is used at dif
ferent vertices I- € V, 1=1,2,...r, then the computed value
must be broadcast to all vertices I-. It, as shoun in
Figure 3 ( a ) , introduces r data dependence vectors. As a
result, there need to be paths from the processing cell
'^.c/
M x^
( a ) : Broadcasting of Data (b) : Pipelining of Data
Figure 3: Types of Data-flou
29
computing for I^ to cells processing for I.. It uould
unnecessarily saturate the communication capability of any
VLSI array. An elegant solution to this problem is to
pipeline the value of the variable V to the r vertices as
shoun in Figure 3 ( b ) .
The aim, as can be visualized from Figure 3 ( b ) , is to
reduce the number of data dependence vectors in the origi
nal algorithm. Data broadcasts in a program graph may occur
for many different reasons.The points I Q , I . , . . . , I 6 F^ may
lie on a line, or a plane or a hyper-surface, etc. Finding
a general solution for many different situations of broad
casting may not be feasible, and is generally not neces
sary. Here, ue give some guidelines on hou ue can pipeline
a value if the r+1 points lie along a line given by an
equation y = ax + c, uhere y and x are tuo of the n axes in
the index space, a is an integer taking values -1, 0, or 1,
and c is an integer.
Case i (a = 0 ) : The r+1 points are on a line y = c
This type of data broadcast is very common among the dif
ferent algorithms uhich have been synthesized so far. Typi
cally, the broadcast is signalled by a missing index of a
variable. Suppose there are n indices involved in an algo
rithm, and there are only n-1 indices associated uith a
variable in the algorithm. This indicates that the value of
30
the variable is being broadcast to all the points along the
line parallel to the missing index.
In order to eliminate broadcasting in this case, ue
fill in the missing index, and introduce neu artificial
variables such that for each generated variable, there is
only one destination.
Case 2 (a = - 1 ) : In this case, the points uhich
receive the broadcast value lie on a line y = - x + c There
fore, the data are pipelined along this line, by sending
the data from the point (x-l,y+l) to the point (x,y).
Case 3 (a = 1 ) : This case is very similar to the
previous one. Here the data broadcast is along the line
given by y = x+c. Ue pipeline the data along this line by
sending data from the point (x-l,y-l) to the point (x,y).
The three cases of data broadcasting are quite common
in most of the general algorithms amenable to VLSI imple
mentation. Any other situation of data broadcasting can be
uell handled by tailoring the above basic solutions to the
needs of the algorithm. This, of course, requires a lot of
intuition and insight.
2.4.1 Example to Illustrate Pipelining
Consider the convolution product algorithm. Given a
sequence x ( 0 ) , x ( l ) , . . . , x ( i ) , . . , , and a set of coeffi
cients u ( 0 ) , . . . , u ( K ) , the convolution algorithm computes
31
the sequence y(0) y ( i ) , . . . given by
y<i) = m u(k) * x(i-k) (2-1) o
This equation can be reuritten as:
y(i,-l) = 0
for k = 0 to K
y(i,k) = y(i,k-l) + u(k) • x(i-k) (2-2)
Consider the used variable u(k). As the index i is
missing in this variable, this falls under Case 1. Ue
modify the variable to u(i-l,k) and introduce an artificial
variable statement:
u(i,k) = u(i-l,k)
The value of the used variable x(i-k) must be pipe
lined along the line i-k = c, for all values of constant c.
This falls under Case 3. Therefore, x(i-k) is modified to
x(i-l,k-l), and the statement:
x(i,k) = x(i-l,k-l) is introduced.
After the pipelining of the variables, loop (2-2) is
reuritten as:
y(i,-l) = 0
for k = 0 to K begin
y(i,k)
u(i,k) x(i,k)
end
= y(i,k-l) + u(i-l,k) * x(i-l,k-l)
= u(i-l,k) = x(i-l,k-l)
(2-3)
32
Figure 4 illustrates the regularity in the data flou of
this algorithm after pipelining.
2 ^ Variable Dependence Vectors
Def1ni t i nn 4? A dependence vector uhose components are
integer constants, is a fixed dependence vector.
Definit ion 5? A dependence vector uhose components are
functions of the indices of the algorithm is a variable
dependence vector.
Figure 4: Data-flou of the Convolution Product Algorithm
33
Fixed dependence vectors denote regularity in the
communication structure of an algorithm as shoun in Figure
4. The program graph of the algorithm has regular data
flou. Algorithms, uith a regular program graph only have
been implemented on a VLSI array.
There are a number of algorithms uhich have variable
dependence vectors. The length of the dependence vector is
a function of the size of the algorithm, resulting in an
irregular data flou. Also, broadcasting may exist straining
the communication requirements of the VLSI architecture.
To design a simple and regular VLSI array for these
algorithms, the data flou should be regularized. This can
be done by replacing the variable dependence vector by a
set of fixed dependence vectors uhich uill not change the
output. In effect, ue desire to change the program graph
uhile preserving its equivalence in computations. Tuo prog
ram graphs are equivalent, if the data flou is preserved.
The follouing definition introduces a stronger criterion
for the program graphs to be equivalent.
DefInlt ion ^: Tuo program graphs PG = CV,D,C,X,YD and , * f * t t
PG = CV ,D ,C ,X ,Y D, are said to be equivalent if and
only if;
a) the execution ordering of graph PG is equivalent to the
execution ordering of PG , i.e. D is equivalent to D ;
34
b) the graph PG is input-output equivalent to PG ;
c) the set of vertices, V = V ;
d) any mathematical function in PG corresponds to an iden
tical function in PG .
For simplicity, first ue develop a basic solution for
a variable dependence vector having only one of its compo
nents variable. The solution for variable dependence vector
uith any number of variable components can be obtained as
an extension of the basic solution.
Let an assignment statement in the algorithm be of the
form:
X(g(Ij)) = F {X(u(l2) ^
Suppose, the generated-used pair X(g(I.)) and X(u(l2)) form
the dependence vector d. Then, d = I. - l2» Depending on
functions g and u, there are three cases uhich may make d
a variable dependence vector: (a) g is a many-to-one func
tion, (b) u is a many-to-one function, and (c) both g and u
are many-to-one functions. Ue shall consider the first tuo
cases. The third case can be solved as a combination of the
first tuo.
Case i: In this case, g is a many-to-one function.
Let Iii» ^12* * * * ' 1 i ' * *'"^Ik ^® ^^® solutions of g(I) = a
constant, k is a function of the indices. Let d = (x.,
^2**'** ^j ' * * * ^n^ uith the jth component as the variable
35
one. As g has k solutions, x. uill have k values. Then d
; , may be represented by a linear sum of {d., d2 d-
dj^}, uhere each one of these vectors is a fixed dependence
vector. This situation is shoun in Figure 5 ( a ) .
Theorem i: The variable dependence vector d can be
replaced by the follouing set of fixed dependence vectors:
d . = (x.,X2»...»min(abs(x . ) ) , . . . x ) , uhere the jth
component gives the minimum of the absolute
values assumed by x., j = l,2,...,k, w
dp = (0,0,...,1,...0) uith the jth component = 1, and all
the other components = 0,
d. = (0,0,...,-1,...0) uith the jth component = - 1 , and
all the other components = 0.
Proof: As d = {d , d2» • • • » d^, . . . , d,^} , ue have to prove
that each of these fixed dependence vectors is equivalent
to the system of vectors <Jf,ip,» dp, and dj^.
As ue are taking the minimum of the absolute values
assumed by x., d_._ exists for any set of values of x-. ij ill X 11 . *J
Note that the value that x. takes is a function of the
indices of the algorithm. In other uords, the arc labeled
d . exists for any size of the program graph, min
Let d^in = h - ^imin- ^ou,
T, . - I i - = a*do + b*d, , uhere a*b = 0 and a,b >= 0. • ^ I m i n 11 K L
A l s o , I 2 - I i i = ( i 2 ' " ^ lm in^ " ^ ^ I m i n ' ^ l i ^ >
36
( a ) : I r r e g u l a r Data - f lou
5-' d ' • * • 4
rncn
6 . (b): Regularized Data-flou
Figure 5:. Data-flou uhen g is Many-to-one Funct ion
37
i.e., I2 - Ij. = d^.^ + a*dp + b*dL.
Therefore, d. = d^.^ + a*dp + b*d,_. Hence, by replacing the
variable dependence vector d by the set of fixed dependence
vectors d^^^^, dp and d|_, ue are not changing the original
execution ordering.
The transformed data flou is as shoun in Figure 5 ( b ) .
From the definition of equivalent graphs, ue conclude that
the graph PG uith the variable dependence vector d, uill be
equivalent to graph PG uith the equivalent set of fixed
dependence vectors.
The function g is associated uith the generated varia
ble X ( g ( I j ) ) . This means that before transformation the
partial value of the generated variable X is residing at
each one of the nodes, ^ii»*.*»^ii (Figure 5 ( a ) ) . By the
transformation ue are actually finding the final value of
the generated variable and storing it at all the nodes, or
atleast in the node ^imir,* There are many uays of finding
the final value of X from the partial values residing at
the different nodes, and storing it in the node Ii^^^. Imi n
To give the designer full flexibility in choosing the
appropraite uay to do this, ue do not include the vectors
dp and d _ in the equivalent set. Therefore, in the case
uhen g is a many-to-one function, ue replace d by d_,-mi n
only.
38
Case 2* in this case, u is a many-to-one function. Let
^21 *^22** * * *^2i* * * * *^2k ^® ^^® many solutions of u(I) = a
constant. Let d = (x.,x^,...,xj,...x^) uith the jth ^,^2 n Jth compo
nent as the variable one. As g has k solutions, x. uill <J
have k values. Then d = {dj, d2, .•., d .,..., dj^} uhere each
one of these vectors is a fixed dependence vector. This
situation is schematically shoun in Figure 6 ( a ) .
Theorem 1 holds for this case too. It can be similarly
proved that d can be replaced by an equivalent set of fixed
dependence vectors d^^^, dp, and d|_. The transformed data
flou is as shoun in Figure 6 ( b ) .
Thus, variable dependence vectors are eliminated fro m
the set of dependence vectors D. This results in a program
graph uith a regular communication structure, uhich can be
readily mapped on to a simple and regular VLSI array.
ZL£ Model 1 ing of th£ DynamU Programming Algorithm a^ ^ Program Graph
Ue consider here the optimal parenthesization algo
rithm based on dynamic programming {1}. It is given in the
nested loop form as shoun belou.
for i = 1 to n do
m(i,i) = 0
for 1 = 1 to n-1 do
for i = 1 to n-1 do
39
( a ) : I r r e g u l a r D a t a - f l o u
'J. I •a^
o
''^rvi t.n
<Ao —2K'I I.
(b): Regularized Data-flou
Figure 6: Data-flou uhen u is Many-to-one Funct ion
40
begin
j = i + 1
m(i,j) = MIN {m(i,k) + m(k+l,j) + r(i-1)r(k)r(j)}
end (2-4)
After pipelining the nested-loop (2-4) ue have
for i = 1 to n do
m(i, i) = 0
for 1 = 1 to n-1 do
for i = 1 to n-1 do
for k = i to 1+1-1 do
begin
m^(i,k) = m^"^(i,k)
m^(k+l,i+l) = m^"^(k+l,i+l)
m^(i,i+l) = MIN {m^"^(i,k) + m^"^(k+1,i+1) +
r(i-l)r(k)r(i+l)}
end. (2-5)
The data dependencies derived from the above algorithm
are shoun in Table 2. There are four used-generated pairs.
The data dependence vectors of the last tuo used-generated
pair have variable components, uith
X = 1-1,1-2 1 and y = -1,-2,...,-1+ 1.
Let the index function associated uith the generated
variable m (1,1+1) be g=(l,1,1+1). As can be verified, g is
a many-to-one function. This results in the variable
TABLE 2
Data Dependence Vectors of the Dynamic Programming Algorithm
41
Pairs of generated-used variables
! Data dependence vectors ! 1 1 k
m^ (i,k) , m^'"^(i,k)
m^ (k+1 ,1 + 1) , m^"'^(k + l,i + l)
m^(1,1+1), m^"^(i,k)
m^(1,1+1), m^"^(k+l,i+l)
(1
(1
(1
(1
0
- 1
0
- 1
0)^ = d
0)^ = d.
x)^ = d-
y)"^ = d,
dependence vectors dq» and d^. The data-flou graph of this
algorithm for n = 6 is shoun in Figure 7. Ue can see that
the data-flou associated uith the dependence vectors d^ and
d-, is similar to the one shoun in Figure 5 ( a ) .
The third component of d^ uhich is variable, assumes a
range of values given by x. The minimum of the absolutes of
these values is 1. Similarly the minimum of the absolutes
of the range of values assumed by the third component of d^
is - 1 . By theorem 2.1, ue substitute d3, and d^ by d^ =
(1 0 1 ) ^ and d^ = (1 0 - 1 ) ^ respectively.
So the equivalent set of fixed dependence vectors are
given as:
dj = (1 0 0)
d2 = <1 -1 0)
T
T
42
t \ ik f\--%
"• 1 •.
I
0) X ^
c o
X
E ^ ••-' • i 4
c o a>
o —
Q.C m u
CD
E (t) c a« o c
CL <r
"c- ':i ro 1 c
V
1 " ' •»
' c_
a' c
•«« E £ <TJ C a» a c
Q.
U •« E ft c >
• O C
••
r 0) (-D O)
J ^ o • * » t
^
43
d^ = (1 0 1) '
d^ = (1 -1 -1)"''.
The data-flou graph of the transformed algorithm is as
given in Figure 8. The data-flou is very regular as com
pared to the original data-flou shoun in Figure 7. From the
data-flou graph, ue see that there are tuo vector computa
tions. Vectors d*, d2 and d> are associated uith one
vector-computation, and vectors d., d2 and dc are assciated
uith the other one.
The program-graph model PG = CV, D, C, X, YD for this
algorithm found using Definition 3 is given belou:
The set of nodes is {V = (l,i,k): l<=l<=n-l, l<=i<=n-l,
i<=|<<=i + l-l}.
The set of labels assigned to the arcs D = {d., d2»
d^, d^} uhere d^ = (1 0 0 ) ^ , d2 = d -1 0 ) ^ , d^ =
(1 0 1 ) ^ , and d^ = (1 -1 -1)''^.
The set of vector-computations C= {c^, C2} uhere c*..
and C2f^ : { + » MIN}, and c^^ : {d^, d2» d^} and C2j : { d p
d/ , dc } .
The set of input variables X and the set of output
variables Y are easily identified using the Indices and are
Ignored here.
44
VA
I
I I
I I
0 E
c c
^ — a< l ft
CT' C
E E ft E i;_ (C
O CT (_ c
CL C CL
x»
4^ E — (D XI c G > C O
00
"T3 Tr
CHAPTER 3
THE TIMING FUNCTION
As described in the previous chapter, the data depend
ency in the program graph of an algorithm enforces an order
of execution uhich need not be the same as one given by the
lexicographical order. Data dependency introduces a partial
ordering on the points of the index space. In this chapter,
ue uill discuss properties of timing functions, uhich de
fine an order of execution of the program. This function
must preserve the partial ordering implied in the program
graph.
2 ^ liifi Description of Ikg Timing Function
Let T(I) denote the time at uhich the computation at I
€ F* is done. At time T ( I ) , all the input arguments for the
computation at node I should be available. Ue, therefore,
define a timing function as a mapping P from F"^ to F such
that P is non-negative and monotonic, i.e. for every pair
of points X, Y, X data-dependent on Y, P(X) 1 P(Y) + t^,
uhere t . is the transmission time from Y to X. If P(X) >
P(Y) + t ., it means that the value required for computation
at X arrives at X before it is needed. In that case, the
value is either bufferred, or is delayed by using a delay
operator. The data can be stored in a local memory
45
46
associated uith X and can be used later at time P ( X ) .
Otheruise, a delay operator can be introduced in the trans
mission path thereby suitably delaying the arrival of the
data at X.
Given a timing function P and a constant c, the set of
points such that P(I) = c form a hyper-surface in the index
space. As the points on a hyper-suface are assigned the
same time of execution, it is necessary that none of these
points are data dependent on each other.
If P is a linear function on the index space, the
hypersurfaces are hyperplanes, and the index space is di
vided into set of parallel planes. The hyperplanes them
selves are ordered in time such that the original execution
ordering of the nodes is still preserved. The number of
hyperplanes give the order of the total computation time of
the algorithm.
2 ^ Concurrency
As pointed out in the previous chapter, the computa
tion at all points in the index space of the algorithm may
not use the same set of dependence vectors. This results in
an irregular data-flou. Ue may not be able to divide the
index space into a set of linear hyperplanes To alleviate
this problem, ue vieu the index space as regions. Ue parti
tion the index space into regions such that the data-flou
47
is regular in each region. All points in a region use the
same subset of dependence vectors. Hyperplanes are defined
on the points uithin a region.
This can be uel1-111ustrated by assigning time ele
ments to the nodes of the program graph for the dynamic
programming problem considered in Chapter 2. Let us assign
t=l to the node (1,1,1), and from there assign appropriate
times to the other nodes, bearing tuo things in mind. One
that the execution ordering of the nodes should be pre
served, and tuo the that nodes uhich are not data-dependent
can be done in parallel.
The time model is as given in Figure 9. As it can be
seen, ue can define convex surfaces consisting of planes,
such that nodes on a convex surface are not data-dependent,
and therefore the computations at these nodes can be done
in parallei.
3.3 Sufficient Cnndltions fji£ Ihs Existence of til£ Timing Function
n By defining convex surfaces over the index space, F ,
uith each convex surface consisting of hyper-surfaces,
uhich, in turn, are associated uith a set of dependence
vectors, ue are actually partitioning the index space into
regions. Each region is associated uith a set of dependence
vectors. The uavefront of computation for each region is a
48
..-•(D-. / H3
/ £
t • • - p .
. / • • • i i ' - . /
u • I — •
£ C E > -C
C -^ - > — <
-C G •*-> CP
o
<— c
^ 6 o 6
£ o — L.
h- CL.
en
49
hyper-surface. The dependence vectors associated uith the
computations at any node in a region belong to the set of
dependence vectors associated uith the definition of that
region. This is uell illustrated by Figure 10.
Let D., 1=1,...,k be subsets of the dependence vectors
D of the program graph PG. The number of subsets k, is
equal to the number of vector computations as each D. is
the set of dependence vectors associated uith a vector
computation. Also, D = D, U ... U D|^. Let D(I) denote the
set of dependence vectors associated uith the computation
at node I. Let I -d-> J denote that J is dependent on I,
and that the vector d is associated uith this dependency.
3.3.1 Restrictions on the Class of Algorithms
Nou, ue introduce a set of conditions to be satisfied
by the algorithm so that the timing function ue defined can
be computed:
(RO) For all I 6 V, D(I) C D- for some i.
(Rl) There exists a minimum set of root nodes, R 6 V such
that D(R) C D . V i.
(R2) If Y -d-> X for index points X and Y, then D ( X ) n D ( Y )
C D^ for some i.
(R3) For each node X € V such that D(X) C D- for some 1,
and D(X) ^ D-, there exists a node Y 8 V such that
D(Y) C Dj, a path from Y = Y Q through nodes Y^,
50
Figure 10: Partitioning of the Index Space of the Dynamic Programming Algorithm
(R4)
51
^2* * * * *^h-l ^°^~^h* ^^^ ^ sequence of integers
J = J Q » j 2 » • • • » j ^ = i , not necessarily distinct, in the
range 1 to k, such that either
(a) Y^ - Y^_j = d 6 D for t = l,...,h; D(Y^) Q D.^Oj
Dj^, for t = 1, ... , h-1, or
(b) D(Y) C D.; and let d^ = Y^ - Y^_j for t = 1, ...,
h; then D(Y + d^) C D..
For each point X uith D(X) £ D- O D., there exists an
ancestor node Y such that D(Y) £ D- H Dj, and the
path from Y to X is defined as a linear combination
of vectors in D | H D..
The above restrictions are natural, and some explana
tion is in order. As stated before, ue assume that the
uavefront in a partitioned region is a hypersurface. A set
of dependence vectors D- is associated uith a region 1.
Nodes in a region can be arranged on a set of parallel
hyper-surfaces.
Restriction RO is necessary for dividing D into
subsets D..
Restriction Rl defines the roots of the program
graph uhere the initial computations are performed. The
roots can be considered as the input nodes. The definition
of the roots corresponds to the definition of the initial
boundary conditions in recurrence equations.
52
Restriction R2 defines data flou to be betueen tuo
parallel hyper-surfaces in a region only. That is, data
required for computation at nodes on a hyper-surface come
from the nodes on preceding parallel surfaces only. The
nodes on the line of intersection of tuo surfaces may feed
to nodes on surfaces parallel to these tuo surfaces only.
Restriction R3(a) defines data flou path betueen any
tuo nodes Y and X through a sequence of nodes on the inter
sections of hyper-surfaces. The data flou betueen any tuo
consequent nodes on the path is through the same dependence
vector d, i.e., X - Y = h*d uhere h is an integer (see
Figure 11 ( a ) ) .
Restriction R3(b) considers the case uhere the path
betueen nodes X and Y is a linear combination of a set of
dependence vectors not necessarily distinct. The nodes on
the path lie on a set of parallel hyper-surfaces (see
Figure 11 ( b ) ) .
Restriction R4 defines the existence of flou be
tueen tuo nodes on the intersection of tuo hyper-surfaces
associated uith tuo regions. Let a node X lie on the bound
ary of regions 1 and j, uith the associated set of
dependence vectors D- and D- respectively. Then X lies on
the intersection of tuo hyper-surfaces H. and H.. There
exists a node Y on the intersection of a hyper-surface
53
'""• °^*^-^'°" defined by R3(a)
\
<>»•• Data-f,ou defined by R3(,) Figure 11: Dafa-n
'*' " ° " °^^-«<^ by Restriction R3
54
parallel to H- and a hyper-surface parallel to H., and a
data path from Y to X consists of vectors in D- D Dj only.
2JA IhSL Definition oI ih^ Timing Function
Let p. be a linear function associated uith D-, for i
= l,...,k such that P. maps F' to F. Let us define the
timing function P of the program graph PG as P = max {P- :
i=l,...,k}. Nou, P is an equation of a set of convex sur
faces, and each of the P^ define the hyper-surfaces uhich
constitute a convex surface.
Theorem 2* Let P^, i=l,...,k satisfy the follouing
condit ions:
(CO) P.(d) > 0 for each d € D.;
(Cl) For each of the root nodes R, if D(R) Q D-, then V i,
P(R) = P^(R);
(C2) If Y -d-> X, D(Y) C D. n DJ n D, for some 1, j and k,
and D(X) C D. n Dj , and D(X) i D,^, then P.(d) >
P^(d), Pj(d) > P|^(d)» and P.(d) = Pj (d) • i may
possibly be equal to J.
For program graphs satisfying the restrictions R0-R3, if
D(X) C D. for node X, P(X) = P.(X).
Proof: Our aim is to shou that if the set of depend
ence vectors associated uith the computation at a node X,
D(X) C D., then P^(X) is the maximum of the P sub-
functions, thereby assigning the computation time to X.
55
There are tuo cases to be considered, one uhen D(X) C D.
for a j, and tuo uhen D(X) ^ D..
Case 1: D(X) C D- D Dj. Ue have to shou that P.(X) =
Pj(X). This is true at the root nodes of the program graph
from condition Cl. The root nodes are ancestors to other
points in the program graph. By restriction R4, there
exists an ancestor Y of X in the program graph such that
D(Y) C D. 0 D.. To prove by induction, let P.(Y) = P.(Y).
By condition C2, ue have that for every vector d in the
path X-Y, P.(d) = P.(d). Therefore, P.(X-Y) = P.(X-Y). This X J .1 J
results in P.(X) = P.(X). X J
Case 2: D(X) C D- and D(X) | D•. Ue have to prove that
P.(X) > Pj(X). By restriction R3, there is a nearest ances-
tor Y of X uith D(Y) Q D-. Ue prove this case by induction
on h in R3.
Case 2a: h=l. That is ue have the situation Y -d-> X.
Since D(Y) C D. f) Dj , P.(Y) = Pj (Y) . As D(X) | Dj , P.(d) >
P.(d). As the sub-functions of P are linear, it follous
that P.(X) = P.(Y+d) = Pi(Y) + P.(d) < P.(Y) + P.(d).
Therefore, ue have P^(X) > Pj(X).
Case 2b: h>l. There are tuo types of flous to be
considered given by restrictions R3(a) and R3(b). In the
first case, D(Y^) £ Djt'^^jt + l' ^^ implies that P(Y^) =
P.^(Y^) = Pjt + i^^^t^* ^^ condition C2, since Y^ -d-> Y^^^,
56
Pjt^d) < Pjt + i<<J> ^or^ t = 0,..., h-1. Therefore, (Pj(d) =
Pj0^d>> < Pji<d) <•••< <Pjh^^^ = Pi(d)). Further, D(Y^_j)
and D(Y^ = X) both C Dj^, m=h-l or h. If m=h-l, D(X=Y^) Q
^jh-^ D j ^ . Therefore, Pj^^(X) = Pjj^_^(X). By induction,
since D(Yp^_^) C Dj^_j, Pj(Y^_j) < Pj f , (Y^_ ) . By linearity
of the functions, P;(Y.) = P • (Y. .) + P.(d) < • J n J h - 1 J
(P jh-i^^-l> ^ Pjh-1^^> = Pjh-1^^>>- Since D(Y,) Q D ^ . . ^
Djh' Pjh-1^Y^> = P j h ^ ^ > = Pi^X^' Consequently, Pj(X) <
P-(X). Similar arguments hold for m=h.
Uhen ue consider the second type of data flou given by
R3(b), ue have D ( Y Q = Y ) Q D - , and D(Y) Q Dj. It follous that
Pj(Y) = P.(Y). Let d^ = Y^ - Y^_^. D(Y+d^) C D. for t =
l,..,h and D(Y) Q D ^ H D j . Therefore. Pj(d^) < P.(d^). By
linearity of the functions P. ard P^, P.(X-Y) < Pj(X-Y).
Consequently, Pj(X) < P.(X).
Thus Theorem 2 is proved. Nou the restrictions R0-R4,
and the conditions C0-C3 can be used to construct the
timing function P(x) = max {P|^<X> * ^^ *
3.5 Iii£ Timing Function fjiL iM Dynamic Programming Algorithm
Consider the program-graph of the dynamic programming
algorithm as shoun in Figure 8.There are tuo root nodes
(1,1,1) and (1,2,2). There are tuo vector-computations in
this algorithm. So ue uill define a timing function uhich
57
uill define a set of convex surfaces, each consisting of
tuo planes. So the timing function uill be of the form
P(l,i,k) = max {Pj(l,i,k), P2(l,i,k)}
uith
r. — a-i a^ a^ a A J
Pfy — Lb* hr\ bo b/i_j*
Applying Theorem 2 on the program-graph, ue have the
follouing conditions:
From condition CO, ue have
a. > 0; a.-a2 > 0; aj-a2-a3 > 0;
bj > 0; bj-b2 > 0; bj+b3 > 0.
From condition Cl, ue have
a.+a2+a2+a^ = \>^-^\>2^^3^^i^*
a^+2*a2+2*a3+a^ = bj+2*b2+2*b3+b^.
From condition C2, ue have
^1 ^ '^l' 1~^2 b|-b2»
ai+a3 < bj+b3; aj-a2-a3 > b^-b2-b3.
Let us restrict the solution space by including the
condit ions:
a|+a2+a3+a^ <= 3;
bj+b2+b3+b^ <= 3.
Nou ue solve for the coefficients of P^, and P2»
satisfying the above conditions. One of the solutions is:
p. = C2 1 -1 OD and P2 = Cl -1 1 13.
58
So the timing function is:
P(l,i,k) = max {21+i-k, 1-i+k+l}.
Uith this timing function, ue reassign the time element to
each node in the program-graph as shoun in Figure 12.
Ue see that the timing function P defines convex
surfaces on the nodes, uith nodes on a convex surface
associated uith the same time element. The tuo sub-
functions P. and P2 define the tuo planes in a convex
surface. The nodes are partitioned into regions, uith each
region associated uith one of the sub-functions.The nodes
on a plane receive data only from the nodes on the pre
ceding parallel plane. Thus the uavefront in each region is
the plane associated uith the region.
59
c o
c
c
I— -C
a ^ CT)
*.- £ o a'
e
o o -^ c_ —' QL
a a c C o
(N
o
CHAPTER 4
THE ALLOCATION FUNCTION
In the last chapter ue defined the timing function for
the program graph. Once the time of computation at differ
ent nodes has been determined, it remains to map the compu
tations onto a set of processors. This is done in such a
uay that at most one computation is performed at one proc
essor, and computations uhich should be performed at the
same time are mapped onto different processors. The VLSI
array processors uill be modeled as a digraph, so that the
allocation function can be defined as a mapping of the
program graph onto the VLSI array graph.
i U Model inn ^^121 Arrays
Ue define the array graph model as follous:
DefInit ion 2* A heterogeneous array graph is a 3-tuple
AG = CG, L, F3 uhere
(a) G refers to the physical underlying of the netuork.
The number of processors is given by the index set 2*
The position of each processor is defined by its Cartesian
coordinates.
(b) L refers to the interconnections betueen processors
supporting the flou of data through the netuork. It is
defined as a matrix L = Cl. I2 ••• Ip^ uhere 1. is a
60
61
column vector representing a unique data communication link
in a specific direction. These links are termed as the
primitive communication links of the array graph.
(c) F represents the totality of arithmetic and logic
operations that a processor is capable to perform.
±:2 Ih£ Definition flf Ih^ Allocation
Function
For S to be a valid allocation function, some of the
requirements on S are: (a) the nodes X on a hyper-surface
for uhich P(X) = a constant should be mapped onto different
processors. This enables the computation at these nodes to
be done concurrently; (b) no node is mapped to more than
one processor; (c) S satisfies the timing constraint, i.e.
if Y -d-> X, then the propagation time of data from S(Y) to
S(X) should be less than or equal to P(X)-P(Y); (d) a
dependence vector is mapped to a unique communication path
constituting the primitive links in the netuork; and (e) no
processor performs more than one operation any time.
Recall that the index space is partitioned into re
gions, and uith each region, a subset of data-dependence
vectors and a timing function are associated. Ue define an allocation function as follous:
Pffi nit ion §: For a node X 6 V, S(X) =
{P.(X):i}.
{S.(X):i} if P(X)
62
The follouing theorem gives a method to construct an
allocation function S uith the desired properties:
Theorem 2* For a region 1, let
(a) S^ be a linear function,
(b) P and S together form a bijectlon,
(c) for each j, and d € D.riD., S.(d) = S.(d), X J X J
(d) S.D. = LU. uhere matrix U- indicates the utilization
of the primitive communication links in L onto uhich
D. is mapped. Each column in U- represents a vector d
€ D-, and the rou entries in the column correspond to
the primitive links used in the data path onto uhich
vector d is mapped. Matrix U. = Cu.(l,m)3 such that
u.(1,m) >= 0 (4-1)
yZ u.(l,m) <= min{P.(d), Pi(d)} for each j such that 1 A J
d € D H D - , if the column m repre-
sents the vector d (4-2)
for each j, and d G D ^ O D j , u.(l,m) = Uj(l,m) (4-3)
(e) for each root node R, S(R) = S.(R).
Then S- correctly maps the nodes in the region 1, and S =
{S.:i} is the valid allocation function for the program
graph, PG.
Proof: Conditions (a) and (b) of the theorem ensure
that the nodes uith the same time of computation, i.e..
63
uhich can be done concurrently, are mapped onto different
processors. Ue have to consider tuo cases. Let tuo nodes X
and Y lie on a convex-surface. This means that P(X) = P(Y),
and therefore the computations at these nodes can be done
in parallel. The function S should map these tuo nodes onto
different processors.
Case A: The tuo nodes X and Y lie on a hyper-surface
in the region i. Then P(X) = P.(X) and P(Y) = P.(Y). Since
from (a) and (b) ue have that P^ and S- together is a mono
tonic function, S.(X) = S^(Y) and therefore, S(X) = S(Y).
Case B: The tuo nodes X and Y lie on different hyper
surfaces in the regions i and j respectively. Then P(X) =
P.(X) and P(Y) = P.(Y). Ue resolve this case by considering
the uorst case situation i.e. uhen m nodes are on m differ
ent hyper-surfaces belonging to a convex surface. Let X. be
a node on the kth hyper-surface in the kth region. Ue have
P(X.) = P(X2) = ••• = P^^m^' "^^^ maximium value m, and
hence k can take is equal to the number of regions, and is
a small number in general. Ue define P* (X) = m P(X). This
allocates m time units to the convex surface so that the m
nodes X«, X2» ••• • X can be done in sequence even at the
same processor. Then ue replace P by P* . This modification
in the timing function uill naturally increase the compu
tation time, but uill ensure correct mapping.
64
Next ue have to prove that S is single-valued. This
ensures that no node is mapped to more than one processor.
For a node X in the region 1, S- is the mapping function.
Since S^ is a linear function, by (a), S.(X) gives a unique
value. Thus S maps node X onto only one processor.
Another case to be considered is uhen a node X lies on
the boundary of tuo regions 1 and j. Ue have to ensure that
S(X) = S.(X) = S.(X). For the root nodes R, by (e), ue have X J
that S(R) = S.(R) for all regions 1 on the boundary of
uhich lies the node R. For the other nodes ue apply induc
tion. Uhen a node X lies on the boundary of tuo regions 1
and j, ue have D(X) Q D - O D . . By restriction R4 on the X fj
program graph, there exists a node Y such that D(Y) C D^O
D- and the path X-Y is a linear combination of the vectors
belonging to D - O D j . By induction, S(Y) = S-(Y) = S.(Y).
From (c), ue have that for each d in the path X-Y, S.(d) = S.(d). Hence, as S- and S- are linear functions by (a), it J 1 J
follous that S.(X-Y) = Sj(X-Y). Therefore, again by line-X J
arity of the functions invloved, ue have S^(X) = S.(X).
This proves that S uniquely maps a node X onto the array
graph.
Ue have to prove that S satisfies the execution order
ing. If Y -d-> X then the time taken for data to propagate
from S(Y) to S(X) should be less than or equal to P(X) -
65
P(Y) = P ( d ) . This ensures that the data computed at the
processor S(Y) reaches the processor S(X) before the compu
tation at S(X) starts. Ue defined a utilization matrix U in
d ) . Let X and Y be in the region 1. Then S(X) = S.(X), and
S(Y) = S ^ ( Y ) , and U^ is the utilization matrix. In the
matrix U^, there is a unique column for the vector d. The
summation of the elements of this column vector, say t,
gives the propagation time of data betueen tuo processors
through the communication path representing the vector d.
Also ue knou that P.(d) is the difference in the
computation-start times of the nodes connected by the vec
tor d. By equation (4-2) ue have that t <= P^(d). This
ensures that the execution ordering is preserved by the
array graph. Also by equation (4-3) ue ensure that a data
stream d is mapped uniquely onto a communication path.
The timing function P, and the allocation function S
together form the algorithm transformation T. The follouing
theorem summarizes our results.
Theorem 4? A transformation
T = P
S
of an algorithm such that P and S satisfy Theorem 2, and
Theorem 3 respectively, maps the given algorithm onto a
VLSI array in uhich the data flou is correct.
66
1 ^ VLSI Implementation of ihs. D y n ? m U Programming Algorithm
In the previous chapter, ue computed a timing function
for the dynamic programming problem. In the follouing sec
tion, ue shall derive the S function for the dynamic pro
gramming problem.
As given by Definition 7, the VLSI array is modeled as
an array-graph AG = CG, L, F3 uhere
G = {(g., 9r>) I I i 9^ i n-1, 1 i 9<y i n-1}
L = 1 -1 -1 0 1 -1
1 - 1 1 - 1 1 - 1
lo 1. Ic 1^ I7 1ft 1 h ^2 '3 '4 '5 '6 '7 '8 '9
F = {+, MIN}.
The VLSI array has a 8-neighbor bidirectional communi
cation paths and also a communication uithin a processor. A
typical processor and its 8 neighbors in the array are as
shoun in Figure 13. Each processor is capable of performing
the addition and the MIN operations.
Next, ue construct the S function uhich uill allocate
the processors in the VLSI array graph AG for the
computations at the different nodes in the program graph
PG. From Definition 8 in Section 4.2, S is given as:
S = {Sj, S2} such that S. is the valid allocation function
for the node X for uhich P(X) = P^(X).
67
Figure 13: A Square Array uith* 8-neighbor
Connect ions
6S
Let
Si = '11 ^12 ^13
'21 ^22 ^23
^2 = ^11 ^12 ^13
^21 ^22 ^23
From condition (c) of Theorem 3, ue have
Sjdj - ^2^1
This leads to the follouing equations:
11 = bj^; ^21 " ^21
^11 ^12 " ^ 11 ^12
^21"^22 ^ ^2r^22*
From this, ue have
aj2 - bj2' a/ / ~ U/y/y .
Applying condition (e) on the root node (1,1,1), ue
have
ajj+aj2+aj3 - bjj + bj2"^^13
21"*" 22'*" 23 " b2i + b22+b23»
From the above equations,ue derive that
^11 " ^11' ^12 " ^12* ^13 = b 13
'21 = b 21' ^22 = b 22' ^23 = b 23
Therefore, S = S^ = S2.
69
One possible set of utilization matrices U = {U., U2}
uhich satisfies the condition (d) is given belou:
Ui =
U2 =
0 0 0 0 0 0 0 1 0
^1
0 0 0 0 0 0 0 1 0
^1
0 0 0 0 0 1 0 0 0
< 2
0 0 0 0 0 1 0 0 0
'2
0 0 0 0 0 1 0 0 0
^ 4
0 0 0 0 0 0 0 1 0
^ 3
Uith these utilization matrices, ue have
^ — o 1 — >5^ —
1 1
0 -1
0
0
Ue apply the S function on the nodes of the program
graph PG. Ue see that at most only tuo nodes uith the same
computation time have been mapped to the same processor.
Note that there is a tuo unit computation time difference
betueen these nodes and the nodes feeding the input values
required for the computations at these nodes. Therefore,
70
from the proof of Theorem 3, ue ensure that P and S to
gether is a monotonic function.
From Theorem 4, the algorithm transformation function
T is given as
T = P
S uhere P: F^ -> F^; S: F^ -> F* ^.
The VLSI array architecture defined by the array graph
AG, and the allocation function S is as shoun in Figure 14.
This architecture uas first proposed by Guibas et al. {6}.
All the processing cells perform the same functions. The
directions of the different data streams are given by the
utilization matrices.
71
(y-*C)—<)—<y^--Q A ^ n I
It)
s^xv /-^^x^ /-Ni xi rpi-<y^^-^'""
(^ri-^Anii^ m c
Q ^ -Q'"^
O" -5
Figure 14: VLSI Array for the Dynamic Programming Algorithm
CHAPTER 5
PROCEDURE FOR VLSI IMPLEMENTATION OF ALGORITHMS
In this chapter, ue propose a procedure to transform
an algorithm to a highly parallel form, and then to map it
onto a prechosen VLSI array architecture. The procedure
summarizes the methods uhich uere discussed so far.
^Lil Procedure lor Mapping Algorithms onto ) A ^ Arrays
step 1: Pipeline all variables in the algorithm.
Step 2* Find the set of data-dependence vectors.
Step 2* Replace each variable dependence vector by its
equivalent fixed dependence vector.
Step 4 J Model the algorithm as a program graph.
Step 5J Divide the set of dependence vectors into
subsets uhere each subset is associated uith a vector-
computation in the algorithm.
Step ^i Compute a valid timing function for each
subset of data dependence vectors.
Step Zs Model the VLSI array architecture as an array
graph.
Step SJ Select a valid allocation function for each
subset.
72
73
Step 9: Integrate the time and allocation functions
into one transformation.
Step IQ: Map the program graph onto the VLSI array.
Explanat lor): The role of the first step is to elimi
nate all possible broadcasts uhich may exist in the origi
nal algorithm. Some methods have been suggested in Section
2.4 of Chapter 2 to identify broadcasting, and to eliminate
them by pipelining the variables in the algorithm.
If 'components of a dependence vector are functions of
the indices, i.e., the vector is not fixed, then this
vector may lead to an irregular data-flou. Irregular data-
flou uill strain the communication paths of an arch
itecture. Step 3) identifies such variable dependence vec
tors, and each such vector is replaced by an equivalent set
of fixed dependence vectors. Theorem 1 in Section 2.5 of
Chapter 2 states a technique to find the equivalent set.
As graphs are very suitable to represent systems, ue
model the nested-loop algorithm as a data-flou program
graph. Our model-definition of the program graph is given
in Definition 3 in Section 2.3 of Chapter 2.
In Step 5, ue construct subsets of dependence vectors
so that each subset is associated uith a vector-computa
tion. The conceptual necessity for such a division is
stated in Section 3.2 of Chapter 3.
74
In Step 6, ue construct a timing function uhich deter
mines convex surfaces on the index set. Computations at the
nodes on a convex surface can be done in parallel. Theorem
2 in Chapter 3 defines such a timing function, and the
method to construct a valid one.
Definition 7 in Chapter 4 gives the model-defini11 on
of a VLSI array graph. By Step 7, ue specify the desired
array model uhich constitutes number of processors, and
their identification, the primitive communication links,
and the arithmetic and logical operations that a processor
is capable to perform.
In the next step, ue compute an allocation function as
given by Theorem 3 in Chapter 4. This function maps the
nodes, and vectors of the program graph onto the pro
cessors, and the data-communication links of the VLSI array
graph respectively.
The timing function P, and the allocation function S
are integrated to form the algorithm transformation T. The
validity of the transformation T is given by Theorem 4.
In the last step, ue map the algorithm onto the VLSI
array. The algorithm transformation T maps the execution of
an algorithm onto a VLSI array architecture.
The ten steps of the procedure mentioned above de
scribe a systematic method for the implementation of VLSI
75
algorithms. The validity of the procedure is ascertained by
the Theorems 1-4.
In the next section, ue shall exemplify our procedure
on the shortest path problem.
Ii2 ^L$I Implementat inn of lil^ Sbgptest Path Algorithm
Consider the all-pair shortest path problem for a
directed graph { 7 } . Ue shall map this algorithm onto a 2-D
square array. A systolic architecture has been designed
before for the shortest-path problem by Lakhani (13} ex
ploiting synchronization in computation.
Let A {i,j) denote the length of a shortest path from
vertex 1 to vertex j goinc through no vertex of index
greater than k. The shortest path algorithm is given by
for k = 1 to n do
for 1 = 1 to n do
for j = 1 to n do
A^(i,j) = min { A ^ " ^ ( i , j ) , A^"^(i,k) +
A^"^(k,j)} (5-1)
Ue shall apply the steps of our mapping procedure on loop
(5-1) to design the VLSI array for this algorithm.
Step 1: The number of indices associated uith each one
of the variables in loop (5-1) is equal to the number of
76
the index variables of the loop. Therefore, step 1 of the
procedure does not apply in this case.
Step 2J The data-dependence vectors can be found using
Definition 1. All possible pairs of generated and used
variables are formed and their indices equated. The three
pairs thus formed are <A*^(i,j), A^"^(i,j)>, <A^(i,j),
e set of data de-A^ ^ i , k ) > and <A*^(i,j), A'^ ^(k,j)>.Th
pendence vectors derived from the loop(l) is shoun in Table
3. These denote the used-generated pairs. The data depend
ence vectors of the last tuo pairs have variable components
uith X = 1-n,..., 0,..., n-1.
TABLE 3
Data Dependence Vectors of the Shortest Path Algorithm
Pairs of generated-used variables
A'^(i,j), A'^~^(i,j)
A ^ ( i , j ) , A'^"^(i,k)
A*^(i,j), A'^"^(k,j)
I Data dependence vectors I k 1 j
(1
(1
(1
0
0
X
0 ) ^ = d
x ) ^ = d
0)"^ = d
Step 3: The data-flou of this algorithm for n = 3 is
given in Figure 15. Ue can see that the variable dependence
vectors d2 and d3 result in data-flou similar to the one
shoun in Figure 6 ( a ) .
77
/ / / ' K \
*C ~C (o fo
a;
o £
CL. -^ fC --<
CD o en
£ r-tt) <I
o -^ C (D
Q_ CL
c a; - r l .^- '
-^ o
O CO
IT)
c
cn
78
Using Theorem 1, ue obtain the dependence vectors d, =
(1 0 0 ) ^ , d^ = (0 0 1 ) ^ , and d^ = (0 0 - 1 ) ^ as the equiva
lent fixed dependence vectors for d2» Similarly, the equi
valent set of fixed dependence vectors for do is found to
be dj = (1 0 0 ) ^ , d^ = (0 1 0 ) ^ , and d^ = (0 -1 0 ) ^ .
The neu set of dependence vectors is as given belou:
dj = (1 0 0 ) ^
T
d^ = (0 0 1)
d^ = (0 0 -1)
d^ = (0 1 0)
dy = (0 -1 0 ) ' .
The data-flou graph of the modified algorithm is shoun
in Figure 16. From the data-flou graph ue can see that
there are four vector computations. The set of dependence
vectors associated uith the four vector-computations are
{d., d^, d^}, {d., dc, dy}, {d^, d^, d^}, and {d^, d^, d^}
respect ively.
Step 4: The program-graph model PG = CV, D, C, X, YD
for the shortest path algorithm constructed from Definition
3 is as given belou:
The set of nodes is {V = (k,i,j) : 1 < k ^ n, 1 ^ i ^ n,
1 1 j 1 n}.
79
^ -
r-!
rT
d
T> r-- 00
iA
i^-\ /
i H ^
» V
> /
^ " ^ y D ~ / \
7f
Vp— -# /
/ I
-^kL_>
I y
/
. ^
I \ /
^ ^
•<3>
( /
I , I > /
18
I
^ ^
* ( $
f / J
^ /
f-ff i
IL \
• \ / I
/ .• \
4-
. \ / I
-K^ Y
IT
; I 1 I
• «
o
/
\ i
_5
0/
£ a -.-'
c_ c CD O
£ — (D <I
o ^ c 0)
CL CL
"D -.-' O) iP
•<-* c
O X
n c/
vO
80
The set of labels assigned to the arcs is D = {d., d^,
d^, d^, dy} uhere d = (1 0 0)^, d^ = (0 0 1)^, d^ =
(0 0 - 1 ) ^ , d^ = (0 1 0 ) ^ , and dy = (0 -1 0 ) ^ .
The set of vector computations C = {c., C2. C3, c^}
uhere c^^, C2n,, C3^ and c^^ = { + , MIN}, and c^^ = {d^, d^,
d^}» C2J = {dj, d^, dy}, C3^ = {d^* d^, d^}, and c^^ = id^,
^5* ^6^*
The set of input variables X, and the set of output
variables are easily identified using the indices and are
ignored here.
Step 5J The set of dependence vectors D is divided
into four subsets given as follous:
DJ = {dj, d^, d^}
D2 = {dj, d^, dy}
D3 = {dj, d^, dy}
D4 = <dj, d^, d^}.
Step 6? Nou, ue derive the conditions for defining the
timing function from Theorem 2. The root nodes are (1,1,1),
(1,1,2), (1,1,3), (1,2,1) (1,3,3). The timing func
tion uill be of the form
P(k,i,j) = max {Pj(k,i,j), P2(l<»i»J)» P3(k,i,j), P4(k,i,j)}
uith
Pj = Cuj U2 U3 u^D
p2 = Cxj X2 X3 x^D
81
^3 = ^^1 >'2 ^3 U^
P4 = Czj Z2 Z3 z^D.
Applying Theorem 2 on the program-graph, ue have the
follouing conditions:
From condition CO, ue have
Uj > 0
Xj > 0
yj > 0
Zj > 0
U3 > 0
X3 < 0
yo > 0
Z3 < 0
U2 > 0;
X2 < 0;
Vo < 0;
Z2 > 0.
From condition Cl, ue have
Uj+U2+U3+u^ = X.+X2+X3+X-
= yi+y2+y3-^y4
= ZJ+Z2+Z3+24.
From condition C2, ue have
" 1
* < 1
"3
><3
"2
X2
<
>
>
<
>
=
'l
"1
X3)
^3i
Xji
yji
! Uj
1 Xj
1 U3
><3
"2
X2
<
>
=
=
>
<
-l
1
^3-
^3'
yji
22!
! Uj
• "1
! U3
• ^ 3
"2
^2
<
—
>
>
=
<
^1'
r ^3'
^3'
Z2!
Zj.
Let us restrict the solution space by including the
condit ions:
Uj+U2+W3+u^ i 3;
X2+X2+X3+X4 i 3;
82
l'*" 2"*' 3' 4 ^ ^*
Zi+Z2+Z3+z^ i 3.
Ue solve for the coefficients of P^, P2, P3 and P4
satisfying the above conditions. The only solution in our
solution space is :
Pj = Cl 1 1 0]
P2 = C5 -1 -1 0:
P3 = C3 -1 1 01
P4 = C3 1 -1 0:.
So the timing function is
P(k,i,j) = max {k+i+j, 5k-i-j, 3k-i+j, 3k+i-j}.
After assigning the time unit to each node using the
above timing function, the program-graph is as shoun in
Figure 16. It can be verified that the timing function is a
valid one.
Step 7: The VLSI array is modeled as an array-graph
AG = CG, L, FD as fol1ous:
G = i(9^f92^ * 1 £ g^ 1 n, 1 1 g2 1 n}.
L = 0 1 - 1 - 1 1
0 1 - 1 1 - 1
0 0 1 - 1
1 - 1 0 0
^1
^2
h h h U 5 6 7 U 9 This array has 8-neighbor bidirectional connections and
83
also a connection uithin a processor, as shoun in Figure
13.
F = {+. MIN}.
Step S: Nou ue have to construct an allocation func
tion S uhich uill map the nodes in the program graph onto
the nodes in the array graph. From Definition 8 in Section
4.2, S is given as:
S = {Sp S2» S3, S^} such that S^ is the valid allocation
function for the node X for uhich P(X) = P.(X).
Let
^1 = ^11 ^12 ^13
^21 ^22 ^23
^2 = bjj bj2 bi3
b2i b22 ^23
S3 = ^11 ^12 ^13
C21 C22 ^23
S4 = ^11 ^12 " 13
^21 ^22 ^23
From condition (c) of Theorem 3, ue have
Sjd^ = S2d2 = S3d^ = S4dj
$2^2 = S4d2
^1^3 " ^3^3
84
^2^4 - ^3^4
Sjd^ = S^d^.
This results in the follouing conditions on the co
efficients :
11
'21
13
13
12
12
have
- bj^ - Cjj - d^j
—
=
—
=
=
^ 21
^13
^13
C j 2
^12
—
«
*
^21 ""
^23 =
^23 ^
b22 =
^22 ""
^21
^23
*^23
C22
d22-
Applying condition (e) on the root node (1,1,1), ue
ajj+aj2+aj3 - b^.j+bj2"'^13
= c^j + c^2"^^13
a22 + a22+a23 = b2i+b22'*"t>23
" 21" 22"*" 23
= d2i+d22+^23'
From the above equations, ue derive that
^11
^12
^13
^21
^22
= b
= b
= b
11
12
13
= c
= c
11
12
= c
= b
= b
21
22
= c
= c
13
21
22
= d
= d
= d
11
12
13
= d
= d
21
22
^23 " ^23 " ^
85
23 " ^23*
Therefore, S = Sj = S2 = S3 = S^.
One possible set of utilization matrices uhich satis
fies the condition (d) is given belou:
Ul =
Uo =
Uo = _ I
0 0 0 0 0 0 1 1 0
^1
0 0 0 0 0 0 0 1 0
< 1
0 0 0 0 0 0 0 1 0
^1
1 0 0 0 0 0 0 0 0
^ 3
1 0 0 0 0 0 0 0 0
^ 2
1 0 0 0 0 0 0 0 0
^ 3
0 0 0 0 0 1 0 0 0
< 5
0 0 0 0 0 0 1 0 0
^ 4
0 0 0 0 0 0 1 0 0 .
^ 4
86
U4 =
0 0 0 0 0 0 0 1 0
^1
1 0 0 0 0 0 0 0 0
d2
0 0 0 0 0 1 0 0 0
'5
Uith these utilization matrices, and the condition (d)
ue have
O •" O . •" ^ O — ^ O "" A. ""
1 0 0
0 1 0
Next, ue have to ensure that P and S together is a
monotonic function. The proof of Theorem 3 gives a method
to this effect. Ue replace the P function by P* = m * P,
uhere m = 4 in our case. From Figure 16, ue see that at
most tuo nodes uith the same time of computation are mapped
to the same processor. Therefore, P* = 2 * P is sufficient
to make T = CP* , SD a monotonic function. The modified P
functions are as given belou:
^2
P3'
P4'
= C 4 4 4 ]
= CIO -2 -2 D
= C 6 -2 2 3
= C 6 2 -2 ] .
87
Step 2: As stated in Theorem 4, the timing function
P* , and the allocation function S together form the algo
rithm transformation function T such that
T = uhere P* : F* -> F ^ S: F " -> F^ ^.
Step IQ: The VLSI array architecture defined by the
array graph and the allocation function S is as shoun in
Figure 17. All processors are identical, and the structure
of a processor depends upon the computations requierd by
the shortest path algorithm as uell as the timing and data
communication dictated by the transformed data dependen
cies. The communication paths are labeled denoting the
dependence vectors they represent. Notice that a variable
uhich has a dependence d. moves from a processor to the
next via a vertical channel uith direction (1 0) .
It is important to comment here that tradeoffs are
possible betueen the time and space charecteristics of the
VLSI array. By simply selecting another transformation T,
it results in a different parallel execution time, differ
ent array dimensions, and different interprocessor communi-
cat ions.
88
Figure 17: VLSI Array for the Shortest Path Algor1thm
i^dSf*
89
The number of valid algorithm transformations is usu
ally very large. A transformation must be selected based on
a performance index, uhich uill measure the overall array
performance. Some of the charecteristics to be considered
are speed, processor complexity, number of interprocessor
connections, practical design considerations, etc.
CHAPTER 6
CONCLUSIONS
6-t_L C o n t r i h u t innc;
In this thesis, ue have developed a procedure to find
algorithm transformations for VLSI array processing. The
concepts on uhich our procedure is based, have been de
scribed in the previous chapters. In this section, ue
summarize our uork, highlighting its significance.
The most important information about an algorithm is
contained in its data dependences because these determine
the algorithm's communication requirements. Ue have sug
gested techniques to pipeline the data propagation. Ue
regularize the data-flou in algorithms uith variable
dependence vectors by replacing such vectors uith an equi
valent fixed dependence vectors.
Then ue examined a class of algorithms uith heteroge
neous data-flou, uhere computations at all points are not
dependent on the same set of dependence vectors. Ue pro
vided a set of sufficient conditions on the structure of
data-flou in this class for the existence of syntactically
correct mappings on a VLSI array. This is the most signifi
cant contribution of this thesis, as the existing methods
are inadequate to find mappings for the class of algorithms
uith heterogeneous data-flou.
90
91
The characterization developed in this uork sheds some
insight into the regularity of computations that can be
mapped onto VLSI arrays. Finally, the concepts introduced
in this thesis are not restricted only to VLSI systems;
they can also be used for mapping algorithms onto some
other fixed parallel computer architectures. The concepts
ue have developed to find the timing function can be used
in the compilation of the class of algorithms specified
above, for supercomputation.
^Li2 Future UorK
Pipelining of data is a significant step in the design
of VLSI arrays, as these dictate the communication require
ments of the array. A unifying approach to pipeline the
propogation of data has to be developed.
Techniques have to be developed to map algorithms of
any size, uith heterogeneous data-flou, onto fixed size
VLSI arrays.
There is no sound basis on uhich ue can evaluate the
transformation ue have designed. More uork is needed to
Identify a unifying performance index to measure the over
all array performance including speed, processor complex
ity, communication requirements, and practical design
considerat ions.
REFERENCES
1. Aho,A.; Hopcroft,J.E.; and Ullman,J.D. "The design and analysis of computer algorithms," Addison-Uesley, 1982.
2. Banerjee,U. "Data dependence in ordinary programs," M.S. thesis. Department of Computer Science, University of Illinois, Urbana-Champaign, Nov. 1976.
3. Banerjee,U.; Chen,S.C.; Kuck,D.J.; and Toule,R.A. "Time anc parallel processor bounds for Fortran-like loops," IEEE Transactions on Computers, Vol. C-28, Sep. 1979, pp. 660-670.
4. Capel lo,P.R. ; and Steiglitz,K. "Unifyin.:; VLSI array design uith geometric transformation," Proceedings of the 1983 International Conference on Parallel Processing, Aug. 1983, pp. 448-457.
5. Foster,M.J.; and Kung,H.T. "The design of special-purpose VLSI chips," Computer Magazine, Vol. 13, Jan. 1980, pp. 26-40.
6. Guibas,L.J.; Kung,H.T.; and Thomson,C.D. "Direct VLSI implementation of combinatorial algorithms," Proceedings of the Caltech Conference on VLSI, Jan. 1979, pp. 509-525.
7. Horouitz,E.; and Sahni,S. "Fundamentals of computer algorithms," Computer Science Press, 1984.
8. Huang,K.J and Briggs, A.F. "Computer architecture and parallel processing," McGrau-Hlll Book Co., 1984.
92
93
9. Kuck,D.J.; Muraoka,Y.; and Chen, S . C "On the number of operations simultaneously executable in Fortranlike programs and their resulting speedup," IEEE Transactions on Computers, Vol. C-21, Dec. 1972, pp. 1293-1310.
10. Kung,H.T. "Let's design algorithms for VLSI systems. Proceedings of the Caltech Conference on VLSI, Jan. 1979, pp. 65-90.
11. Kung,H.T. "Uhy systolic architectures ?," Computer Magazine, Vol. 15, Jan. 1982, pp. 37-46.
12. Kung,S.Y. "On supercomputing uith systolic/uavefront array processors," Proceedings of the IEEE, Vol. 72, July 1984.
13. Lakhani,G.D. "An improved distribution algorithm for shortest paths problem," IEEE Transactions on Computers, Vol. C-23, Sep. 1984, pp. 855-857.
14. Lam,M.S.; and Mostou,J. "A transformational model of VLSI systolic design," Computer, Vol. 18, Feb. 1985, pp. 42-52.
15. Lamport,L. "The parallel execution of DO loops," Communications of the ACM," Feb. 1974, pp. 83-93.
16. Leiserson,C.E. "Area efficient VLSI computation," Carnegie-Mellon University, Scribe Version 3A(1117), Nov. 1980.
17. Li,G.J.; and Uah,B.U. "The design of optimal systolic arrays," IEEE Transactions on Computers, Vol. C-34, Jan. 1985, pp. 66-77.
18. Mead,C.; and Conuay,L. "Introduction to VLSI systems," Addison-Uesley Publishing Co., 1980
94
19. Miranker,U.L.; and Uinkler,A. "Spacetime representations of computational structures," Computing, Vol. 32, 1984, pp. 93-114.
20. Moldovan,D.I. "On the analysis and synthesis of VLSI algorithms," IEEE Transactions on Computers, Vol. C-31, Nov. 1982, pp. 1121-1126.
21. Moldovan,D.I. "On the design of algorithms for VLSI systolic arrays," Proceedings of the IEEE, Vol. 71, Jan. 1983, pp. 113-120.
22. Moldovan,D.I. "ADVIS: A softuare package for the design of systolic arrays," Proceedings of the 1984 International Conference on Parallel Processing, 1984, pp. 158-164.
23. Muraoka,Y. "Parallelism exposure and exploitation in programs," Ph.D. dissertation. Department of Computer Science, University of Illinois, Urbana-Champaign, Feb. 1971.
24. Quinton,P. "Automatic synthesis of systolic array from uniform recurrent equations," Proceedings of the 11th Annual International Symposium on Computer Architecture, June 1984, pp. 208-214.
25. Toule, R. "Control and data dependence for program transformations," Ph.D. dissertation, Department of Computer Science, University of Illinois, Urbana-Champaign, Mar. 1976.
26. Ueiser,U.; and Davis,A. "A uavefront notation tool for VLSI array design," VLSI Systems and Computations, ed. by Kung, Sproull, and Steele, Computer Science Press, 1981, pp. 226-234.
PERMISSION TO COPY
In presenting this thesis in partial fulfillment of the
requirements for a master's degree at Texas Tech University, I agree
that the Library and my major department shall make it freely avail
able for research purposes. Permission to copy this thesis for
scholarly purposes may be granted by the Director of the Library or
my major professor. It is understood that any copying or publication
of this thesis for financial gain shall not be allowed without my
further written permission and that any user may be liable for copy
right infringement.
Disagree (Permission not granted) Agree (Permission granted)
Student's signature
/^ i^l^'^^i^A^'u^ V"\ Student's signature
Date
(0/2^/ es Date